Home
About

Text Generation with Markov Chains

A gentle introduction to Markov Chains and how to use them for generating text.

Sample text: Oh, the Places You'll Go! by Dr. Seuss
N-gram type:
N-gram size:
Generated Text Length:
generated text

How it works

An n-gram is simply a contiguous sequence of n units of text (e.g. characters or words) for a given text. Take for example this excerpt from Green Eggs and Ham.

I am Sam Sam I am That Sam-I-am! That Sam-I-am! I do not like that Sam-I-am!

(I a), ( am), (am ), (m S), ( Sa) and (Sam) are the first six character tri-grams (three character sequences).

Here's another list, this time of bi-grams (two character sequences) and also their occurrences within the text.

( a): 5
( d): 1
( i): 5
( l): 1
( n): 1
( s): 5
( t): 3
(am): 10
(at): 3
(do): 1
(e ): 1
(ha): 3
(i ): 6
(ik): 1
(ke): 1
(li): 1
(m ): 9
(no): 1
(o ): 1
(ot): 1
(sa): 5
(t ): 4
(th): 3

To keep things simple, I've lowercased the characters and excluded punctuation.

N-grams alone are nothing special, it's their number of occurrences and the characters that follow in each occurrence that is interesting. In the above analysis, we've only listed the number of times each bi-gram occurred. Let's replace that with the actual character that followed each occurrence.

( a) -> [m, m, m, m, m]
( d) -> [o]
( i) -> [ ,  ,  ,  ,  ]
( l) -> [i]
( n) -> [o]
( s) -> [a, a, a, a, a]
( t) -> [h, h, h]
(am) -> [ ,  ,  ,  ,  ,  ,  ,  ,  ,  ]
(at) -> [ ,  ,  ]
(do) -> [ ]
(e ) -> [t]
(ha) -> [t, t, t]
(i ) -> [a, a, a, a, d, a]
(ik) -> [e]
(ke) -> [ ]
(li) -> [k]
(m ) -> [s, s, i, t, i, t, i, i, i]
(no) -> [t]
(o ) -> [n]
(ot) -> [ ]
(sa) -> [m, m, m, m, m]
(t ) -> [s, s, l, s]
(th) -> [a, a, a]

By including the "next" characters following an n-gram, we now have a tool to statistically predict the next possible n-grams. For example the next character after (i ) has a 1/6 and 5/6 probability of being d and a respectively. Let's continue with this example and see what we end up with.

"(i )" -> [a, a, a, a, d, a]
"i( a)"

Assuming the next character is a. Without any consideration for i, we look at the bi-gram formed by adding a, ( a), and see what the next possible characters are.

"i( a)" -> [m, m, m, m, m]
"i (am)"

Here are a few more iterations.

"i (am)" -> [ ,  ,  ,  ,  ,  ,  ,  ,  ,  ]
"i a(m )" -> [s, s, i, t, i, t, i, i, i]
"i am( t)" -> [h, h, h]
"i am (th)" -> [a, a, a]
"i am t(ha)" -> [t, t, t]
"i am th(at)" -> [ ,  ,  ]
"i am tha(t )" -> [s, s, l, s]
"i am that( l)" -> [i]
"i am that (li)" -> [k]
"i am that l(ik)" -> [e]
"i am that li(ke)" -> [ ]
"i am that lik(e )" -> [t]
...
"i am that like tha(t )" -> [s, s, l, s]

Continuing this process would yield a chain of various states, following the probability distribution of the bi-grams, thus generating new text. This process of moving from one state to the next based on probabilities is the basis of a Markov chain, and what we are going to build to generate text.

For the remainder of this post we'll build a simple Markov chain, modeled with n-grams and use it to generate some text. Here's a very high-level overview of how our text generator will work.

  1. tokenize a corpus into characters
  2. build a Markov model of the characters from the corpus
    1. every model entry will contain an n-gram and list of next characters
  3. generate new text

We start with a simple tokenizer, splitting our text into character tokens.

class SimpleMarkovTextGenerator {

  /**
   * Tokenize (e.g. split) given corpus into array of characters tokens.
   * @param corpus text corpus to tokenize
   * @return array of characters tokens or empty array
   */
  char[] tokenize(String corpus) {
    if (corpus == null)
      return new char[]{};
    return corpus.toLowerCase()        // lowercase everything
      .replaceAll("[.!\\-',?\n]", " ") // remove punctuation
      .replaceAll("\\s+", " ")         // trim to single space
      .trim()
      .toCharArray();
  }
}

Next we iterate through the character tokens and build a model (e.g. map of ngrams and their next characters).

class SimpleMarkovTextGenerator {
  final Map<String, List<Character>> model;
  final int ngramSize;

  SimpleMarkovTextGenerator(String corpus, int ngramSize) {
    this.ngramSize = ngramSize;
    this.model = buildModel(corpus);
  }

  ...

  /** 
   * Return a Map representing a Markov model of n-grams to next characters.
   * @param corpus text to tokenize and build n-grams from
   * @return 
   */
  Map<String, List<Character>> buildModel(String corpus) {
    return buildModel(tokenize(corpus));
  }

  /**
   * Return a simple Map representing a Markov model of n-grams to next characters.
   * @param characters tokens to build n-grams from
   * @return 
   */
  Map<String, List<Character>> buildModel(char[] characters) {
    Map<String, List<Character>> model = new HashMap<>();
    for (int i=0; i < characters.length-ngramSize; i++) {
      String ngram = String.copyValueOf(characters, i, ngramSize);
      char nextChar = characters[i + ngramSize];
      if (model.containsKey(ngram)) {
        model.get(ngram).add(nextChar);
      } else {
        model.put(ngram, new ArrayList<>(Arrays.asList(nextChar)));
      }
    }
    return model;
  }
}

We restructure the class a little to make our class a little easier to work with. Finally to generate our text, we'll use the following algorithm.

  1. we start with a "seed" n-gram
  2. look up the ngram in our model and grab the next possible characters
  3. randomly choose a next possible character
  4. append the character to our generated text
  5. build our next ngram using the character and repeat step 2 until we have a long enough text
class SimpleMarkovTextGenerator {
  ...

  /**
   * Generate text from the given model. 
   * @param ngram the starting n-gram
   * @param len the length of text to generate
   * @return 
   */
  String generate(String ngram, int len) {
    Random random = new Random();
    StringBuilder sb = new StringBuilder(ngram);
    while (len-- > 0) {
      List<Character> nextCharacters = model.get(ngram);
      if (nextCharacters == null)
        break;
      char nextChar = nextCharacters.get(random.nextInt(nextCharacters.size()));
      sb.append(nextChar);
      ngram = ngram.substring(1) + nextChar;
    }
    return sb.toString();
  }
}

Let's generate a couple of text from our original Green Eggs and Ham excerpt.

class SimpleMarkovTextGenerator {
  ...

  public static void main(String[] args) {
    String corpus = "I am Sam Sam I am That Sam-I-am! That Sam-I-am! I do not like that Sam-I-am!";
    String startNgram = "i ";
    int ngramSize = 2;
    int textLen = 30;

    SimpleMarkovTextGenerator markov = new SimpleMarkovTextGenerator(corpus, ngramSize);
    System.out.println(markov.generate(startNgram, textLen));
    System.out.println(markov.generate(startNgram, textLen));
    System.out.println(markov.generate(startNgram, textLen));
  }
}

And some sample generated text from our next text generator, powered by a Markov model using bi-grams.

i do not sam sam i am i do not s
i am sam i am that sam i am i am
i am that sam sam sam i am i am

Try it with uni-gram, what kind of results did you get vs say a tri-gram? Anyways, that was a brief introduction to Markov chains and n-grams.

Demo and post source code on GitHub.