[SUMMARY] Markov Chains (#74)

I owe myself a signed copy of my book. Err, I mean, this was a popular quiz!
That's always great because I have so many wonderful selections of code to talk
about in the summary, but it can also be hard because I have so many wonderful
selections of code to talk about in the summary. All of the solutions were
terrificly educational, even the ones I don't have time to discuss below.
Everyone who reads this summary is bound by Ruby Law (can be tried in the Ruby
Courts) to go look through the rest of the submissions.

Now that we have the legal formalities out of the way, let's dig into a
solution. Jon Egil Strand wrote in with descriptions of several practical
applications for Markov Chains and this bit of code:

  class MarkovChain
    def initialize(text)
      @words = Hash.new
      wordlist = text.split
      wordlist.each_with_index do |word, index|
        add(word, wordlist[index + 1]) if index <= wordlist.size - 2
      end
    end
  
    def add(word, next_word)
      @words[word] = Hash.new(0) if !@words[word]
      @words[word][next_word] += 1
    end
  
    def get(word)
      return "" if !@words[word]
      followers = @words[word]
      sum = followers.inject(0) {|sum,kv| sum += kv[1]}
      random = rand(sum)+1
      partial_sum = 0
      next_word = followers.find do |word, count|
        partial_sum += count
        partial_sum >= random
      end.first
      next_word
    end
  end
  
  mc = MarkovChain.new(
    File.read("Agatha Christie - The Mysterious Affair at Styles.txt")
  )
  
  sentence = ""
  word = "Murder"
  until sentence.count(".") == 4
    sentence << word << " "
    word = mc.get(word)
  end
  puts sentence << "\n\n"

Let's examine the MarkovChain object first, as it is where the real work is
done. In initialize() you can see that a Hash is created, the text is split()
on whitespace, and then each pair of adjacent words are passed to the add()
method.

In add(), the first word is used to key the Hash we saw created in initialize(),
adding a frequency Hash as the value the first time a new word is seen. The
following word is then added to the counts in the frequency Hash. Obviously, we
are just working with a first order chain here, since we only ever track the
word that immediately preceded the current word.

The real work is done in the get() method, where we pass the current word and
are returned a reasonable follow-up word. The word is checked to make sure we
have seen it, then the frequency Hash is pulled for that word. Next, the
frequencies of all possible follow-up words are summed and a random selection is
made in that range. Finally, the frequency Hash is walked one last time, and
the random number used to select and return the indicated word.

The rest of the solution code follows the class. First a MarkovChain is
constructed from an Agatha Christie novel. Then a variable is prepared to hold
the output (called sentence, but it will actually hold a few sentences) and a
starting word appropriate to the content is hand-picked. The script then loops,
using get() to choose words one by one until it has collected at least four
sentences of output, which are dumped to the user before we exit.

The above code does not deal with punctuation at all, which can be both good and
bad. The upside is that you will see some natural punctuation, like commas and
they will generally even appear in reasonable places since they are attached to
the word from the original text. The downside is that you may get something
like an opening quote, but never get the closing quote. Sometimes it works out
naturally, and sometimes it doesn't.

Now, in this whole discussion, I was loosely throwing around this "frequency
Hash" term that I never bothered to define. Allow me to fix that now. The
frequency Hash is obviously just a Hash with keys being the words that can
follow and values being a count of how many times that word was seen to follow
in the original text. Dominik Bathon included excellent examples of how this
looks in his submission email. Using Dominik's trivial example text of:

  q s a w s. a s w q s. q w s a a w. s q a w s.

Jon's code will produce a Hash tree that looks like:

  { "w." => {"s" => 1},
    "w" => {"s." => 2, "q" => 1, "s" => 1},
    "a" => {"w." => 1, "a" => 1, "w" => 2, "s" => 1},
    "s." => {"a" => 1, "q" => 1},
    "q" => {"a" => 1, "w" => 1, "s." => 1, "s" => 1},
    "s" => {"w" => 1, "a" => 2, "q" => 1} }

You can really see how punctuation changes things there, since "w" and "w." are
different words.

The right hand side of that (the values of the top level Hash), shows the
frequency breakdown. "w" occurs 2 times after an "a", but "s" occurs just once
following the same word.

Domink then went on to show how a person might change that representation. Here
are two significant changes:

  { "w" => { "q" => ["s"],
             "s" => [".", "a", "."] },
    "a" => { "a" => ["w"],
             "w" => ["s", ".", "s"],
             "s" => ["w"] },
    "q" => { "a" => ["w"],
             "w" => ["s"],
             "s" => ["a", "."] },
    "s" => { "w" => ["q"],
             "a" => ["w", "a"],
             "q" => ["a"] } }

Note that punctuation is handled differently here, because I'm now using
Dominik's examples and his code used a different implementation. In Dominik's
code, periods, exclamation marks, and question marks are treated as their own
words and most other punctuation is ignored.

Now try to ignore the tree structure of the above for now and just focus again
on that right-hand side. This is still just a frequency Hash, though it is now
disguised as an Array. If you have the Hash {"a" => 1, "." => 2}, you could
select the next word just as we saw Jon do earlier. However, the Array [".",
"a", "."] represents the same data and we can select a word just by randomly
selecting a member. The duplication ensures that the odds are still the same.

The other obvious change is that we are now taking into account an order for the
chain. This works out pretty naturally, just by nesting Hashes to the depth of
the order. We then walk the tree to find out what comes next. For example, if
we have a "w" followed by a "a", we get to our now familiar Array of choices for
the next word [".", "a", "."].

But wait, Dominik goes one step further:

  { ["q", "a"] => ["w"],
    ["q", "w"] => ["s"],
    ["s", "q"] => ["a"],
    ["w", "q"] => ["s"],
    ["s", "w"] => ["q"],
    ["a", "s"] => ["w"],
    ["w", "s"] => [".", "a", "."],
    ["s", "a"] => ["w", "a"],
    ["q", "s"] => ["a", "."],
    ["a", "a"] => ["w"],
    ["a", "w"] => ["s", ".", "s"] }

This is exactly the same as the tree above, except that the Hash tree of
preceding words has been flattened into a single Array of words, used to key the
Hash. This is possible because Ruby's Arrays are hashable. Isn't that handy?
The gain is speed. Since we only have to make one Hash lookup to get to the
frequency Array, it works a little quicker.

Now that we have gone through the trouble of understanding Dominik's data
structure, the code should be a breeze. Let's take a look:

  # ...
  
  if $0 == __FILE__
    order = 2
    n = 10
    while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
      if $1 == "o"
        order = Integer($2)
      else
        n = Integer($2)
      end
      ARGV.shift
    end
    mc = MarkovChainer.new(order)
    mc.add_text(ARGF.read)
    n.times {
      puts mc.generate_sentence
    }
  end

This is the last little bit of Dominik's code, but it kicks the process off.
You can see that it selects defaults for the order and number of sentences to
produce, but then updates them from the provided command-line options. Next, a
MarkovChainer is built and fed the combined contents of ARGF. Finally, the
requested number of sentences are generated with a simple method call repeated
inside of an iterator.

Let's look at the pieces of MarkovChainer used to read text:

  class MarkovChainer
    attr_reader :order
    def initialize(order)
      @order = order
      @beginnings = []
      @freq = {}
    end
  
    def add_text(text)
      # make sure each paragraph ends with some sentence terminator
      text.gsub!(/\n\s*\n/m, ".")
      text << "."
      seps = /([.!?])/
      sentence = ""
      text.split(seps).each { |p|
        if seps =~ p
          add_sentence(sentence, p)
          sentence = ""
        else
          sentence = p
        end
      }
    end
  
    # ...
  
    private
  
    def add_sentence(str, terminator)
      words = str.scan(/[\w']+/)
      return unless words.size > order # ignore short sentences
      words << terminator
      buf = []
      words.each { |w|
        buf << w
        if buf.size == order + 1
          (@freq[buf[0..-2]] ||= []) << buf[-1]
          buf.shift
        end
      }
      @beginnings << words[0, order]
    end
  
    # ...
    
  end

The initialize() method shows a familiar setup, save that new Array we haven't
talked about yet. We will tackle that when it comes up.

add_text() isn't much more complicated. It ensures the document it is dealing
with contains complete sentences, then breaks it on the end-of-line punctuation
characters I mentioned earlier. The trick here is that you need to notice the
parentheses in the Regexp handed to split(). Ordinarily, split() will remove
the match used to divide the chunks, but when the match includes capturing
parentheses, those captures are returned as items of their own. So, the final
iterator of the method reads the sentence then the punctuation that followed it
and hands those over to add_sentence().

add_sentence() is where the data structure we analyzed earlier gets built. The
sentence is separated into words, including the end-of-line punctuation don't
forget. Then those words are pushed into a buffer one at a time. When the
buffer has enough words stored to satisfy the order, it starts filling the Hash
of frequency Arrays for this chain. Finally, we see what that extra Array is
used for. It holds beginnings, or groups of words used to start a sentence
(when we wouldn't have previous words to check frequencies for). You can see
those being stored in the last line of this method.

Here's the other side of that class, text writing:

  class MarkovChainer
    # ...
  
    def generate_sentence
      res = @beginnings[rand(@beginnings.size)]
      loop {
        unless nw = next_word_for(res[-order, order])
          return res[0..-2].join(" ") + res.last
        end
        res << nw
      }
    end
  
    private
  
    # ...
  
    def next_word_for(words)
      arr = @freq[words]
      arr && arr[rand(arr.size)]
    end
  
  end

Not much here, as you can see. With the data structure built, the problem
practically solves itself.

The interface method, generate_sentence(), starts by selecting a beginning from
the Array of possible beginnings. It then iterates using next_word_for() to
grab follow-up words until a nil is returned (after an end-of-line character,
since we didn't give follow-ups for them). Whatever we have at that point
becomes the returned sentence, after adding a sprinkle of whitespace between the
words.

The helper method, next_word_for(), just does the lookup from the frequency
Array. Don't let that fancy last line throw you, it's just a shortcut to keep
from calling []() when arr is nil.

A huge thank you to all who came out of hiding to get the quiz rolling again. I
just knew I would find a problem you couldn't resist eventually.

The response to the Ruby Quiz contest has been strong and tomorrow we will start
our run of contributed quizzes with an offering from Pat Eyler...