Here is my solution to this very interesting quiz.
My MarkovChainer can work with any order and it is sentence oriented: I split the input text with /([.!?])/ and then scan each sentence with /[\w']+/, so all punctuation (except ".", "!" and "?") is ignored.
It also stores the first <order> words of each sentence, to use them as start words when generating sentences.
The script accepts command line arguments for the order (-o) and the number of sentences to generate (-n), the defaults are -o2 -n10. Then it uses ARGF.read as input text and prints n generated sentences.
Now lets talk a bit about the data structure:
My first implementation used trees of hashes and because symbols are faster as hash keys than strings, I stored the words as symbols:
For the text "q s a w s. a s w q s. q w s a a w. s q a w s.", that looked like:
{:q=>{:s=>{:"."=>1, :a=>1}, :a=>{:w=>1}, :w=>{:s=>1}},
:s=>{:q=>{:a=>1}, :w=>{:q=>1}, :a=>{:a=>1, :w=>1}},
:w=>{:q=>{:s=>1}, :s=>{:"."=>2, :a=>1}},
:a=>{:s=>{:w=>1}, :a=>{:w=>1}, :w=>{:"."=>1, :s=>2}}}
Then I thought it might be faster to put all the last words into one array instead of building a frequency hash, because I built that array anyway, when I was looking for the next word: That looked like:
{:q=>{:s=>[:a, :"."], :a=>[:w], :w=>[:s]},
:s=>{:q=>[:a], :a=>[:w, :a], :w=>[:q]},
:a=>{:s=>[:w], :a=>[:w], :w=>[:s, :".", :s]},
:w=>{:q=>[:s], :s=>[:".", :a, :"."]}}
And it also was faster. Then I wanted to know if symbols really are faster then strings:
{"w"=>{"q"=>["s"], "s"=>[".", "a", "."]},
"a"=>{"a"=>["w"], "w"=>["s", ".", "s"], "s"=>["w"]},
"q"=>{"a"=>["w"], "w"=>["s"], "s"=>["a", "."]},
"s"=>{"w"=>["q"], "a"=>["w", "a"], "q"=>["a"]}}
It turned out that strings are faster. I think that is because converting a symbol to a string (when generating the sentences) needs one hash lookup and generates a new String object. So the faster hash lookup of symbols doesn't pay off.
And finally I wanted to see if at least the hash tree is worth it. I tried the following:
{["q", "a"]=>["w"],
["q", "w"]=>["s"],
["s", "q"]=>["a"],
["w", "q"]=>["s"],
["s", "w"]=>["q"],
["a", "s"]=>["w"],
["w", "s"]=>[".", "a", "."],
["s", "a"]=>["w", "a"],
["q", "s"]=>["a", "."],
["a", "a"]=>["w"],
["a", "w"]=>["s", ".", "s"]}
Well, the hash tree also turned out to be slower than this simple flat hash (multiple hash lookups are slower than computing the hash of a multi element array and one lookup).
(I also tried joined strings as keys (e.g. "q a"=>["w"]), but they are slower than the array keys, because more transformations are needed)
So, the morale of all this:
- don't use symbols if they have to be converted to a string often
- hash lookups might be slower than you think
- premature optimization...
Dominik
class MarkovChainer
attr_reader :order
def initialize(order)
@order = order
@beginnings = []
@freq = {}
end
def add_text(text)
# make sure each paragraph ends with some sentence terminator
text.gsub!(/\n\s*\n/m, ".")
text << "."
seps = /([.!?])/
sentence = ""
text.split(seps).each { |p|
if seps =~ p
add_sentence(sentence, p)
sentence = ""
else
sentence = p
end
}
end
def generate_sentence
res = @beginnings[rand(@beginnings.size)]
loop {
unless nw = next_word_for(res[-order, order])
return res[0..-2].join(" ") + res.last
end
res << nw
}
end
private
def add_sentence(str, terminator)
words = str.scan(/[\w']+/)
return unless words.size > order # ignore short sentences
words << terminator
buf = []
words.each { |w|
buf << w
if buf.size == order + 1
(@freq[buf[0..-2]] ||= []) << buf[-1]
buf.shift
end
}
@beginnings << words[0, order]
end
def next_word_for(words)
arr = @freq[words]
arr && arr[rand(arr.size)]
end
end
if $0 == __FILE__
order = 2
n = 10
while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
if $1 == "o"
order = Integer($2)
else
n = Integer($2)
end
ARGV.shift
end
mc = MarkovChainer.new(order)
mc.add_text(ARGF.read)
n.times {
puts mc.generate_sentence
}
end