[QUIZ] Markov Chains (#74)

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion.

···

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is about text generation. That's right, we're going to
teach your computer to weave tall tales.

At its most basic level, a solution might be:

  >> (1..30).map { (("a".."z").to_a + [" "] * 10)[rand(36)] }.join
  => "fb mcr hhluesjbhtf swm eehokmi"

However, let's make our goal to get as close to English looking sentences as
possible. One way you might do this is using a technique called Markov Chains.

To use Markov Chains, you read some text document(s), making note of which
characters commonly follow which characters or which words commonly follow other
words (it works for either scale). Then, when generating text, you just select
a character or word to output, based on the characters or words that came before
it.

The number of previous characters or words considered is called the "order" and
you can adjust that to try and find a natural feel. For example, here is some
generated text using a second order word chain derived from the Sherlock Holmes
novel "The Hound of the Baskervilles" by Arthur Conan Doyle:

  The stars shone cold and bright, while a crushing weight of responsibility
  from my shoulders. Suddenly my thoughts with sadness. Then on the lady's
  face. "What can I assist you?"

If you need text's to prime your program with, I suggest searching Project
Gutenberg:

  http://www.gutenberg.org/

I'm pumped! I've been reading this mailing list for the last 3 months
and I finally feel ready to try out a quiz. This one sounds fun!

···

On Fri, 2006-04-07 at 22:02 +0900, Ruby Quiz wrote:

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is about text generation. That's right, we're going to
teach your computer to weave tall tales.

At its most basic level, a solution might be:

  >> (1..30).map { (("a".."z").to_a + [" "] * 10)[rand(36)] }.join
  => "fb mcr hhluesjbhtf swm eehokmi"

However, let's make our goal to get as close to English looking sentences as
possible. One way you might do this is using a technique called Markov Chains.

To use Markov Chains, you read some text document(s), making note of which
characters commonly follow which characters or which words commonly follow other
words (it works for either scale). Then, when generating text, you just select
a character or word to output, based on the characters or words that came before
it.

The number of previous characters or words considered is called the "order" and
you can adjust that to try and find a natural feel. For example, here is some
generated text using a second order word chain derived from the Sherlock Holmes
novel "The Hound of the Baskervilles" by Arthur Conan Doyle:

  The stars shone cold and bright, while a crushing weight of responsibility
  from my shoulders. Suddenly my thoughts with sadness. Then on the lady's
  face. "What can I assist you?"

If you need text's to prime your program with, I suggest searching Project
Gutenberg:

  http://www.gutenberg.org/

I think it would be really neat to apply this to music composition. I'm
just not sure how you would get a corpus for that (maybe parsing
lilypond files?).

--Aaron

···

On Fri, Apr 07, 2006 at 10:02:28PM +0900, Ruby Quiz wrote:

This week's Ruby Quiz is about text generation. That's right, we're going to
teach your computer to weave tall tales.

At its most basic level, a solution might be:

  >> (1..30).map { (("a".."z").to_a + [" "] * 10)[rand(36)] }.join
  => "fb mcr hhluesjbhtf swm eehokmi"

However, let's make our goal to get as close to English looking sentences as
possible. One way you might do this is using a technique called Markov Chains.

Ruby Quiz wrote:

  The stars shone cold and bright, while a crushing weight of responsibility
  from my shoulders. Suddenly my thoughts with sadness. Then on the lady's
  face. "What can I assist you?"
  
I can't tell if this is a bug, or success beyond my wildest dreams:

···

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
She wore handily the woven craft of lips. Besotted! My anglican parrot wenches freely. Three dozen Iceland begin to unwind. Hope. Charity. My outrage cannot succumb to rainswept chance. Now chance he the broadly winding spar.

This stock is on the move! You'll never fail to satisfy her again! Home loans from 0.025%!

V t A G R A from $ 3.50 per item
C 1 @ L I S from $3.75 per item
V Ax L 1 U M from $1.20 per item

Premier PARMACY - your c e r t i f i e d online parmacy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I took usenet for my corpus, then used a genetic algorithm with SpamAssassin for scoring.

And, uh, I wrote it 6 days ago.... :wink:

-dB

--
David Brady
ruby_talk@shinybit.com
"Every normal man must be tempted, at times, to spit on his hands, hoist the Jolly Roger, and start slitting throats." -- H. L. Mencken

Hi all-

Here is my first attempt at a Ruby Quiz (attached file: 'markovtext.rb'). I'm new to Ruby, mostly paying the bills over the past years with Perl work. This quiz was fun, and I learned a lot from doing this. Coming from Perl, I often rely upon auto-vivification, so I needed to figure out how to work around this. Perhaps there are improved ways of going about this, and I'd appreciate any feedback on how to go about it better.

My solution is fairly straight forward, using a second order word chain, which I imagine many others will be implementing in similar ways. As noted in the comments, I only start making my paragraph on the second sentence, in order to get greater diversity in how the paragraph starts. I'm also striping out some of the punctuation from the end result, as some texts give orphaned punctuation marks.

I had the most fun from the quiz running my script against various texts, and seeing the results. My favorite has been getting results from "Huckleberry Finn", though the grammar gets a bit off at times, but that relates to the input text. For comparison, one can see much more typical sentence structure from "Journey to the Center of the Earth".

Example results:

markovtext.rb (3.23 KB)

···

==

Huck Finn via 'markovtext.rb':

And above all, don't you try to run in the dark he says: Yes, Mars Sid, A dog. Cur'us dog, too. Does you want it, you can go and write poetry after them out loud. One bill said, The celebrated Dr. Armand de Montalban, of Paris, would lecture on the floor and tied up the bank. I couldn't help it, and she snatched a kiss of him, and never think about it; and I went along up the smoothness and the judge for him, being a mystery, and he'd cipher out a speech, all full of it, and I would kill me, dey sk'yers me so.

==

Journey to the Center of the Earth via 'markovtext.rb':

I gave way to the south-east. We have passed through the treatises of Cassanion, and all his time at Altona, staying with a translation? This IS the Icelandic Professor. At this rate we shall see. So says the Professor, no doubt, was pursuing his observations or taking notes, for in three days we must go back to the dull thuds of the country might be held up by the descent commenced. I can but try Spanish, French, Italian, Greek, or Hebrew.

==

Cheers,
-albert

Here is my solution. It was so much fun I couldn't help but keep making
small tweaks. Here are a few of my favourite bits of output with varying
word limits, from various sources (including rubytalk):

markov.rb (9.28 KB)

···

====
thou wast question'd my algorithm? -- posted delight hath.

FasterCSV can also reason, having methods 2006 08:58 am, after lunch.

April 1st joke. I'm subscribed to ruby-dev and the other hand, I've
thought of it...hrrmmm... And now I fear, that the Mozilla Foundation
relicensed everything as MPL and GNU GPL provides. It is unfortunate
that it should be able to check against multiple types as well Hi.

Ruby Quiz: 1. Please do not hesitate dive in to it, on the C:\ruby\lib
\ruby\1.8\cgi\session.rb file, $350 convoluted I'm licenses David 11:26
install the 'ruby-db2-0.4' package and all that...

I'll blow this police whistle from my shoulders. Suddenly and makes it
incompatible. But I might have agreed tosuch without having to support
meta-progarmming. Perhaps an example how to handle initialize_copy: The
mixin manages some attributes and the other hand, I've thought of
it...hrrmmm...

Before using it, you have to give it some knowledge - the -l option
accepts a filename or URL to learn from (omit the filename for stdin).
After chomping through it, the knowledge will get dumped out to a file
ready to be used for generation. You can do this multiple times to
incrementally teach it.

If you want to use a large body of input I recommend storing each dump
in it's own file, which you can do with the -f option. Then, specify one
or more -f options when running generation to use the specified
knowledge file(s).

Once you've got some input, run markov.rb, specifying your -f options if
you're not using the default 'chainstore' file, and optionally
generation parameters such as -w N to limit to at most N words.

There's a few options you can pass to control both learning and
generation, there's a bit of doc available from the -h option.

I posted a few chomped input files at:
  http://roscopeco.co.uk/code/ruby-quiz-entries/74/
  http://roscopeco.co.uk/code/ruby-quiz-entries/74/README

Thanks again for this quiz, I've gotten a fair few lol moments out of
it :slight_smile:

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

Great quiz. It will be interesting to see how others have solved this
problem.
Here is my submission.

To use run the following:

$ cat <some_text_file> | ./rand_text.rb

Or you can give options

$ cat <some_text_file> | ./rand_text.rb -o 2 -n 10

-o : the order. Which is the number of previous words to consider
-n : the number of sentences to output

I used an hash of arrays to keep track of the possible state transitions.
The key is the current state, and the contents of the array is the possible
next states. When generating the output I randomly select elements from this
array. I always start with the first 'n' number of words in the original
text, where 'n' is the order.

There is a sample output where <some_text_file> is Moby Dick and using the
default parameters of order = 2, and number of sentences = 10.

rand_text.rb (2.91 KB)

···

==
Call me Ishmael.
Some years ago--never mind how long precisely --having little or no money in
my soul; whenever I find myself involuntarily pausing before coffin
warehouses, and bringing up the rear of every kind whatsoever.
It is a damp, drizzly November in my purse, and nothing particular to
interest me on shore, I thought I would sail about a little and see the
mummies of those creatures in their huge bake-houses the pyramids.
No, when I go to sea, I go to sea as a Commodore, or a Captain, or a Cook.
I abandon the glory and distinction of such offices to those who like them.
For my part, I abominate all honorable respectable toils, trials, and
tribulations of every kind whatsoever.
It is quite as much as I can.
This is my substitute for pistol and ball.
With a philosophical flourish Cato throws himself upon his sword; I quietly
take to the royal mast-head.
True, they rather order me about--however they may thump and punch me about,
I have of driving off the spleen, and regulating the circulation

On 4/7/06, Ruby Quiz <james@grayproductions.net> wrote:

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz
until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby Talk follow the discussion.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is about text generation. That's right, we're going
to
teach your computer to weave tall tales.

At its most basic level, a solution might be:

        >> (1..30).map { (("a".."z").to_a + [" "] * 10)[rand(36)] }.join
        => "fb mcr hhluesjbhtf swm eehokmi"

However, let's make our goal to get as close to English looking sentences
as
possible. One way you might do this is using a technique called Markov
Chains.

To use Markov Chains, you read some text document(s), making note of which
characters commonly follow which characters or which words commonly follow
other
words (it works for either scale). Then, when generating text, you just
select
a character or word to output, based on the characters or words that came
before
it.

The number of previous characters or words considered is called the
"order" and
you can adjust that to try and find a natural feel. For example, here is
some
generated text using a second order word chain derived from the Sherlock
Holmes
novel "The Hound of the Baskervilles" by Arthur Conan Doyle:

        The stars shone cold and bright, while a crushing weight of
responsibility
        from my shoulders. Suddenly my thoughts with sadness. Then on the
lady's
        face. "What can I assist you?"

If you need text's to prime your program with, I suggest searching Project
Gutenberg:

        http://www.gutenberg.org/

Here's my solutions. I wrote this 2.5 times. The first version worked, but it was slow as molasses. The second one is much much faster. The 2.5th inherits from the second one and produces more readable, complete sentences

first, slow version:
% cat markov.rb
class WordTable
   def initialize(src, separator = /\W+/, word_shape = /\w+/)
     @src = src
     @separator = separator
     @word_shape = word_shape
   end

   def build_table
     @table = Hash.new { |h, k| h[k] = [] }
     pos = 0
     while line = @src.gets
       line = line.chomp.downcase
       words = line.split(@separator).select { |potential_word|
         potential_word.match(@word_shape)
       }
       words.each do |word|
         @table[word] << pos
         pos += 1
       end
     end
     self
   end
   def words
     @table.keys
   end

   def positions_for(word)
     @table[word]
   end

   def followers_for(word)
     positions = positions_for(word)
     followers_positions = positions.map { |i| i + 1 }
     following_words = self.words.select { |key_word|
       followers_positions.any? { |pos| positions_for(key_word).include?(pos) }
     }
     following_words
   end
end

class ChainBuilder
   attr_accessor :chain_length
   def initialize(initial_word, word_table, chain_length = 10)
     @initial_word = initial_word
     @word_table = word_table
     @chain_length = chain_length
   end

   def distributed_followers(word)
     distd_followers = []
     followers = @word_table.followers_for(word)
     positions_of_word = @word_table.positions_for(word)
     followers.each do |follower|
       follower_positions = @word_table.positions_for(follower)
       count = follower_positions.select { |pos|
         prec = pos - 1
         positions_of_word.include?(prec)
       }.length
       distd_followers += [follower] * count
     end
     distd_followers
   end
   def randomized_next_word(word)
     choices = distributed_followers(word)
     choices[ rand(choices.length) ]
   end
   def chain
     res_chain = [@initial_word]
     (self.chain_length - 1).times do
       res_chain << randomized_next_word(res_chain.last)
     end
     res_chain
   end
end

if $0 == __FILE__
   if ARGV[0].nil?
     STDERR.puts "Please provide a corpus."
     STDERR.puts "#$0 usage: #$0 <corpus file name> [chain length] [initial word]"
     exit 1
   end
   chain_len = (ARGV[1] || 10).to_i
   wt = nil

   File.open(ARGV[0]) do |file|
     #wt = WordTable.new(file, //, /./) # try by characters
     wt = WordTable.new(file)
     wt.build_table
   end
   init_word = (ARGV[2] || wt.words[ rand( wt.words.length ) ] )

   chain_builder = ChainBuilder.new(init_word, wt, chain_len)
   chain = chain_builder.chain
   puts chain.join(' ').capitalize
end

Second, quicker version:
% cat markov2.rb
class MarkovChainBuilder
   class Entry < Struct.new(:word_index, :successors)
   end
   def initialize(src, separator = /\W+/, word_shape = /\w+/, downcase = true)
     @src = src
     @separator = separator
     @word_shape = word_shape
     @downcase = downcase
     build_tables
   end
   def build_tables
     @word_list = []
     @successor_lists = {}
     last_word = nil
     @src.each do |line|
       line = line.downcase if @downcase
       line = line.chomp
       words_for_line = line.split(@separator).select{ |w| w.match(@word_shape)}
       words_for_line.each do |word|
         unless @successor_lists.has_key? word
           entry = @successor_lists[word] = Entry.new
           @word_list << word
           entry.word_index = @word_list.length - 1
           entry.successors = []
         end

         unless last_word.nil?
           @successor_lists[last_word].successors << @successor_lists[word].word_index
         end
         last_word = word
       end
     end
   end
   def distributed_successors_for(word)
     @successor_lists[word].successors.map { |index| @word_list[index] }
   end
   def randomized_next_for(word)
     succs = distributed_successors_for(word)
     succs[ rand(succs.length) ]
   end
   def markov_chain(initial_word, len = 10)
     res = [initial_word]
     (len - 1).times do
       res << randomized_next_for(res.last)
     end
     res
   end
   def words
     @word_list
   end
   private :build_tables
end

if $0 == __FILE__
   if ARGV[0].nil?
     STDERR.puts "Please provide a corpus."
     STDERR.puts "#$0 usage: #$0 <corpus file name> [chain length] [initial word]"
     exit 1
   end
   chain_len = (ARGV[1] || 10).to_i
   mc = nil

   File.open(ARGV[0]) do |file|
     mc = MarkovChainBuilder.new(file)
   end
   init_word = (ARGV[2] || mc.words[ rand( mc.words.length ) ] )

   chain = mc.markov_chain(init_word, chain_len)
   puts chain.join(' ').capitalize
end

And now the third version which is basically the second version, but looks for sentences:
% cat smarter_markov.rb
require 'markov2'
class SmarterMarkovChainBuilder < MarkovChainBuilder
   def initialize(src)
     super(src, /\s+/, /\S+/, false)
   end
   def markov_chain(initial_word, len = 10)
     initial_chain = super
     index_of_last_word = nil
     initial_chain.each_index do |idx|
       if initial_chain[idx] =~ /\.|\?|!/
         index_of_last_word = idx
         break
       end
     end
     if index_of_last_word
       initial_chain[0..index_of_last_word]
     else
       initial_chain
     end
   end
end
if $0 == __FILE__
   mc = nil
   File.open(ARGV[0]) do |file|
     mc = SmarterMarkovChainBuilder.new(file)
   end
   start_words = mc.words.select { |w| w =~ /^[A-Z]/ }
   init_word = start_words[ rand(start_words.length) ]
   puts mc.markov_chain(init_word, 200).join(' ').gsub(/"/,'')
end

Hi all,

here is my solution. Its a rather stupid approach, but as many simple
and short solutions its better than i had expected.

It uses some gsubs to simplify the input text and surround each word
and punctuation by exactly one space character.

Scan is used to find occurrences of the last <order> words and
store the next words respectively. (even multiple times)

From this array one is choose randomly (which preserves the original
frequency of occurrence because the words are stored multiple times
in the array, increasing the chance of being picked)

Thats it. It works well for files of several MB and gets even faster
if the order is higher.

usage: quiz74.rb <order> [input files]

···

--------------------------------------------------------------------
before, line, order = ['.'], '', ARGV.shift.to_i

txt = ARGF.read.tr('"/*\\()[]', ' ').downcase
txt = txt.gsub(/[^'\w]/, ' \0 ').gsub(/\s+/, ' ')

500.times do
  ary = txt.scan(/ #{before.join(' ')} (\S+)/).transpose.first
  before << ary[rand(ary.size)]
  before.shift if before.length > order
  (puts line; line = '') if line.length > 50 && /\w/ =~ before.last
  line << ( /\w/ =~ before.last && !line.empty? ? ' ' : '') << before.last
end
--------------------------------------------------------------------

cheers

Simon

Hi,

this is the first time i participate in a ruby-quiz. It was quite
funny programming this, and testing it out.
I know that my solution is far from perfect, but since I
invested some time in it (I'm quite new to ruby, so had to look up a
lot of stuff) I'm submitting it.
It does nothing fancy, and it misses some error-checking. The main data
structure is a Hash, it's explained in the code.
If I did something completely stupid it would be nice if someone tells
me :slight_smile:

Thanks,
Tom

markov.rb (2.18 KB)

Here is my solution to this very interesting quiz.

My MarkovChainer can work with any order and it is sentence oriented: I split the input text with /([.!?])/ and then scan each sentence with /[\w']+/, so all punctuation (except ".", "!" and "?") is ignored.
It also stores the first <order> words of each sentence, to use them as start words when generating sentences.

The script accepts command line arguments for the order (-o) and the number of sentences to generate (-n), the defaults are -o2 -n10. Then it uses ARGF.read as input text and prints n generated sentences.

Now lets talk a bit about the data structure:

My first implementation used trees of hashes and because symbols are faster as hash keys than strings, I stored the words as symbols:

For the text "q s a w s. a s w q s. q w s a a w. s q a w s.", that looked like:

{:q=>{:s=>{:"."=>1, :a=>1}, :a=>{:w=>1}, :w=>{:s=>1}},
  :s=>{:q=>{:a=>1}, :w=>{:q=>1}, :a=>{:a=>1, :w=>1}},
  :w=>{:q=>{:s=>1}, :s=>{:"."=>2, :a=>1}},
  :a=>{:s=>{:w=>1}, :a=>{:w=>1}, :w=>{:"."=>1, :s=>2}}}

Then I thought it might be faster to put all the last words into one array instead of building a frequency hash, because I built that array anyway, when I was looking for the next word: That looked like:

{:q=>{:s=>[:a, :"."], :a=>[:w], :w=>[:s]},
  :s=>{:q=>[:a], :a=>[:w, :a], :w=>[:q]},
  :a=>{:s=>[:w], :a=>[:w], :w=>[:s, :".", :s]},
  :w=>{:q=>[:s], :s=>[:".", :a, :"."]}}

And it also was faster. Then I wanted to know if symbols really are faster then strings:

{"w"=>{"q"=>["s"], "s"=>[".", "a", "."]},
  "a"=>{"a"=>["w"], "w"=>["s", ".", "s"], "s"=>["w"]},
  "q"=>{"a"=>["w"], "w"=>["s"], "s"=>["a", "."]},
  "s"=>{"w"=>["q"], "a"=>["w", "a"], "q"=>["a"]}}

It turned out that strings are faster. I think that is because converting a symbol to a string (when generating the sentences) needs one hash lookup and generates a new String object. So the faster hash lookup of symbols doesn't pay off.

And finally I wanted to see if at least the hash tree is worth it. I tried the following:

{["q", "a"]=>["w"],
  ["q", "w"]=>["s"],
  ["s", "q"]=>["a"],
  ["w", "q"]=>["s"],
  ["s", "w"]=>["q"],
  ["a", "s"]=>["w"],
  ["w", "s"]=>[".", "a", "."],
  ["s", "a"]=>["w", "a"],
  ["q", "s"]=>["a", "."],
  ["a", "a"]=>["w"],
  ["a", "w"]=>["s", ".", "s"]}

Well, the hash tree also turned out to be slower than this simple flat hash (multiple hash lookups are slower than computing the hash of a multi element array and one lookup).
(I also tried joined strings as keys (e.g. "q a"=>["w"]), but they are slower than the array keys, because more transformations are needed)

So, the morale of all this:
- don't use symbols if they have to be converted to a string often
- hash lookups might be slower than you think
- premature optimization...

Dominik

class MarkovChainer
   attr_reader :order
   def initialize(order)
     @order = order
     @beginnings = []
     @freq = {}
   end

   def add_text(text)
     # make sure each paragraph ends with some sentence terminator
     text.gsub!(/\n\s*\n/m, ".")
     text << "."
     seps = /([.!?])/
     sentence = ""
     text.split(seps).each { |p|
       if seps =~ p
         add_sentence(sentence, p)
         sentence = ""
       else
         sentence = p
       end
     }
   end

   def generate_sentence
     res = @beginnings[rand(@beginnings.size)]
     loop {
       unless nw = next_word_for(res[-order, order])
         return res[0..-2].join(" ") + res.last
       end
       res << nw
     }
   end

   private

   def add_sentence(str, terminator)
     words = str.scan(/[\w']+/)
     return unless words.size > order # ignore short sentences
     words << terminator
     buf = []
     words.each { |w|
       buf << w
       if buf.size == order + 1
         (@freq[buf[0..-2]] ||= []) << buf[-1]
         buf.shift
       end
     }
     @beginnings << words[0, order]
   end

   def next_word_for(words)
     arr = @freq[words]
     arr && arr[rand(arr.size)]
   end

end

if $0 == __FILE__
   order = 2
   n = 10
   while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
     if $1 == "o"
       order = Integer($2)
     else
       n = Integer($2)
     end
     ARGV.shift
   end
   mc = MarkovChainer.new(order)
   mc.add_text(ARGF.read)
   n.times {
     puts mc.generate_sentence
   }
end

I'm glad.

These can be quite a bit of fun, depending on what text you prime them with... :wink:

James Edward Gray II

···

On Apr 7, 2006, at 8:53 AM, Charlie Bowman wrote:

I'm pumped! I've been reading this mailing list for the last 3 months
and I finally feel ready to try out a quiz. This one sounds fun!

or ascii based tab files for us banjo pickers!

···

On Sat, 2006-04-08 at 00:06 +0900, Aaron Patterson wrote:

On Fri, Apr 07, 2006 at 10:02:28PM +0900, Ruby Quiz wrote:
>
> This week's Ruby Quiz is about text generation. That's right, we're going to
> teach your computer to weave tall tales.
>
> At its most basic level, a solution might be:
>
> >> (1..30).map { (("a".."z").to_a + [" "] * 10)[rand(36)] }.join
> => "fb mcr hhluesjbhtf swm eehokmi"
>
> However, let's make our goal to get as close to English looking sentences as
> possible. One way you might do this is using a technique called Markov Chains.

I think it would be really neat to apply this to music composition. I'm
just not sure how you would get a corpus for that (maybe parsing
lilypond files?).

--Aaron

Good quiz this one.

Markov Chains are highly practical creations, we use them extensively when
calulating life insurance and pensions schemes in Mercer.

Wikipedia has a nice summary of other applications, I recommend it hihgly:

At University, they taught us Markov Chains multiple times for various
uses, but none of them used this linguistic approach. It's very intuitive,
and so much more fun than the hard-core introductions we were given. So if
I ever teach, I'll teach this quiz.

Here we go:

As primer for this writeup I've used Agatha Cristie's "The mysterious
affair at Styles", a 57_000 words english novel starring Poirot, her
famous detective.

To establish the chain of words I use a Hash of Hashes, where the first is
a simple lookup of words, and the second containt the words that folow.
The second also counts how many times each word follows the leader.

Example:

Murder of Emily Agnes...
Murder against Alfred Inglethorp ...
Murder against some person ...
Murder against him right ...

@words["Murder"] >> {"of"=>1, "against"=>3}

The counts are to weigh the randomness when selecting the next word. I
wanted a selection process where rare followers in the real world are rear
in this world as well. After 'Murder', 'of' will pop up every forth time.
This is done here

    sum = candidates.inject(0) {|sum,kv| sum += kv[1]}
    random = rand(sum)+1
    partial_sum = 0
    next_word = candidates.find do |word, count|
      partial_sum += count
      partial_sum >= random
    end.first

sum is the total number of followers (4 for 'Murder'). I pick a random
number 0..3, add +1 for indexing and iterate my way up the list of
followers, and keep track of how many words I have passed on my way. When
we reach (or pass for the first time) the random limit, the word at hand
is the word we want. Since the block finds the first ['word', count] pair
that satifies our condition, (f.ex ['against', 3]), I use .first to get
it.

That's it. Now just build a sentence by starting with a word and move
along it's path. My syntectic rules are very simple, a sentence is
complete when it containts a period ".", thereby opening up for som silly
grammar. But it doesn't really matter, grammar processing isn't the text
of the day.

The results are good enoughby far, and quite entertaining too. But alas,
even Mr Poirot would be hard pressed to solve this case.

Output # 1

Murder of sight. Mary Cavendish was vital to connect the charred paper. "A
great risk, it was a book?" "Yes, often. I could certainly had already
been tested!" I was before I asked.

Output # 2

Murder against Alfred Inglethorp, I fancied that, mon ami, it is broken
bell, and the matter, hoping that a note." Miss Howard had a day we came
to warn him more, had kindly offered to make amends; which weighed with
the Hall in the company. Miss Howard was that?" "She made it was the
Hall--you'n a penknife, and went round Styles Court in some time I voiced
the room?" "I was inclined to call for some friends one side. His words of
horrors business to make a very fond of her--yes, I made a message to the
slips of dismissal.

=== code ===

class MarkovChain
  def initialize(text)
    @words = Hash.new
    wordlist = text.split
    wordlist.each_with_index do |word, index|
      add(word, wordlist[index + 1]) if index <= wordlist.size - 2
    end
  end

  def add(word, next_word)
    @words[word] = Hash.new(0) if !@words[word]
    @words[word][next_word] += 1
  end

  def get(word)
    return "" if !@words[word]
    followers = @words[word]
    sum = followers.inject(0) {|sum,kv| sum += kv[1]}
    random = rand(sum)+1
    partial_sum = 0
    next_word = followers.find do |word, count|
      partial_sum += count
      partial_sum >= random
    end.first
    #puts "Selected #{next_word} from #{candidates.inspect}"
    next_word
  end
end

mc = MarkovChain.new(File.read("Agatha Christie - The Mysterious Affair at
Styles.txt"))

sentence = ""
word = "Murder"
until sentence.count(".") == 4
  sentence << word << " "
  word = mc.get(word)
end
puts sentence << "\n\n"

=== code ===

···

--
Jon Egil Strand
Phone: +47 98232340
jes@luretanker.no

My approach isn't terribly complex, but it was fun to write. It
requires that you have a plain text file "book.txt" in the directory of
execution. I used http://www.gutenberg.org/dirs/etext91/alice30.txt for
all of my tests and my favorite phrase that it came up with was:

Beautiful soup beauootiful soooop soooop of evidence said the queen turning to the dormouse.

How is works is loads each word into a hash stripping punctuation and
capitalization associating each word with an array of common words that
it is followed by. One word is picked by random to start us off then it
picks randomly commonly chained words from there till it ends up with a
word that was at the end of sentence in the original document.

Kind Regards,

74.rb (732 Bytes)

···

--------------------------------------------------------------------
Barry Dmytro
badcherry@mailc.net
http://badcherry.org/
--------------------------------------------------------------------

Your code looks very clean to me! A minor thing I noticed is that you could simplify:

   .gsub(/[\"\(\)]/,"")

to:

   .delete('"()')

Can you show an example using Perl's auto-vivification that feels funny in Ruby? Perhaps we would have better ideas after seeing it...

Thanks for the submission!

James Edward Gray II

···

On Apr 9, 2006, at 8:45 AM, Albert Vernon Smith wrote:

Coming from Perl, I often rely upon auto-vivification, so I needed to figure out how to work around this. Perhaps there are improved ways of going about this, and I'd appreciate any feedback on how to go about it better.

Nothing stupid no. Here are a few extremely minor suggestions.

         File.open(filename) do |io|
             io.each do |line|
    ...
             end
         end

That pattern can be simplified to:

   File.foreach(filename) do |line| ... end

And...

             if @frequencies[tupel].include?(word)
                 @frequencies[tupel][word]+=1
             else
                 @frequencies[tupel][word]=1
             end

If you set the Hash to default to a value of 0, you could simplify the above to just:

   @frequencies[tupel][word]+=1

All in all, you're code is quite nice.

Thanks for the submission.

James Edward Gray II

···

On Apr 10, 2006, at 7:08 AM, Tom Rauchenwald wrote:

If I did something completely stupid it would be nice if someone tells me :slight_smile:

class WordHash
  attr_accessor :words
  def initialize(words)
    @words = {}
    words.each_with_index{|x,i|self.add(x,words[i-1],words[i+1])}
  end

  def add(word,prev,nex)
    @words[word] ||= {}
    @words[word]['prev'] ||= Hash.new{|k,v|k[v]=0}
    @words[word]['prev'][prev] += 1
    @words[word]['next'] ||= Hash.new{|k,v|k[v]=0}
    @words[word]['next'][nex] += 1
  end
end
f = ARGF.read.join.tr('"/*\\()[]\'', ' ').downcase
words = f.gsub(/[^'\w]/, ' \0 ').gsub(/\s+/, ' ').split(/\W/)
words = words.map{|x|x.strip}
w = WordHash.new(words)
@k = (a=w.words).keys[rand(a.size)]
wi = [@k.capitalize]
100.times do
  c=(a=w.words[@k]['next']).keys[rand(a.keys.size)]
  if c != ' '
  wi << c
  @k = w.words.keys.find{|y|y==c}||rand(a.keys.size)
  else
    next
  end
end
puts wi.join(" ").gsub(/\s+/,' ')+'.'

···

--
Posted via http://www.ruby-forum.com/.

I used symbols in my solution, but primarily because I was trying to
make the hash itself require as little memory as possible to allow for
vast bodies of input (which I admit I've not really tried yet) and I
figured a hash on symbols would be smaller than a hash on strings.

I guess this means that GC is pushed by lots of string objects created
and destroyed during the generation run, but I tended to aim for memory
efficiency over speed for this one, as long as it was running 'fast
enough'.

···

On Tue, 2006-04-11 at 02:15 +0900, Dominik Bathon wrote:

So, the morale of all this:
- don't use symbols if they have to be converted to a string often
- hash lookups might be slower than you think
- premature optimization...

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

Banjo? I'm pretty sure he said "music", so I'm not sure why you are bringing
up banjos. :slight_smile:

What's the difference between a banjo and a(n)Š

Chain Saw:

   1. a chain saw has a dynamic range.
   2. you can turn a chain saw off.

   3. South American Macaw: one is loud, obnoxious, and noisy; and the other
is a bird.

   4. Harley Davidson Motorcycle: you can tune a Harley.

   5. Onion: no one cries when you cut up a banjo.

   6. Trampoline: you take your shoes off to jump on a trampoline.

   7. Uzi: an uzi only repeats forty times.

Sorry - couldn't help myself on this one :slight_smile:

Keith

···

On 4/7/06 10:14 AM, "Charlie Bowman" <charlie@castlebranch.com> wrote:

or ascii based tab files for us banjo pickers!