Optimizing large-scale Markov chain handler

Oblomov · 30 July 2007 12:15

Hello all,

I'm building a Markov chain handler with a couple of twists over the
usual naive handlers.

First of all, I consider both words and non-words (separators), so the
string scanner
uses the regexp /\w+|\W/u
This gives much better handling of punctuation and so, but it has the
downside that
I need twice the order of a naive handler to get the same context
(word-punctuation-
word-punctuation instead of word-word).

Secondly, it's bidirectional, so it can create a sentence by starting
from the middle and
adding both 'next' and 'previous' words. This obviously doubles the
memory requirements,
as we store two chains instead of one.

My prototype implementation is available on http://oblomov.dnsalias.org/git
in the rbot-mark.git repository.

The structure is rather simple. I have a ChanceHash class to abstract
the collection of
possible next (or previous) words, with corresponding chances, and the
MarkovChainer
class which abstracts the engine: the chains ar stored in @mkv, which
is a simple hash
where values are {:prev => ChanceHash, :next => ChanceHash} and keys
are arrays of
symbols. nil is used to mark the begin/end of a sentence.

Although this works remarkably well, there are huge memory issues:
learning my test
text (210K+ words, with 20K+ unique words) is unfeasible for orders >
4 (at order 4
it takes about half a gigabyte of memory).

Now, the final application (this is a plugin for the ircbot rbot) will
store the data in a bdb,
which will enormously ease the memory load, but I was wondering if it
was possible to
optimize memory usage in a way that would allow the code to be used
'live' without
some on-disk backend, while still being able to go up to order 6 (at
least).

Any suggestions?

M_Edward_Ed_Borasky1 · 30 July 2007 13:27

Oblomov wrote:

Hello all,

I'm building a Markov chain handler with a couple of twists over the
usual naive handlers.

First of all, I consider both words and non-words (separators), so the
string scanner
uses the regexp /\w+|\W/u
This gives much better handling of punctuation and so, but it has the
downside that
I need twice the order of a naive handler to get the same context
(word-punctuation-
word-punctuation instead of word-word).

Secondly, it's bidirectional, so it can create a sentence by starting
from the middle and
adding both 'next' and 'previous' words. This obviously doubles the
memory requirements,
as we store two chains instead of one.

My prototype implementation is available on http://oblomov.dnsalias.org/git
in the rbot-mark.git repository.

The structure is rather simple. I have a ChanceHash class to abstract
the collection of
possible next (or previous) words, with corresponding chances, and the
MarkovChainer
class which abstracts the engine: the chains ar stored in @mkv, which
is a simple hash
where values are {:prev => ChanceHash, :next => ChanceHash} and keys
are arrays of
symbols. nil is used to mark the begin/end of a sentence.

Although this works remarkably well, there are huge memory issues:
learning my test
text (210K+ words, with 20K+ unique words) is unfeasible for orders >
4 (at order 4
it takes about half a gigabyte of memory).

Now, the final application (this is a plugin for the ircbot rbot) will
store the data in a bdb,
which will enormously ease the memory load, but I was wondering if it
was possible to
optimize memory usage in a way that would allow the code to be used
'live' without
some on-disk backend, while still being able to go up to order 6 (at
least).

Any suggestions?

Well, I'm not an expert on natural language processing by any stretch of
the imagination, so I don't know if this sort of processing is normal in
that line of work, and if it is, how the memory requirements grow with
the order of Markov chain you're using. For openers, though, you might
want to find a C++ or Java library that does what you're trying to do
*efficiently* and interface your Ruby code to one, using regular Ruby
for a C++ library or jRuby for a Java library. There is lots of open
source NLP code out there in both languages, so it shouldn't be too
difficult to find something.

···

Axel_Etzold · 30 July 2007 16:15

Dear Giuseppe,

Although this works remarkably well, there are huge memory issues:
learning my test
text (210K+ words, with 20K+ unique words) is unfeasible for orders >
4 (at order 4
it takes about half a gigabyte of memory).

If you have 20000 unique words and want to represent the probabilities
of an n-chain of them, you'll need (20000)^n values, i.e.,
1.6*10^17 for n=4, so the sparsity of your data is such that there is
maybe one in a billion 4-chains actually represented in your probability hash ( a hash entry will actually need more than a byte, and that increases the sparsity).

So I don't think you'll be able to reduce the need in memory a great
deal ...

I could think of two things (not necessarily disjunct).

1.) Group individual words together to get classes of words,
and calculate the probabilities for those fewer classes. I'm not
quite sure how well this will work for you, but I'd try to use some
kind of Huffman coding concept on the n-chains:

I'd then break off building the tree somewhere and put the remaining
words into a class of seldom words ... This would introduce
some overhead of getting back to words from the classes, though,
and there will be many words that have different meanings that are
nevertheless equally seldom

2.) Be less strict with the probability values -- you could say that
the probability of any particular 6-chain is the product of the probabilities of the five successive two-chains involved in it
(independence hypothesis), unless you can reject that hypothesis by a
statistical test (I'd use the chi-square test - the Ruby code can be
found in the package statistics2 by Shin-ichiro Hara).

So keep only the two-chain probabilities and those longer chain probabilities which deviate considerably from the independence hypothesis,
leading to rejection in the chi-square test.
You can influence the amount of data thus created by changing the
alpha value of the test.

There is a book by Jeremy Spinrad about "Efficient Graph Representations"
where you can find further ideas about how to represent a graph .

And, something else:
What is the probability of an n-chain of words that wasn't actually
encountered in the training set ? It's not zero, even though the
Hash created from the training data says so .....
If you think of the incidence matrix A=(a_ij) representing the probabilities of finding a 2-chain of words i-j, all entries with
no such Markov chain are zero. Now, in a technique called "Latent Semantic
Analysis"

this matrix is reconstructed from a product of matrices of lower
rank, introducing non-zero entries in the previously zero entries.
Thus all of a sudden, there is a small probability of finding that
particular chain now.

I join Ed Borasky in suggesting some number-crunching computer language
+ Ruby interface for dealing with big amounts of data, but maybe
you can reduce it all a bit ?

Best regards,

Axel

···

Now, the final application (this is a plugin for the ircbot rbot) will
store the data in a bdb,
which will enormously ease the memory load, but I was wondering if it
was possible to
optimize memory usage in a way that would allow the code to be used
'live' without
some on-disk backend, while still being able to go up to order 6 (at
least).

Any suggestions?

--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: GMX E-Mail ✉ sichere & kostenlose E-Mail-Adresse ✉

Oblomov · 31 July 2007 10:10

If you have 20000 unique words and want to represent the probabilities
of an n-chain of them, you'll need (20000)^n values, i.e.,
1.6*10^17 for n=4, so the sparsity of your data is such that there is
maybe one in a billion 4-chains actually represented in your probability
hash ( a hash entry will actually need more than a byte, and that
increases the sparsity).

So I don't think you'll be able to reduce the need in memory a great
deal ...

D'oh I was afraid of that

I could think of two things (not necessarily disjunct).

1.) Group individual words together to get classes of words,
and calculate the probabilities for those fewer classes. I'm not
quite sure how well this will work for you, but I'd try to use some
kind of Huffman coding concept on the n-chains:

Huffman coding - Wikipedia

I'd then break off building the tree somewhere and put the remaining
words into a class of seldom words ... This would introduce
some overhead of getting back to words from the classes, though,
and there will be many words that have different meanings that are
nevertheless equally seldom

Well, this project is mostly for fun (a plugin for an irc bot,
as I mentioned), so even if 'odd' words are chosen, it just adds
to the fun.

I like this 'word class' idea. I could put words with very few
occurrences into the 'seldom' class, and then pick them based
on an auxiliary letter-based markov chain (which would pick as many
characters as necessary to isolate a word in the group. An interesting
challenge.

2.) Be less strict with the probability values -- you could say that
the probability of any particular 6-chain is the product of the probabilities
of the five successive two-chains involved in it
(independence hypothesis), unless you can reject that hypothesis by a
statistical test (I'd use the chi-square test - the Ruby code can be
found in the package statistics2 by Shin-ichiro Hara).

So keep only the two-chain probabilities and those longer chain
probabilities which deviate considerably from the independence hypothesis,
leading to rejection in the chi-square test.
You can influence the amount of data thus created by changing the
alpha value of the test.

There is a book by Jeremy Spinrad about "Efficient Graph Representations"
where you can find further ideas about how to represent a graph .

This sounds like an excellent idea. I'll look up these references,
thanks.

And, something else:
What is the probability of an n-chain of words that wasn't actually
encountered in the training set ? It's not zero, even though the
Hash created from the training data says so .....
If you think of the incidence matrix A=(a_ij) representing the
probabilities of finding a 2-chain of words i-j, all entries with
no such Markov chain are zero. Now, in a technique called "Latent Semantic
Analysis"

Latent semantic analysis - Wikipedia

this matrix is reconstructed from a product of matrices of lower
rank, introducing non-zero entries in the previously zero entries.
Thus all of a sudden, there is a small probability of finding that
particular chain now.

Aha, excellent link, thanks. This is really interesting. I wonder if
it's
possible to take it to higher orders and write the n-th level tensor
as a product of lower level tensors ... wonder what kind of speed hit
it would take.

I join Ed Borasky in suggesting some number-crunching computer language
+ Ruby interface for dealing with big amounts of data, but maybe
you can reduce it all a bit ?

Well, going out of Ruby would sort-of defeat some of the (learning)
purposes of this project, but for serious application it's something
I should really look into, especially to do matrix manipulations
as suggested in your last points.

Thanks a lot for the suggestions,

···

On Jul 30, 6:15 pm, "Axel Etzold" <AEtz...@gmx.de> wrote:

--
Giuseppe "Oblomov" Bilotta

Axel_Etzold · 1 August 2007 08:43

Dear Giuseppe,

> And, something else:
> What is the probability of an n-chain of words that wasn't actually
> encountered in the training set ? It's not zero, even though the
> Hash created from the training data says so .....
> If you think of the incidence matrix A=(a_ij) representing the
> probabilities of finding a 2-chain of words i-j, all entries with
> no such Markov chain are zero. Now, in a technique called "Latent
Semantic
> Analysis"
>
> Latent semantic analysis - Wikipedia
>
> this matrix is reconstructed from a product of matrices of lower
> rank, introducing non-zero entries in the previously zero entries.
> Thus all of a sudden, there is a small probability of finding that
> particular chain now.

Aha, excellent link, thanks. This is really interesting. I wonder if
it's
possible to take it to higher orders and write the n-th level tensor
as a product of lower level tensors ... wonder what kind of speed hit
it would take.

What you would need in order to generalize
LSA to higher orders is a generalization of the singular value
decomposition (SVD) to tensors. There are plenty of links, such
as this one:

http://www.cs.cornell.edu/cv/OtherPdf/William.pdf

to do that.

If you'd consider doing statistical tests to find longer
chains that occur significantly more often than they should
given the product of shorter chains, you can make use
of the following:

1.) For the chi-square test, always, if at all, the hypothesis
of independence for an entire sample of
chains (of different length) gets rejected, if it contains
chains whose occurrence probabilities is much different from
the product of shorter chains which make it up.
But it's tiresome to consider all the subsets of some
set of cardinality (20000)^n -- and none of the number-crunching
languages is fast enough to do that for you anyway

You see which ones are the tricky candidates when you look
at how much they contribute to the test statistic - the
big contributors make the test fail for the whole sample.

2.) As a product of big numbers is always bigger than a product
of small numbers, you can find long chains with deviant probability
iteratively: they will always contain subchains with deviant probability.
So, with 20000 words, I'd start looking at the probabilities
of 2-chains, filter out possible candidates, then consider 3-chains etc ...

> I join Ed Borasky in suggesting some number-crunching computer language
> + Ruby interface for dealing with big amounts of data, but maybe
> you can reduce it all a bit ?

Well, going out of Ruby would sort-of defeat some of the (learning)
purposes of this project, but for serious application it's something
I should really look into, especially to do matrix manipulations
as suggested in your last points.

Mmmh, it is often argued that Ruby is slow in comparison to other
languages, when you consider the execution time of some pre-defined task.
Entire webpages are dedicated to this claim.
But rarely ever do these comparisons take into account how much time you need to THINK and code. I've often found the execution
time of some task infinite with other languages, because they don't leave
you enough time to think about the problem, as you always bother
about a million implementation details that obscure the real memory and
computing needs.
I now think that it is perfectly feasible to do a 6 or longer Markov chain handler entirely in Ruby, and fast enough, whereas many brute-force
approaches with some "fast" language will take an infinite amount of
both space and time.

Thanks a lot for the suggestions,

Glad to be of help

Best regards,

Axel

···

--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: GMX E-Mail ✉ sichere & kostenlose E-Mail-Adresse ✉

M_Edward_Ed_Borasky1 · 1 August 2007 14:42

Axel Etzold wrote:

What you would need in order to generalize
LSA to higher orders is a generalization of the singular value
decomposition (SVD) to tensors. There are plenty of links, such
as this one:

http://www.cs.cornell.edu/cv/OtherPdf/William.pdf

to do that.

If you'd consider doing statistical tests to find longer
chains that occur significantly more often than they should
given the product of shorter chains, you can make use
of the following:

[snip]

Mmmh, it is often argued that Ruby is slow in comparison to other
languages, when you consider the execution time of some pre-defined task.
Entire webpages are dedicated to this claim.
But rarely ever do these comparisons take into account how much time you need to THINK and code. I've often found the execution
time of some task infinite with other languages, because they don't leave
you enough time to think about the problem, as you always bother
about a million implementation details that obscure the real memory and
computing needs.
I now think that it is perfectly feasible to do a 6 or longer Markov chain handler entirely in Ruby, and fast enough, whereas many brute-force
approaches with some "fast" language will take an infinite amount of
both space and time.

Well ... OK ... but ...

A couple of years back I got interested in chatbots for a brief period.
Google for AIML and the Loebner Prize and you'll probably find some of
my thoughts on the subject. Giuseppe IIRC wants to build a chatbot and
is doing it with Markov chains. I argued, as you appear to be arguing,
that the "right way" to build a chatbot involves statistical analysis of
semantics, and the AIML people argued that their way was a lot simpler
and worked just as well.

I conversed with a number of chatbots over a few months and attempted to
train a couple, one using AIML and one using what appears to be the most
credible technology, unfortunately proprietary. It was an exercise in
frustration. I ended up concluding that statistical analysis of
semantics was the right way to do it, but didn't see any payoff for
investing the effort in trying to build a working code.

So two comments:

1. Check out AIML -- they believe in it even if I don't, and I think
there is a Ruby interface now. There is a Java interface, so perhaps
jRuby can be the glue. http://www.alicebot.org/

2. Check out the chatbot I though was the most credible, and which won
the Loebner Prize a couple of times. http://www.jabberwacky.com/

Axel_Etzold · 1 August 2007 23:20

-------- Original-Nachricht --------

···

Datum: Wed, 1 Aug 2007 23:42:10 +0900
Von: "M. Edward (Ed) Borasky" <znmeb@cesmail.net>
An: ruby-talk@ruby-lang.org
Betreff: Re: Optimizing large-scale Markov chain handler

Axel Etzold wrote:
>
> What you would need in order to generalize
> LSA to higher orders is a generalization of the singular value
> decomposition (SVD) to tensors. There are plenty of links, such
> as this one:
>
> http://www.cs.cornell.edu/cv/OtherPdf/William.pdf
>
> to do that.
>
> If you'd consider doing statistical tests to find longer
> chains that occur significantly more often than they should
> given the product of shorter chains, you can make use
> of the following:

[snip]

> Mmmh, it is often argued that Ruby is slow in comparison to other
> languages, when you consider the execution time of some pre-defined
task.
> Entire webpages are dedicated to this claim.
> But rarely ever do these comparisons take into account how much time you
need to THINK and code. I've often found the execution
> time of some task infinite with other languages, because they don't
leave
> you enough time to think about the problem, as you always bother
> about a million implementation details that obscure the real memory and
> computing needs.
> I now think that it is perfectly feasible to do a 6 or longer Markov
chain handler entirely in Ruby, and fast enough, whereas many brute-force
> approaches with some "fast" language will take an infinite amount of
> both space and time.

Well ... OK ... but ...

A couple of years back I got interested in chatbots for a brief period.
Google for AIML and the Loebner Prize and you'll probably find some of
my thoughts on the subject. Giuseppe IIRC wants to build a chatbot and
is doing it with Markov chains. I argued, as you appear to be arguing,
that the "right way" to build a chatbot involves statistical analysis of
semantics, and the AIML people argued that their way was a lot simpler
and worked just as well.

I conversed with a number of chatbots over a few months and attempted to
train a couple, one using AIML and one using what appears to be the most
credible technology, unfortunately proprietary. It was an exercise in
frustration. I ended up concluding that statistical analysis of
semantics was the right way to do it, but didn't see any payoff for
investing the effort in trying to build a working code.

So two comments:

1. Check out AIML -- they believe in it even if I don't, and I think
there is a Ruby interface now. There is a Java interface, so perhaps
jRuby can be the glue. http://www.alicebot.org/

2. Check out the chatbot I though was the most credible, and which won
the Loebner Prize a couple of times. http://www.jabberwacky.com/

Dear Ed,

I have had a look at both bots ... I can't say I'd find any one particularly
appealing as a substitute for a human conversation partner, e.g., to solve personal problems,
even though I've read reports about psychiatrists for whom
even Eliza passed the Turing test in the Sixties already.

In the discussion with Giuseppe, I was mainly concerned with
reducing the amount of data in the n-chains he wants to use ...
as one first application of statistics.
Now that you ask, I wonder how much statistics there actually
is in the part of latent semantic analysis I find appealing

I like the fact that you merely do an SVD of the word-in-sentence
count matrix, then cut off after the first n eigenvalues ( giving
n concepts) and multiply everything up together.

One can interpret the resulting matrix as a linear mapping
and each sentence as a vector, with entries describing the words in it,
which gets mapped to some lower-dimensional subspace when
multiplied with the matrix.
This image subspace is a "subspace of concepts". In a general text
about pets, sentences containing "dog" and "cat" will probably get mapped to
very close points in the subspace ... but you could introduce separating
hyper-planes to distinguish between anything doggish and cattish.

I like this way of constructing relatively few concepts each time anew from the actual
context of the available text rather than trying to find > 40000 categories
that will make up the whole world, as seems to be the case in ALICE and
similar bots. It might be that one day, I want to discuss cats as
mean predators eating poor mice, another day, as nice pets .
Then, a pre-configured categorization might have difficulties ...
and the number of categories is way too big in my taste.

Some neuroscientist said that humans can only deal with about seven
concepts at once:

http://www.musanim.com/miller1956/

In Ruby, there is the classifier gem by
Lucas Carlson and David Fayram II that will do LSA, text classification,
Bayesian analysis in the context of general texts.

Best regards,

Axel
--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kanns mit allen: http://www.gmx.net/de/go/multimessenger

Topic		Replies	Views
[QUIZ][SOLUTION] Markov Chains (#74) ruby-talk	8	115	10 April 2006
[QUIZ] Markov Chains (#74) ruby-talk	40	146	10 April 2006
[SUMMARY] Markov Chains (#74) ruby-talk	0	71	13 April 2006
Text analyzator ruby-talk	15	68	22 September 2004
[QUIZ] Markov Chains (#74) ruby-talk	5	69	7 April 2006

Optimizing large-scale Markov chain handler

Related topics