If you have 20000 unique words and want to represent the probabilities
of an n-chain of them, you'll need (20000)^n values, i.e.,
1.6*10^17 for n=4, so the sparsity of your data is such that there is
maybe one in a billion 4-chains actually represented in your probability
hash ( a hash entry will actually need more than a byte, and that
increases the sparsity).
So I don't think you'll be able to reduce the need in memory a great
deal ...
D'oh I was afraid of that 
I could think of two things (not necessarily disjunct).
1.) Group individual words together to get classes of words,
and calculate the probabilities for those fewer classes. I'm not
quite sure how well this will work for you, but I'd try to use some
kind of Huffman coding concept on the n-chains:
Huffman coding - Wikipedia
I'd then break off building the tree somewhere and put the remaining
words into a class of seldom words ... This would introduce
some overhead of getting back to words from the classes, though,
and there will be many words that have different meanings that are
nevertheless equally seldom 
Well, this project is mostly for fun (a plugin for an irc bot,
as I mentioned), so even if 'odd' words are chosen, it just adds
to the fun.
I like this 'word class' idea. I could put words with very few
occurrences into the 'seldom' class, and then pick them based
on an auxiliary letter-based markov chain (which would pick as many
characters as necessary to isolate a word in the group. An interesting
challenge.
2.) Be less strict with the probability values -- you could say that
the probability of any particular 6-chain is the product of the probabilities
of the five successive two-chains involved in it
(independence hypothesis), unless you can reject that hypothesis by a
statistical test (I'd use the chi-square test - the Ruby code can be
found in the package statistics2 by Shin-ichiro Hara).
So keep only the two-chain probabilities and those longer chain
probabilities which deviate considerably from the independence hypothesis,
leading to rejection in the chi-square test.
You can influence the amount of data thus created by changing the
alpha value of the test.
There is a book by Jeremy Spinrad about "Efficient Graph Representations"
where you can find further ideas about how to represent a graph .
This sounds like an excellent idea. I'll look up these references,
thanks.
And, something else:
What is the probability of an n-chain of words that wasn't actually
encountered in the training set ? It's not zero, even though the
Hash created from the training data says so .....
If you think of the incidence matrix A=(a_ij) representing the
probabilities of finding a 2-chain of words i-j, all entries with
no such Markov chain are zero. Now, in a technique called "Latent Semantic
Analysis"
Latent semantic analysis - Wikipedia
this matrix is reconstructed from a product of matrices of lower
rank, introducing non-zero entries in the previously zero entries.
Thus all of a sudden, there is a small probability of finding that
particular chain now.
Aha, excellent link, thanks. This is really interesting. I wonder if
it's
possible to take it to higher orders and write the n-th level tensor
as a product of lower level tensors ... wonder what kind of speed hit
it would take.
I join Ed Borasky in suggesting some number-crunching computer language
+ Ruby interface for dealing with big amounts of data, but maybe
you can reduce it all a bit ?
Well, going out of Ruby would sort-of defeat some of the (learning)
purposes of this project, but for serious application it's something
I should really look into, especially to do matrix manipulations
as suggested in your last points.
Thanks a lot for the suggestions,
···
On Jul 30, 6:15 pm, "Axel Etzold" <AEtz...@gmx.de> wrote:
--
Giuseppe "Oblomov" Bilotta