Binary Tree vs. Hash

Hi ruby fans,

I have a question about algorithm efficiency.

I have used hash in my ruby program. It worked fine. However, I found
that the memory consumed by a hash is tremendous. While the hash file is
stored on disk (using pstore) it is merely 100-200KB, but in memory, it
consumes apporximately 5-10 Megabytes! I don’t know if it is related to
the cache algorithm used by ruby?

I tried to reduce this memory requirment and thought of using Binary
Tree. Just did some test in Delphi, it seems good. I loaded an English
dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
seems much better than hash.

Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it
mean that hash is born to be less storage-efficient than BTree? How can
I get the best of both BTree and hash?

Thanks a lot.

Shannon

“Xiangrong Fang” xrfang@hotmail.com schrieb im Newsbeitrag
news:20030526135729.583F.XRFANG@hotmail.com

Hi ruby fans,

I have a question about algorithm efficiency.

I have used hash in my ruby program. It worked fine. However, I found
that the memory consumed by a hash is tremendous. While the hash file is
stored on disk (using pstore) it is merely 100-200KB, but in memory, it
consumes apporximately 5-10 Megabytes! I don’t know if it is related to
the cache algorithm used by ruby?

How did you measure this? Is there a way to know the amount of memory
used of an object graph?

I tried to reduce this memory requirment and thought of using Binary
Tree. Just did some test in Delphi, it seems good. I loaded an English
dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
seems much better than hash.

I don’t think you can compare memory consumption across programming
languages like that.

Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it
mean that hash is born to be less storage-efficient than BTree? How can
I get the best of both BTree and hash?

It’s a typical tradeoff: you can’t have both - space efficiency AND
minimal memory consumption. If you have to keep memory consumption low
I think there is a tree implementation in the RAA. Another option would
be to use an Array, keep that sorted and implement (or use) a binary
search, which gives quite similar fetch performance as a tree. Insert
might or might not be slower.

Regards

robert

Xiangrong Fang wrote:

Hi ruby fans,

I have a question about algorithm efficiency.

I have used hash in my ruby program. It worked fine. However, I found
that the memory consumed by a hash is tremendous. While the hash file is
stored on disk (using pstore) it is merely 100-200KB, but in memory, it
consumes apporximately 5-10 Megabytes! I don’t know if it is related to
the cache algorithm used by ruby?

I tried to reduce this memory requirment and thought of using Binary
Tree. Just did some test in Delphi, it seems good. I loaded an English
dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
seems much better than hash.

If you’re looking for memory efficiency are are simply doing lookups,
perhaps you should look at using a Bloom filter. You could pack a 90,000
word dictionary into 64kb. By coincidence my latest Kata is about them:

http://pragprog.com/pragdave/Practices/Kata/KataFive.rdoc,v

Cheers

Dave

“Xiangrong Fang” xrfang@hotmail.com wrote in message
news:20030526135729.583F.XRFANG@hotmail.com

Hi ruby fans,

I have a question about algorithm efficiency.

OK

I have used hash in my ruby program. It worked fine. However, I found
that the memory consumed by a hash is tremendous. While the hash file is
stored on disk (using pstore) it is merely 100-200KB, but in memory, it
consumes apporximately 5-10 Megabytes! I don’t know if it is related to
the cache algorithm used by ruby?

I’m assuming 32 bit architecture below.

I don’t know how the hash works in Ruby, or how you dumped it on disk.
But if the disk is non-indexed such that it is loaded into memory before
becoming useful, you can save an awful lot of memory.
A hash can be quite memory consuming. Each entry in the hash typically has:
hashkey, reference to real key, reference to data, reference to next
collision.

This is typically 4 32bit words plus the actually key.

For afficient allocation you typically have 25% extra entries but this can
vary, so you actually use 5 x 32bit in this case.
You also need a bucket table. A bucket table could be that array of entries
directly, but it is better to have a separate table. The bucket table takes
up on reference to the entry. The bucket table is preferably more empty than
full (which is why it is better to not use the entry table directly.
How much overcapacity you use in the bucket table is a design decision, but
for at least half empty you need two references.

So we are totalling 7 x 32bit per stored entry, not counting any keys
(strings or whatever). Keys are typically short, even for strings, so one or
two 32bit words probably.

A total rough estimate is therefore 8 32 bit words per stored entry in the
hash table.
If you have 2 32 bit words worth of data stored in a flat file unindexed,
you use 4 times that memory in a in-memory hash.
This suggests that Ruby should use around 1-2MB if your on disk
datastructure uses 100-200KB. However, Ruby probably has a lot of extra
per-object overhead for keys.

I tried to reduce this memory requirment and thought of using Binary
Tree. Just did some test in Delphi, it seems good. I loaded an English
dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
seems much better than hash.

Binary trees are typically implemented as red-black trees. The have
reference to data, reference to left and right child, often also reference
to parent. And they have some bits to store its color. This can be done
tagging a bit in one of the pointers, but typically a node useds 5 32 bit
words and the keys. If we use 2 32 bit words for storing keys, we also end
up using 7 32bit words per entry.
That is only marginally better than hash tables.

Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it
mean that hash is born to be less storage-efficient than BTree? How can
I get the best of both BTree and hash?

Don’t worry too much about the Log(N) in B-Trees. It is not log2(N)
but logM(N), where M is in the range of 10 to 1000. For most practical
purposes
the B-Tree can be viewed as close to O(1). However, the constant factor
may be somewhat larger than for hash tables. It just means a B-Tree scales
well.
A binary tree does not scale nearly as well. Here you truly get log(N)
performance.

A binary tree and a B-Tree keeps an order relation, a hash does not

  • or rather:

Usually you do not get to keep insertion order in hashtables, but it is in
fact possible to do so without paying any significant time/space overhead.
I have written such a hashtable but only implemented it for 32bit keys
(which
therefor can be stored in-place in the hash entry. This is not sorting, but
it is
extra ordering information.

Searching Binary trees are usually faster than most things for small (<100)
items.
This assumes the tree allocates tree node from a common pool so all nodes
are close
to each other (for cache performance).
Arrays are faster for (<10) items (linear search, not binary search).

A hash table generally performs very well for medium to large amounts of
data.
One problem is the memory consumption which hurts cache efficiency. Another
problem is the large bucket table with random access pattern. This is also
bad for
cache performance. However, despite these problems, a good hash so fast that
it still manages to beat more clever datastructures except in the
competition for size.

A B-Tree uses [reference to key, reference to data]. It also uses additional
memory
for internal tree nodes (assuming all data is stored at leaf entries).
Had the B-Tree been binary, it would use the same amount of memory for
internal nodes
as for leaf storage, but the branching is higher, so the memory consuption
drops. I don’t
have any exact mesures here, but the cost of internal nodes are limited.
A B-Tree operates at about 2/3 of allocated memory. So you need 3 32 bit
words plus internal
node overhead. A rough estimate is therefore 4 entries. Then you need to add
the overhead
for any keys you store externally. Thus we get to about 5 bit words per
entry.

The best way to work with in-memory B-Trees is to ensure large branchfactor
that still is cache friendly.
Furthermore, we want to scan a B-Tree node linearly and not using binary
search becuase it is too slow
for anything below 100 entries. A good choice for in memory nodes appear to
be around 7-15 entries,
but that depends on processor and the data stored. If the key is external,
it may be relevant to use
binary seach to reduce the number of comparions.
The B-Tree generally has very good cache performance, except it can be hurt
be external keys.

The Judy tree / trie is even more complex than the B-Tree but is more memory
efficient than B-Trees
and may be slightly more cache friendly.

The B-Tree and Judy tree both scales well to very large datastructures,
whereas a hash table gets into
trouble as soon as you start trashing to disk. A Binary tree is just not as
efficient long before memory
becomes a problem.

The Judy trie does not have a problem a problem with external keys. I have
developed a B-Trie which are
stacked B-Trees each holding a part of the key. It uses too much memory
becuase there are many
mostly empty B-Tree nodes, but this can fixed by optimizing B-Trees for very
small collections.

A believe the B-Tree is easier to update than Judy trees and is better for
combined in-memory and on-disk
datastructures. However, for strictly in-memory operation the Judy tree may
be better.

Another datastructure, the Skip-list datastructure has been very popular
becuase it is 10 times
easier to implement than an efficient in-memory B-Tree. It has been claimed
to better than B-Trees.
This is not true. A skip-list jumps all over the memory and has poor cache
performance. It suffers from
pointer chasing. I have done some benchmarks, and I 've seen one other
person do some similar
investigations reaching the same conclusion.
This does not mean you shouldn’t use skip-lists. They are still useful
datastructures - they may be better than than red-black trees which are
typically used for map
implementations.

B-Trees are very kind to efficient allocation of memory. Judy trees and skip
lists allocate all sorts of
different memory sizes. A B-Tree only allocates one node size (and possibly
one smaller node size
for the optimized special case of very small collections).

Mikkel

Hi Robert,

I have used hash in my ruby program. It worked fine. However, I found
that the memory consumed by a hash is tremendous. While the hash file is
stored on disk (using pstore) it is merely 100-200KB, but in memory, it
consumes apporximately 5-10 Megabytes! I don’t know if it is related to
the cache algorithm used by ruby?

How did you measure this? Is there a way to know the amount of memory
used of an object graph?

I have no good way to measure this. What I did is to watch the total
memory required by the program, with or without the hash. I don’t know
if there is a way to do what you mentioned. May be some ruby experts
know?

It’s a typical tradeoff: you can’t have both - space efficiency AND
minimal memory consumption. If you have to keep memory consumption low
I think there is a tree implementation in the RAA. Another option would
be to use an Array, keep that sorted and implement (or use) a binary
search, which gives quite similar fetch performance as a tree. Insert
might or might not be slower.
In my program lookup is very freqent, but update is rare. I tried using
Array.include? , but it seems is using seqential search, not binary. The
performance is unbearable. Does anyone know if ruby’s internal
Array.include? is using binary search or not?

Thanks!

Shannon

···

On Mon, 26 May 2003 17:18:35 +0900 “Robert Klemme” bob.news@gmx.net wrote:

How about 'skip-list’s ???

http://c2.com/cgi/wiki?SkipList

···

On Mon, 26 May 2003 10:44:16 +0200, Robert Klemme wrote:

“Xiangrong Fang” xrfang@hotmail.com schrieb im Newsbeitrag

Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it
mean that hash is born to be less storage-efficient than BTree? How can
I get the best of both BTree and hash?

It’s a typical tradeoff: you can’t have both - space efficiency AND
minimal memory consumption. If you have to keep memory consumption low
I think there is a tree implementation in the RAA. Another option would
be to use an Array, keep that sorted and implement (or use) a binary
search, which gives quite similar fetch performance as a tree. Insert
might or might not be slower.


Simon Strandgaard

I have a question about algorithm efficiency.

I have used hash in my ruby program. It worked fine. However, I found
that the memory consumed by a hash is tremendous. While the hash file is
stored on disk (using pstore) it is merely 100-200KB, but in memory, it
consumes apporximately 5-10 Megabytes! I don’t know if it is related to
the cache algorithm used by ruby?

IIRC you were using that for a spam filter or something like that ?

I tried to reduce this memory requirment and thought of using Binary
Tree. Just did some test in Delphi, it seems good. I loaded an English
dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
seems much better than hash.

If you’re looking for memory efficiency are are simply doing lookups,
perhaps you should look at using a Bloom filter. You could pack a 90,000
word dictionary into 64kb. By coincidence my latest Kata is about them:

http://pragprog.com/pragdave/Practices/Kata/KataFive.rdoc,v

Are you sure the given example (80000 words into 16KB) is right?
It just doesn’t seem to work according to the reference given in your
page:

1681024/80000
1.63840

=> max hashes k = 1. P(false positive) = .46!!

However with 90000 words and 64KB, P(false positive) = .061 (k=4) which
is much better :slight_smile:

···

On Mon, May 26, 2003 at 10:40:00PM +0900, Dave Thomas wrote:


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Because I don’t need to worry about finances I can ignore Microsoft
and take over the (computing) world from the grassroots.
– Linus Torvalds

“Dave Thomas” dave@pragprog.com wrote in message
news:3ED21933.7080504@pragprog.com

Xiangrong Fang wrote:

If you’re looking for memory efficiency are are simply doing lookups,
perhaps you should look at using a Bloom filter. You could pack a 90,000
word dictionary into 64kb. By coincidence my latest Kata is about them:

http://pragprog.com/pragdave/Practices/Kata/KataFive.rdoc,v

I also recently found some articly on non-loss storage for lookup only. I.e.
returns yes or no to match correctly.
The method is based on finite state machines.
It could be used with attached data at some additional complexity - it
requires you to track the seach path and use
bits from each branch to generate an index for the data reference.
I guess this is a fancy way of creating perfect hashes.

For multi word searching I also found an aritcle on using vector spaces.
Each document is given a vector and your query is also given a vector.
You enumerate all known words. Each word becomes a position in the vector.
The value in the vector is the number of word occurrences.
You locate the document with the smallest distance from the query vector
using some euclidian or other measure
of distance. This also handles inexact queries and is supposedly memory
efficient.
The query would now need to look up each word in some hashtable or similar
to locate the vector index.
The documents need not be stored in-memory, only their vector representation
(which can be compressed).
A dump implementation would now scan all document vectors against the query,
but I’m sure there is more clever way.
Perhaps this is an excercise for a new kata?

Mikkel

Dave Thomas wrote:

…you should look at using a Bloom filter. You could pack a 90,000
word dictionary into 64kb.

As long as you don’t mind missing 1 in 64-ish spelling errors.

A Bloom filter reaches optimal performance when there an equal
number of odd and even bits set (each probe into the bitmap
therefore yields maximum selectivity) 64Kb is 524288 bits,
which means you must use around six hash values from each word
in your 90,000 word dictionary. That yields false hit
performance of 1:64, which IMO is insufficient for a useful
spell checker.

I mentioned this not to criticise, but to show folk how to
choose the best number of hash probes for a Bloom filter of
a given size, and how to predict the false hit rate.

I learnt this stuff by experiment while building a spell-checker.
I used 16-18 probes for each word, yielding a false hit rate of
1 in 65000+. For that performance, your 90,000 word dictionary
needs a filter of 200Kb+, not 64Kb.

Clifford.

Hi Mikkel,

Thanks for the detailed explanations. I have some qusestions regarding
your explain:

  1. You mentioned B-Tree and binary tree, seems that they are different?

  2. I didn’t used any of my own hash algorithm, I just used Ruby’s hash,
    and used Ruby’s PStore to dump it to disk. The memory usage is an
    estimation from monitoring the task list in win2k. Do you have any
    comments about the efficiency of using Ruby’s hash and pstore?

  3. My application is related to using N-GRAM method to cut Chinese text
    into words (a Chinese bigram is 4-byte string). Do you have any comments
    on which algorithm is better? I am very interested in your comments
    about SkipList, can you expand more?

Thanks a lot!

Shannon

I am not sure what your program is doing, but if it's something similar to
a search engine, then you might consider to use an inverted word-list,
which need not loaded into memory at all.

Regards,

  Michael

···

On Mon, May 26, 2003 at 06:29:23PM +0900, Xiangrong Fang wrote:

> It's a typical tradeoff: you can't have both - space efficiency AND
> minimal memory consumption. If you *have* to keep memory consumption low
> I think there is a tree implementation in the RAA. Another option would
> be to use an Array, keep that sorted and implement (or use) a binary
> search, which gives quite similar fetch performance as a tree. Insert
> might or might not be slower.
In my program lookup is very freqent, but update is rare. I tried using
Array.include? , but it seems is using seqential search, not binary. The
performance is unbearable. Does anyone know if ruby's internal
Array.include? is using binary search or not?

--
ruby -r complex -e 'c,m,w,h=Complex(-0.75,0.136),50,150,100;puts "P6\n#{w} #{h}\n255";(0...h).each{|j|(0...w).each{|i|
n,z=0,Complex(.9*i/w,.9*j/h);while n<=m&&(z-c).abs<=2;z=z*z+c;n+=1 end;print [10+n*15,0,rand*99].pack("C*")}}'|display

No, it can’t because arrays don’t have to be sorted. “include?” is mixed
in by Enumerable which uses “each” to iterate over the elements.

···

On 2003-05-26 18:29:23 +0900, Xiangrong Fang wrote:

performance is unbearable. Does anyone know if ruby’s internal
Array.include? is using binary search or not?


4) “A TRUE Klingon Warrior does not comment his code!”
– Top 12 things likely to be overheard if you had a Klingon Programmer

I have no good way to measure this. What I did is to watch the total
memory required by the program, with or without the hash. I don’t know
if there is a way to do what you mentioned. May be some ruby experts
know?

You are probably creating a lot of temporary garbage objects, so you’re
measuring the size of hash + garbage pool.

You could try ‘GC.start’ to do garbage collection, but I don’t know if Ruby
then un-mallocs the storage and returns it to the O/S.

In my program lookup is very freqent, but update is rare. I tried using
Array.include? , but it seems is using seqential search, not binary. The
performance is unbearable. Does anyone know if ruby’s internal
Array.include? is using binary search or not?

It doesn’t, because Array doesn’t have any requirement to be sorted.

Regards,

Brian.

···

On Mon, May 26, 2003 at 06:29:23PM +0900, Xiangrong Fang wrote:

In fact Dave sort of solved the Kata already :slight_smile:
http://pragprog.com/pragdave/Tech/Blog/Searching.rdoc,v

···

On Tue, May 27, 2003 at 02:00:18AM +0900, MikkelFJ wrote:

For multi word searching I also found an aritcle on using vector spaces.
Each document is given a vector and your query is also given a vector.
You enumerate all known words. Each word becomes a position in the vector.
The value in the vector is the number of word occurrences.
You locate the document with the smallest distance from the query vector
using some euclidian or other measure
of distance. This also handles inexact queries and is supposedly memory
efficient.
The query would now need to look up each word in some hashtable or similar
to locate the vector index.
The documents need not be stored in-memory, only their vector representation
(which can be compressed).
A dump implementation would now scan all document vectors against the query,
but I’m sure there is more clever way.
Perhaps this is an excercise for a new kata?


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Not only Guinness - Linux is good for you, too.
– Banzai on IRC

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message
news:3ed243b6$0$97260$edfadb0f@dread12.news.tele.dk…

The documents need not be stored in-memory, only their vector
representation
(which can be compressed).
A dump implementation would now scan all document vectors against the
query,
but I’m sure there is more clever way.
Perhaps this is an excercise for a new kata?

Which makes me think of many keyed indices.
Using space filling curves (a kind of fractals), you can map a vector into a
single linear value.
The linear value is the position on the line of the space filling curve.
The vector is the position in space where the curve has the given distance.
A Piano curve is one such fractal.

The linear index can be stored in B-Trees or similar efficient lookup
datastructures.

The nice thing about space filling curves is that things many curves exhibit
a kind of sensible distance measure.
That is, if the distance between two linear indices is small then the points
are also reasonably close together on the curve.
This is not an exact science but it does work.
Spacefilling curves have been used to index images where each key is some
aspect of the image, such as how much yellow etc.
Given a new image of the same person in the same shirt etc. you should have
a chance of retrieving and old similar image.

If we take this idea and apply it to document storage, we can now easily
find the best document matches by looking up the documents
stored with linear indices within a predetermined distance of the query. The
distance chosen may take into account the nature of the
space filling curve.

A problem is that it can be a bit slow to calculate the linear index.

Do we have a new google? :slight_smile:

Mauricio Fernández wrote:

Are you sure the given example (80000 words into 16KB) is right?
It just doesn’t seem to work according to the reference given in your
page:

1681024/80000
1.63840

=> max hashes k = 1. P(false positive) = .46!!

Hmm - you’re right, but I now I’m seriously worried. I’ve carried those
numbers around in my head for 20 years, and I was sure they were
correct. Now my problem is trying to find a way to convert the old 8"
floppies in RT-11 format so I can work out what why there’s an
inconsistency.

Thanks for pointing it out.

Cheers

Dave

MikkelFJ wrote:

For multi word searching I also found an aritcle on using vector spaces.
Each document is given a vector and your query is also given a vector.
You enumerate all known words. Each word becomes a position in the vector.
The value in the vector is the number of word occurrences.
You locate the document with the smallest distance from the query vector
using some euclidian or other measure
of distance. This also handles inexact queries and is supposedly memory
efficient.
The query would now need to look up each word in some hashtable or similar
to locate the vector index.
The documents need not be stored in-memory, only their vector representation
(which can be compressed).
A dump implementation would now scan all document vectors against the query,
but I’m sure there is more clever way.
Perhaps this is an excercise for a new kata?

Or for the original one… :slight_smile:

http://pragprog.com/pragdave/Practices/CodeKata

Cheers

Dave

Clifford Heath wrote:

Dave Thomas wrote:

…you should look at using a Bloom filter. You could pack a 90,000
word dictionary into 64kb.

As long as you don’t mind missing 1 in 64-ish spelling errors.

A Bloom filter reaches optimal performance when there an equal
number of odd and even bits set (each probe into the bitmap
therefore yields maximum selectivity) 64Kb is 524288 bits,
which means you must use around six hash values from each word
in your 90,000 word dictionary. That yields false hit
performance of 1:64, which IMO is insufficient for a useful
spell checker.

Actually, it’s worse that that. According to
Bloom Filters - the math, for an m/n
ratio of 6, the filter words best with 4 hashes, which gives an error
rate of about 1/18, which is clearly silly. Next time I’ll actually use
the tables rather than try to wing it in my head :slight_smile:

I learnt this stuff by experiment while building a spell-checker.
I used 16-18 probes for each word, yielding a false hit rate of
1 in 65000+. For that performance, your 90,000 word dictionary
needs a filter of 200Kb+, not 64Kb.

Interestingly, the same tables suggest that for a 200k hash, m/n will be
roughly 18, so k optimizes out at 12 or 13 hashes, for an error rate of
0.000176, or 1/5681, which probably still isn’t good enough for
real-world use. It looks like about 300k is probably a good starting
point. I’m off to try generate some real statistics just to confirm all
this.

Cheers

Dave

“Xiangrong Fang” xrfang@hotmail.com wrote in message
news:20030527092309.73C4.XRFANG@hotmail.com

Hi Mikkel,

Thanks for the detailed explanations. I have some qusestions regarding
your explain:

  1. You mentioned B-Tree and binary tree, seems that they are different?

In principle they are identical, but differ vastly in implementation.

A binary tree has two children, the left has smaller values, the right has
larger values.
A B-Tree has an array of children such that all values in the leftmost
subtree
is smaller than all values in the next subtree, etc.
In effect a B-Tree is a compressed binary tree.
In both cases you have logarithmic search. But if you assume the time is
consumed
by visiting new nodes and not by visiting values within a single B-Tree
node,
you get a much faster operation. This assumption is true if you read data
from
disk. There is a limit of B-Tree node sizes beyond which it takes
too long time to visit each key within each node. For disk based
B-Trees you usually have about 1000 keys (M=1000).
You can also use B-Trees in-memory, but then the node size must
reflect the size of the CPU cache. When data is in cache, a very
fast scan over the array of keys is more effecient than visiting
new nodes.
B-Trees are very complicated because a node can have between M / 2 and M
keys in a node. If you go above or below you must reorginize the tree,
This is efficient, but the implementation has lots of special cases.

A B-Tree guarantees that you must at most visit K nodes.
K is a small limited value. If K is 3 and you have 1000 keys per
node (M=1000), you have K^M = 1G possible keys.
That is about all you can address in a 32bit systems.
For in-memory operation (M=10), K is significantly larger,
but if you consider the realistic size of your dataset, it is still
small. The trick you can do to make in memory performance better
is to organize you tree so allocation of nodes happen close to
related nodes. This gives you benefits in lower level cache and
virtual memory. Effectively you a split a larger (M=1000) into
several smaller (M=10) nodes. This gives you both good in-memory
and on-disk performance. In praxis it is not that easy to manage,
but you can do something to make it work reasonably well.

Binary trees on the other hand do not have a small limit on the
number of nodes that you must visit. Therefore they truly have
logarithmic behavior. It’s not too bad, it is just not the most
efficient way to deal with large datasets.

A binary tree requires you to visit a new node for each comparison.
Binary trees must be balanced otherwise all data could end up in one
subtree (becoming just a linked list).
There are several ways to balance binary trees (e.g. red-black trees and
splay trees).
In any case it solves the problem.

  1. I didn’t used any of my own hash algorithm, I just used Ruby’s hash,
    and used Ruby’s PStore to dump it to disk. The memory usage is an
    estimation from monitoring the task list in win2k. Do you have any
    comments about the efficiency of using Ruby’s hash and pstore?

My best guess is that PStore would not save more than necessary, so you
should
expect in-memory to take up more space as I already suggested,
but I don’t know what it does.
Your way of measuring is very crude, but probably fair because this is how
the
system is actually stressed no matter the theory.

  1. My application is related to using N-GRAM method to cut Chinese text
    into words (a Chinese bigram is 4-byte string). Do you have any comments
    on which algorithm is better?

I’d like to here more about this, in private mail if you prefer.

It it appears that you are counting “word” or symbol frequencies and can do
this
by indexing a pair of symbols as a single 32bit word.
This is ideal for the hashtable I have implemented as it does not need to
visit any
external keys. However, it is not available in Ruby. I can send you the
C-Source.
I already suggested that this source be made available to Ruby but I don’t
know how
well it fits with Ruby as is. It would be specialized for 32 bit words, not
general hashes.

I have been working a lot with B-Trees and I think I would choose B-Trees
over
hash tables because they scale better. B-Trees have a good trade-off. They
are easy to have
small and easy to have very very large and if they are not the fastest
solution, they are
always seem to be among the best performers.

With B-Trees you can have your data on disk and do efficient lookups without
needing
to load everything into memory.
It is possible to have on-disk hash tables, but in the end you need to do
some clever
memory mananagement that is going to make it look an awful lot like B-Trees.

Also, because the key is 32bit the B-Tree is efficient because it does not
need to
visit external keys. As soon as you visit external keys the cache benefit of
B-Trees
goes away. However, I do not recommend that you start implementing B-Trees.
They are
difficult to implement and difficult to make sure that they work.

You may be able to use the datastructures in the Berkeley DB database. I
don’t know how
efficient it would be in you case (it’s fast but it is a database after
all).
Berkeley DB does handle hashes and B-Trees. It can operate in in-memory mode
only
and I think there is a Ruby interface.
Note that Berkeley is free only for non-commerical usage.

I am very interested in your comments
about SkipList, can you expand more?

I don’t recall exactly why skip-lists are better than binary trees, but they
are
in any case easy to implement (except efficient node allocation).

I can send you an implementation in C if you like.

This is only a rough outline according to memory:

A skiplist is a sorted linked list of all values.
Since it is slow to find a value this way, some nodes both point to the next
node
as well as to one or more nodes further down the list.
In fact a skiplist node has an array of next pointers.
All nodes have on next pointer, half the nodes have at least two next
pointers.
Only one nodes has log(N) next pointers (the head of the list).
Two nodes has log(N-1) next pointers, etc.
The search happens by looking at the longest jumping link first. If that was
too far,
you try the next shorter jumping node.
Eventually you find a node that has a key that matches or is shorter.
You now repeat the process jumping from that node. This goes on until you
have a match
(or conclude failure).
This is a bit simplified as there are a few more cases to consider, but this
is generally
the idea.
In reality the skiplist does not have a perfect distribution of nodes. The
number of nodes
with a certain number of next links may not be optimal, and the location of
these nodes may
also not be optimal. The distribution happens statistically, but works
pretty well in praxis.
A skiplist node thus has an array of pointers. This is not unlike B-Tree
nodes, except the
concept is radically different. There is no complex balancing or splitting
of nodes and
the performance is decent. However, you are jumping back and forth between a
lot of nodes,
so it is not friendly to your CPU cache. Since nodes have vastly different
sizes it is
difficult to make a fast node allocation scheme (if you choose malloc / new,
you have lost the
speed competition).

Thanks a lot!

You are welcome.

Mikkel

It does free() the space. Whether or not it’s returned to the OS depends
on stdlib.

Maybe you could add some printf statements in obj_free (gc.c) and
st_free_table (st.c) to see how much memory was released, doing

theHash = some_method(*args)
GC.start
theHash = nil
GC.start # and now take a look at the last output lines

···

On Mon, May 26, 2003 at 09:11:59PM +0900, Brian Candler wrote:

On Mon, May 26, 2003 at 06:29:23PM +0900, Xiangrong Fang wrote:

I have no good way to measure this. What I did is to watch the total
memory required by the program, with or without the hash. I don’t know
if there is a way to do what you mentioned. May be some ruby experts
know?

You are probably creating a lot of temporary garbage objects, so you’re
measuring the size of hash + garbage pool.

You could try ‘GC.start’ to do garbage collection, but I don’t know if Ruby
then un-mallocs the storage and returns it to the O/S.


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

  • JHM wonders what Joey did to earn “I’d just like to say, for the record,
    that Joey rules.”
    – Seen on #Debian