Ruby and Judy

You guys have got to change the subject. Mac mail keeps sorting
messages about Ruby and Judy into the porno mailbox. :slight_smile:

···

On Saturday, August 24, 2002, at 10:40 PM, Hal E. Fulton wrote:

This (Judy in general) is a very interesting concept to
me, partly because I find it hard to imagine this kind
of thing coded with the kind of efficiency we seem to
talking about, and yet presumably in a platform-independent
manner.

Well, from my reading of the ten minute description, it’s not really
platform-independent. It’s going to be much faster on machines where
main memory access is much slower than cache memory access. That just
happens to be most of the machines in common use today. :slight_smile: But on
an old machine that doesn’t use a cache, such as say an Apple II, it
wouldn’t be so much different in speed from other methods.

cjs

···

On Sun, 25 Aug 2002, Hal E. Fulton wrote:

This (Judy in general) is a very interesting concept to me, partly
because I find it hard to imagine this kind of thing coded with the
kind of efficiency we seem to talking about, and yet presumably in a
platform-independent manner.


Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

This (Judy in general) is a very interesting concept to
me, partly because I find it hard to imagine this kind
of thing coded with the kind of efficiency we seem to
talking about, and yet presumably in a platform-independent
manner.

I’m not knocking it, I’m just surprised that Judy exists.

I have been working with some similar concept using B-Trees:
Having an efficient in-memory tree with a reduced level of branching
compared to typically page sized trees does work quite efficiently. However,
you have to have very large hash tables before the B-tree it outperforms a
hashed 32 bit key.

I implented a variant called I-Tree (I later learned they are also know as
ordered B-Trees or OB-trees). An I-Tree uses the size of the sub-tree for
indexing. At leaf level there is no key. Hence a space efficient dynamic
array.

I was actually considering making Ruby running on this indexing principle,
but also realize it is a lot of work.

I need to study Judy more, but it seems like it used a trie where I use
B-tree keys. I’m worried the Judy may consume too much memory on some cases
(especially smaller trees) which is why I chose B-Tree keys instead of using
a trie lookup. If Judy does handle this efficiently it’s interesting. I
noticed the use of “widening” also know as path compression. I’ve also been
working with this in relation to statically indexed trees, but it is not
good for dynamic behavior. Jucy seem to have abanded the use of key
widening. The cost for not widening the key is more levels in the tree.

The I-Tree is very space efficient. The B-Tree requires a key for each entry
which effectively becomes 64 bit per entry and the inherent space overhead
in B-Trees for not fully using all nodes.

The downside of in-memory B-Trees is that entries must be moved during
insertion into node. It is generally faster to do linear scan than using
binary search of node up to a certain size (which is too large due to cache
considerations).

Ordinary Red-black trees (STL map) are not very space efficient. But for
small collections they are fast and has little additional overhead.

Also note the 3-Trie posted in DDJ a few years back. I implemented a variant
using multi-level B-Trees. It branches on 32bit (or 64) bits instead of
Judy’s 8 bit. I haven’t measured it but I expect it to be a very efficient
suffix tree string indexing method. I store runs of strings to avoid
excessive number of sub-trees.

Mikkel

Is Ruby likely to be used in an Apple II? }:->

···

On Sun, Aug 25, 2002 at 06:36:54PM +0900, Curt Sampson wrote:

On Sun, 25 Aug 2002, Hal E. Fulton wrote:

This (Judy in general) is a very interesting concept to me, partly
because I find it hard to imagine this kind of thing coded with the
kind of efficiency we seem to talking about, and yet presumably in a
platform-independent manner.

Well, from my reading of the ten minute description, it’s not really
platform-independent. It’s going to be much faster on machines where
main memory access is much slower than cache memory access. That just
happens to be most of the machines in common use today. :slight_smile: But on
an old machine that doesn’t use a cache, such as say an Apple II, it
wouldn’t be so much different in speed from other methods.


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Sex dumps core
(Sex is a Simple editor for X11)
– Seen on debian bugtracking

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message news:3d68f556$0$166$edfadb0f@dspool01.news.tele.dk

This (Judy in general) is a very interesting concept to
me, partly because I find it hard to imagine this kind
of thing coded with the kind of efficiency we seem to
talking about, and yet presumably in a platform-independent
manner.

I’m not knocking it, I’m just surprised that Judy exists.

[Mikkel]
I have been working with some similar concept using B-Trees:
Having an efficient in-memory tree with a reduced level of branching
compared to typically page sized trees does work quite efficiently. However,
you have to have very large hash tables before the B-tree it outperforms a
hashed 32 bit key.

I implented a variant called I-Tree (I later learned they are also know as
ordered B-Trees or OB-trees). An I-Tree uses the size of the sub-tree for
indexing. At leaf level there is no key. Hence a space efficient dynamic
array.

I was actually considering making Ruby running on this indexing principle,
but also realize it is a lot of work.

I need to study Judy more, but it seems like it used a trie where I use
B-tree keys. I’m worried the Judy may consume too much memory on some cases

This is the major reason people do not use digital trees (tries) however,
Judy was able to solve this problem. In fact Judy is normally consumes less
memory than other methods. The reason is because Judy takes advantage that
a digital tree decodes a portion of the key at each level, therefore the
whole key need not be stored in the tree.

(especially smaller trees) which is why I chose B-Tree keys instead of using
a trie lookup. If Judy does handle this efficiently it’s interesting. I
noticed the use of “widening” also know as path compression. I’ve also been
working with this in relation to statically indexed trees, but it is not
good for dynamic behavior. Jucy seem to have abanded the use of key
widening. The cost for not widening the key is more levels in the tree.

Judy does not use “path compression” (per Steffin) because that opportunity
is only available high in the tree. This part of the tree is normally in
the data cache therefore the very wide branch does not help performance.
Very wide branches make a very dynamic tree slow to add and delete.
Judy will use wider that 256 branches when 1TB machines are commonplace.

The I-Tree is very space efficient. The B-Tree requires a key for each entry
which effectively becomes 64 bit per entry and the inherent space overhead
in B-Trees for not fully using all nodes.

The downside of in-memory B-Trees is that entries must be moved during
insertion into node. It is generally faster to do linear scan than using
binary search of node up to a certain size (which is too large due to cache
considerations).

Judy did not use a b-tree internally because of the number of cache-line
fills necessary to search a node. However we came close. B-trees have a
lot of good properties if done well. I think there is some b-tree code
in the apps/benchmark section of the Judy source download.

Ordinary Red-black trees (STL map) are not very space efficient. But for
small collections they are fast and has little additional overhead.

Also note the 3-Trie posted in DDJ a few years back. I implemented a variant
using multi-level B-Trees. It branches on 32bit (or 64) bits instead of
Judy’s 8 bit. I haven’t measured it but I expect it to be a very efficient
suffix tree string indexing method. I store runs of strings to avoid
excessive number of sub-trees.

Mikkel

Have you got some code to compare? SLcompare.c benchmark at
<www.sourcejudy.com> compares JudySL with the best 3 methods I have
been able to get code. I do not write code for known methods because
I am too tainted.

doug@sourcejudy.com

I need to study Judy more, but it seems like it used a trie where I use
B-tree keys. I’m worried the Judy may consume too much memory on some
cases

This is the major reason people do not use digital trees (tries) however,
Judy was able to solve this problem. In fact Judy is normally consumes
less
memory than other methods. The reason is because Judy takes advantage
that
a digital tree decodes a portion of the key at each level, therefore the
whole key need not be stored in the tree.

I admit that B-Trees probably are more memory expensive than Judy trees. I
imagine the wasted space can improve insertion times.

The static trie I’ve been working with was based on ideas from the article
“Fast IP Routing with LC-Tries” article in Dr. Dobbs Journal (search their
archives). I modified it in several ways including adding a substring
compare node to avoid subsequent comparison to see if a match was truly a
match. The datastructure uses a fillfactor to decide the size of the
jumptable at a given bit-offset in the index. The fillfactor is the
percentage of empty slots in a jumptable. The index bit-length of a given
node is decided by the longest sequence that does not violate the fillfactor
and which does not cross a machine word boundary in the index. The bitstring
is shifted into place and is used to jump to the next node.

I have considered having a combined static trie and a dynamic B-Tree
(B-Trie) where new data is added to the B-Tree and the static trie is
recompiled occasionally much like garbage collection.

Judy does not use “path compression” (per Steffin) because that
opportunity
is only available high in the tree. This part of the tree is normally in
the data cache therefore the very wide branch does not help performance.
Very wide branches make a very dynamic tree slow to add and delete.
Judy will use wider that 256 branches when 1TB machines are commonplace.

I agree that you should limit the maximum fanout - as I also did in my
static trie - because you want to keep the jumptable within cache boundaries
and because you want to not waste too much memory. I do not agree that you
only get a wide fanout close to the root. That depends on the datasource.
For example “comp.lang.” could be a common prefix where you get the
diversity closer to the leaf.

Judy did not use a b-tree internally because of the number of cache-line
fills necessary to search a node. However we came close. B-trees have a
lot of good properties if done well. I think there is some b-tree code
in the apps/benchmark section of the Judy source download.

I found that having between 7-14 entries in the B-Tree proved to be quite
efficient. A linear tight loop search of a small node should be running
within data and instruction cache. It is not a problem that the node is
small - you are simply trading tree height against local node search
performance. You trim the node size until you get maximum performance. For
this to work optimally, related nodes should preferable be stored close to
each other. The benefit of small nodes is also the you need to move less
entries during insertion and removal.

The latest concept I have been working on is allocating small B-Tree nodes
from larger pages: try allocate next node from same page. From a disk-cache
point of view the page works like a large B-Tree node which in turn is
divided into smaller CPU-cache smaller nodes. I expect this to work well
with memory mapped files as well as large indices where some nodes
eventually gets paged out. You also get a good tradeoff between avoiding CPU
expensive intranode binary search and still have good logarithmic behavior.

Have you got some code to compare? SLcompare.c benchmark at
<www.sourcejudy.com> compares JudySL with the best 3 methods I have
been able to get code. I do not write code for known methods because
I am too tainted.

I have some fast 32 bit indexing datastructures including a fairly fast
B-Tree.
The B-Trie for variable length keys is very recently implemented and use
code that is not yet optimized (needs inlining). I am curious about the
performance though - so we might set something up - but it would take some
amount of work (mail me if interested, remove “-anti-spam” from mail
address).

You could also take a look at ternary search trie by Sedgewick:
http://www.ddj.com/documents/s=921/ddj9804a/9804a.htm

It uses too much memory and does not have sufficient data locality but the
code is very small and simple. It executes in a tight loop. It could be
enhanced with a sub-string compare node as I did with the B-Trie to avoid an
excessive number of nodes with only one branch.

Mikkel

“Doug Baskins” doug@sourcejudy.com wrote in message
news:34b67111.0208290825.45942f59@posting.google.com

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message
news:3d68f556$0$166$edfadb0f@dspool01.news.tele.dk

This (Judy in general) is a very interesting concept to
me, partly because I find it hard to imagine this kind
of thing coded with the kind of efficiency we seem to
talking about, and yet presumably in a platform-independent
manner.

I’m not knocking it, I’m just surprised that Judy exists.

[Mikkel]
I have been working with some similar concept using B-Trees:
Having an efficient in-memory tree with a reduced level of branching
compared to typically page sized trees does work quite efficiently.
However,
you have to have very large hash tables before the B-tree it outperforms
a
hashed 32 bit key.

Have you got some code to compare? SLcompare.c benchmark at
<www.sourcejudy.com> compares JudySL with the best 3 methods I have
been able to get code. I do not write code for known methods because
I am too tainted.

I couldn’t run the test as I’m running Windows.
I rewrote the test under windows on a laptop 750 MHz laptop. Given the
comparable machine configuration and the comparable getline speeds, I assume
you can compare to the data posted in the “SLCompare.c” file are reasonably
comparable. I compared to Hash and Splay, but not Judy as didn’t want to
port Judy to NT. Still I think the numbers are comparable.

Note that I modified the test slightly to terminate the line with a padding
up to a 32bit boundary. I also changed the test such the line length was
available to insert and find functions.

I banged up the optimization as much as I could to deal with the lack of
inlining in the BTrie code. The preliminary results show that the B-Trie
uses 3 times more memory than the other methods and 4 times the raw data (I
haven’t space optimized for small nodes yet so I never allocate less than
128 bytes). It also showed that for small datasets (8000 lines, 300KB i.e.
Hamlet), the B-Trie is actually faster than the hash. For larger sets it is
slower than the hash but consistently faster than the Splay tree.
In comparison I also noted that a “strdup” and a hash key operation both
takes about 0.5 uS so all methods are quite fast.

Note: memory is measured as requested alloc size, not counting internal
allocator overhead.

I didn’t test the full Google repos.00 as I didn’t have enough RAM - not
with the memory consumed by the BTrie anyway.

// test machine Windows 2000, IBM X21 laptop, 392MB RAM, 750 MHz PIII.

// test data set: 338.528 bytes
// http://the-tech.mit.edu/Shakespeare/hamlet/full.html
// this test set requires hi-res timer, clock() can’t separate hash from
btrie.

hamlet.html

lines avg_linelen getline dup_lines RAMused/line store/line

lookup/line ADT
7728 43.8 bytes 0.640 uS 2770 181.9 bytes 2.036 uS
1.606 uS BTrie
7728 43.8 bytes 0.668 uS 2770 68.0 bytes 2.543 uS
1.943 uS SPLAY
7728 43.8 bytes 0.643 uS 2770 60.0 bytes 2.500 uS
1.687 uS HASH

// test data set: 9.907.357 bytes
// first 200000 lines of google repos.00 testset

rp200.txt

lines avg_linelen getline dup_lines RAMused/line store/line

lookup/line ADT
199998 49.5 bytes 0.591 uS 83703 234.9 bytes 2.973 uS
2.433 uS BTrie
199998 49.5 bytes 0.591 uS 83703 85.5 bytes 3.732 uS
3.150 uS SPLAY
199998 49.5 bytes 0.596 uS 83703 77.5 bytes 2.353 uS
1.827 uS HASH

test data set: 50.045.886 bytes
first 1.000.000 lines of test data set

rp1000.txt

lines avg_linelen getline dup_lines RAMused/line store/line

lookup/line ADT
997621 50.2 bytes 0.606 uS 452649 234.3 bytes 3.450 uS
2.940 uS BTrie
997621 50.2 bytes 0.605 uS 452649 87.7 bytes 4.523 uS
4.001 uS SPLAY
997621 50.2 bytes 0.607 uS 452649 79.7 bytes 2.384 uS 1.90
9 uS HASH

Mikkel

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message news:3d7531f5$0$59288$edfadb0f@dspool01.news.tele.dk

[Doug Baskins]
Have you got some code to compare? SLcompare.c benchmark at
<www.sourcejudy.com> compares JudySL with the best 3 methods I have
been able to get code. I do not write code for known methods because
I am too tainted.

Still looking forward to adding your code to the benchmark.

…del

[Mikkel]
It also showed that for small datasets (8000 lines, 300KB i.e.
Hamlet), the B-Trie is actually faster than the hash. For larger sets it is
slower than the hash but consistently faster than the Splay tree.
In comparison I also noted that a “strdup” and a hash key operation both
takes about 0.5 uS so all methods are quite fast.

Note: memory is measured as requested alloc size, not counting internal
allocator overhead.

I didn’t test the full Google repos.00 as I didn’t have enough RAM - not
with the memory consumed by the BTrie anyway.

// test machine Windows 2000, IBM X21 laptop, 392MB RAM, 750 MHz PIII.

// test data set: 338.528 bytes
// http://the-tech.mit.edu/Shakespeare/hamlet/full.html
// this test set requires hi-res timer, clock() can’t separate hash from
btrie.

hamlet.html

lines avg_linelen getline dup_lines RAMused/line store/line

lookup/line ADT
7728 43.8 bytes 0.640 uS 2770 181.9 bytes 2.036 uS
1.606 uS BTrie
7728 43.8 bytes 0.668 uS 2770 68.0 bytes 2.543 uS
1.943 uS SPLAY
7728 43.8 bytes 0.643 uS 2770 60.0 bytes 2.500 uS
1.687 uS HASH

Thought I would throw in my 2 cents. My hash table was 2^20 in size
which is why I think the insert time for Hash is high. I wouldn’t put
much stock in benchmarks that are mostly in the data cache – results
might not be representive of typical use.

wc Hamlet.html
8882 39581 349041 Hamlet.html

This test deletes blank lines – 8882 - 7729 = 1165 lines.

I shortened the number of colums of output to fit:
getline = 0.670 uS for all

Judy Hamlet.html

#lines avg_linelen Uqlines RAMused/line store lookup/line ADT
7729 45.2 bytes 4959 60.2 bytes 2.506 uS 1.591 uS JUDY
7729 45.2 bytes 4959 77.5 bytes 4.654 uS 1.622 uS HASH
7729 45.2 bytes 4959 85.6 bytes 2.487 uS 2.048 uS SPLAY
7729 45.2 bytes 4959 85.7 bytes 2.974 uS 2.243 uS REDBL

// test data set: 9.907.357 bytes
// first 200000 lines of google repos.00 testset

rp200.txt

lines avg_linelen getline dup_lines RAMused/line store/line

lookup/line ADT
199998 49.5 bytes 0.591 uS 83703 234.9 bytes 2.973 uS
2.433 uS BTrie
199998 49.5 bytes 0.591 uS 83703 85.5 bytes 3.732 uS
3.150 uS SPLAY
199998 49.5 bytes 0.596 uS 83703 77.5 bytes 2.353 uS
1.827 uS HASH

test data set: 50.045.886 bytes
first 1.000.000 lines of test data set

rp1000.txt

lines avg_linelen getline dup_lines RAMused/line store/line

lookup/line ADT
997621 50.2 bytes 0.606 uS 452649 234.3 bytes 3.450 uS
2.940 uS BTrie
997621 50.2 bytes 0.605 uS 452649 87.7 bytes 4.523 uS
4.001 uS SPLAY
997621 50.2 bytes 0.607 uS 452649 79.7 bytes 2.384 uS 1.90
9 uS HASH

Mikkel

Doug Baskins doug@sourcejudy.com

“Doug Baskins” doug@sourcejudy.com wrote in message
news:34b67111.0209051528.3757601d@posting.google.com

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message
news:3d7531f5$0$59288$edfadb0f@dspool01.news.tele.dk

[Doug Baskins]
Have you got some code to compare? SLcompare.c benchmark at
<www.sourcejudy.com> compares JudySL with the best 3 methods I have
been able to get code. I do not write code for known methods because
I am too tainted.

Still looking forward to adding your code to the benchmark.

by email

lines avg_linelen getline dup_lines RAMused/line store/line

lookup/line ADT
7728 43.8 bytes 0.640 uS 2770 181.9 bytes 2.036 uS
1.606 uS BTrie
7728 43.8 bytes 0.668 uS 2770 68.0 bytes 2.543 uS
1.943 uS SPLAY
7728 43.8 bytes 0.643 uS 2770 60.0 bytes 2.500 uS
1.687 uS HASH

Thought I would throw in my 2 cents. My hash table was 2^20 in size
which is why I think the insert time for Hash is high. I wouldn’t put
much stock in benchmarks that are mostly in the data cache – results
might not be representive of typical use.

This is correct and this is also my problem with hash tables: how large
should they be initially and what is the cost of growing them (there is an
article in DDJ a few months back on growing arrays) - yet I must confirm
hash tables actually work quite well in praxis and most datastructures
breaks down in virtual memory anyway, no matter the theory. When I chose
small fanout B-Trees over hash it was because I expected them to better
handle small collections while still scaling well. A comparison to a
dynamically growing hash table would be interesting. I also chose B-Trees
becuase of they are friendlier to memory management: How do you find a large
linear space for your hashtable nr. 700 in a memory mapped file, assuming
you also delete data and you do not wan’t an infitely large file.

Judy Hamlet.html

#lines avg_linelen Uqlines RAMused/line store lookup/line ADT
7729 45.2 bytes 4959 60.2 bytes 2.506 uS 1.591 uS JUDY
7729 45.2 bytes 4959 77.5 bytes 4.654 uS 1.622 uS HASH
7729 45.2 bytes 4959 85.6 bytes 2.487 uS 2.048 uS SPLAY
7729 45.2 bytes 4959 85.7 bytes 2.974 uS 2.243 uS REDBL

Standard balanced binary trees and similar work quite well up to a certain
size. Small fanout B-Trees and JUDY work better because of the improved
cache line utilization. For many applications there is no reason to go
beyond std::map, std::map in C++.

Mikkel