Time to retire st_* and replace it with Judy?

Doug2 · 2 September 2002 02:40

After a posting I did yesterday, I decided to take a closer look
at st_* library code. Please correct me if I am wrong in the
following discussion:

The st_* library is a collection of "string" to "word" mapping
routines.  BUT, only the POINTER to the string is stored in the
internal st_* structures.  This implies that the comparison of the
pointer is enough to determine if the string is a match.  Now we are
in JudyL's territory (pointer to pointer mapping) instead of JudySL
(string to pointer mapping).  JudyL has been tuned!!!  I wrote the
st_* routines using JudyL and wrote a benchmark to compare with the
ruby-1.6.7 version of "st.c".  Since, JudyL does not have
collisions, there is no need for a collision structure.

Then I wrote a benchmark simulating st_* with JudyL inside:
<www.sourcejudy.com/benchmarks/st_compare.tar.gz>.

The results:

This is the output of 2 versions of st_compare when run on a 933Mhz P3,
one using “stock” st.c in ruby, the other using the simulation with
JudyL inside. “domnames.txt” is a text file of most/all of the
internet domain names.

wc domnames.txt

28826508 28826508 488227095 domnames.txt

That’s right 28.8 million names, let’s see if it is current:

grep -i sourcejudy domnames.txt

SOURCEJUDY.COM
SOURCEJUDY.ORG

Yep, not more than 2 months old.

Hash_st domnames.txt

lines avg_linelen RAMused/line insert/line lookup/line lookup_random

28826508 16.9 bytes 24.0 bytes 1.674 uS 0.570 uS 1.056 uS

Judy_st domnames.txt

lines avg_linelen RAMused/line insert/line lookup/line lookup_random

28826508 16.9 bytes 6.8 bytes 0.472 uS 0.154 uS 0.706 uS

Notice the Judy_st version used 1/2Gb less memory.
The performance improvement reminds me of the snake oil salesman.

There are still a lot of performance opportunities if st_* were retired.
My take on the st_* routines is they are based on a machine designed
a dozen years ago. They are not tuned for a modern machine.

Enough fun, back to getting Judy to work with ‘autoconf’ et.al.

Doug Baskins doug@sourcejudy.com

Mauricio_Fernndez · 2 September 2002 09:01

[snip]

IMHO the st_* routines are meant to be more general than simple string->word
mapping The key comparison function might be more complex than
bit-by-bit…

···

On Mon, Sep 02, 2002 at 11:40:43AM +0900, Doug Baskins wrote:

After a posting I did yesterday, I decided to take a closer look
at st_* library code. Please correct me if I am wrong in the
following discussion:

The st_* library is a collection of "string" to "word" mapping
routines.  BUT, only the POINTER to the string is stored in the
internal st_* structures.  This implies that the comparison of the
pointer is enough to determine if the string is a match.  Now we are
in JudyL's territory (pointer to pointer mapping) instead of JudySL
(string to pointer mapping).  JudyL has been tuned!!!  I wrote the
st_* routines using JudyL and wrote a benchmark to compare with the
ruby-1.6.7 version of "st.c".  Since, JudyL does not have
collisions, there is no need for a collision structure.

–
_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |__ \ | | | | | (| | | | |
.__/ _,|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Never trust an operating system you don’t have sources for.
– Unknown source

Doug2 · 5 September 2002 23:49

Mauricio Fernández batsman.geo@yahoo.com wrote in message news:20020902071849.GA20445@kodos…

···

On Mon, Sep 02, 2002 at 11:40:43AM +0900, Doug Baskins wrote:
After a posting I did yesterday, I decided to take a closer look
at st_* library code. Please correct me if I am wrong in the
following discussion:
The st_* library is a collection of "string" to "word" mapping
routines.  BUT, only the POINTER to the string is stored in the
internal st_* structures.  This implies that the comparison of the
pointer is enough to determine if the string is a match.  Now we are
in JudyL's territory (pointer to pointer mapping) instead of JudySL
(string to pointer mapping).  JudyL has been tuned!!!  I wrote the
st_* routines using JudyL and wrote a benchmark to compare with the
ruby-1.6.7 version of "st.c".  Since, JudyL does not have
collisions, there is no need for a collision structure.
[snip]

IMHO the st_* routines are meant to be more general than simple string->word
mapping The key comparison function might be more complex than
bit-by-bit…

Ok, what could a compare routine do besides compare bit-by-bit in a hashed
environment?

Doug Baskins doug@sourcejudy.com

Mauricio_Fernndez · 6 September 2002 11:11

In Ruby any object can be the key; it is represented in C as a VALUE*
pointer (this would be the key). I suspect Ruby would call the <=>
method, which might be user-defined. I don’t have the source handy so
I can’t check this, but it’s a reasonable belief IMO.

Plus the user might define strange semantics for its comparison
functions, for example for cache data and the like.

···

— Doug Baskins doug@sourcejudy.com wrote:

Mauricio Fernández batsman.geo@yahoo.com wrote in message
news:20020902071849.GA20445@kodos…
On Mon, Sep 02, 2002 at 11:40:43AM +0900, Doug Baskins wrote:
After a posting I did yesterday, I decided to take a closer
look
at st_* library code. Please correct me if I am wrong in the
following discussion:
The st_* library is a collection of "string" to "word"
mapping
routines.  BUT, only the POINTER to the string is stored in
the
internal st_* structures.  This implies that the comparison
of the
pointer is enough to determine if the string is a match. 
Now we are
in JudyL's territory (pointer to pointer mapping) instead
of JudySL
(string to pointer mapping).  JudyL has been tuned!!!  I
wrote the
st_* routines using JudyL and wrote a benchmark to compare
with the
ruby-1.6.7 version of "st.c".  Since, JudyL does not have
collisions, there is no need for a collision structure.
[snip]

IMHO the st_* routines are meant to be more general than simple
string->word
mapping The key comparison function might be more complex
than
bit-by-bit…
Ok, what could a compare routine do besides compare bit-by-bit in a
hashed
environment?

Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com

Lyle_Johnson6 · 6 September 2002 13:51

Doug Baskins wrote:

IMHO the st_* routines are meant to be more general than simple string->word
mapping The key comparison function might be more complex than
bit-by-bit…

Ok, what could a compare routine do besides compare bit-by-bit in a hashed
environment?

The st_* routines are used in different ways. There are two special
cases: (1) where all the keys are assumed to be integers and (2) where
all the keys are assumed to be strings. Both of these methods are used
internally in different places and I would expect Judy Arrays to help
improve those cases (but I haven’t tried this).

Ruby’s Hash class, however, uses the third and most general case. For
this case, no assumption is made about the types of the keys (i.e. they
can be mixed). The keys are compared by calling the objects’ eql?
method. For example, during a lookup of the key “a”, Ruby will:

determine the hash value for a by invoking a’s “hash” method
look-up its bucket in the hash table
if there’s a collision, walk through the list of collisions
looking for the key by calling a.eql?(key).

So the short answer is, the compare routine (eql?) could do anything it
wants to. That’s not to say that Judy Arrays couldn’t also speed up this
more general Hash table too; I’m just trying to explain the extra work
it’s doing over the more special-purpose cases.

Hope this helps,

Lyle

Topic		Replies	Views
Judy inside st_* ruby-talk	0	66	1 September 2002
Ruby and Judy ruby-talk	2	113	26 August 2002
Ruby and Judy ruby-talk	28	184	6 September 2002
ANN: RJudy-0.1 - Judy Arrays for Ruby ruby-talk	16	175	2 September 2002
Making Ruby faster with Judy ruby-talk	1	106	27 February 2007

Time to retire st_* and replace it with Judy?

wc domnames.txt

grep -i sourcejudy domnames.txt

Hash_st domnames.txt

lines avg_linelen RAMused/line insert/line lookup/line lookup_random

Judy_st domnames.txt

lines avg_linelen RAMused/line insert/line lookup/line lookup_random

Related topics