Time to retire st_* and replace it with Judy?

After a posting I did yesterday, I decided to take a closer look
at st_* library code. Please correct me if I am wrong in the
following discussion:

The st_* library is a collection of "string" to "word" mapping
routines.  BUT, only the POINTER to the string is stored in the
internal st_* structures.  This implies that the comparison of the
pointer is enough to determine if the string is a match.  Now we are
in JudyL's territory (pointer to pointer mapping) instead of JudySL
(string to pointer mapping).  JudyL has been tuned!!!  I wrote the
st_* routines using JudyL and wrote a benchmark to compare with the
ruby-1.6.7 version of "st.c".  Since, JudyL does not have
collisions, there is no need for a collision structure.

Then I wrote a benchmark simulating st_* with JudyL inside:
<www.sourcejudy.com/benchmarks/st_compare.tar.gz>.

The results:

This is the output of 2 versions of st_compare when run on a 933Mhz P3,
one using “stock” st.c in ruby, the other using the simulation with
JudyL inside. “domnames.txt” is a text file of most/all of the
internet domain names.

wc domnames.txt

28826508 28826508 488227095 domnames.txt

That’s right 28.8 million names, let’s see if it is current:

grep -i sourcejudy domnames.txt

SOURCEJUDY.COM
SOURCEJUDY.ORG

Yep, not more than 2 months old.

Hash_st domnames.txt

lines avg_linelen RAMused/line insert/line lookup/line lookup_random

28826508 16.9 bytes 24.0 bytes 1.674 uS 0.570 uS 1.056 uS

Judy_st domnames.txt

lines avg_linelen RAMused/line insert/line lookup/line lookup_random

28826508 16.9 bytes 6.8 bytes 0.472 uS 0.154 uS 0.706 uS

Notice the Judy_st version used 1/2Gb less memory.
The performance improvement reminds me of the snake oil salesman.

There are still a lot of performance opportunities if st_* were retired.
My take on the st_* routines is they are based on a machine designed
a dozen years ago. They are not tuned for a modern machine.

Enough fun, back to getting Judy to work with ‘autoconf’ et.al.

Doug Baskins doug@sourcejudy.com

[snip]

IMHO the st_* routines are meant to be more general than simple string->word
mapping :frowning: The key comparison function might be more complex than
bit-by-bit…

···

On Mon, Sep 02, 2002 at 11:40:43AM +0900, Doug Baskins wrote:

After a posting I did yesterday, I decided to take a closer look
at st_* library code. Please correct me if I am wrong in the
following discussion:

The st_* library is a collection of "string" to "word" mapping
routines.  BUT, only the POINTER to the string is stored in the
internal st_* structures.  This implies that the comparison of the
pointer is enough to determine if the string is a match.  Now we are
in JudyL's territory (pointer to pointer mapping) instead of JudySL
(string to pointer mapping).  JudyL has been tuned!!!  I wrote the
st_* routines using JudyL and wrote a benchmark to compare with the
ruby-1.6.7 version of "st.c".  Since, JudyL does not have
collisions, there is no need for a collision structure.


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Never trust an operating system you don’t have sources for. :wink:
– Unknown source

Mauricio Fernández batsman.geo@yahoo.com wrote in message news:20020902071849.GA20445@kodos

···

On Mon, Sep 02, 2002 at 11:40:43AM +0900, Doug Baskins wrote:

After a posting I did yesterday, I decided to take a closer look
at st_* library code. Please correct me if I am wrong in the
following discussion:

The st_* library is a collection of "string" to "word" mapping
routines.  BUT, only the POINTER to the string is stored in the
internal st_* structures.  This implies that the comparison of the
pointer is enough to determine if the string is a match.  Now we are
in JudyL's territory (pointer to pointer mapping) instead of JudySL
(string to pointer mapping).  JudyL has been tuned!!!  I wrote the
st_* routines using JudyL and wrote a benchmark to compare with the
ruby-1.6.7 version of "st.c".  Since, JudyL does not have
collisions, there is no need for a collision structure.

[snip]

IMHO the st_* routines are meant to be more general than simple string->word
mapping :frowning: The key comparison function might be more complex than
bit-by-bit…

Ok, what could a compare routine do besides compare bit-by-bit in a hashed
environment?

Doug Baskins doug@sourcejudy.com

In Ruby any object can be the key; it is represented in C as a VALUE*
pointer (this would be the key). I suspect Ruby would call the <=>
method, which might be user-defined. I don’t have the source handy so
I can’t check this, but it’s a reasonable belief IMO.

Plus the user might define strange semantics for its comparison
functions, for example for cache data and the like.

···

— Doug Baskins doug@sourcejudy.com wrote:

Mauricio Fernández batsman.geo@yahoo.com wrote in message
news:20020902071849.GA20445@kodos

On Mon, Sep 02, 2002 at 11:40:43AM +0900, Doug Baskins wrote:

After a posting I did yesterday, I decided to take a closer
look
at st_* library code. Please correct me if I am wrong in the
following discussion:

The st_* library is a collection of "string" to "word"

mapping

routines.  BUT, only the POINTER to the string is stored in

the

internal st_* structures.  This implies that the comparison

of the

pointer is enough to determine if the string is a match. 

Now we are

in JudyL's territory (pointer to pointer mapping) instead

of JudySL

(string to pointer mapping).  JudyL has been tuned!!!  I

wrote the

st_* routines using JudyL and wrote a benchmark to compare

with the

ruby-1.6.7 version of "st.c".  Since, JudyL does not have
collisions, there is no need for a collision structure.

[snip]

IMHO the st_* routines are meant to be more general than simple
string->word
mapping :frowning: The key comparison function might be more complex
than
bit-by-bit…

Ok, what could a compare routine do besides compare bit-by-bit in a
hashed
environment?


Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com

Doug Baskins wrote:

IMHO the st_* routines are meant to be more general than simple string->word
mapping :frowning: The key comparison function might be more complex than
bit-by-bit…

Ok, what could a compare routine do besides compare bit-by-bit in a hashed
environment?

The st_* routines are used in different ways. There are two special
cases: (1) where all the keys are assumed to be integers and (2) where
all the keys are assumed to be strings. Both of these methods are used
internally in different places and I would expect Judy Arrays to help
improve those cases (but I haven’t tried this).

Ruby’s Hash class, however, uses the third and most general case. For
this case, no assumption is made about the types of the keys (i.e. they
can be mixed). The keys are compared by calling the objects’ eql?
method. For example, during a lookup of the key “a”, Ruby will:

  • determine the hash value for a by invoking a’s “hash” method
  • look-up its bucket in the hash table
  • if there’s a collision, walk through the list of collisions
    looking for the key by calling a.eql?(key).

So the short answer is, the compare routine (eql?) could do anything it
wants to. That’s not to say that Judy Arrays couldn’t also speed up this
more general Hash table too; I’m just trying to explain the extra work
it’s doing over the more special-purpose cases.

Hope this helps,

Lyle