Ruby Compile-time optimization

Hi,

···

In message “Re: Is Ruby slower?” on 03/02/28, Travis Whitton whitton@atlantic.net writes:

All it would take would be converting it over to fit the variables in st.c and
adding another preprocessor conditional such as #elif HASH_JENKINS so as not to
disturb the 3 other strhash implementations. Anyway, just an idea. You can read
about Bob Jenkins hashing function at
A Hash Function for Hash Table Lookup

I will examine, when I have time to do. Probably we need to remove
prime modulo from st.c first.

						matz.

just did it, and got the test of the GCLS to be only 7% faster.
Even if I disable the GC, it is only ~25% faster. Perl is about 3 times
faster than Ruby :expressionless:

···

On Fri, Feb 28, 2003 at 03:20:34AM +0900, Travis Whitton wrote:

Of course, to be fair, in its more traditional strong areas, (regexp matching, hash
access, etc.) Perl shows an equal or better advantage.

All it would take would be converting it over to fit the variables in st.c and
adding another preprocessor conditional such as #elif HASH_JENKINS so as not to
disturb the 3 other strhash implementations. Anyway, just an idea. You can read
about Bob Jenkins hashing function at
A Hash Function for Hash Table Lookup


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Q: What’s the big deal about rm, I have been deleting stuff for years? And
never lost anything… oops!
A: …
– From the Frequently Unasked Questions

The below doesn’t appear to be the Jenkins hash. On Mr. Jenkins’ page,
he lists the below as the ‘One-at-a-Time’ hash. But it appears to be
rather swift as well.

···

Travis Whitton (whitton@atlantic.net) wrote:

It wouldn’t take much to use Bob Jenkins’ algorithm in Ruby. The current Perl
implementation looks like this:

#define PERL_HASH(hash,str,len)
STMT_START {
register const char *s_PeRlHaSh_tmp = str;
register const unsigned char *s_PeRlHaSh = (const unsigned char *)s_PeRlHaSh_tmp;
register I32 i_PeRlHaSh = len;
register U32 hash_PeRlHaSh = 0;
while (i_PeRlHaSh–) {
hash_PeRlHaSh += *s_PeRlHaSh++;
hash_PeRlHaSh += (hash_PeRlHaSh << 10);
hash_PeRlHaSh ^= (hash_PeRlHaSh >> 6);
}
hash_PeRlHaSh += (hash_PeRlHaSh << 3);
hash_PeRlHaSh ^= (hash_PeRlHaSh >> 11);
(hash) = (hash_PeRlHaSh + (hash_PeRlHaSh << 15));
} STMT_END

“Yukihiro Matsumoto” matz@ruby-lang.org wrote in message
news:1046421624.833733.22618.nullmailer@picachu.netlab.jp…

Hi,

All it would take would be converting it over to fit the variables in
st.c and
adding another preprocessor conditional such as #elif HASH_JENKINS so as
not to
disturb the 3 other strhash implementations. Anyway, just an idea. You
can read
about Bob Jenkins hashing function at
A Hash Function for Hash Table Lookup

I will examine, when I have time to do. Probably we need to remove
prime modulo from st.c first.

I have been working with Jenkins hash - I can’t say if is the best, but it
appears to be pretty good.
I did an optimization on the hash function to operate on exactly 32bits in
order get a fast integer map that grows and shrinks dynamically and which
remembers the order of insertion - so it is an indexed list. I’m not sure if
is any good in Ruby, but it is available upon request. It’s only really
faster than red/black trees (STL) when there many elements, more than 1000
elements and possibly slightly slower with less than 100 elements.

Mikkel

···

In message “Re: Is Ruby slower?” > on 03/02/28, Travis Whitton whitton@atlantic.net writes:

I’m posting to ruby-core my tentative modification.
It seems to be faster than what we have currently.

···

On Fri, Feb 28, 2003 at 05:40:25PM +0900, Yukihiro Matsumoto wrote:

Hi,

In message “Re: Is Ruby slower?” > on 03/02/28, Travis Whitton whitton@atlantic.net writes:

All it would take would be converting it over to fit the variables in st.c and
adding another preprocessor conditional such as #elif HASH_JENKINS so as not to
disturb the 3 other strhash implementations. Anyway, just an idea. You can read
about Bob Jenkins hashing function at
A Hash Function for Hash Table Lookup

I will examine, when I have time to do. Probably we need to remove
prime modulo from st.c first.


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

This is a scsi driver, scraes the shit out of me, therefore I tapdanced
and wrote a unix clone around it (C) by linus
– Somewhere in the kernel tree

Spoke too soon!

batsman@tux-chan:~/src/cvs/ruby-1.6.8.jenkins$ cat /tmp/hash3.rb
#!/usr/local/bin/ruby

-- mode: ruby --

$Id: hash2.ruby,v 1.2 2001/05/16 16:17:08 doug Exp $

http://www.bagley.org/~doug/shootout/

n = Integer(ARGV.shift || 1)

hash1 = {}
for i in 0 … 9999
hash1[“foo_” << i.to_s] = i
end

hash2 = Hash.new(0)
n.times do
for k in hash1.keys
hash2[k] += hash1[k]
end
end

printf “%d %d %d %d\n”,
hash1[“foo_1”], hash1[“foo_9999”], hash2[“foo_1”], hash2[“foo_9999”]

batsman@tux-chan:~/src/cvs/ruby-1.6.8.jenkins$ ruby -v
ruby 1.6.8 (2002-12-24) [i386-linux]
batsman@tux-chan:~/src/cvs/ruby-1.6.8.jenkins$ time ruby /tmp/hash3.rb 10
1 9999 10 99990

real 0m0.417s
user 0m0.390s
sys 0m0.010s
batsman@tux-chan:~/src/cvs/ruby-1.6.8.jenkins$ time ruby /tmp/hash3.rb 100
1 9999 100 999900

real 0m3.383s
user 0m3.380s
sys 0m0.000s
batsman@tux-chan:~/src/cvs/ruby-1.6.8.jenkins$ time ./ruby /tmp/hash3.rb 10
1 9999 10 99990

real 0m0.299s
user 0m0.290s
sys 0m0.000s
batsman@tux-chan:~/src/cvs/ruby-1.6.8.jenkins$ time ./ruby /tmp/hash3.rb 100
1 9999 100 999900

real 0m2.302s
user 0m2.300s
sys 0m0.010s

The results indicate consistently about 25-30% speed increase!
Mapping that to the results in
http://www.bagley.org/~doug/shootout/bench/hash2/
Ruby would go up 4 ranks and outperform Python, to place itself just
behind Perl.

Is there any other easy optimization left?

···

On Sat, Mar 01, 2003 at 06:23:18AM +0900, Mauricio Fernández wrote:

On Fri, Feb 28, 2003 at 03:20:34AM +0900, Travis Whitton wrote:

Of course, to be fair, in its more traditional strong areas, (regexp matching, hash
access, etc.) Perl shows an equal or better advantage.

All it would take would be converting it over to fit the variables in st.c and
adding another preprocessor conditional such as #elif HASH_JENKINS so as not to
disturb the 3 other strhash implementations. Anyway, just an idea. You can read
about Bob Jenkins hashing function at
A Hash Function for Hash Table Lookup

just did it, and got the test of the GCLS to be only 7% faster.
Even if I disable the GC, it is only ~25% faster. Perl is about 3 times
faster than Ruby :expressionless:


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Turn right here. No! NO! The OTHER right!

I have been working with Jenkins hash - I can’t say if is the best, but it
appears to be pretty good.
I did an optimization on the hash function to operate on exactly 32bits in
order get a fast integer map that grows and shrinks dynamically and which
remembers the order of insertion - so it is an indexed list.

We can use it to make sparse arrays if it’s faster than st.c…

···

On Fri, Feb 28, 2003 at 07:02:58PM +0900, MikkelFJ wrote:

I’m not sure if
is any good in Ruby, but it is available upon request. It’s only really
faster than red/black trees (STL) when there many elements, more than 1000
elements and possibly slightly slower with less than 100 elements.


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

‘Ooohh… “FreeBSD is faster over loopback, when compared to Linux
over the wire”. Film at 11.’
– Linus Torvalds

Cool!
I’d love to see this implemented. Going up 4 rankings would help dispell the
impression that Ruby is slow.
Go Ruby!

···

On Sat, Mar 01, 2003 at 06:59:35AM +0900, Mauricio Fernández wrote:

The results indicate consistently about 25-30% speed increase!
Mapping that to the results in
http://www.bagley.org/~doug/shootout/bench/hash2/
Ruby would go up 4 ranks and outperform Python, to place itself just
behind Perl.


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

looking at
http://bulknews.net/lib/doc-ja/perldelta.html#Performance_Enhancements

it seems to me that the adoption of one-at-a-time hashes does’ nt
affect much perl’s performance.
Could it be your results are tainted from something external?

or
they rest of Perl interpreter is taking away all the advantages of the
change?

or
I did’nt get a clue about what’s happening and about the things I’m
reading? (my bet is on this)

···

il Sat, 1 Mar 2003 06:59:35 +0900, Mauricio Fernández batsman.geo@yahoo.com ha scritto::

The results indicate consistently about 25-30% speed increase!
Mapping that to the results in
http://www.bagley.org/~doug/shootout/bench/hash2/
Ruby would go up 4 ranks and outperform Python, to place itself just
behind Perl.

Is there any other easy optimization left?

Give FNV hashing a shot too
(http://www.isthe.com/chongo/tech/comp/fnv/). It looks like it’s
faster than Jenkins’ hash on Intel for small keys. (It’s slower on a
Sun.) What I’m still curious about is the effects of collisions in
FNV vs Jenkins’ hash.

hi,

i’m thinking of using ruby as a scripting
language for a multimedia system that i’m
writing in C++. not to go into too much detail,
the general idea is to have sound and image synthesis
very closely tied, sharing buffers and what not. it’s still very very
much
in the design stages.

i would like it to eventually be capable of real-time and non-real-time
operation. my question is about real time useage, because for
non real time stuff ruby would be perfect i’m sure.

for real time use i would need millisecond accurate timing,
and i’m concerned about unwanted dropouts and timing
variations imposed by memory management and
threading in the ruby interpreter.

i’ve also considered three implementation options and am leaning
strongly towards the last:

embedding the language interpreter
into the app.

extending the interpreter,
using ruby’s bindings to C

having the ruby interpreter as a network client
to a media server written entirely in C++

just wondering if anybody could give any advice
on which route they would recommend, and also give some
general advice on ruby’s gc and threading, all in the light
of my real-time requirements.

anybody done anything remotely similar in ruby?

thanks much,
c

I double checked them and… you’re right :frowning:

I was comparing my modified version against the system’s, which, on
Debian, is compiled for i386, and much slower.

Working a little bit more on this, it turns out (read ruby-core for more
info) that:

  • one-at-a-time is slower than the hash we’re using now
  • Jenkins should be even worse as the additive constant to its
    complexity is higher
  • however, a perf. increase of 1-3% can be achieved by setting the
    number of bins in the hash to be a power of 2, so that we can later
    use bitwise AND instead of the modulus operator [I assume our hash is good enough not to increase the num. of collisions a lot]
  • some additional 1-3% could be gained by unrolling the loop, either
    using -funroll-loops (w/ gcc) or by applying the technique described
    in [ruby-core:898]
···

On Sun, Mar 02, 2003 at 10:48:44AM +0900, gabriele renzi wrote:

il Sat, 1 Mar 2003 06:59:35 +0900, Mauricio Fernández > batsman.geo@yahoo.com ha scritto::

The results indicate consistently about 25-30% speed increase!
Mapping that to the results in
http://www.bagley.org/~doug/shootout/bench/hash2/
Ruby would go up 4 ranks and outperform Python, to place itself just
behind Perl.

Is there any other easy optimization left?

looking at
http://bulknews.net/lib/doc-ja/perldelta.html#Performance_Enhancements

it seems to me that the adoption of one-at-a-time hashes does’ nt
affect much perl’s performance.
Could it be your results are tainted from something external?


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

We come to bury DOS, not to praise it.
– Paul Vojta, vojta@math.berkeley.edu

“Mauricio Fernández” batsman.geo@yahoo.com wrote in message
news:20030302083833.GA8115@student.ei.uni-stuttgart.de

it seems to me that the adoption of one-at-a-time hashes does’ nt

What does “one at time” mean? One character at a time as opposed to Jenkins
4 characters at a time?

  • Jenkins should be even worse as the additive constant to its
    complexity is higher
  • however, a perf. increase of 1-3% can be achieved by setting the
    number of bins in the hash to be a power of 2, so that we can later
    use bitwise AND instead of the modulus operator [I assume our hash is good enough not to increase the num. of collisions a lot]

There is some work going on in Jenkins hash, but it happens at the best
possible location: inside the processor and at least it attempts to exploit
parellism. Modern processors are much faster than memory. If the hash avoids
an additional memory access ever so often, it may pay off decently. I would
estimate the benefits of any good hash will only really show up in large
hash tables. This is because a cache-miss is expensive (many hundred
instruction cycles). An extra memory access in a small hash table is likely
to happen in memory that is already conveniently in cache.

If the bucket size is large you get fewer collissions, but you also increase
the risc of cache-misses.

This must, however, be held against the cache-misses of reading and
comparing the actual keys if not stored inside the bucket itself.

If the full 32-bit (or whatever) hash is compared first (i.e. store the full
hash key in the bucket), the bucket count does not affect the number of
external keys that must be accessed - only the quality of the hash itself.
This technique also makes it cheaper to perform a rehash operation when
expanding the buckettable.
(I sent a copy of my hashtable in private mail). It only operates on 32 bit
keys (e.g. storing pointers to already internalized strings), but it uses
the above mentioned principle of storing the hash and could be enhanced to
operate on longer keys without much effort.

The avoidance of modulus prime is good both for avoiding the calculation,
and because it makes it much easier to scale the bucket size on demand -
thus avoiding large initial bucket tables - which in turn makes conflict
resolution cheaper, although more frequent.

For small hashtables it may not pay off with a good, but more expensive hash
function. When frequently accessed, everything, including keys, will be
in-cache, and the statistical spread over buckets is probably unpredictable.

Mikkel

Hi,

  • however, a perf. increase of 1-3% can be achieved by setting the
    number of bins in the hash to be a power of 2, so that we can later
    use bitwise AND instead of the modulus operator [I assume our hash is good enough not to increase the num. of collisions a lot]

Sorry if this contribution is “newbie-like” as I’m not an expert
in any kind of mathematical/analytical field, but… Just wanted
to mention that my Sedgewick algorithms book goes into a bit of
detail on why prime numbers should generate better distributed
hash values using the modulo operator than powers-of-two…

HTH,

Bill

···

From: “Mauricio Fernández” batsman.geo@yahoo.com

“Mauricio Fernández” batsman.geo@yahoo.com wrote in message
news:20030302083833.GA8115@student.ei.uni-stuttgart.de

it seems to me that the adoption of one-at-a-time hashes does’ nt

What does “one at time” mean? One character at a time as opposed to Jenkins
4 characters at a time?

That’s the name given to perl’s current hash function in Bob Jenkins’
webpage.

  • Jenkins should be even worse as the additive constant to its
    complexity is higher
  • however, a perf. increase of 1-3% can be achieved by setting the
    number of bins in the hash to be a power of 2, so that we can later
    use bitwise AND instead of the modulus operator [I assume our hash is good enough not to increase the num. of collisions a lot]

There is some work going on in Jenkins hash, but it happens at the best
possible location: inside the processor and at least it attempts to exploit
parellism. Modern processors are much faster than memory. If the hash avoids
an additional memory access ever so often, it may pay off decently. I would
estimate the benefits of any good hash will only really show up in large
hash tables. This is because a cache-miss is expensive (many hundred
instruction cycles). An extra memory access in a small hash table is likely
to happen in memory that is already conveniently in cache.

This (as several other things) should perhaps be adjusted at run-time
with a built-in class

Tuning.use_hash = Tuning::HASH_JENKINS
Tuning.gc_growth_const = 1.9

etc

We cannot optimize for both small (the most usual) and big (where speed
matters the most) hashes at the same time.

It probably makes more sense for the GC than for the hash, as you can
always use RJudy or whatever instead of the built-in hashes…

If the bucket size is large you get fewer collissions, but you also increase
the risc of cache-misses.

This must, however, be held against the cache-misses of reading and
comparing the actual keys if not stored inside the bucket itself.

If the full 32-bit (or whatever) hash is compared first (i.e. store the full
hash key in the bucket), the bucket count does not affect the number of
external keys that must be accessed - only the quality of the hash itself.
This technique also makes it cheaper to perform a rehash operation when
expanding the buckettable.

This is what happened when I modified st.c to use Judy; then the full
hash value was used and there just were no collisions. I thought it
would be faster than the current implementation for the reason you give
but it wasn’t… I’ll give it a try again with Jenkins’ hash some time
to see if there’s any change.

···

On Mon, Mar 03, 2003 at 01:10:08AM +0900, MikkelFJ wrote:

(I sent a copy of my hashtable in private mail). It only operates on 32 bit
keys (e.g. storing pointers to already internalized strings), but it uses
the above mentioned principle of storing the hash and could be enhanced to
operate on longer keys without much effort.

The avoidance of modulus prime is good both for avoiding the calculation,
and because it makes it much easier to scale the bucket size on demand -
thus avoiding large initial bucket tables - which in turn makes conflict
resolution cheaper, although more frequent.

For small hashtables it may not pay off with a good, but more expensive hash
function. When frequently accessed, everything, including keys, will be
in-cache, and the statistical spread over buckets is probably unpredictable.


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

But what can you do with it?
– ubiquitous cry from Linux-user partner

“Mauricio Fernández” batsman.geo@yahoo.com wrote in message
news:20030302083833.GA8115@student.ei.uni-stuttgart.de

it seems to me that the adoption of one-at-a-time hashes does’ nt

What does “one at time” mean? One character at a time as opposed to Jenkins
4 characters at a time?

Yes, I suppose.

There is some work going on in Jenkins hash, but it happens at the best
possible location: inside the processor and at least it attempts to exploit
parellism. Modern processors are much faster than memory. If the hash avoids
an additional memory access ever so often, it may pay off decently. I would
estimate the benefits of any good hash will only really show up in large
hash tables.
This is because a cache-miss is expensive (many hundred
instruction cycles). An extra memory access in a small hash table is likely
to happen in memory that is already conveniently in cache.

I don’t grok what you mean talking about the cache :frowning:
It seems to me that 1-at-a-time didn’t depend on accessing memory more
than jenkins.
Using a big hash would stress the caching system, but we won’t be
measuring the hash performance .

Probably I misunderstood something, would you please explain me what
problem this algorithm could have with the cache stuff?

I agree that maybe jenkins could exploit modern CPU superscalar arch,
but we miss the ability to inline the code reducing the number of
operations from 9n to 5n.
And,
(I didn’t passed passed my CPU-related exam so good, so maybe I’m
wrong) having the work on 1 char per time won’t damage the
parallelism, the cpu could work on more than one char at a time
anyway.

It seems that 1-at-a-time scales well, and that it is actually better
than what we have in current ruby, at least it got less collision.

And well, the perl guys won’t put it in 5.8 if it was’nt someway
better than the old :slight_smile:

···

il Sun, 2 Mar 2003 17:12:28 +0100, “MikkelFJ” mikkelfj-anti-spam@bigfoot.com ha scritto::

If the bucket size is large you get fewer collissions, but you also increase
the risc of cache-misses.

This must, however, be held against the cache-misses of reading and
comparing the actual keys if not stored inside the bucket itself.

If the full 32-bit (or whatever) hash is compared first (i.e. store the full
hash key in the bucket), the bucket count does not affect the number of
external keys that must be accessed - only the quality of the hash itself.
This technique also makes it cheaper to perform a rehash operation when
expanding the buckettable.
(I sent a copy of my hashtable in private mail). It only operates on 32 bit
keys (e.g. storing pointers to already internalized strings), but it uses
the above mentioned principle of storing the hash and could be enhanced to
operate on longer keys without much effort.

The avoidance of modulus prime is good both for avoiding the calculation,
and because it makes it much easier to scale the bucket size on demand -
thus avoiding large initial bucket tables - which in turn makes conflict
resolution cheaper, although more frequent.

For small hashtables it may not pay off with a good, but more expensive hash
function. When frequently accessed, everything, including keys, will be
in-cache, and the statistical spread over buckets is probably unpredictable.

Mikkel

Hi,

From: “Mauricio Fernández” batsman.geo@yahoo.com

  • however, a perf. increase of 1-3% can be achieved by setting the
    number of bins in the hash to be a power of 2, so that we can later
    use bitwise AND instead of the modulus operator [I assume our hash is good enough not to increase the num. of collisions a lot]

Sorry if this contribution is “newbie-like” as I’m not an expert
in any kind of mathematical/analytical field, but… Just wanted
to mention that my Sedgewick algorithms book goes into a bit of
detail on why prime numbers should generate better distributed
hash values using the modulo operator than powers-of-two…

Mathematically, that is true. However, if you analyze the computational
aspects, the impact is not obvious, as you are decreasing collisions at the
cost of a more expensive hash generation algorithm. I’m not saying that it
definitely wouldn’t pay off, but the answer is not obvious, and my belief is
that there would be a small net loss.

···

On Sunday 02 March 2003 05:26 pm, Bill Kelly wrote:

HTH,

Bill


Seth Kurtzberg
M. I. S. Corp.
480-661-1849
seth@cql.com

It seems that 1-at-a-time scales well, and that it is actually better
than what we have in current ruby, at least it got less collision.

Well, I'm not agree with you

With the example given, OZ sdbm hash

mesange% ./ruby b.rb 100
1 9999 100 999900
mesange% cat /tmp/col
collision: 3187721
mesange%

With one-at-a-time

mesange% ./ruby b.rb 100
1 9999 100 999900
mesange% cat /tmp/col
collision: 3195197
mesange%

OZ has less collision when it work with a table based on prime numbers.

And well, the perl guys won't put it in 5.8 if it was'nt someway
better than the old :slight_smile:

Who want trust P languages ?

Guy Decoux

“gabriele renzi” surrender_it@remove.yahoo.it wrote in message
news:9ed46v0r2nn4r7ehchcl6s4ci8t7ccl23h@4ax.com

This is because a cache-miss is expensive (many hundred
instruction cycles). An extra memory access in a small hash table is
likely
to happen in memory that is already conveniently in cache.

I don’t grok what you mean talking about the cache :frowning:

I’ll try to explain although I’m not sure what part you are missing. You are
probably aware of most that I cover:

  1. how expensive memory access can be

  2. when memory access happens.

  3. I’m no expert on cache systems, but there are several layers of cache
    starting from virtual on-disk memory through TLB hits/misses (the virtual
    memory addressing system) to 2nd level cache, 1st level cache and actual CPU
    registers. The further down the pipeline, the more costly the access. I
    believe an in-memory cache miss can range from hundreds to thousands of
    missed instructions and paged memory is out of the scale. In comparioson, a
    major cost of process context switching seems to be the flush of the TLB
    cache. Memory in 1st level cache may be as fast as CPU registers.

  4. There are two major ways of avoiding cache penalties: a) keeping you data
    smaller such that more fits into a cache closer to the CPU. b) locality of
    reference - i.e. the next access is cheap of close to a previous access and
    therefore will be loaded into fast cache along with the first access.

There are differnt hash table implementation strategies, but a hash works by
spreading data all over the place.

First a bucket is located in a bucket table. This is a once per lookup
operation. If the bucket table is small and there are many lookups, there is
a good chance that the next lookup will find a bucket in fast cache and
avoid many hundres worth of instructions missed on failed cache.

Once the bucket is located, there is a list of possible matches, all with
same hash key. The better the hash, the shorter the list. The list is
typically searched from head to tail. Each visit risks a cache miss. The
good thing is that all entries can be stored close to each other in an
array, so the risk of a cache miss is proportional to the actual data
stored. This is contrary to the bucket table where the risk is proportional
to the total number of buckets. Fortunately the bucket table can be as small
as a single pointer. It therefore makes sense to use about as much memory on
buckets as on stored hash entries. If the entry table is allocated in one
block, it will on average be 50%-75% depending on expansion scheme. Thus an
easy option is to set the bucket table to 50% of entry table, but this can
only be tuned by careful performance measures as there are many factors. A
bucket size the same as the average load of the entry table will on average
give 1 entry per bucket and have equal load on the cache as the entry
table - a larger bucket table may just consume too much memory and reduce
cache effiency - but this also depends on the quality of the hash key.

The next cache problem, is the number of key comparison operations. If the
bucket size is chosen as mentioned above, we will only expect slightly more
than one entry per bucket on average, although collisions will occur -
especially with a bad hash. For the purpose of illustration, assume a
collision list on average is 4 entries long, then on average 2 keys must be
examined for each lookup. If the key is small enough to be stored in the
hash entry itself, this comparison is cheap. If the key is stored in
external data, each key may potentially be a cache miss as well. An external
key may also be slow to compare for other reasons than potential cache miss:
long string comparions, function calls to compare function etc., so
collisions should be avoided.

One technique to avoid accessing external keys more often than necessary is
to compare the full hash key before visiting the key: Typically a hash key
is 32 bit and the crunched to bucket table size either using modulo prime or
in the case of Jenkins, the much faster modulo power of two. By storing the
full hash key in the hash entry, the collision list is effectively made much
shorter because the risc of collisions in the full hash is much smaller than
the risc of collisions in the full hash. It is very cheap to compare the
full hash if it is the original output of the hash operation and fits a
machineword (or two). We still may need to visit all entries in the
collision list, but probably only compare a single external key.
Storing the hash key in the hash entry makes the entry larger and therefore
increases the risc of cache misses. Storing the hash key has another
benefit: growing the hash table can avoid recalculating the hash keys. A
typical hash table entry would therefore be:

<hash-key, key or link to key, data, next-collison link>

or the minimum: <link to key and data, next-collision link>

Or to put a long story short: CPU is much faster than memory, so any
performance optimization should first and foremost minimize the number of
memory accesses that are far apart. If the typical case is access to a
single location, or access to data in the same cache, a complex algorithm
reducing memory accesses may be too expensive. Notably a tight loop is
faster than most things. Code is also subject to cache misses in the
instruction pipeline so an algorithm can get too complex.

It seems to me that 1-at-a-time didn’t depend on accessing memory more
than jenkins.

This is true. As soon as the first bit of the key is peeked at, the entire
key is probably already in fast cache. This is an example of locality of
reference. Therefore it probably doesn’t really matter that much. You are
looping more frequently but then there are predictive branching and Jenkins
has other overhead. Loop unrolling is difficult on variable length keys, but
the compiler might unroll one-at-time to handle an entire word per loop.
Thus the result is blurred. As long as the end result is a high quality hash
and the hash isn’t too slow, it may be more important how many collisions
can be avoided.

You simply have to performance test.

Using a big hash would stress the caching system, but we won’t be
measuring the hash performance .

Not in a major way, perhaps you are missing the point?
I’m talking about using the 32bit (machine word) that is typically resulting
from a hash key, not a full MD5 signature (although in some cases that might
be relevant). It is true that storing the hash key takes up more memory and
it may not necessarily pay off as discussed above.

Probably I misunderstood something, would you please explain me what
problem this algorithm could have with the cache stuff?

The quality of the hash key affects the number of memory accesses and
therefore the risc of cache miss.

I agree that maybe jenkins could exploit modern CPU superscalar arch,
but we miss the ability to inline the code reducing the number of
operations from 9n to 5n.

Perhaps - but you could specialize the hash for short keys as I did for 32
bit. A lot of keywords are short. Jenkins did create the core Mix operation
as a macro. But your argument may be valid and performance tests are, as
usual, in place. Indeed I have had a very fast datastructure loose 50% or
more of its performance on a lost inline opportunity in the lookup method
(new processors may change this - IA64 have fast function calls using
rolling stack).

One thing that a typical performance test will not capture is the number of
cache misses in real life operations because the test is constanly banging
against the same datastructure and therefore downplays the risk of cache
misses. This will give a worse but faster hash key an advantage, but it is
equally likely to overestimate the need for large hash tables where the
quality probably matters the most.

And,
(I didn’t passed passed my CPU-related exam so good, so maybe I’m
wrong) having the work on 1 char per time won’t damage the
parallelism, the cpu could work on more than one char at a time
anyway.

I’m not up to speed on the latest CPU architectures, so I can only produce a
qualified guess. I believe your point is not an unrealistic assumption for
modern achitectures. Some older processors perform much worse on unaligned
memory access (and some do not allow it, forcing the compiler to insert
shift and mask instructions in the code). And some operate only on 8 bit
anyway (microcontrollers, still going strong). The faster the processor, the
less important the parallelism is (and hash function performance generally)
because it is going to wait on the memory.

By the way, speaking of achitectures, someone made the point that DDR RAM
sometimes result in poor performance because it supposedly isn’t very
effective at filling meeting random access requests. That is, streamlined
access as in video streaming is fast, but hash table lookups may be poor.

Also, don’t forget to add the cost of taking modulo prime on the final hash
result in the non-Jenkins case. The modulo prime can be expensive. If the
search key is short and the lookup hits perfect cache conditions, the
modulos may take significant time - but then the latest processors may
already have changed that.

It seems that 1-at-a-time scales well, and that it is actually better
than what we have in current ruby, at least it got less collision.

I thought Jenkins carefully chose his function to produce fewer collisions,
but that may be a tradeoff with avoding the use of modulo prime.

And well, the perl guys won’t put it in 5.8 if it was’nt someway
better than the old :slight_smile:

Again, the hash key calculation cost is probably only the significant factor
significant for small frequently used small hash tables.

In conclusion there are so many different operating conditions, that it is
difficult to predict what hash function will be best, or even if a hash
table is the best overall.

I have been working a lot with variants of B-Trees because they are more
than adequately fast in most scenarios and scale pretty well when cache
constraints starts to be significant, even (or especially) on-disk if
allocation is done appropriately. A lot of people claim that the much
simpler skip-list datastructure perform better than B-Trees. It doesn’t.
Someone made the counter-argument that B-Trees are faster due to the poor
cache behavior of skip-lists and backed it up with tests. I reproduced the
tests by implemting a skiplist and compared to a B-Tree implementation I had
already made. The scan through a buffer of a small-fanned B-Tree node is
fast both in terms of CPU and memory cache. Therefore I have great respect
for cache behavior in datastructures.

Mikkel

···

il Sun, 2 Mar 2003 17:12:28 +0100, “MikkelFJ” > mikkelfj-anti-spam@bigfoot.com ha scritto::

“gabriele renzi” surrender_it@remove.yahoo.it wrote in message
news:9ed46v0r2nn4r7ehchcl6s4ci8t7ccl23h@4ax.com

This is because a cache-miss is expensive (many hundred
instruction cycles). An extra memory access in a small hash table is
likely
to happen in memory that is already conveniently in cache.

I don’t grok what you mean talking about the cache :frowning:

I’ll try to explain although I’m not sure what part you are missing. You are
probably aware of most that I cover:

  1. how expensive memory access can be

  2. when memory access happens.

  3. I’m no expert on cache systems, but there are several layers of cache
    starting from virtual on-disk memory through TLB hits/misses (the virtual
    memory addressing system) to 2nd level cache, 1st level cache and actual CPU
    registers. The further down the pipeline, the more costly the access. I
    believe an in-memory cache miss can range from hundreds to thousands of
    missed instructions and paged memory is out of the scale. In comparioson, a
    major cost of process context switching seems to be the flush of the TLB
    cache. Memory in 1st level cache may be as fast as CPU registers.

  4. There are two major ways of avoiding cache penalties: a) keeping you data
    smaller such that more fits into a cache closer to the CPU. b) locality of
    reference - i.e. the next access is cheap of close to a previous access and
    therefore will be loaded into fast cache along with the first access.

There are differnt hash table implementation strategies, but a hash works by
spreading data all over the place.

First a bucket is located in a bucket table. This is a once per lookup
operation. If the bucket table is small and there are many lookups, there is
a good chance that the next lookup will find a bucket in fast cache and
avoid many hundres worth of instructions missed on failed cache.

Once the bucket is located, there is a list of possible matches, all with
same hash key. The better the hash, the shorter the list. The list is
typically searched from head to tail. Each visit risks a cache miss. The
good thing is that all entries can be stored close to each other in an
array, so the risk of a cache miss is proportional to the actual data
stored. This is contrary to the bucket table where the risk is proportional
to the total number of buckets. Fortunately the bucket table can be as small
as a single pointer. It therefore makes sense to use about as much memory on
buckets as on stored hash entries. If the entry table is allocated in one
block, it will on average be 50%-75% depending on expansion scheme. Thus an
easy option is to set the bucket table to 50% of entry table, but this can
only be tuned by careful performance measures as there are many factors. A
bucket size the same as the average load of the entry table will on average
give 1 entry per bucket and have equal load on the cache as the entry
table - a larger bucket table may just consume too much memory and reduce
cache effiency - but this also depends on the quality of the hash key.

The next cache problem, is the number of key comparison operations. If the
bucket size is chosen as mentioned above, we will only expect slightly more
than one entry per bucket on average, although collisions will occur -
especially with a bad hash. For the purpose of illustration, assume a
collision list on average is 4 entries long, then on average 2 keys must be
examined for each lookup. If the key is small enough to be stored in the
hash entry itself, this comparison is cheap. If the key is stored in
external data, each key may potentially be a cache miss as well. An external
key may also be slow to compare for other reasons than potential cache miss:
long string comparions, function calls to compare function etc., so
collisions should be avoided.

One technique to avoid accessing external keys more often than necessary is
to compare the full hash key before visiting the key: Typically a hash key
is 32 bit and the crunched to bucket table size either using modulo prime or
in the case of Jenkins, the much faster modulo power of two. By storing the
full hash key in the hash entry, the collision list is effectively made much
shorter because the risc of collisions in the full hash is much smaller than
the risc of collisions in the full hash. It is very cheap to compare the
full hash if it is the original output of the hash operation and fits a
machineword (or two). We still may need to visit all entries in the
collision list, but probably only compare a single external key.
Storing the hash key in the hash entry makes the entry larger and therefore
increases the risc of cache misses. Storing the hash key has another
benefit: growing the hash table can avoid recalculating the hash keys. A
typical hash table entry would therefore be:

<hash-key, key or link to key, data, next-collison link>

or the minimum: <link to key and data, next-collision link>

Or to put a long story short: CPU is much faster than memory, so any
performance optimization should first and foremost minimize the number of
memory accesses that are far apart. If the typical case is access to a
single location, or access to data in the same cache, a complex algorithm
reducing memory accesses may be too expensive. Notably a tight loop is
faster than most things. Code is also subject to cache misses in the
instruction pipeline so an algorithm can get too complex.

It seems to me that 1-at-a-time didn’t depend on accessing memory more
than jenkins.

This is true. As soon as the first bit of the key is peeked at, the entire
key is probably already in fast cache. This is an example of locality of
reference. Therefore it probably doesn’t really matter that much. You are
looping more frequently but then there are predictive branching and Jenkins
has other overhead. Loop unrolling is difficult on variable length keys, but
the compiler might unroll one-at-time to handle an entire word per loop.
Thus the result is blurred. As long as the end result is a high quality hash
and the hash isn’t too slow, it may be more important how many collisions
can be avoided.

You simply have to performance test.

Using a big hash would stress the caching system, but we won’t be
measuring the hash performance .

Not in a major way, perhaps you are missing the point?
I’m talking about using the 32bit (machine word) that is typically resulting
from a hash key, not a full MD5 signature (although in some cases that might
be relevant). It is true that storing the hash key takes up more memory and
it may not necessarily pay off as discussed above.

Probably I misunderstood something, would you please explain me what
problem this algorithm could have with the cache stuff?

The quality of the hash key affects the number of memory accesses and
therefore the risc of cache miss.

I agree that maybe jenkins could exploit modern CPU superscalar arch,
but we miss the ability to inline the code reducing the number of
operations from 9n to 5n.

Perhaps - but you could specialize the hash for short keys as I did for 32
bit. A lot of keywords are short. Jenkins did create the core Mix operation
as a macro. But your argument may be valid and performance tests are, as
usual, in place. Indeed I have had a very fast datastructure loose 50% or
more of its performance on a lost inline opportunity in the lookup method
(new processors may change this - IA64 have fast function calls using
rolling stack).

One thing that a typical performance test will not capture is the number of
cache misses in real life operations because the test is constanly banging
against the same datastructure and therefore downplays the risk of cache
misses. This will give a worse but faster hash key an advantage, but it is
equally likely to overestimate the need for large hash tables where the
quality probably matters the most.

And,
(I didn’t passed passed my CPU-related exam so good, so maybe I’m
wrong) having the work on 1 char per time won’t damage the
parallelism, the cpu could work on more than one char at a time
anyway.

I’m not up to speed on the latest CPU architectures, so I can only produce a
qualified guess. I believe your point is not an unrealistic assumption for
modern achitectures. Some older processors perform much worse on unaligned
memory access (and some do not allow it, forcing the compiler to insert
shift and mask instructions in the code). And some operate only on 8 bit
anyway (microcontrollers, still going strong). The faster the processor, the
less important the parallelism is (and hash function performance generally)
because it is going to wait on the memory.

By the way, speaking of achitectures, someone made the point that DDR RAM
sometimes result in poor performance because it supposedly isn’t very
effective at filling meeting random access requests. That is, streamlined
access as in video streaming is fast, but hash table lookups may be poor.

Also, don’t forget to add the cost of taking modulo prime on the final hash
result in the non-Jenkins case. The modulo prime can be expensive. If the
search key is short and the lookup hits perfect cache conditions, the
modulos may take significant time - but then the latest processors may
already have changed that.

It seems that 1-at-a-time scales well, and that it is actually better
than what we have in current ruby, at least it got less collision.

I thought Jenkins carefully chose his function to produce fewer collisions,
but that may be a tradeoff with avoding the use of modulo prime.

And well, the perl guys won’t put it in 5.8 if it was’nt someway
better than the old :slight_smile:

Again, the hash key calculation cost is probably only the significant factor
significant for small frequently used small hash tables.

In conclusion there are so many different operating conditions, that it is
difficult to predict what hash function will be best, or even if a hash
table is the best overall.

I have been working a lot with variants of B-Trees because they are more
than adequately fast in most scenarios and scale pretty well when cache
constraints starts to be significant, even (or especially) on-disk if
allocation is done appropriately. A lot of people claim that the much
simpler skip-list datastructure perform better than B-Trees. It doesn’t.
Someone made the counter-argument that B-Trees are faster due to the poor
cache behavior of skip-lists and backed it up with tests. I reproduced the
tests by implemting a skiplist and compared to a B-Tree implementation I had
already made. The scan through a buffer of a small-fanned B-Tree node is
fast both in terms of CPU and memory cache. Therefore I have great respect
for cache behavior in datastructures.

Mikkel

···

il Sun, 2 Mar 2003 17:12:28 +0100, “MikkelFJ” > mikkelfj-anti-spam@bigfoot.com ha scritto::