String Hashing Algorithms

Summary

···

=============================================================
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.

Background

At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.

As a result, I wanted to do a preliminary investigation into various
string hashing algorithms, to see how limiting that would be. (How
often would we need to rename a bunch of our strings just to get around
collisions?) Given the nature of hashing algorithms, it was important
that this be based on ~real world strings.

Methodology

I wrote a ruby script to scan all our (C++) source code and find ANY
words that were valid C++ identifiers. This also meant that I grabbed
all such words from inside comment blocks, as well as C++ keywords, but
that was OK for this back-of-the-envelope testing. I ended up with
almost 43,000 unique identifiers.

I then ran through the list of words and used 7 different string
hashing algorithms (which I ported from the C code I found on a web
page[1]) to compute different hash keys. I used a Set (per hashing
algorithm) to store these keys, and at the end compared the number of
keys in the Set with the number of identifiers fed in.

The Results

Three of the hashing algorithms ended up with no collisions whatsoever,
and one of these (SDBM) is really simple and fast (in C).

Here is the output of my test:
djb :: 99.8601 percent coverage (60 collisions out of 42884)
elf :: 99.5430 percent coverage (196 collisions out of 42884)
sdbm :: 100.0000 percent coverage (0 collisions out of 42884)
js :: 99.9044 percent coverage (41 collisions out of 42884)
ap :: 100.0000 percent coverage (0 collisions out of 42884)
bkdr :: 99.9953 percent coverage (2 collisions out of 42884)
rs :: 100.0000 percent coverage (0 collisions out of 42884)

The Algorithms

module HashEm
  SIGNEDSHORT = 0x7FFFFFFF

  def self.rs( str, len=str.length )
    a,b = 63689,378551
    hash = 0
    len.times{ |i|
      hash = hash*a + str[i]
      a *= b
    }
    hash & SIGNEDSHORT
  end

  def self.js( str, len=str.length )
    hash = 1315423911
    len.times{ |i|
      hash ^= ( ( hash << 5 ) + str[i] + ( hash >> 2 ) )
    }
    hash & SIGNEDSHORT
  end

  def self.elf( str, len=str.length )
    hash = 0
    x = 0
    len.times{ |i|
      hash = (hash << 4) + str[i]
      if (x = hash & 0xF0000000) != 0
        hash ^= (x >> 24)
        hash &= ~x
      end
    }
    hash & SIGNEDSHORT
  end

  def self.bkdr( str, len=str.length )
    seed = 131 # 31 131 1313 13131 131313 etc..
    hash = 0
    len.times{ |i|
      hash = ( hash*seed )+str[i]
    }
    hash & SIGNEDSHORT
  end

  def self.sdbm( str, len=str.length )
    hash = 0
    len.times{ |i|
      hash = str[i] + ( hash << 6 ) + ( hash << 16 ) - hash
    }
    hash & SIGNEDSHORT
  end

  def self.djb( str, len=str.length )
    hash = 5381
    len.times{ |i|
      hash = ((hash << 5) + hash) + str[i]
    }
    hash & SIGNEDSHORT
  end

  def self.ap( str, len=str.length )
    hash = 0
    len.times{ |i|
      if (i & 1) == 0
        hash ^= (hash << 7) ^ str[i] ^ (hash >> 3)
      else
        hash ^= ~( (hash << 11) ^ str[i] ^ (hash >> 5) )
      end
    }
    hash & SIGNEDSHORT
  end
end

[1] http://www.code-source.org/algorithm.php?id=62

Phrogz wrote:

Summary

A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.

Is ruby's own String#hash one of the ones you tested? If not, how does it compare?

Just out of curiosity, but have you considered some other
structures/algorithms which might be alternatives depending on your
usage? Off the top of my head I can think of:

* Trie -- Should find other strings really fast, but gets pretty big the
more strings you need to store. There's a C library for this at
http://www.octavian.org/cs/software.html
* PATRICIA -- Basically a compacted Trie which takes less space.
Couldn't find a library for this one.
* Suffix Array -- I have a Ruby binding for one of the faster C
libraries which I use in FastCST. The big advantage of a Suffix Array
is that you can store it so that you only need to calculate the suffix
array once. A suffix array is really more useful for matching and
exclusion.
* Suffix Tree -- There's really not much of a reason to use suffix trees
these days since newer suffix array construction algorithms are faster.
The main advantage of a suffix tree is that searching for a result can
be faster. The main disadvntages are that they are fat memory pigs.
* Bloom Filter -- These are not as acurate, but they can be fast as all
hell if you just want to match strings with some probability.

Anyway, just thought I'd throw out some alternatives to hash tables for
certain situations. I'd say if you need to match C++ keywords to a
stored index then take a look at the libtst Trie implementation. It's
quite fast for that application. If you just need to see if a certain
word is "included" or "excluded" than a suffix array could do that
really fast. Trick there is to build the suffix array by joining the
strings with a | character (or something else) between them. Nice thing
about a suffix array is that you can build it offline and then just load
it directly for the searching.

Zed A. Shaw

···

On Wed, 2005-05-11 at 05:00 +0900, Phrogz wrote:

Background

At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.

Phrogz wrote:

Summary

Good summary and test, Phrogz.

Don't forget that in many hashing algorithms, the cost of computing
the hash function far outweighs the occasional cost of traversing a
chain. If you need guaranteed lookup time, you need a "perfect hash"
(try Googling on that) but otherwise you just need a fast hash function
with an excellent spread and a low load factor. Neither of those is
hard to achieve.

We use linear hashing (note, this is not the same as linear open
addressing, a much older technique), which achieves excellent
performance with any size hash table. Unfortunately I have no
implementation I can publish.

Clifford Heath.

At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.

If you know all possible strings in advance, you might want to generate a perfect hash:
http://burtleburtle.net/bob/hash/perfect.html

martinus

You might be interested in the following page, which compares several
different hash-algorithms:

http://www.fantasy-coders.de/projects/gh/html/x435.html

But, It's written in german language :wink:

Regards,

  Michael

···

Am Tuesday 10 May 2005 22:00 schrieb Phrogz:

Summary

A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.

Joel VanderWerf wrote:

Phrogz wrote:
> A brief analysis of Ruby versions of a few string hashing

algorithms,

> and my conclusions about their efficacy at preventing collisions.

Is ruby's own String#hash one of the ones you tested? If not, how

does

it compare?

Because I was interested in C algorithms, I didn't consider it. But, I
just ran it again, and ruby came out with the best of them:

ruby :: 100.0000 percent coverage (0 collisions out of 42884)

Zed A. Shaw wrote:

>
> Background
> =============================================================
> At work, we have some (young) code which is currently doing a lot

of

> string compares, which needs to be really fast. We're about to try
> using hashes of the strings to represent each string instead, but
> instead of a traditional hash table (which 'chains' collisions for

the

> same key) we're going to require that no two (different) strings

may

> share the same hash key.

Just out of curiosity, but have you considered some other
structures/algorithms which might be alternatives depending on your
usage? Off the top of my head I can think of:

* Trie -- Should find other strings really fast, but gets pretty big

the

more strings you need to store. There's a C library for this at
http://www.octavian.org/cs/software.html
* PATRICIA -- Basically a compacted Trie which takes less space.
Couldn't find a library for this one.
* Suffix Array -- I have a Ruby binding for one of the faster C
libraries which I use in FastCST. The big advantage of a Suffix

Array

is that you can store it so that you only need to calculate the

suffix

array once. A suffix array is really more useful for matching and
exclusion.
* Suffix Tree -- There's really not much of a reason to use suffix

trees

these days since newer suffix array construction algorithms are

faster.

The main advantage of a suffix tree is that searching for a result

can

be faster. The main disadvntages are that they are fat memory pigs.
* Bloom Filter -- These are not as acurate, but they can be fast as

all

hell if you just want to match strings with some probability.

Anyway, just thought I'd throw out some alternatives to hash tables

for

certain situations. I'd say if you need to match C++ keywords to a
stored index then take a look at the libtst Trie implementation.

It's

quite fast for that application. If you just need to see if a

certain

word is "included" or "excluded" than a suffix array could do that
really fast. Trick there is to build the suffix array by joining the
strings with a | character (or something else) between them. Nice

thing

about a suffix array is that you can build it offline and then just

load

it directly for the searching.

Zed A. Shaw
http://www.zedshaw.com/

You may be interested in this project:
http://www.rubyforge.org/projects/judy

Has an implementation of something like what you describe as a PATRICIA
above and a sparse hash.

-Charlie

···

On Wed, 2005-05-11 at 05:00 +0900, Phrogz wrote:

Ooh, that's quite helpful, thanks!

···

On May 11, 2005, at 12:05 AM, Martin Ankerl wrote:

If you know all possible strings in advance, you might want to generate a perfect hash:
Perfect Hashing

Thank you for all the links. I've passed them on to the engineers in charge of this particular refactor. (Despite my love of Ruby, I'm not actually a programmer on the engineering team, so I only got involved in overhearing them talking and wanting to investigate how often hashing algorithms collide.)

···

On May 10, 2005, at 4:38 PM, Zed A. Shaw wrote:

Just out of curiosity, but have you considered some other
structures/algorithms which might be alternatives depending on your
usage?

That's not something I knew, so it's hard to unforget it. Thanks :slight_smile:

I *think* that the proposed application within my company is a one-time optimization. (Hash the nice human/programmer-readable strings and from then on store and use the results only as shorts.)

I have a BS comp-sci degree, but apparently I need to go back to school and learn myself some in-depth algorithm and data structure stuff.

Thanks also for the note about "linear hashing". Looks interesting.

···

On May 10, 2005, at 5:30 PM, Clifford Heath wrote:

Don't forget that in many hashing algorithms, the cost of computing
the hash function far outweighs the occasional cost of traversing a
chain.

Phrogz wrote:

Joel VanderWerf wrote:

Phrogz wrote:

A brief analysis of Ruby versions of a few string hashing

algorithms,

and my conclusions about their efficacy at preventing collisions.

Is ruby's own String#hash one of the ones you tested? If not, how

does

it compare?

Because I was interested in C algorithms, I didn't consider it. But, I
just ran it again, and ruby came out with the best of them:

ruby :: 100.0000 percent coverage (0 collisions out of 42884)

It is coded in C, and looking at rb_str_hash() in string.h, I see that it is one of HASH_ELFHASH, HASH_PERL, or a third (very simple) one.

Gavin Kistner wrote:

the cost of computing
the hash function far outweighs the occasional cost of traversing a
chain.

That's not something I knew, so it's hard to unforget it. Thanks :slight_smile:

It's probably a little more difficult and woolley now that CPUs
are so very much faster than RAM, and caches are big and wide to
compensate. But in the old days, compilers used techniques to
hash identifiers like hashing the first few characters, the last
few, and if the identifier was long, some of the middle ones as
well, ignoring any other characters. This was done to avoid CPU
cycles integrating the characters. It also works better with the
typical character spread of identifiers rather than non-specific
data.

Nowadays, if you have to fill a cache line, you might as well hash
all the bytes you loaded - which will often be 8 or 16 bytes. The
extra CPU cycles required will disappear in wait states anyhow...

If the data is ASCII, most of the entropy is in the bottom four
bits. If you treat the characters in groups of four, you shouldn't
just XOR all the groups together, because though you get a 32-bit
result, 16 of those bits have very little entropy. I usually do a
5-bit rotate of the accumulating hash value between each word,
which is often enough. Not quite as good statistical spread as a
CRC, which you can match by randomising each step with a multiply
and add. Here's the function I use, which works pretty well, though
I won't claim it's optimal - these things always have tradeoffs.
For example, this doesn't assume that cp is passed in word-aligned,
though that will often be the case.

HashValT
HashBlock(
         const char* cp,
         size_t bytes
)
{
         HashValT val = 0;
         union {
                 char four[4];
                 HashValT one;
         };

         for (bytes += 4; bytes > 4;) // Avoid unsigned woes...
         {
                 one = 0; // Initialise all 4 at once
                 switch (bytes -= 4) // Duff's device:
                 {
                 default: four[3] = *cp++; // Fall through
                 case 3: four[2] = *cp++; // Fall through
                 case 2: four[1] = *cp++; // Fall through
                 case 1: four[0] = *cp++;
                 }
                 val ^= one;
                 val = val*1103515245L + 12345; // For good measure...
                 val = ((val<<5) ^ (val>>(32-5))) ^ one; // Rotate 5
         }
         return val;
}

I *think* that the proposed application within my company is a one- time optimization.

If you google on "perfect hashing", that will lead you to a nice tool
for this job. It finds a guaranteed 1:1 hash with close to minimum
computation on a predefined set of strings.

Clifford Heath.

How can I figure out which hash it's using? It's not the ELF hash,
since that was the worst performer of the lot. Searching the source for
HASH_PERL I don't see it #defined anywhere, but my C is really, really
rusty. Does this mean that it's using the 'fallback' hash algorithm?
Does it vary by build per platform?

Hi,

···

In message "Re: String Hashing Algorithms" on Wed, 11 May 2005 07:20:26 +0900, "Phrogz" <gavin@refinery.com> writes:

How can I figure out which hash it's using?

We are using the last one, unless specifically defined.

              matz.

Thanks. FWIW, though I know fine-tuning the performance of Ruby is not a high-priority, that last algorithm uses multiplication, while some of the other hashes I posted do not. I don't know how often String.hash is used, or how much of a difference a change of implementation would make (in speed) or how the algorithm being used fares compared to something like SDBM in real-world tests...but if it did compare favorably, was much faster, and String.hash was used in tons of places in the code, it might be a nice little speed boost.

···

On May 10, 2005, at 5:58 PM, Yukihiro Matsumoto wrote:

How can I figure out which hash it's using?

We are using the last one, unless specifically defined.