Summary
···
=============================================================
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.
Background
At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.
As a result, I wanted to do a preliminary investigation into various
string hashing algorithms, to see how limiting that would be. (How
often would we need to rename a bunch of our strings just to get around
collisions?) Given the nature of hashing algorithms, it was important
that this be based on ~real world strings.
Methodology
I wrote a ruby script to scan all our (C++) source code and find ANY
words that were valid C++ identifiers. This also meant that I grabbed
all such words from inside comment blocks, as well as C++ keywords, but
that was OK for this back-of-the-envelope testing. I ended up with
almost 43,000 unique identifiers.
I then ran through the list of words and used 7 different string
hashing algorithms (which I ported from the C code I found on a web
page[1]) to compute different hash keys. I used a Set (per hashing
algorithm) to store these keys, and at the end compared the number of
keys in the Set with the number of identifiers fed in.
The Results
Three of the hashing algorithms ended up with no collisions whatsoever,
and one of these (SDBM) is really simple and fast (in C).
Here is the output of my test:
djb :: 99.8601 percent coverage (60 collisions out of 42884)
elf :: 99.5430 percent coverage (196 collisions out of 42884)
sdbm :: 100.0000 percent coverage (0 collisions out of 42884)
js :: 99.9044 percent coverage (41 collisions out of 42884)
ap :: 100.0000 percent coverage (0 collisions out of 42884)
bkdr :: 99.9953 percent coverage (2 collisions out of 42884)
rs :: 100.0000 percent coverage (0 collisions out of 42884)
The Algorithms
module HashEm
SIGNEDSHORT = 0x7FFFFFFF
def self.rs( str, len=str.length )
a,b = 63689,378551
hash = 0
len.times{ |i|
hash = hash*a + str[i]
a *= b
}
hash & SIGNEDSHORT
end
def self.js( str, len=str.length )
hash = 1315423911
len.times{ |i|
hash ^= ( ( hash << 5 ) + str[i] + ( hash >> 2 ) )
}
hash & SIGNEDSHORT
end
def self.elf( str, len=str.length )
hash = 0
x = 0
len.times{ |i|
hash = (hash << 4) + str[i]
if (x = hash & 0xF0000000) != 0
hash ^= (x >> 24)
hash &= ~x
end
}
hash & SIGNEDSHORT
end
def self.bkdr( str, len=str.length )
seed = 131 # 31 131 1313 13131 131313 etc..
hash = 0
len.times{ |i|
hash = ( hash*seed )+str[i]
}
hash & SIGNEDSHORT
end
def self.sdbm( str, len=str.length )
hash = 0
len.times{ |i|
hash = str[i] + ( hash << 6 ) + ( hash << 16 ) - hash
}
hash & SIGNEDSHORT
end
def self.djb( str, len=str.length )
hash = 5381
len.times{ |i|
hash = ((hash << 5) + hash) + str[i]
}
hash & SIGNEDSHORT
end
def self.ap( str, len=str.length )
hash = 0
len.times{ |i|
if (i & 1) == 0
hash ^= (hash << 7) ^ str[i] ^ (hash >> 3)
else
hash ^= ~( (hash << 11) ^ str[i] ^ (hash >> 5) )
end
}
hash & SIGNEDSHORT
end
end