Hashes sensitive to simularity


(Thomas Hurst) #1

I’m after something like nilsimsa[1]; a hashing algorithm that allow you to
determine how similar two strings are. Ideally similar strings would
have identical hashes so it’s a simple hash lookup, otherwise I might
have to do some work, and we can’t have that :wink:

Before I try to make use of nilsimsa, can anyone point out similar
algorithms that might be of use, especially if they’re already available
as Ruby modules :slight_smile:

[1] http://lexx.shinn.net/cmeclax/nilsimsa.html

···


Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

TRANSACTION CANCELLED - FARECARD RETURNED


(Ned Konz) #2

Perhaps Levenshtein distance?
http://www.merriampark.com/ld.htm

···

On Thursday 06 June 2002 08:51 am, Thomas Hurst wrote:

I’m after something like nilsimsa[1]; a hashing algorithm that
allow you to determine how similar two strings are. Ideally
similar strings would have identical hashes so it’s a simple hash
lookup, otherwise I might have to do some work, and we can’t have
that :wink:

Before I try to make use of nilsimsa, can anyone point out similar
algorithms that might be of use, especially if they’re already
available as Ruby modules :slight_smile:


Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE


(Paul Brannan) #3

Since I don’t know your application, I’m not sure how closely this
applies, but a 1998 DDJ article suggests using a ternary search tree:

http://www.ddj.com/documents/s=921/ddj9804a/

Your “database” could be comprised of entries in the tree, and you can
use the described “near-neighbor searching” to find your string.

Paul


(Niklas Frykholm) #4

[Thomas Hurst]:

Ideally similar strings would have identical hashes so it’s a simple hash
lookup, otherwise I might have to do some work, and we can’t have that :wink:

That is not possible in the general case, consider for example

s1 = 'AAAAA'
s2 = 'BAAAA'
s3 = 'BBAAA'
s4 = 'BBBAA'
s5 = 'BBBBA'
s6 = 'BBBBB'

Since s1 is “similar” to s2, they should have the same hash value.
Since s2 is “similar” to s3, they should have the same hash value.

This means that s1 and s6 will have to have the same hash value even
though they are not similar at all.

// Niklas


(Thomas Hurst) #5

Mmmn, the thing is, I really need a key I can use for a database and a
hash. Maybe something like an overlong soundex, or a more generic
hashing algorithm that produces identical hashes up to a certain
threshold.

I suppose I might be able to use a less accurate algorithm and use
a simularity matching algorithm like Levenshtein distance on the smaller
set.

···

Perhaps Levenshtein distance?
http://www.merriampark.com/ld.htm


Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

QOTD:
“You’re so dumb you don’t even have wisdom teeth.”


(Rick Bradley) #6

[Thomas Hurst]:

Ideally similar strings would have identical hashes so it’s a simple hash
lookup, otherwise I might have to do some work, and we can’t have that :wink:

That is not possible in the general case, consider for example

s1 = 'AAAAA’
s2 = 'BAAAA’
s3 = 'BBAAA’
s4 = 'BBBAA’
s5 = 'BBBBA’
s6 = ‘BBBBB’

Since s1 is “similar” to s2, they should have the same hash value.
Since s2 is “similar” to s3, they should have the same hash value.

On the contrary, soundex meets the poster’s criteria, it just may not be
the most flexible or natural function for disparate data sets.

The definition of “similar” is critical to the solution.

Rick

···


http://www.rickbradley.com MUPRN: 444 (88F/95F)
> stay free If you hope
random email haiku | to keep building GDP.
> He is right, said St.