Hashes sensitive to simularity

I’m after something like nilsimsa[1]; a hashing algorithm that allow you to
determine how similar two strings are. Ideally similar strings would
have identical hashes so it’s a simple hash lookup, otherwise I might
have to do some work, and we can’t have that :wink:

Before I try to make use of nilsimsa, can anyone point out similar
algorithms that might be of use, especially if they’re already available
as Ruby modules :slight_smile:

[1] http://lexx.shinn.net/cmeclax/nilsimsa.html

···


Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

TRANSACTION CANCELLED - FARECARD RETURNED

Perhaps Levenshtein distance?

···

On Thursday 06 June 2002 08:51 am, Thomas Hurst wrote:

I’m after something like nilsimsa[1]; a hashing algorithm that
allow you to determine how similar two strings are. Ideally
similar strings would have identical hashes so it’s a simple hash
lookup, otherwise I might have to do some work, and we can’t have
that :wink:

Before I try to make use of nilsimsa, can anyone point out similar
algorithms that might be of use, especially if they’re already
available as Ruby modules :slight_smile:


Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Since I don’t know your application, I’m not sure how closely this
applies, but a 1998 DDJ article suggests using a ternary search tree:

http://www.ddj.com/documents/s=921/ddj9804a/

Your “database” could be comprised of entries in the tree, and you can
use the described “near-neighbor searching” to find your string.

Paul

[Thomas Hurst]:

Ideally similar strings would have identical hashes so it’s a simple hash
lookup, otherwise I might have to do some work, and we can’t have that :wink:

That is not possible in the general case, consider for example

s1 = 'AAAAA'
s2 = 'BAAAA'
s3 = 'BBAAA'
s4 = 'BBBAA'
s5 = 'BBBBA'
s6 = 'BBBBB'

Since s1 is “similar” to s2, they should have the same hash value.
Since s2 is “similar” to s3, they should have the same hash value.

This means that s1 and s6 will have to have the same hash value even
though they are not similar at all.

// Niklas

Mmmn, the thing is, I really need a key I can use for a database and a
hash. Maybe something like an overlong soundex, or a more generic
hashing algorithm that produces identical hashes up to a certain
threshold.

I suppose I might be able to use a less accurate algorithm and use
a simularity matching algorithm like Levenshtein distance on the smaller
set.

···

Perhaps Levenshtein distance?
merriampark.com


Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

QOTD:
“You’re so dumb you don’t even have wisdom teeth.”

  • Niklas Frykholm (r2d2@acc.umu.se) [020606 16:56]:

[Thomas Hurst]:

Ideally similar strings would have identical hashes so it’s a simple hash
lookup, otherwise I might have to do some work, and we can’t have that :wink:

That is not possible in the general case, consider for example

s1 = ‘AAAAA’
s2 = ‘BAAAA’
s3 = ‘BBAAA’
s4 = ‘BBBAA’
s5 = ‘BBBBA’
s6 = ‘BBBBB’

Since s1 is “similar” to s2, they should have the same hash value.
Since s2 is “similar” to s3, they should have the same hash value.

On the contrary, soundex meets the poster’s criteria, it just may not be
the most flexible or natural function for disparate data sets.

The definition of “similar” is critical to the solution.

Rick

···


http://www.rickbradley.com MUPRN: 444 (88F/95F)
> stay free If you hope
random email haiku | to keep building GDP.
> He is right, said St.