Hashes sensitive to simularity

Thomas_Hurst · 6 June 2002 15:51

I’m after something like nilsimsa[1]; a hashing algorithm that allow you to
determine how similar two strings are. Ideally similar strings would
have identical hashes so it’s a simple hash lookup, otherwise I might
have to do some work, and we can’t have that

Before I try to make use of nilsimsa, can anyone point out similar
algorithms that might be of use, especially if they’re already available
as Ruby modules

[1] http://lexx.shinn.net/cmeclax/nilsimsa.html

···

–
Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

TRANSACTION CANCELLED - FARECARD RETURNED

Ned_Konz · 6 June 2002 16:10

Perhaps Levenshtein distance?

···

On Thursday 06 June 2002 08:51 am, Thomas Hurst wrote:

I’m after something like nilsimsa[1]; a hashing algorithm that
allow you to determine how similar two strings are. Ideally
similar strings would have identical hashes so it’s a simple hash
lookup, otherwise I might have to do some work, and we can’t have
that

Before I try to make use of nilsimsa, can anyone point out similar
algorithms that might be of use, especially if they’re already
available as Ruby modules

–
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Paul_Brannan · 6 June 2002 17:24

Since I don’t know your application, I’m not sure how closely this
applies, but a 1998 DDJ article suggests using a ternary search tree:

http://www.ddj.com/documents/s=921/ddj9804a/

Your “database” could be comprised of entries in the tree, and you can
use the described “near-neighbor searching” to find your string.

Paul

Niklas_Frykholm2 · 6 June 2002 21:53

[Thomas Hurst]:

Ideally similar strings would have identical hashes so it’s a simple hash
lookup, otherwise I might have to do some work, and we can’t have that

That is not possible in the general case, consider for example

s1 = 'AAAAA'
s2 = 'BAAAA'
s3 = 'BBAAA'
s4 = 'BBBAA'
s5 = 'BBBBA'
s6 = 'BBBBB'

Since s1 is “similar” to s2, they should have the same hash value.
Since s2 is “similar” to s3, they should have the same hash value.
…

This means that s1 and s6 will have to have the same hash value even
though they are not similar at all.

// Niklas

Thomas_Hurst · 6 June 2002 16:42

Mmmn, the thing is, I really need a key I can use for a database and a
hash. Maybe something like an overlong soundex, or a more generic
hashing algorithm that produces identical hashes up to a certain
threshold.

I suppose I might be able to use a less accurate algorithm and use
a simularity matching algorithm like Levenshtein distance on the smaller
set.

···

Ned Konz (ned@bike-nomad.com) wrote:

Perhaps Levenshtein distance?
merriampark.com

–
Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

QOTD:
“You’re so dumb you don’t even have wisdom teeth.”

Rick_Bradley1 · 6 June 2002 22:26

Niklas Frykholm (r2d2@acc.umu.se) [020606 16:56]:

[Thomas Hurst]:

Ideally similar strings would have identical hashes so it’s a simple hash
lookup, otherwise I might have to do some work, and we can’t have that

That is not possible in the general case, consider for example

s1 = ‘AAAAA’
s2 = ‘BAAAA’
s3 = ‘BBAAA’
s4 = ‘BBBAA’
s5 = ‘BBBBA’
s6 = ‘BBBBB’

Since s1 is “similar” to s2, they should have the same hash value.
Since s2 is “similar” to s3, they should have the same hash value.

On the contrary, soundex meets the poster’s criteria, it just may not be
the most flexible or natural function for disparate data sets.

The definition of “similar” is critical to the solution.

Rick

···

–
http://www.rickbradley.com MUPRN: 444 (88F/95F)
> stay free If you hope
random email haiku | to keep building GDP.
> He is right, said St.

Topic		Replies	Views
Hashes sensitive to simularity ruby-talk	0	73	6 June 2002
String Hashing Algorithms ruby-talk	15	88	12 May 2005
Mini quiz (Was: Detecting similar strings) ruby-talk	3	112	2 August 2006
Detecting similar strings ruby-talk	0	91	1 August 2006
Mini quiz (Was: Detecting similar strings) ruby-talk	1	108	1 August 2006

Hashes sensitive to simularity

– Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

– Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

Related topics

–
Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

–
Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/