Detecting similar strings

Is there a way to take two strings, and decide if they are "similar."
I'm creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.

Is there a way to check if 2 strings are "identical" up to a certain
percentage, such as only having 1 or 2 characters different?

···

--
Posted via http://www.ruby-forum.com/.

Here's a Perl module that does something similar. You might try
porting it over to Ruby.

Farrel

···

On 01/08/06, Dylan Markow <dylan@dylanmarkow.com> wrote:

Is there a way to take two strings, and decide if they are "similar."
I'm creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.

Is there a way to check if 2 strings are "identical" up to a certain
percentage, such as only having 1 or 2 characters different?

sender: "Dylan Markow" date: "Tue, Aug 01, 2006 at 03:25:59PM +0900" <<<EOQ

Is there a way to take two strings, and decide if they are "similar."
I'm creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.

Is there a way to check if 2 strings are "identical" up to a certain
percentage, such as only having 1 or 2 characters different?

Looks like there is a soundex implementation for Ruby:
http://raa.ruby-lang.org/search.rhtml?search=soundex

If by any chance you are using MySQL you could use the soundex function
builtin into it as well.

Good luck,
Alex

Dylan Markow wrote:

Is there a way to take two strings, and decide if they are "similar."
I'm creating a contact system in Rails, and am having a large problem

Interesting your name is almost "Markov" ;-}

Anyway, besides algorithms mentioned here, I have notes on Double
Metaphone, NYSIIS, Phonex. For sequence analysis in general, google
McIlroy-Hunt, Ratcliff/Obershelp:

http://docs.python.org/lib/module-difflib.htmlhttp://docs.python.org/lib/module-difflib.html

And here is a nice description of the Levenstein Distance:

Farrel

···

On 01/08/06, Farrel Lifson <farrel.lifson@gmail.com> wrote:

On 01/08/06, Dylan Markow <dylan@dylanmarkow.com> wrote:
> Is there a way to take two strings, and decide if they are "similar."
> I'm creating a contact system in Rails, and am having a large problem
> with my users punching in duplicate entries with the last names spelled
> slightly different.
>
> Is there a way to check if 2 strings are "identical" up to a certain
> percentage, such as only having 1 or 2 characters different?

Here's a Perl module that does something similar. You might try
porting it over to Ruby.
String::Similarity - calculate the similarity of two strings - metacpan.org

Farrel

sender: "Dylan Markow" date: "Tue, Aug 01, 2006 at 03:25:59PM +0900"
<<<EOQ

Is there a way to take two strings, and decide if they are "similar."
I'm creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names
spelled
slightly different.

Is there a way to check if 2 strings are "identical" up to a certain
percentage, such as only having 1 or 2 characters different?

Looks like there is a soundex implementation for Ruby:
http://raa.ruby-lang.org/search.rhtml?search=soundex

If by any chance you are using MySQL you could use the soundex function
builtin into it as well.

If you check on Google under "ruby soundex module", you'll find a few
likely hits:
  ruby soundex module - Google Search

I remember a friend saying that there are superior algorithms to soundex
that are no more complex, so you might want to explore about a little.

Gene Tani wrote:

Dylan Markow wrote:
> Is there a way to take two strings, and decide if they are "similar."

http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf

You know, there are lots of implementations there, but the Ruby one
seems to be missing :slight_smile: [There's no reason to restrict it to working on
strings. If you duck, it'll work just as nicely on arrays of what have
you.]

···

On 01/08/06, Farrel Lifson <farrel.lifson@gmail.com> wrote:

And here is a nice description of the Levenstein Distance:
Levenshtein distance - Wikipedia

If the names might be from different languages I would rather use
Levenstein than soundex. Levenstein is probably good at describing
typos as "very close" but soundex might be somewhat language specific.

But I haven't tried so I might be wrong.

Thanks

Michal

···

On 8/1/06, benjohn@fysh.org <benjohn@fysh.org> wrote:

>>>> sender: "Dylan Markow" date: "Tue, Aug 01, 2006 at 03:25:59PM +0900"
>>>> <<<EOQ
>> Is there a way to take two strings, and decide if they are "similar."
>> I'm creating a contact system in Rails, and am having a large problem
>> with my users punching in duplicate entries with the last names
>> spelled
>> slightly different.
>>
>> Is there a way to check if 2 strings are "identical" up to a certain
>> percentage, such as only having 1 or 2 characters different?
> Looks like there is a soundex implementation for Ruby:
> http://raa.ruby-lang.org/search.rhtml?search=soundex
>
> If by any chance you are using MySQL you could use the soundex function
> builtin into it as well.

If you check on Google under "ruby soundex module", you'll find a few
likely hits:
  ruby soundex module - Google Search

I remember a friend saying that there are superior algorithms to soundex
that are no more complex, so you might want to explore about a little.

Here is an implementation of levenstein distance. It seems to me that there is another in ruby gems somewhere but I don't remember where.

# Determine Levenshtein distance of two strings

def Ld(s,t)

    #s string #1
    #t string #2
       n = s.size
    m = t.size
    a = Array.new

    if n != 0 && m != 0
           #2 create array
        r = Array.new
        rz = Array.new
           0.upto(m) {|x| r.push(0)}
           0.upto(n) {|x|a.push(r.dup)}
        a.each_index {|x| a[x][0] = x}
        0.upto(m) {|x| a[0][x] = x}
       #a.each {|x| p x}
           cost = 0
        1.upto(n) do |i|
            1.upto(m) do |j|
                if s[i] == t[j]
                    cost =0
                else
                    cost = 1
                end
                a[i][j] = [a[ i- 1][j] +1,a[i][j - 1] + 1,a[i - 1][j - 1] + cost].min
            end
        end
        a[n][m]
#a.each {|x| p x}
    else
        0
    end
end