Edit Distance at Wikipedia

Minkoo · 16 October 2006 10:15

Hi list.

I read an article posted at Wikipedia about Levenshtein distance (aka edit
distance).
The location of the document is

In the document, a sample Ruby code goes like the following:

class String
   def levenshtein(comparator)
     a, b = self.unpack('U*'), comparator.unpack('U*')
     n, m = a.length, b.length
     a, b, n, m = b, a, m, n if n > m
     current = [*0..n]
     1.upto(m) do |i|
       previous, current = current, [i]+[0]*n
       1.upto(n) do |j|
         add, delete = previous[j]+1, current[j-1]+1
         change = previous[j-1]
         change += 1 if a[j-1] != b[i-1]
         current[j] = [add, delete, change].min
       end
     end
     current[n]
   end
end

In the code, there's a parameter called comparator which seems to be used to
decode given parameter. But, I can't understand what exactly the comparator
is doing.

Does anybody know the detail?

Sincerely,
Minkoo Seo

Mauricio_Fernandez · 16 October 2006 10:25

It's simply the string you're comparing against; unpack('U*') just turns the
UTF-8 characters into unsigned integers:

    class String
      def levenshtein(comparator)
  a, b = self.unpack('U*'), comparator.unpack('U*')
  b # => [102, 111, 111, 98, 97, 114]
  n, m = a.length, b.length
  a, b, n, m = b, a, m, n if n > m
  current = [*0..n]
  1.upto(m) do |i|
    previous, current = current, [i]+[0]*n
    1.upto(n) do |j|
      add, delete = previous[j]+1, current[j-1]+1
      change = previous[j-1]
      change += 1 if a[j-1] != b[i-1]
      current[j] = [add, delete, change].min
    end
  end
  current[n]
      end
    end

"foo".levenshtein("foobar") # => 3

···

On Mon, Oct 16, 2006 at 07:15:52PM +0900, Minkoo wrote:

Hi list.

I read an article posted at Wikipedia about Levenshtein distance (aka edit
distance).
The location of the document is
Levenshtein distance - Wikipedia

In the document, a sample Ruby code goes like the following:

class String
  def levenshtein(comparator)
    a, b = self.unpack('U*'), comparator.unpack('U*')
    n, m = a.length, b.length
    a, b, n, m = b, a, m, n if n > m
    current = [*0..n]
    1.upto(m) do |i|
      previous, current = current, [i]+[0]*n
      1.upto(n) do |j|
        add, delete = previous[j]+1, current[j-1]+1
        change = previous[j-1]
        change += 1 if a[j-1] != b[i-1]
        current[j] = [add, delete, change].min
      end
    end
    current[n]
  end
end

In the code, there's a parameter called comparator which seems to be used to
decode given parameter. But, I can't understand what exactly the comparator
is doing.

Does anybody know the detail?

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

Carlos · 16 October 2006 10:40

Minkoo wrote:

Hi list.

I read an article posted at Wikipedia about Levenshtein distance (aka edit
distance).
The location of the document is
Levenshtein distance - Wikipedia

In the document, a sample Ruby code goes like the following:

class String
def levenshtein(comparator)

[...]

In the code, there's a parameter called comparator which seems to be used to
decode given parameter. But, I can't understand what exactly the comparator
is doing.

I didn't examine the code, but I'd guess... you want the L. distance between two strings. One is self, and the other is the parameter named "comparator".

Good luck.

···

--

Minkoo · 16 October 2006 13:11

I'm afraid that I'm not used to character encodings. Does Ruby use UTF-8 by
default?

In other words, suppose that I've launched irb and fired
"foo".levenshtein("foobar").
In that case, is the string "foo" encoded as utf-8? Do I always have to
unpack the string like
the code shown above?

Sincerely,
Minkoo Seo

···

On 10/16/06, Mauricio Fernandez <mfp@acm.org> wrote:

On Mon, Oct 16, 2006 at 07:15:52PM +0900, Minkoo wrote:
> Hi list.
>
> I read an article posted at Wikipedia about Levenshtein distance (aka
edit
> distance).
> The location of the document is
> Levenshtein distance - Wikipedia
>
> In the document, a sample Ruby code goes like the following:
>
> class String
> def levenshtein(comparator)
> a, b = self.unpack('U*'), comparator.unpack('U*')
> n, m = a.length, b.length
> a, b, n, m = b, a, m, n if n > m
> current = [*0..n]
> 1.upto(m) do |i|
> previous, current = current, [i]+[0]*n
> 1.upto(n) do |j|
> add, delete = previous[j]+1, current[j-1]+1
> change = previous[j-1]
> change += 1 if a[j-1] != b[i-1]
> current[j] = [add, delete, change].min
> end
> end
> current[n]
> end
> end
>
> In the code, there's a parameter called comparator which seems to be
used to
> decode given parameter. But, I can't understand what exactly the
comparator
> is doing.
>
> Does anybody know the detail?

It's simply the string you're comparing against; unpack('U*') just turns
the
UTF-8 characters into unsigned integers:

    class String
      def levenshtein(comparator)
        a, b = self.unpack('U*'), comparator.unpack('U*')
        b # => [102, 111,
111, 98, 97, 114]
        n, m = a.length, b.length
        a, b, n, m = b, a, m, n if n > m
        current = [*0..n]
        1.upto(m) do |i|
          previous, current = current, [i]+[0]*n
          1.upto(n) do |j|
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            change += 1 if a[j-1] != b[i-1]
            current[j] = [add, delete, change].min
          end
        end
        current[n]
      end
    end

    "foo".levenshtein("foobar") # => 3

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

Ken_Bloom · 16 October 2006 13:45

I have now renamed the variable to "other" for clarity.

···

On Mon, 16 Oct 2006 19:40:07 +0900, Carlos wrote:

Minkoo wrote:

Hi list.

I read an article posted at Wikipedia about Levenshtein distance (aka edit
distance).
The location of the document is
Levenshtein distance - Wikipedia

In the document, a sample Ruby code goes like the following:

class String
def levenshtein(comparator)

[...]

In the code, there's a parameter called comparator which seems to be
used to
decode given parameter. But, I can't understand what exactly the comparator
is doing.

I didn't examine the code, but I'd guess... you want the L. distance
between two strings. One is self, and the other is the parameter named
"comparator".

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/
I've added a signing subkey to my GPG key. Please update your keyring.

Mauricio_Fernandez · 16 October 2006 21:04

>It's simply the string you're comparing against; unpack('U*') just turns
>the UTF-8 characters into unsigned integers:
>
> class String
> def levenshtein(comparator)
> a, b = self.unpack('U*'), comparator.unpack('U*')
> b # => [102, 111,
>111, 98, 97, 114]

[...]

>
> "foo".levenshtein("foobar") # => 3
>
I'm afraid that I'm not used to character encodings. Does Ruby use UTF-8 by
default?

Ruby's Strings as of 1.8 are just a sequence of bytes. If a String happens to
hold a UTF-8 sequence, some operations will work properly when you set
$KCODE="u", e.g.

s = "\303\241" # "á" == acute
s.scan(/\w/) # =>
$KCODE="u"
s.scan(/\w/) # => ["á"]

Some other methods are covered by jcode.rb.

In other words, suppose that I've launched irb and fired
"foo".levenshtein("foobar").
In that case, is the string "foo" encoded as utf-8?

This depends on your locale; Ruby's Strings do not know about encodings in 1.8
(they will very soon in 1.9 according to matz' latest messages in ruby-core;
search the ruby-talk archives for extensive discussions of the upcoming M17N).

Do I always have to unpack the string like the code shown above?

In this particular case, it depends on two things:
* the encoding you're using
* whether the Levenshtein should be relative to characters or bytes

In general, it's up to you to keep track of the encodings and handle Strings
appropriately.

···

On Mon, Oct 16, 2006 at 10:11:51PM +0900, Minkoo wrote:

On 10/16/06, Mauricio Fernandez <mfp@acm.org> wrote:

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

Dido_Sevilla · 17 October 2006 06:44

I'm afraid that I'm not used to character encodings. Does Ruby use UTF-8 by
default?

As of Ruby 1.8, it doesn't, but see the other responses.

In other words, suppose that I've launched irb and fired
"foo".levenshtein("foobar").
In that case, is the string "foo" encoded as utf-8?

The code makes that assumption, correct or not. If you're one of those
ignorant yokels who believes that characters are bytes, it's also a
safe assumption. Now, if you're one of those people who does i18n,
l10n, and m17n before breakfast, you'll have alarm bells going off
inside your head immediately, as such assumptions are in general very
dangerous to make. A cardinal rule in this kind of programming is that
strings are meaningless without an attached encoding.

Do I always have to
unpack the string like
the code shown above?

It would be better, at least that keeps your options open when you
attempt to internationalize your program. You wind up supporting
Unicode at the very least, and that makes the transition to
internationalization a bit easier, as Unicode is a reasonable choice
as a character set if one wishes to do internationalization.

···

On 10/16/06, Minkoo <minkoo.seo@gmail.com> wrote:

Paul_Battley · 17 October 2006 08:58

The version of Levenshtein in the Text project
(http://rubyforge.org/projects/text\) will work on either UTF-8
codepoints or bytes depending on the value of $KCODE. It's the best we
can do for now; in the future, of course, we will be able to decide
based on the strings' encodings themselves.

</blatant plug>

Paul.

···

On 17/10/06, Dido Sevilla <dido.sevilla@gmail.com> wrote:

The code makes that assumption, correct or not. If you're one of those
ignorant yokels who believes that characters are bytes, it's also a
safe assumption. Now, if you're one of those people who does i18n,
l10n, and m17n before breakfast, you'll have alarm bells going off
inside your head immediately, as such assumptions are in general very
dangerous to make. A cardinal rule in this kind of programming is that
strings are meaningless without an attached encoding.

Topic		Replies	Views
Levenshtein 0.1.0 ruby-talk	2	71	24 May 2008
A place for an "edit_distance" method ruby-talk	2	92	19 May 2007
Mini quiz (Was: Detecting similar strings) ruby-talk	3	112	2 August 2006
Levenshtein_distance and recreate the string ruby-talk	11	179	1 May 2008
40 million levenshtein distances for two long strings ruby-talk	11	137	25 May 2008

Edit Distance at Wikipedia

Related topics