I read an article posted at Wikipedia about Levenshtein distance (aka edit
distance).
The location of the document is
In the document, a sample Ruby code goes like the following:
class String
def levenshtein(comparator)
a, b = self.unpack('U*'), comparator.unpack('U*')
n, m = a.length, b.length
a, b, n, m = b, a, m, n if n > m
current = [*0..n]
1.upto(m) do |i|
previous, current = current, [i]+[0]*n
1.upto(n) do |j|
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
change += 1 if a[j-1] != b[i-1]
current[j] = [add, delete, change].min
end
end
current[n]
end
end
In the code, there's a parameter called comparator which seems to be used to
decode given parameter. But, I can't understand what exactly the comparator
is doing.
It's simply the string you're comparing against; unpack('U*') just turns the
UTF-8 characters into unsigned integers:
class String
def levenshtein(comparator)
a, b = self.unpack('U*'), comparator.unpack('U*')
b # => [102, 111, 111, 98, 97, 114]
n, m = a.length, b.length
a, b, n, m = b, a, m, n if n > m
current = [*0..n]
1.upto(m) do |i|
previous, current = current, [i]+[0]*n
1.upto(n) do |j|
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
change += 1 if a[j-1] != b[i-1]
current[j] = [add, delete, change].min
end
end
current[n]
end
end
"foo".levenshtein("foobar") # => 3
···
On Mon, Oct 16, 2006 at 07:15:52PM +0900, Minkoo wrote:
Hi list.
I read an article posted at Wikipedia about Levenshtein distance (aka edit
distance).
The location of the document is Levenshtein distance - Wikipedia
In the document, a sample Ruby code goes like the following:
class String
def levenshtein(comparator)
a, b = self.unpack('U*'), comparator.unpack('U*')
n, m = a.length, b.length
a, b, n, m = b, a, m, n if n > m
current = [*0..n]
1.upto(m) do |i|
previous, current = current, [i]+[0]*n
1.upto(n) do |j|
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
change += 1 if a[j-1] != b[i-1]
current[j] = [add, delete, change].min
end
end
current[n]
end
end
In the code, there's a parameter called comparator which seems to be used to
decode given parameter. But, I can't understand what exactly the comparator
is doing.
I read an article posted at Wikipedia about Levenshtein distance (aka edit
distance).
The location of the document is Levenshtein distance - Wikipedia
In the document, a sample Ruby code goes like the following:
class String
def levenshtein(comparator)
[...]
In the code, there's a parameter called comparator which seems to be used to
decode given parameter. But, I can't understand what exactly the comparator
is doing.
I didn't examine the code, but I'd guess... you want the L. distance between two strings. One is self, and the other is the parameter named "comparator".
I'm afraid that I'm not used to character encodings. Does Ruby use UTF-8 by
default?
In other words, suppose that I've launched irb and fired
"foo".levenshtein("foobar").
In that case, is the string "foo" encoded as utf-8? Do I always have to
unpack the string like
the code shown above?
Sincerely,
Minkoo Seo
···
On 10/16/06, Mauricio Fernandez <mfp@acm.org> wrote:
On Mon, Oct 16, 2006 at 07:15:52PM +0900, Minkoo wrote:
> Hi list.
>
> I read an article posted at Wikipedia about Levenshtein distance (aka
edit
> distance).
> The location of the document is
> Levenshtein distance - Wikipedia
>
> In the document, a sample Ruby code goes like the following:
>
> class String
> def levenshtein(comparator)
> a, b = self.unpack('U*'), comparator.unpack('U*')
> n, m = a.length, b.length
> a, b, n, m = b, a, m, n if n > m
> current = [*0..n]
> 1.upto(m) do |i|
> previous, current = current, [i]+[0]*n
> 1.upto(n) do |j|
> add, delete = previous[j]+1, current[j-1]+1
> change = previous[j-1]
> change += 1 if a[j-1] != b[i-1]
> current[j] = [add, delete, change].min
> end
> end
> current[n]
> end
> end
>
> In the code, there's a parameter called comparator which seems to be
used to
> decode given parameter. But, I can't understand what exactly the
comparator
> is doing.
>
> Does anybody know the detail?
It's simply the string you're comparing against; unpack('U*') just turns
the
UTF-8 characters into unsigned integers:
class String
def levenshtein(comparator)
a, b = self.unpack('U*'), comparator.unpack('U*')
b # => [102, 111,
111, 98, 97, 114]
n, m = a.length, b.length
a, b, n, m = b, a, m, n if n > m
current = [*0..n]
1.upto(m) do |i|
previous, current = current, [i]+[0]*n
1.upto(n) do |j|
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
change += 1 if a[j-1] != b[i-1]
current[j] = [add, delete, change].min
end
end
current[n]
end
end
I have now renamed the variable to "other" for clarity.
···
On Mon, 16 Oct 2006 19:40:07 +0900, Carlos wrote:
Minkoo wrote:
Hi list.
I read an article posted at Wikipedia about Levenshtein distance (aka edit
distance).
The location of the document is Levenshtein distance - Wikipedia
In the document, a sample Ruby code goes like the following:
class String
def levenshtein(comparator)
[...]
In the code, there's a parameter called comparator which seems to be
used to
decode given parameter. But, I can't understand what exactly the comparator
is doing.
I didn't examine the code, but I'd guess... you want the L. distance
between two strings. One is self, and the other is the parameter named
"comparator".
--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology. http://www.iit.edu/~kbloom1/
I've added a signing subkey to my GPG key. Please update your keyring.
>It's simply the string you're comparing against; unpack('U*') just turns
>the UTF-8 characters into unsigned integers:
>
> class String
> def levenshtein(comparator)
> a, b = self.unpack('U*'), comparator.unpack('U*')
> b # => [102, 111,
>111, 98, 97, 114]
[...]
>
> "foo".levenshtein("foobar") # => 3
>
I'm afraid that I'm not used to character encodings. Does Ruby use UTF-8 by
default?
Ruby's Strings as of 1.8 are just a sequence of bytes. If a String happens to
hold a UTF-8 sequence, some operations will work properly when you set
$KCODE="u", e.g.
In other words, suppose that I've launched irb and fired
"foo".levenshtein("foobar").
In that case, is the string "foo" encoded as utf-8?
This depends on your locale; Ruby's Strings do not know about encodings in 1.8
(they will very soon in 1.9 according to matz' latest messages in ruby-core;
search the ruby-talk archives for extensive discussions of the upcoming M17N).
Do I always have to unpack the string like the code shown above?
In this particular case, it depends on two things:
* the encoding you're using
* whether the Levenshtein should be relative to characters or bytes
In general, it's up to you to keep track of the encodings and handle Strings
appropriately.
···
On Mon, Oct 16, 2006 at 10:11:51PM +0900, Minkoo wrote:
On 10/16/06, Mauricio Fernandez <mfp@acm.org> wrote:
I'm afraid that I'm not used to character encodings. Does Ruby use UTF-8 by
default?
As of Ruby 1.8, it doesn't, but see the other responses.
In other words, suppose that I've launched irb and fired
"foo".levenshtein("foobar").
In that case, is the string "foo" encoded as utf-8?
The code makes that assumption, correct or not. If you're one of those
ignorant yokels who believes that characters are bytes, it's also a
safe assumption. Now, if you're one of those people who does i18n,
l10n, and m17n before breakfast, you'll have alarm bells going off
inside your head immediately, as such assumptions are in general very
dangerous to make. A cardinal rule in this kind of programming is that
strings are meaningless without an attached encoding.
Do I always have to
unpack the string like
the code shown above?
It would be better, at least that keeps your options open when you
attempt to internationalize your program. You wind up supporting
Unicode at the very least, and that makes the transition to
internationalization a bit easier, as Unicode is a reasonable choice
as a character set if one wishes to do internationalization.
The version of Levenshtein in the Text project
(http://rubyforge.org/projects/text\) will work on either UTF-8
codepoints or bytes depending on the value of $KCODE. It's the best we
can do for now; in the future, of course, we will be able to decide
based on the strings' encodings themselves.
</blatant plug>
Paul.
···
On 17/10/06, Dido Sevilla <dido.sevilla@gmail.com> wrote:
The code makes that assumption, correct or not. If you're one of those
ignorant yokels who believes that characters are bytes, it's also a
safe assumption. Now, if you're one of those people who does i18n,
l10n, and m17n before breakfast, you'll have alarm bells going off
inside your head immediately, as such assumptions are in general very
dangerous to make. A cardinal rule in this kind of programming is that
strings are meaningless without an attached encoding.