Damerau-Levenshtein_distance

Jeremy2 · 18 May 2007 01:35

I wrote the method below by copying the algorithm from http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance (and matrix is a really simple 2d array implementation). But the problem is that it slows wat down as the string size gets bigger. At string length of about 150 it takes 1s, at 500 10s. Is there any way to recode this to get better performance without rewriting it in C, and would rewrting it in C even help or is this just a slow algorithm?

  def self.distance(string1, string2)
    string1 = string1.unpack('C*')
    string2 = string2.unpack('C*')
    s1n = string1.length
    s2n = string2.length
        m = DLDiff::Matrix.new(s1n+1, s2n+1)
    cost = 0
            (0..s1n).each {|i| m[i,0] = i}
    (1..s2n).each {|j| m[0,j] = j}
        (1..s1n).each do |i|
      (1..s2n).each do |j|
        cost = string1[i] == string2[j] ? 0 : 1
        m[i, j] = [ m[i-1, j] + 1,
                    m[i, j-1] + 1,
                    m[i-1, j-1] + cost ].min
        m[i, j] = [ m[i,j], m[i-2,j-2] + cost].min if(i > 1 && j > 1 && string1[i] == string2[j-1] && string1[i-1] == string2[j])
      end
    end
        m[s1n, s2n]
  end

  class Matrix
    def initialize(columns, rows)
      ac = Array.new(columns, 0)
      @am = Array.new(rows, 0)
      @am = @am.collect{|r| ac.dup}
    end
       def [](c, r)
      @am[r][c]
    end
       def []=(c,r,value)
      @am[r][c] = value
    end
       def inspect
      @am.collect{|a| a.inspect}.join("\n")
    end
       def to_s
      @am.to_s
    end
  end

Daniel_Martin · 18 May 2007 05:39

Jeremy <jwells@servalsystems.co.uk> writes:

I wrote the method below by copying the algorithm from
Damerau–Levenshtein distance - Wikipedia (and matrix
is a really simple 2d array implementation). But the problem is that
it slows wat down as the string size gets bigger. At string length of
about 150 it takes 1s, at 500 10s. Is there any way to recode this to
get better performance without rewriting it in C, and would rewrting
it in C even help or is this just a slow algorithm?

Well, the algorithm itself is O(n*m), where n and m are the size of
the strings involved, so on large strings it's going to get slow.

I was able to shave about 40% off the time for your method, and fix a
bug.

The bug was caused because the wikipedia article indexes the strings
starting from 1, but indexes the array starting form 0. In ruby both
start at 0, of course. To fix this, I added a fake element at the
start of both strings.

Anyway, here's my slightly faster version:

    def self.distance(string1, string2)
        string1 = string1.unpack('C*')
        string2 = string2.unpack('C*')
        s1n = string1.length
        s2n = string2.length
        string1.unshift -1
        string2.unshift -1
        m = Array.new(s1n+1){Array.new(s2n+1,0)}
        cost = 0
        a = nil; b = nil
        (0..s1n).each {|i| m[i][0] = i}
        (1..s2n).each {|j| m[0][j] = j}
        (1..s1n).each do |i|
            (1..s2n).each do |j|
                cost = string1[i] == string2[j] ? 0 : 1

                a = m[i-1][j] + 1
                a = b if ((b = m[i][j-1]+1) < a)
                a = b if ((b = m[i-1][j-1]+cost) < a)

                if(i > 1 && j > 1 &&
                   string1[i] == string2[j-1] &&
                   string1[i-1] == string2[j]) then
                    a = b if ((b = m[i-2][j-2] + 1) < a)
                end
                m[i][j] = a
            end
        end
        m[s1n][s2n]
    end

···

--
s=%q( Daniel Martin -- martin@snowplow.org
puts "s=%q(#{s})",s.to_a.last )
puts "s=%q(#{s})",s.to_a.last

Jeremy2 · 18 May 2007 08:40

Daniel Martin wrote:

Well, the algorithm itself is O(n*m), where n and m are the size of
the strings involved, so on large strings it's going to get slow.

I was able to shave about 40% off the time for your method, and fix a
bug.

Thanks, thats really useful. I guess that now the data i'm comparing has grown it would be better to use a diff type method than this.

Topic		Replies	Views
Levenshtein 0.1.0 ruby-talk	2	71	24 May 2008
Mini quiz (Was: Detecting similar strings) ruby-talk	3	112	2 August 2006
Problem with arrays ruby-talk	1	75	8 November 2007
Levenshtein_distance and recreate the string ruby-talk	11	179	1 May 2008
40 million levenshtein distances for two long strings ruby-talk	11	137	25 May 2008

Damerau-Levenshtein_distance

Related topics