[SUMMARY] Short But Unique (#83)

This is a pretty real world problem many applications struggle with. I dare say
it's hard to get right, which is probably why so many applications don't even
try. We had a few adventurous solvers though and they got some workable
results. Let's look into Daniel Martin's solution.

I felt Daniel got some great output, by favoring the word boundaries of the
terms to shorten. Here's a sample of the code being run on the quiz problem
set:

  users...er
  use...test
  account...
  acc...test
  bacon

Notice how the compression favors keeping word intact, when possible. That
seems to give fairly good results overall.

Alright, let's see how Daniel does it:

  # Returns the length of the longest common
  # substring of "a" and "b"
  def string_similarity(a, b)
    retval = 0
    (0 ... b.length).each { |offset|
      len = 0
      (0 ... b.length - offset).each { |aind|
        if (a[aind] and b[aind+offset] == a[aind])
          len += 1
          retval = len if retval < len
        else
          len = 0
        end
      }
    }
    (1 ... a.length).each { |offset|
      len = 0
      (0 ... a.length - offset).each { |bind|
        if (b[bind] and a[bind+offset] == b[bind])
          len += 1
          retval = len if retval < len
        else
          len = 0
        end
      }
    }
    retval
  end
  
  # ...

The comment above this method tells you exactly what it does, which is to return
the numerical length of the longest common substring for the parameters. It
works by walking each possible substring of b and checking that the string
occurs in a somewhere. It then reverses and repeats the process because one
string might be longer than the other. The highest count seen overall is
returned.

Daniel mentioned that he has been out of Ruby for a bit and his Ruby might be a
little rusty. The above code works just fine, of course, but we can shorten it
up a bit, if we want to. Here's another way to write the above:

  $KCODE = "u" # make jcode happy (silence a warning)
  require "jcode" # for String#each_char
  require "enumerator" # for Enumerable#each_cons
  
  def string_similarity(str1, str2)
    long, short = [str1, str2].sort_by { |str| str.length }
    long.length.downto(0) do |len|
      long.enum_for(:each_char).each_cons(len) do |substr|
        return substr.length if short.include? substr.join
      end
    end
    return 0
  end

Again, this works the same. I try all possible substrings of the longer string,
from longest to shortest, for inclusion in the shorter string. Since the code
works with the longest strings first, we can return the first answer we find.
I'm not saying either method is better, but hopefully you can figure out how
they work, between the two of them.

Let's get back to Daniel's code:

  # ...
  
  def score_compression(target, start, len, alltargets)
    score = target.length - len
    score += 3 if len == 0
    score += 3 if (target[start,1] =~ %r(_|\W) or
                   target[start-1,2] =~ %r([a-z0-9][A-Z]))
    score += 3 if (target[start+len-1,1] =~ %r(_|\W) or
                   target[start+len-1,2] =~ %r([a-z0-9][A-Z]))
    prebit = target[0,start]
    postbit = target[start+len,target.length]
    scoreminus = 0
    alltargets.each{|s|
      scoreminus += string_similarity(s,prebit)
      scoreminus += string_similarity(s,postbit)
    }
    score - (1.0 / alltargets.length) * scoreminus
  end
  
  # ...

The above code just rates a possible replacement. The target variable holds the
original string, start and len mark the area to be replaced with the repeat
string, and alltargets holds the other strings that need compressing.

The most interesting part starts on the third line of the method, where a couple
of checks are used to score substitutions at word boundaries higher. Note that
the checks include punctuation boundaries and changes in case as a boundary.
This, in my opinion, is the source of the quite readable end result.

Note that the last section of that method compares the pieces of the string that
will be left after the replacement with other strings in the result set using
the substring count method we examined earlier. This code is there to prevent
two strings from being compressed to the same representation (though I doubt
this weighted scoring system is perfect).

Finally, here's the method that does the actual work:

  # ...
  
  class Array
    def compress(n, repstr = '...')
      retval = []
      self.each { |s|
        short_specs =
        (s.length - n + repstr.length ..
         s.length - n + repstr.length + 2).inject([]) { |sp,l|
          sp + (0 .. s.length - l).map {|st| [st,l]}
        }.select { |a| a[1] > repstr.length}
        if (s.length <= n) then short_specs.unshift([0,0]) end
        retval.push(
          short_specs.inject([-999,""]) { |record, spec|
            candidate = s.dup
            candidate[spec[0],spec[1]] = repstr if spec[1] > 0
            if retval.include?(candidate)
              record
            else
              score = score_compression(s,spec[0],spec[1],self)
              if score >= record[0]
                [score, candidate]
              else
                record
              end
            end
          }[1]
        )
      }
      retval
    end
  end
  
  # ...

This code was a little tricky for me to break down, so let me see if I can
explain it in manageable chunks. This method works over each string in the
Array, building a compressed replacement for each one as it goes. To build
those replacements it first constructs an Array of indices and lengths where it
could replace content with the replacement string. It then scores each of those
replacements and selects the highest candidate as the result.

Here's how one might invoke the above:

  # ...
  
  if __FILE__ == $0
    ARGV[1,ARGV.length].compress(ARGV[0].to_i).each {|s| puts s}
  end

Now, since Unicode has been such a hot topic lately, let's see how this code
handles using a real ellipsis character. If I change the above code to:

  # ...
  
  if __FILE__ == $0
    puts ARGV[1..-1].compress(ARGV[0].to_i, "…")
  end

and run the code again, here's the new output:

  users…er
  use…test
  account…
  acc…test
  bacon

Oops, I asked for 10 characters but got less than that. The reason is that the
code we just examined is counting the byte length of our replacement string, in
this bit of code right here:

  # ...
  
  class Array
    def compress(n, repstr = '...')
      retval = []
      self.each { |s|
        short_specs =
        (s.length - n + repstr.length ..
         s.length - n + repstr.length + 2).inject([]) { |sp,l|
          sp + (0 .. s.length - l).map {|st| [st,l]}
        }.select { |a| a[1] > repstr.length}
        
        # ...

As a fix, I'm going to set the $KCODE variable for Unicode support, load the
jcode library for its helper methods, and change all those repstr.length calls
to repstr.jlength. Here's a look at those changes:

  $KCODE = "u"
  require "jcode"
  
  # ...
  
  class Array
    def compress(n, repstr = '...')
      retval = []
      self.each { |s|
        short_specs =
        (s.length - n + repstr.jlength ..
         s.length - n + repstr.jlength + 2).inject([]) { |sp,l|
          sp + (0 .. s.length - l).map {|st| [st,l]}
        }.select { |a| a[1] > repstr.jlength}
        
        # ...

Does that fix us up? Let's see:

  users…ller
  users…test
  account…er
  account…st
  bacon

Bingo. Now we get the extra characters from using the shorter string. Don't
let people tell you Ruby can't handle Unicode today.

My thanks to all who gave this quiz a shot. I expect you all to email your
solutions as patches all the applications out there that botch the display of
tabs like this.

Tomorrow we have an easy problem for those working through Learn to Program...

Ruby Quiz <james@grayproductions.net> writes:

Note that the last section of that method compares the pieces of the
string that will be left after the replacement with other strings in
the result set using the substring count method we examined earlier.
This code is there to prevent two strings from being compressed to
the same representation (though I doubt this weighted scoring system
is perfect).

Actually, that's not why that code is there, or what it's doing. What
it's doing is favoring unique strings over non-unique strings. For
example:

irb(main):002:0> %w(apple_juice orange_juice grape_juice
                    prune_juice).compress(10)
=> ["apple...", "orange...", "grape...", "prune..."]
irb(main):003:0> %w(orange_juice orange_marmalade orange_flavor
                    orange_pulp).compress(10)
=> ["...juice", "...rmalade", "...flavor", "o...pulp"]

The idea is that stuff common to all or almost-all of the original
strings doesn't really help you when staring at a bunch of tabs.
Abbreviating "orange_juice" as "orange..." makes sense when you have a
bunch of juices, but not when you have a bunch of orange things.

Avoiding previously used abbreviations is done with the bit:

            if retval.include?(candidate)
              record

That is, if the to-be-returned list of abbreviations already includes
this string, don't even score it and just assume that whatever else
you had was better. This leads to bad results when all the possible
abbreviations are already taken.

Also, if you're going to allow for unicode in the output, you really
should allow it in the input, and change all those .length to .jlength
calls, but I'm still holding out for true transparent unicode support
in ruby 2.0...

Ruby Quiz <james@grayproductions.net> writes:

Note that the last section of that method compares the pieces of the
string that will be left after the replacement with other strings in
the result set using the substring count method we examined earlier.
This code is there to prevent two strings from being compressed to
the same representation (though I doubt this weighted scoring system
is perfect).

Actually, that's not why that code is there, or what it's doing. What
it's doing is favoring unique strings over non-unique strings. For
example:

irb(main):002:0> %w(apple_juice orange_juice grape_juice
                    prune_juice).compress(10)
=> ["apple...", "orange...", "grape...", "prune..."]
irb(main):003:0> %w(orange_juice orange_marmalade orange_flavor
                    orange_pulp).compress(10)
=> ["...juice", "...rmalade", "...flavor", "o...pulp"]

The idea is that stuff common to all or almost-all of the original
strings doesn't really help you when staring at a bunch of tabs.
Abbreviating "orange_juice" as "orange..." makes sense when you have a
bunch of juices, but not when you have a bunch of orange things.

Avoiding previously used abbreviations is done with the bit:

            if retval.include?(candidate)
              record

That is, if the to-be-returned list of abbreviations already includes
this string, don't even score it and just assume that whatever else
you had was better. This leads to bad results when all the possible
abbreviations are already taken.

Thanks for setting me straight Daniel.

Also, if you're going to allow for unicode in the output, you really
should allow it in the input, and change all those .length to .jlength
calls, but I'm still holding out for true transparent unicode support
in ruby 2.0...

Good point!

James Edward Gray II

···

On Jun 23, 2006, at 9:47 PM, Daniel Martin wrote: