[QUIZ] DictionaryMatcher (#103)

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

···

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Ken Bloom

From time to time someone asks on ruby-talk how they can write a regexp of the
form:

  /alligator|crocodile|bear|dinosaur|...|seven-thousandth-word/

It's not hard to write such a regexp, but Ruby has in internal limit on how big
the regular expression can be, so users find they can't do this matching
function easily.

Implement a class DictionaryMatcher that determines whether any of the strings
added to it are substrings of a string S. This should function as almost a
drop-in replacement for a Regexp, therefore your implementation should support
the following operations:

  # creates a new empty matcher
  dm=DictionaryMatcher.new
  
  # adds strings to the matcher
  dm << "string"
  dm << "Ruby"
  
  # determines whether a given word was one of those added to the matcher
  dm.include?("Ruby") # => true
  dm.include?("missing") # => false
  dm.include?("stringing you along") # => false
  
  # Regexp-like substing search
  dm =~ "long string" # => 5
  dm =~ "rub you the wrong way" # => nil
  
  # will automatically work as a result of implementing
  # DictionaryMatcher#=~ (see String#=~)
  "long string" =~ dm # => true
  
  # implement the rest of the interface implemented by Regexps (well, almost)
  class DictionaryMatcher
    alias_method :===, :=~
    alias_method :match, :=~
  end

If you can add additional features, like a case insensitivity option when
creating a new DictionaryMatcher this is also very useful.

Hi --

by Ken Bloom

From time to time someone asks on ruby-talk how they can write a regexp of the

form:

  /alligator|crocodile|bear|dinosaur|...|seven-thousandth-word/

It's not hard to write such a regexp, but Ruby has in internal limit on how big
the regular expression can be, so users find they can't do this matching
function easily.

Implement a class DictionaryMatcher that determines whether any of the strings
added to it are substrings of a string S. This should function as almost a
drop-in replacement for a Regexp, therefore your implementation should support
the following operations:

  # creates a new empty matcher
  dm=DictionaryMatcher.new

  # adds strings to the matcher
  dm << "string"
  dm << "Ruby"

  # determines whether a given word was one of those added to the matcher
  dm.include?("Ruby") # => true
  dm.include?("missing") # => false
  dm.include?("stringing you along") # => false

  # Regexp-like substing search
  dm =~ "long string" # => 5
  dm =~ "rub you the wrong way" # => nil

  # will automatically work as a result of implementing
  # DictionaryMatcher#=~ (see String#=~)
  "long string" =~ dm # => true

  # implement the rest of the interface implemented by Regexps (well, almost)
  class DictionaryMatcher
    alias_method :===, :=~
    alias_method :match, :=~
  end

If you can add additional features, like a case insensitivity option when
creating a new DictionaryMatcher this is also very useful.

What's the ruling on priority and "greediness"? In other words,
given:

   dm << "hi"
   dm << "child"

what would:

   dm =~ "children"

give? Would it be different if you added "child" first? Or is there
a rule about finding the longest match?

David

···

On Sat, 25 Nov 2006, Ruby Quiz wrote:

--
                   David A. Black | dblack@wobblini.net
Author of "Ruby for Rails" [1] | Ruby/Rails training & consultancy [3]
DABlog (DAB's Weblog) [2] | Co-director, Ruby Central, Inc. [4]
[1] Ruby for Rails | [3] http://www.rubypowerandlight.com
[2] http://dablog.rubypal.com | [4] http://www.rubycentral.org

On my machine (ruby 1.8.4), I don't get this result. For me, 'long
string' =~ /string/ returns the same as /string/ =~ 'long string',
which is 5, not true.

Just a heads-up for anyone else using the provided code as the basis
for a test case.

- Jamie

···

On 11/24/06, Ruby Quiz <james@grayproductions.net> wrote:

        # will automatically work as a result of implementing
        # DictionaryMatcher#=~ (see String#=~)
        "long string" =~ dm # => true

This DictionaryMatcher implementation implements a trie (prefix tree)
combined with Knuth-Morris-Pratt string matching on O(n) time. The funny
thing is I gave this solution as an answer to the same question on my
algorithms final last semester, and it was marked wrong. The professor
just didn't understand what I was doing.

I violated the interface a little since knowing which words matched can
also be pretty important, and added the #scan method, because the
questioner who inspired me to write this quiz actually told me that he
wanted to count how many matches there were in the string.

require 'enumerator'

# The DictionaryMatcher class holds a collection of strings. It allows lookups to
# determine which strings are included, and it can be used similarly
# to a +Regexp+ in substring matching.
class DictionaryMatcher
   #Create a DictionaryMatcher with no words in it
   def initialize
      @internal=Node.new
   end

   #Add a word to the DictionaryMatcher
   def add string
      array=@internal.add string
      parent_indexes=compute_failure_function string
      array.zip(parent_indexes).each do |node,parentindex|
   node.failure=array[parentindex] if parentindex
      end
      nil
   end
   alias_method :<<, :add

   #Determine whether +string+ was previously <tt>add</tt>ed to the
   #DictionaryMatcher.
   def include? string
      @internal.include? string
   end

   #Determine whether one of the words in the DictionaryMatcher is a substring of
   #+string+. Returns a DictionaryMatcher::MatchData object if found, +nil+ if not
   #found.
   def match string
      internal_match(string){|md| return md}
      return nil
   end

   #Scans +string+ for all occurrances of strings in the DictionaryMatcher.
   #Overlapping matches are skipped (only the first one is yielded), and
   #when some strings in the
   #DictionaryMatcher are substrings of others, only the shortest match at a given
   #position is found.
   def scan string
      internal_match(string){|matchdata| yield matchdata}
      nil
   end

   #Case equality. Similar to =~, but returns true or false.
   def === string
      not match(string).nil?
   end

   #Determines whether one of the words in the DictionaryMatcher is a substring of
   #+string+. Returns the index of the match if found, +nil+ if not
   #found.
   def =~ string
      x=match(string)
      x && x.index
   end

   #Contains the index matched, and the word matched
   MatchData=Struct.new(:index,:match)

   private
   #Doing this globally for the whole word feels kludgy, but I didn't
   #want to figure out how to do this as a per-node function.
   #Basically copied from Cormen, Leiserson, Rivest and Stein
   #"Introduction to Algorithms" 2nd ed.
   def compute_failure_function p
      m=p.size
      pi=[0,0]
      k=0
      2.upto m do |q|
   k=pi[k] while k>0 and p[k] != p[q-1]
   k=k+1 if p[k]==p[q-1]
   pi[q]=k
      end
      pi
   end

   def internal_match string
      node=@internal
      e=Enumerable::Enumerator.new(string,:each_byte)
      e.each_with_index do |b,index|
   advance=false
   until advance
      nextnode=node.transitions[b]
      if not nextnode
         advance=true if node==node.failure #loops happen at the root
         node=node.failure
      elsif nextnode.endword?
         yield MatchData.new(index-node.depth,string[index-node.depth..index])
         advance=true
         node=@internal
      else
         advance=true
         node=nextnode
      end
   end
      end
   end

   class Node #:nodoc:
      attr_accessor :transitions, :failure, :endword, :depth
      alias_method :endword?, :endword
      def initialize depth=0
   @depth=depth
   @transitions={}
   @endword=false
      end
      def add string,offset=0, array=
   first=string[offset]
   if offset==string.size
      @endword=true
      array << self
      return array
   end
   array << self
   node=(@transitions[first] ||= Node.new(@depth+1))
   return node.add(string,offset+1,array)
      end
      def include? string, offset=0
   first=string[offset]
   if offset==string.size
      return @endword
   elsif not @transitions.include?(first)
      return false
   else
      return @transitions[first].include?(string,offset+1)
   end
      end
      def inspect; "x"; end
   end

end

···

On Sat, 25 Nov 2006 08:21:24 +0900, Ruby Quiz wrote:

by Ken Bloom

From time to time someone asks on ruby-talk how they can write a regexp of the
form:

  /alligator|crocodile|bear|dinosaur|...|seven-thousandth-word/

It's not hard to write such a regexp, but Ruby has in internal limit on how big
the regular expression can be, so users find they can't do this matching
function easily.

Implement a class DictionaryMatcher that determines whether any of the strings
added to it are substrings of a string S. This should function as almost a
drop-in replacement for a Regexp, therefore your implementation should support
the following operations:

  # creates a new empty matcher
  dm=DictionaryMatcher.new
  
  # adds strings to the matcher
  dm << "string"
  dm << "Ruby"
  
  # determines whether a given word was one of those added to the matcher
  dm.include?("Ruby") # => true
  dm.include?("missing") # => false
  dm.include?("stringing you along") # => false
  
  # Regexp-like substing search
  dm =~ "long string" # => 5
  dm =~ "rub you the wrong way" # => nil
  
  # will automatically work as a result of implementing
  # DictionaryMatcher#=~ (see String#=~)
  "long string" =~ dm # => true
  
  # implement the rest of the interface implemented by Regexps (well, almost)
  class DictionaryMatcher
    alias_method :===, :=~
    alias_method :match, :=~
  end

If you can add additional features, like a case insensitivity option when
creating a new DictionaryMatcher this is also very useful.

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

# Author: Lou Scoras <louis.j.scoras@gmail.com>
# Date: Sun Nov 26 10:43:34 EST 2006

···

#
# q103.rb - Solution to Rubyquiz 103 (DictionaryMatcher)
#
# Implements DictionaryMatcher using a Trie. This version of
# DictionaryMatcher only matches complete words, but it wouldn't
# be too hard to modify to match any substring.

class Trie
  def initialize
    @children = Hash.new {|h,k| h[k] = Trie.new}
  end

  attr_accessor :value

  def []=(key, value)
    insert(key, 0, value)
    key
  end

  def [](key)
    get(key,0)
  end

  def each &blk
    _each &blk
  end

  include Enumerable

  def keys
    inject([]) {|keys,(k,v)| keys << k; keys}
  end

  def values
    inject([]) {|vals,(k,v)| vals << v; vals}
  end

  def each_key &blk
    keys.each &blk
  end

  def each_value &blk
    values.each &blk
  end

  def inspect(indent=0)
    buff = ''
    i = ' ' * indent
    buff << i + "value: #{value}\n" if value
    return buff unless @children.size > 0
    @children.each {|k,c|
      buff << "#{i}#{k} =>\n"
      buff << c.inspect(indent+2)
    }
    buff
  end

  protected

  def _each(key='', &blk)
    blk.call(key,value) if key != '' and value
    @children.keys.sort.each do |k|
      @children[k]._each(key + k,&blk)
    end
  end

  def insert(key,offset,value)
    if offset == key.length - 1
      @children[key[offset,1]].value = value
    else
      @children[key[offset,1]].insert(key,offset+1,value)
    end
  end

  def get(key,offset)
    if offset == key.length - 1
      @children[key[offset,1]].value
    else
      return nil unless @children.has_key?(key[offset,1])
      @children[key[offset,1]].get(key,offset+1)
    end
  end
end

class DictionaryMatcher
  def initialize(opts={})
    @ignore_case = opts[:ignore_case]
    @trie = Trie.new
  end

  def ignores_case?
    @ignore_case
  end

  def <<(word)
    word = word.downcase if ignores_case?
    @trie[word] = true
  end

  def words
    @trie.keys
  end

  def include?(word)
    !@trie[(ignores_case?? word.downcase : word)].nil?
  end

  def =~(string)
    words = string.split
    positions = words.inject({}) { |h,w|
      h[w] = string.index(w) unless h[w]; h
    }
    words.each do |word|
      return positions[word] if
   include?(ignores_case?? word.downcase : word)
    end
    return nil
  end

  alias_method :===, :=~
  alias_method :match, :=~
end

Well, I'm not sure I can add anything not covered by Adam's solution
(forwarded in by James), but here goes.

I started with the naive solution:

class DictionaryMatcher < Array
  def =~(string)
    self.map{|e| Regexp.new(e) =~ string }.compact.min
  end
end

I then fleshed it out with a case insensitive option, and more tests.

- Jamie

class DictionaryMatcher < Array
  alias_method :===, :=~
  alias_method :match, :=~

  def initialize(default = [], options = nil)
    super(default)

    unless options.nil? or options.is_a? Fixnum
      options = Regexp::IGNORECASE
    end
    @regexp_options = options
  end

  def =~(string)
    self.map{|e| Regexp.new(e, @regexp_options) =~ string }.compact.min
  end
end

class DictionaryMatcherTest < Test::Unit::TestCase
  def test_acceptance
    # creates a new empty matcher
    dm=DictionaryMatcher.new

    # adds strings to the matcher
    dm << "string"
    dm << "Ruby"

    # determines whether a given word was one of those added to the matcher
    assert_equal true, dm.include?("Ruby") # => true
    assert_equal false, dm.include?("missing") # => false
    assert_equal false, dm.include?("stringing you along") # => false

    # Regexp-like substing search
    assert_equal 5, dm =~ "long string" # => 5
    assert_equal nil, dm =~ "rub you the wrong way" # => nil

    # will automatically work as a result of implementing
    # DictionaryMatcher#=~ (see String#=~)
    assert_equal 5, "long string" =~ dm # => true
  end

  def test_include_eh
    dm = DictionaryMatcher.new(['string', 'ruby', 'foo'])
    assert_equal true, dm.include?('string' )
    assert_equal true, dm.include?('ruby' )
    assert_equal true, dm.include?('foo' )
    assert_equal false, dm.include?('stringa')
    assert_equal false, dm.include?('astring')
  end

  def test_equals_tilde
    dm = DictionaryMatcher.new(['string', 'ruby', 'foo'])
    assert_equal 0, dm =~ 'string'
    assert_equal 0, dm =~ 'string two'
    assert_equal 6, dm =~ 'three string'
    assert_equal 5, dm =~ 'four string five'
    assert_equal nil, dm =~ 'strng'

    assert_equal 0, 'string' =~ dm
    assert_equal 2, 'a string b' =~ dm
    assert_equal nil, 'strng' =~ dm
  end

  def test_case_sensitivity
    dm = DictionaryMatcher.new(['Foo','bar'])
    assert_equal 0, dm =~ 'Foo'
    assert_equal 0, dm =~ 'bar'
    assert_equal nil, dm =~ 'foo'
    assert_equal nil, dm =~ 'Bar'
  end

  def test_case_insensitivity
    dm = DictionaryMatcher.new(['Foo','bar'], true)
    assert_equal 0, dm =~ 'Foo'
    assert_equal 0, dm =~ 'bar'
    assert_equal 0, dm =~ 'foo'
    assert_equal 0, dm =~ 'Bar'
  end

  def test_greediness
    dm = DictionaryMatcher.new(['hi','child'])
    r = /hi|child/
    assert_equal r =~ 'children', dm =~ 'children'

    dm = DictionaryMatcher.new(['child','hi'])
    r = /child|hi/
    assert_equal r =~ 'children', dm =~ 'children'
  end

end

Here's my solution. Theres a bit more speed to be had out of it, but
I'll be in trouble if I spend any more time on it, so I'll just post it
up instead :wink:

It's slower than some of the other solutions, but does seem to work. I
ran the attached benchmark to gauge performance against a regexp (using
Oniguruma under 1.9) and it's slow to build, but fast to match (I'm
lazy, using hashes for my tree):

$ ruby9 bench.rb
### CREATION (x10) ###
      user system total real
regexp 1.430000 0.020000 1.450000 ( 1.518385)
tree matcher 15.290000 0.070000 15.360000 ( 15.727852)
### MATCHING (x500) ###
      user system total real
regexp 0.830000 0.000000 0.830000 ( 0.865269)
tree matcher 0.010000 0.000000 0.010000 ( 0.016855)

I also ran Ken's posted benchmark, under both 1.8 and (for fun) 1.9. My
solution suffers slightly, but check out the speed increase in Ken's own
solution:

$ ruby -v kbbench.rb
ruby 1.8.5 (2006-08-25) [i686-linux]
                          user system total real
Edwin Fine -- fill 0.480000 0.020000 0.500000 ( 0.536963)
Edwin Fine -- test 3.310000 0.010000 3.320000 ( 3.383446)
Ken Bloom -- fill 2.460000 0.070000 2.530000 ( 2.569271)
Ken Bloom -- test 13.300000 0.030000 13.330000 ( 13.830023)
Ross Bamford -- fill 1.570000 0.020000 1.590000 ( 1.635048)
Ross Bamford -- test 5.470000 0.030000 5.500000 ( 5.595892)

$ ruby9 -v kbbench.rb
ruby 1.9.0 (2006-11-26 patchlevel 0) [i686-linux]
                          user system total real
Edwin Fine -- fill 0.460000 0.010000 0.470000 ( 0.530037)
Edwin Fine -- test 2.990000 0.000000 2.990000 ( 3.947945)
Ken Bloom -- fill 3.070000 0.060000 3.130000 ( 3.273754)
Ken Bloom -- test 3.110000 0.010000 3.120000 ( 3.184959)
Ross Bamford -- fill 1.270000 0.000000 1.270000 ( 1.314763)
Ross Bamford -- test 6.070000 0.010000 6.080000 ( 6.164216)

Thanks for a fun quiz. Now I'd better get some work done... :slight_smile:

bench.rb (763 Bytes)

dm.rb (6.64 KB)

···

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

I'm doing the same thing a regexp does.

In this case, both /hi|child/ =~ 'children' and /child|hi/ =~
'children' return 0, so I'm returning the index of the first character
in the string that matches the uber-regexp we're standing in for.

- Jamie

···

On 11/24/06, dblack@wobblini.net <dblack@wobblini.net> wrote:

What's the ruling on priority and "greediness"? In other words,
given:

   dm << "hi"
   dm << "child"

what would:

   dm =~ "children"

give? Would it be different if you added "child" first? Or is there
a rule about finding the longest match?

<dblack <at> wobblini.net> writes:

Hi,

What's the ruling on priority and "greediness"? In other words,
given:

   dm << "hi"
   dm << "child"

what would:

   dm =~ "children"

give?

I'd say that given the example:

> dm << "string"
> dm.include?("stringing you along") # => false

your example would return nil, as the purpose of DictionaryMatcher seems to be
to match whole words, and not derivatives

Jamie Macey wrote:

> # will automatically work as a result of implementing
> # DictionaryMatcher#=~ (see String#=~)
> "long string" =~ dm # => true

On my machine (ruby 1.8.4), I don't get this result. For me, 'long
string' =~ /string/ returns the same as /string/ =~ 'long string',
which is 5, not true.

Just a heads-up for anyone else using the provided code as the basis
for a test case.

One of the unwritten rules of Ruby Quiz is to do whatever seems
appropriate with the interface, given the real interface.

You'll notice, for example, that a Regexp returns the offset from =~, a
boolean from === (despite the documentation's claims to the contrary),
and a MatchData object from #match.
The alias_method part of the interface simplifies this aspect of the
interface to DictionaryMatcher, although it defintiely makes more sense
to invent a MatchData object of some sort that can tell you what the
matched word was, and use that as the return value to match.

--Ken

···

On 11/24/06, Ruby Quiz <james@grayproductions.net> wrote:

This solution uses a digital trie to encode the strings to search.
A good background to this data structure can be found here:
http://www.ddj.com/184410528,

which also discusses a better solution that I did not explore (ternary
search trees).

A trie has a search speed that is O(m), where m is the number of
characters in the string to find in the dictionary. A hash table is
O(1), which is theoretically faster than a trie, but in practice the
trie is faster because the has table has to hash all characters in the
string to create the hash key, whereas the trie can reject a string by
matching as little as 1 character.

The solution trades memory usage for speed. Every string in the
dictionary is organized into a hierarchy of character codes.
Essentially, the dictionary encodes a DFA (deterministic finite
automaton) where each character code is a transition from one node to
the next one. This provides very fast search speeds, especially when a
non-matching string is encountered. In other words, the trie can reject
incorrect strings very quickly (as soon as a non-matching prefix to a
valid string is seen).

To store a string in the dictionary, we start at the root (which is just
a hash
or array of the character codes of the first character of every string
in the
dictionary).

trie = root
for each character code in the string to store
  trie[character code] ||= {} # or [], if arrays used
  trie = trie[character code] # drop down 1 level
end
trie[0] = true # Mark end of word (code 0 will never be used)

It is necessary to mark the end of a word because you can get words that
are prefixes of other words (for example, can, canna, and cannabis).
This allows you to decide if you want to use a least-prefix or greedy
search. This code uses a least-prefix search that returns the shortest
matching string in the dictionary. This is due to the requirements of
the quiz. However, it should be easy enough to code a greedy search.

The search algorithm is surprisingly simple. The search space is the
text that
we want to search for matching strings. The code starts at the first
character of the search space and tries to find a match in the
dictionary anchored at that position. If it does not detect a match, it
moves the anchor to the next character and starts again.

trie = root
for each character code in the search space
  break unless character code is in trie
  trie = trie[character code]
end
found a valid string if trie[0] exists

Each character in the dictionary to be searched is an index into a hash.

Informal benchmarks on my system (Intel Core 2 Duo E6600 on Win XP 32
bit)
shows a memory usage of about 15MB when using a hash as the basic trie
storage, and 30+Mb using an array, when storing a dictionary of 24,001
words.
The speed seems to be about the same either way, so the hash solution is
the best bet.

The test code finds all words in the 24,001 word dictionary in a text of
about
600KB. On my system, this takes around 2 seconds. This does include a
possibly
incorrect optimization: once a word is matched, the anchor point is
moved to
after the word so that substrings within the word are not matched. This
may
or may not be what is desired, but removing this optimization almost
doubles
the run time.

The code itself is fairly short:

class DictionaryMatcher
  attr_reader :word_count

  def initialize
    @trie = {}
    @word_count = 0
  end

  def add_word(word)
    @word_count += 1
    container = @trie

    word.each_byte do |b|
      container[b] = {} unless container.has_key? b
      container = container[b]
    end

    container[0] = true # Mark end of word
  end

  def include?(word)
    container = @trie
    word.each_byte do |b|
      break unless container.has_key? b
      container = container[b]
    end
    container[0]
  end

  def =~(text)
    text_end = text.length - 1
    pos = 0

    while pos <= text_end do
      container = @trie

      pos.upto(text_end) do |i|
        b = text[i]
        break unless container.has_key? b
        container = container[b]
      end

      return pos if container[0] # Match
      pos += 1
    end

    nil
  end

  # Return container of matches in text [[pos, len], ...]
  # or call block if provided (returns [])
  def find_all_matching(text, &block)
    matches = []
    block = lambda { |pos, len| matches << [pos, len] } unless block
    pos = 0
    text_end = text.length - 1

    while pos <= text_end do
      container = @trie
      len = 0

      pos.upto(text_end) do |i|
        b = text[i]
        break unless container.has_key?(b)
        container = container[b]
        len += 1
      end

      if container[0] # Match
        block.call(pos, len)
        pos += len # Skip over word
      else
        pos += 1
      end
    end

    matches
  end

  # implement much of the rest of the interface implemented by Regexps
  alias_method :===, :=~
  alias_method :match, :=~
  alias_method :<<, :add_word

  # Add words from a file
  def add_words(words_file)
    IO.foreach(words_file) do |line|
      add_word line.chomp
    end
  end
end

I have posted the code, Test::Unit tests, the test file, and dictionary
file on my web site here:

http://finecomputerconsultants.com/ruby_quiz/dictionary_matcher-0.1.tgz

I hope I have not violated any copyrights/lefts by supplying the text
files; if so, let me know and I will remedy this.

···

--
Posted via http://www.ruby-forum.com/.

Hi --

What's the ruling on priority and "greediness"? In other words,
given:

   dm << "hi"
   dm << "child"

what would:

   dm =~ "children"

give? Would it be different if you added "child" first? Or is there
a rule about finding the longest match?

I'm doing the same thing a regexp does.

In this case, both /hi|child/ =~ 'children' and /child|hi/ =~
'children' return 0, so I'm returning the index of the first character
in the string that matches the uber-regexp we're standing in for.

Actually, a better example of what I was pondering would be:

   dm << "t"
   dm << "ten"

   dm =~ "tenth"

However, I guess as long as it's just the starting offset that's
needed, and not the ending offset (i.e., we don't have to know how
long the match is), it won't be an issue. My instinct was to want to
know which string did the matching, but for starting offset purposes
it doesn't matter.

David

···

On Sat, 25 Nov 2006, Jamie Macey wrote:

On 11/24/06, dblack@wobblini.net <dblack@wobblini.net> wrote:

--
                   David A. Black | dblack@wobblini.net
Author of "Ruby for Rails" [1] | Ruby/Rails training & consultancy [3]
DABlog (DAB's Weblog) [2] | Co-director, Ruby Central, Inc. [4]
[1] Ruby for Rails | [3] http://www.rubypowerandlight.com
[2] http://dablog.rubypal.com | [4] http://www.rubycentral.org

This solution uses a digital trie to encode the strings to search.
A good background to this data structure can be found here:
Ternary Search Trees | Dr Dobb's,

which also discusses a better solution that I did not explore (ternary
search trees).

A trie has a search speed that is O(m), where m is the number of
characters in the string to find in the dictionary. A hash table is
O(1), which is theoretically faster than a trie, but in practice the
trie is faster because the has table has to hash all characters in the
string to create the hash key, whereas the trie can reject a string by
matching as little as 1 character.

The solution trades memory usage for speed. Every string in the
dictionary is organized into a hierarchy of character codes.
Essentially, the dictionary encodes a DFA (deterministic finite
automaton) where each character code is a transition from one node to
the next one. This provides very fast search speeds, especially when a
non-matching string is encountered. In other words, the trie can reject
incorrect strings very quickly (as soon as a non-matching prefix to a
valid string is seen).

To store a string in the dictionary, we start at the root (which is just
a hash
or array of the character codes of the first character of every string
in the
dictionary).

trie = root
for each character code in the string to store
  trie[character code] ||= {} # or , if arrays used
  trie = trie[character code] # drop down 1 level
end
trie[0] = true # Mark end of word (code 0 will never be used)

It is necessary to mark the end of a word because you can get words that
are prefixes of other words (for example, can, canna, and cannabis).
This allows you to decide if you want to use a least-prefix or greedy
search. This code uses a least-prefix search that returns the shortest
matching string in the dictionary. This is due to the requirements of
the quiz. However, it should be easy enough to code a greedy search.

The search algorithm is surprisingly simple. The search space is the
text that
we want to search for matching strings. The code starts at the first
character of the search space and tries to find a match in the
dictionary anchored at that position. If it does not detect a match, it
moves the anchor to the next character and starts again.

trie = root
for each character code in the search space
  break unless character code is in trie
  trie = trie[character code]
end
found a valid string if trie[0] exists

Each character in the dictionary to be searched is an index into a hash.

Informal benchmarks on my system (Intel Core 2 Duo E6600 on Win XP 32
bit)
shows a memory usage of about 15MB when using a hash as the basic trie
storage, and 30+Mb using an array, when storing a dictionary of 24,001
words.
The speed seems to be about the same either way, so the hash solution is
the best bet.

The test code finds all words in the 24,001 word dictionary in a text of
about
600KB. On my system, this takes around 2 seconds. This does include a
possibly
incorrect optimization: once a word is matched, the anchor point is
moved to
after the word so that substrings within the word are not matched. This
may
or may not be what is desired, but removing this optimization almost
doubles
the run time.

[CODE SNIPPED]

I have posted the code, Test::Unit tests, the test file, and dictionary
file on my web site here:

http://finecomputerconsultants.com/ruby_quiz/dictionary_matcher-0.1.tgz

I hope I have not violated any copyrights/lefts by supplying the text
files; if so, let me know and I will remedy this.

Wow. My solution's a real slowpoke compared to yours, even thought mine's
asymptotically faster: O(string.length) versus
O(string.length* max(keys.length))

I wonder what's slowing mine down so much, since these are in many ways
the same.

                        user system total real
Edwin Fine -- fill 0.810000 0.130000 0.940000 ( 0.939537)
Edwin Fine -- test 5.580000 0.630000 6.210000 ( 6.244736)
Ken Bloom -- fill 4.550000 0.460000 5.010000 ( 5.010708)
Ken Bloom -- test 25.210000 0.960000 26.170000 ( 27.519233)

(tested using the files you provided -- I modified your interface just a
little to alias your #find_all_matches method to #scan)

I'd be happy to benchmark anybody else's solution who implements #scan.

#!/usr/bin/env ruby
open("practical-file-system-design.txt") do |f|
   FILEDATA=f.read
end

open("words_en.txt") do |f|
   DICTIONARY=f.readlines.map{|x| x.chomp}
end

require 'benchmark'
include Benchmark
#the following files contain various implementations, renamed so as not to
#conflict with each other
require 'trie'
require 'finedm'

TESTCLASSES={"Ken Bloom" => KenDictionaryMatcher,
  "Edwin Fine" => FineDictionaryMatcher}

bm(TESTCLASSES.keys.collect{|x| x.length}.max + 8) do |benchmarker|
   matcher=nil
   TESTCLASSES.each do |name,klass|
      benchmarker.report("#{name} -- fill") do
   matcher=klass.new
   DICTIONARY.each {|x| matcher << x}
      end
      benchmarker.report("#{name} -- test") do
   matcher.scan(FILEDATA){}
      end
   end
end
__END__

···

On Mon, 27 Nov 2006 11:47:38 +0900, Edwin Fine wrote:

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

This solution uses a digital trie to encode the strings to search.
A good background to this data structure can be found here:
Ternary Search Trees | Dr Dobb's,

which also discusses a better solution that I did not explore (ternary
search trees).

A trie has a search speed that is O(m), where m is the number of
characters in the string to find in the dictionary. A hash table is
O(1), which is theoretically faster than a trie, but in practice the
trie is faster because the has table has to hash all characters in the
string to create the hash key, whereas the trie can reject a string by
matching as little as 1 character.

The solution trades memory usage for speed. Every string in the
dictionary is organized into a hierarchy of character codes.
Essentially, the dictionary encodes a DFA (deterministic finite
automaton) where each character code is a transition from one node to
the next one. This provides very fast search speeds, especially when a
non-matching string is encountered. In other words, the trie can reject
incorrect strings very quickly (as soon as a non-matching prefix to a
valid string is seen).

To store a string in the dictionary, we start at the root (which is just
a hash
or array of the character codes of the first character of every string
in the
dictionary).

trie = root
for each character code in the string to store
  trie[character code] ||= {} # or , if arrays used
  trie = trie[character code] # drop down 1 level
end
trie[0] = true # Mark end of word (code 0 will never be used)

It is necessary to mark the end of a word because you can get words that
are prefixes of other words (for example, can, canna, and cannabis).
This allows you to decide if you want to use a least-prefix or greedy
search. This code uses a least-prefix search that returns the shortest
matching string in the dictionary. This is due to the requirements of
the quiz. However, it should be easy enough to code a greedy search.

The search algorithm is surprisingly simple. The search space is the
text that
we want to search for matching strings. The code starts at the first
character of the search space and tries to find a match in the
dictionary anchored at that position. If it does not detect a match, it
moves the anchor to the next character and starts again.

trie = root
for each character code in the search space
  break unless character code is in trie
  trie = trie[character code]
end
found a valid string if trie[0] exists

Each character in the dictionary to be searched is an index into a hash.

Informal benchmarks on my system (Intel Core 2 Duo E6600 on Win XP 32
bit)
shows a memory usage of about 15MB when using a hash as the basic trie
storage, and 30+Mb using an array, when storing a dictionary of 24,001
words.
The speed seems to be about the same either way, so the hash solution is
the best bet.

The test code finds all words in the 24,001 word dictionary in a text of
about
600KB. On my system, this takes around 2 seconds. This does include a
possibly
incorrect optimization: once a word is matched, the anchor point is
moved to
after the word so that substrings within the word are not matched. This
may
or may not be what is desired, but removing this optimization almost
doubles
the run time.

[CODE SNIPPED]

I have posted the code, Test::Unit tests, the test file, and dictionary
file on my web site here:

http://finecomputerconsultants.com/ruby_quiz/dictionary_matcher-0.1.tgz

I hope I have not violated any copyrights/lefts by supplying the text
files; if so, let me know and I will remedy this.

Wow. My solution's a real slowpoke compared to yours, even thought mine's
asymptotically faster: O(string.length) versus
O(string.length* max(keys.length))

I wonder what's slowing mine down so much, since these are in many ways
the same.

                        user system total real
Edwin Fine -- fill 0.810000 0.130000 0.940000 ( 0.939537)
Edwin Fine -- test 5.580000 0.630000 6.210000 ( 6.244736)
Ken Bloom -- fill 4.550000 0.460000 5.010000 ( 5.010708)
Ken Bloom -- test 25.210000 0.960000 26.170000 ( 27.519233)

(tested using the files you provided -- I modified your interface just a
little to alias your #find_all_matches method to #scan)

I'd be happy to benchmark anybody else's solution who implements #scan.

#!/usr/bin/env ruby
open("practical-file-system-design.txt") do |f|
   FILEDATA=f.read
end

open("words_en.txt") do |f|
   DICTIONARY=f.readlines.map{|x| x.chomp}
end

require 'benchmark'
include Benchmark
#the following files contain various implementations, renamed so as not to
#conflict with each other
require 'trie'
require 'finedm'

TESTCLASSES={"Ken Bloom" => KenDictionaryMatcher,
  "Edwin Fine" => FineDictionaryMatcher}

bm(TESTCLASSES.keys.collect{|x| x.length}.max + 8) do |benchmarker|
   matcher=nil
   TESTCLASSES.each do |name,klass|
      benchmarker.report("#{name} -- fill") do
   matcher=klass.new
   DICTIONARY.each {|x| matcher << x}
      end
      benchmarker.report("#{name} -- test") do
   matcher.scan(FILEDATA){}
      end
   end
end
__END__

···

On Mon, 27 Nov 2006 11:47:38 +0900, Edwin Fine wrote:

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

Wow. My solution's a real slowpoke compared to yours, even thought
mine's
asymptotically faster: O(string.length) versus
O(string.length* max(keys.length))

I wonder what's slowing mine down so much, since these are in many ways
the same.

Well, do we get the same results from the scan of the big file? Recall
that in the scan I skip over strings once I see they are in the
dictionary, which cuts out a lot of work and may yield far fewer
results.

For example, if the text to be scanned is

abacusqwertyark

and the dictionary contains the words abacus and ark, the index into the
string will move as follows:

abacusqwertyark # Begin
^

abacusqwertyark # Found abacus, skip it
      ^
abacusqwertyark # Found ark
               ^

A solution that did not skip the words would find additional substrings,
for example:

abacusqwertyark # Begin
^

abacusqwertyark # Found abacus, start again on next character
^
... omit some steps ...

abacusqwertyark # Found "us"
    ^

Maybe I'm cheating or there's some other bug in my code that omits
results that you find :slight_smile:

···

--
Posted via http://www.ruby-forum.com/\.

Yes, I also skip over the remainder of a word, so we should get the same
results. I can check that later though.

--Ken

···

On Mon, 27 Nov 2006 15:20:32 +0900, Edwin Fine wrote:

Well, do we get the same results from the scan of the big file? Recall
that in the scan I skip over strings once I see they are in the
dictionary, which cuts out a lot of work and may yield far fewer
results.

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

Apparently, part of my problem was the use of the Enumerator object.
Eliminating that and keeping track of the index manually saves about 5
seconds off the benchmark.

I also took my solution and reimplemented it in your code --
instead of having Node objects, I just use your hashes now and for the
special fields I need, I use hash[:failure] and hash[:depth]. This saved
about 15 seconds, so my new solution (based on your code) is only about a
second slower than your solution, which is probably acceptable given the
length of the words in the dictionary file. If the entries in the
dictionary were longer, then my new solution should start to win. And by
the way, both of these solutions win out over fairly large regular
expressions pretty easily, since regular expressions have to backtrack to
try each branch.

                        user system total real
Edwin Fine -- fill 0.840000 0.090000 0.930000 ( 0.933338)
Edwin Fine -- test 5.650000 0.540000 6.190000 ( 6.194274)
Ken Bloom -- fill 3.800000 0.300000 4.100000 ( 4.104002)
Ken Bloom -- test 6.800000 0.370000 7.170000 ( 7.167073)

#based on Edwin Fine's solution, this reimplements my solution with
#less overhead all around.
class DictionaryMatcher
  attr_reader :word_count

  def initialize
    @trie = {}
    @word_count = 0
  end

  def add_word(word)
    @word_count += 1
    container = @trie
    containers=

    i=0
    word.each_byte do |b|
      container[b] = {} unless container.has_key? b
      container[:depth]=i
      containers << container
      container = container[b]
      i+=1
    end
    containers << container

    container[0] = true # Mark end of word
     
    ff=compute_failure_function word
    ff.zip(containers).each do |pointto,container|
      container[:failure]=containers[pointto] if pointto
    end

  end

  def compute_failure_function p
    m=p.size
    pi=[nil,0]
    k=0
    2.upto m do |q|
      k=pi[k] while k>0 and p[k] != p[q-1]
      k=k+1 if p[k]==p[q-1]
      pi[q]=k
    end
    pi
  end
  private :compute_failure_function

  def include?(word)
    container = @trie
    word.each_byte do |b|
      break unless container.has_key? b
      container = container[b]
    end
    container[0]
  end

  def =~ text
    internal_match text {|pos,len| return pos}
    nil
  end

  def internal_match string
      node=@trie
      pos=0
      string.each_byte do |b|
   advance=false
   until advance
      nextnode=node[b]
      if not nextnode
         if node[:failure]
      node=node[:failure]
         else
      advance=true
         end
      elsif nextnode[0]
         yield pos, nextnode[:depth]
         advance=true
         node=@trie
      else
         advance=true
         node=nextnode
      end
      pos+=1
   end
      end
  end
  private :internal_match

  def find_all_matching(text, &block)
    matches=
    block= lambda{ |pos,len| matches << [pos,len] } unless block
    internal_match(text,&block)
    matches
  end

  alias_method :scan, :find_all_matching

  # implement much of the rest of the interface implemented by Regexps
  alias_method :===, :=~
  alias_method :match, :=~
  alias_method :<<, :add_word

  # Add words from a file
  def add_words(words_file)
    IO.foreach(words_file) do |line|
      add_word line.chomp
    end
  end
end

···

On Mon, 27 Nov 2006 15:20:32 +0900, Edwin Fine wrote:

Wow. My solution's a real slowpoke compared to yours, even thought
mine's
asymptotically faster: O(string.length) versus
O(string.length* max(keys.length))

I wonder what's slowing mine down so much, since these are in many ways
the same.

Well, do we get the same results from the scan of the big file? Recall
that in the scan I skip over strings once I see they are in the
dictionary, which cuts out a lot of work and may yield far fewer
results.

For example, if the text to be scanned is

abacusqwertyark

and the dictionary contains the words abacus and ark, the index into the
string will move as follows:

abacusqwertyark # Begin
^

abacusqwertyark # Found abacus, skip it
      ^
abacusqwertyark # Found ark
               ^

A solution that did not skip the words would find additional substrings,
for example:

abacusqwertyark # Begin
^

abacusqwertyark # Found abacus, start again on next character
^
.. omit some steps ...

abacusqwertyark # Found "us"
    ^

Maybe I'm cheating or there's some other bug in my code that omits
results that you find :slight_smile:

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

Apparently, part of my problem was the use of the Enumerator object.
Eliminating that and keeping track of the index itself saves about 5
seconds off the benchmark.

I also took my solution and reimplemented it in your code --
instead of having Node objects, I just use hashes now and for the special
fields I need, I use hash[:failure] and hash[:depth]. This saved about 15
seconds, so my new solution (based on your code) is only about a second
slower than your solution, which is probably acceptable given the length
of the words in the dictionary file. If the entries in the dictionary
were longer, then my new solution should start to win. And by the way, both
of these solutions win out over fairly large regular expressions pretty
easily, since regular expressions have to backtrack to try each branch.

                        user system total real
Edwin Fine -- fill 0.840000 0.090000 0.930000 ( 0.933338)
Edwin Fine -- test 5.650000 0.540000 6.190000 ( 6.194274)
Ken Bloom -- fill 3.800000 0.300000 4.100000 ( 4.104002)
Ken Bloom -- test 6.800000 0.370000 7.170000 ( 7.167073)

#based on Edwin Fine's solution, this reimplements my solution with
#less overhead all around.
class DictionaryMatcher
  attr_reader :word_count

  def initialize
    @trie = {}
    @word_count = 0
  end

  def add_word(word)
    @word_count += 1
    container = @trie
    containers=

    i=0
    word.each_byte do |b|
      container[b] = {} unless container.has_key? b
      container[:depth]=i
      containers << container
      container = container[b]
      i+=1
    end
    containers << container

    container[0] = true # Mark end of word
     
    ff=compute_failure_function word
    ff.zip(containers).each do |pointto,container|
      container[:failure]=containers[pointto] if pointto
    end

  end

  def compute_failure_function p
    m=p.size
    pi=[nil,0]
    k=0
    2.upto m do |q|
      k=pi[k] while k>0 and p[k] != p[q-1]
      k=k+1 if p[k]==p[q-1]
      pi[q]=k
    end
    pi
  end
  private :compute_failure_function

  def include?(word)
    container = @trie
    word.each_byte do |b|
      break unless container.has_key? b
      container = container[b]
    end
    container[0]
  end

  def =~ text
    internal_match text {|pos,len| return pos}
    nil
  end

  def internal_match string
      node=@trie
      pos=0
      string.each_byte do |b|
   advance=false
   until advance
      nextnode=node[b]
      if not nextnode
         if node[:failure]
      node=node[:failure]
         else
      advance=true
         end
      elsif nextnode[0]
         yield pos, nextnode[:depth]
         advance=true
         node=@trie
      else
         advance=true
         node=nextnode
      end
      pos+=1
   end
      end
  end
  private :internal_match

  def find_all_matching(text, &block)
    matches=
    block= lambda{ |pos,len| matches << [pos,len] } unless block
    internal_match(text,&block)
    matches
  end

  alias_method :scan, :find_all_matching

  # implement much of the rest of the interface implemented by Regexps
  alias_method :===, :=~
  alias_method :match, :=~
  alias_method :<<, :add_word

  # Add words from a file
  def add_words(words_file)
    IO.foreach(words_file) do |line|
      add_word line.chomp
    end
  end
end

···

On Mon, 27 Nov 2006 15:20:32 +0900, Edwin Fine wrote:

Wow. My solution's a real slowpoke compared to yours, even thought
mine's
asymptotically faster: O(string.length) versus
O(string.length* max(keys.length))

I wonder what's slowing mine down so much, since these are in many ways
the same.

Well, do we get the same results from the scan of the big file? Recall
that in the scan I skip over strings once I see they are in the
dictionary, which cuts out a lot of work and may yield far fewer
results.

For example, if the text to be scanned is

abacusqwertyark

and the dictionary contains the words abacus and ark, the index into the
string will move as follows:

abacusqwertyark # Begin
^

abacusqwertyark # Found abacus, skip it
      ^
abacusqwertyark # Found ark
               ^

A solution that did not skip the words would find additional substrings,
for example:

abacusqwertyark # Begin
^

abacusqwertyark # Found abacus, start again on next character
^
.. omit some steps ...

abacusqwertyark # Found "us"
    ^

Maybe I'm cheating or there's some other bug in my code that omits
results that you find :slight_smile:

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/