How to build an index of phrases in a phrase/sentence?

I am trying to build an indexing structure on some phrases. Most phrases will have 2 - 5 parts (words). The resulting array will be dumped into an index to find the matching phrases. I don't want to do wildcard searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches for "is some" or "some text", I want it to find this phrase. I don't want a search for "is text" to find this phrase though.

My solution so far can find all but the middle elements. In this case, "is some". But when the original phrase has more parts, then more middle parts are not added to the array.

text = "This is some text"
#=> "This is some text"
ws = ''; text.split(/\W/).collect{|w| ws = (ws+' '+w).strip; ws}
#=> ["This", "This is", "This is some", "This is some text"]
ws = ''; text.split(/\W/).reverse.collect{|w| ws = (w+' '+ws).strip; ws}
#=> ["text", "some text", "is some text", "This is some text"]
text.split(/\W/).collect{|w| w}
=> ["This", "is", "some", "text"]

Is there an better Ruby way to do this? Or is there a better data structure for retrieving a word or an exact phrase within a phrase/sentence without wild-carding the search.

Thanks,

Dan

Perhaps I'm not understanding, but how does the following fail to meet your needs?

"This is some text".match(/is some/i)

James Edward Gray II

···

On May 27, 2005, at 1:39 PM, Dan Fitzpatrick wrote:

I am trying to build an indexing structure on some phrases. Most phrases will have 2 - 5 parts (words). The resulting array will be dumped into an index to find the matching phrases. I don't want to do wildcard searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches for "is some" or "some text", I want it to find this phrase. I don't want a search for "is text" to find this phrase though.

maybe

   harp:~ > cat a.rb
   class String
     def subphrases delim = %r/\s+/
       words = strip.split delim
       return if words.empty?
       words.inject(){|a,w| a << [a[-1],w].compact.join(' ')} +
         words[1..-1].join(' ').subphrases
     end
   end

   p "This is some text".subphrases
   p "foo".subphrases
   p "".subphrases

   harp:~ > ruby a.rb
   ["This", "This is", "This is some", "This is some text", "is", "is some", "is some text", "some", "some text", "text"]
   ["foo"]
   

you may want to check out glimpse.

cheers.

-a

···

On Sat, 28 May 2005, Dan Fitzpatrick wrote:

I am trying to build an indexing structure on some phrases. Most phrases will have 2 - 5 parts (words). The resulting array will be dumped into an index to find the matching phrases. I don't want to do wildcard searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches for "is some" or "some text", I want it to find this phrase. I don't want a search for "is text" to find this phrase though.

My solution so far can find all but the middle elements. In this case, "is some". But when the original phrase has more parts, then more middle parts are not added to the array.

text = "This is some text"
#=> "This is some text"
ws = ''; text.split(/\W/).collect{|w| ws = (ws+' '+w).strip; ws}
#=> ["This", "This is", "This is some", "This is some text"]
ws = ''; text.split(/\W/).reverse.collect{|w| ws = (w+' '+ws).strip; ws}
#=> ["text", "some text", "is some text", "This is some text"]
text.split(/\W/).collect{|w| w}
=> ["This", "is", "some", "text"]

Is there an better Ruby way to do this? Or is there a better data structure for retrieving a word or an exact phrase within a phrase/sentence without wild-carding the search.

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
My religion is very simple. My religion is kindness.
--Tenzin Gyatso

===============================================================================

Dan Fitzpatrick wrote:

I am trying to build an indexing structure on some phrases. Most phrases
will have 2 - 5 parts (words). The resulting array will be dumped into
an index to find the matching phrases. I don't want to do wildcard
searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches
for "is some" or "some text", I want it to find this phrase. I don't
want a search for "is text" to find this phrase though.

My solution so far can find all but the middle elements. In this case,
"is some". But when the original phrase has more parts, then more middle
parts are not added to the array.

text = "This is some text"
#=> "This is some text"
ws = ''; text.split(/\W/).collect{|w| ws = (ws+' '+w).strip; ws}
#=> ["This", "This is", "This is some", "This is some text"]
ws = ''; text.split(/\W/).reverse.collect{|w| ws = (w+' '+ws).strip; ws}
#=> ["text", "some text", "is some text", "This is some text"]
text.split(/\W/).collect{|w| w}
=> ["This", "is", "some", "text"]

Is there an better Ruby way to do this? Or is there a better data
structure for retrieving a word or an exact phrase within a
phrase/sentence without wild-carding the search.

Thanks,

Dan

I think what you want is a suffix tree.

http://www.google.com/search?q=suffix+tree&ie=UTF-8&oe=UTF-8

Luca

Oops, I sent that last reply without performance results.

I loaded in a file with 496 words and calculated all the sub-phrases, and measured the time needed to 'parse' the information, and how much memory was used. I then timed how long it took to match a sub-phrase in the middle of the file, and also to 'find' a phrase that didn't exist.

                      user system total real
create array: 12.910000 0.950000 13.860000 ( 15.403606)
run 100: 15.570000 0.220000 15.790000 ( 17.475724)
array - 158MB of VM

                      user system total real
create set: 16.040000 1.100000 17.140000 ( 21.742738)
run 100k: 0.910000 0.000000 0.910000 ( 1.088728)
set - 159MB of VM

                      user system total real
create matcher: 85.430000 1.340000 86.770000 ( 96.524512)
run 100k: 10.050000 0.160000 10.210000 ( 11.245722)
matcher - 68MB of VM

Note that the array was measuring only 100 iterations of the 2-phrase match, while the other two performed 100 *thousand* iterations.

The array technique is thus over 10,000 slower to find a match than the technique using the Set. and about 1,000 slower than the Trie version, but does setup the data structure more quickly than either.

The Trie method's memory use should also increase at a slower rate than the others as the source string increases, but I don't know how to use Ruby to measure VM of a process, so I can't make a simple automated test to graph this.

···

--
"Despite the surge of power you feel upon learning Ruby,
resist the urge to trip others or slap them in the bald head.
DO NOT LORD YOUR RUBYNESS OVER OTHERS!"
- Why the Lucky Stiff

Direct Answer

···

On May 27, 2005, at 12:39 PM, Dan Fitzpatrick wrote:

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

-------------------------------------------------
def phrases( string )
     pieces = string.split( /\s/ )
     out =
     pieces.each_index{ |start_index|
         (start_index+1).upto( pieces.length ){ |end_index|
             out << pieces[start_index...end_index].join(' ')
         }
     }
     out
end

all = phrases( "It's the end of the world as we know it." )
p all
#=> ["It's", "It's the", "It's the end", "It's the end of", "It's the end of the", "It's the end of the world", "It's the end of the world as", "It's the end of the world as we", "It's the end of the world as we know", "It's the end of the world as we know it.", "the", "the end", "the end of", "the end of the", "the end of the world", "the end of the world as", "the end of the world as we", "the end of the world as we know", "the end of the world as we know it.", "end", "end of", "end of the", "end of the world", "end of the world as", "end of the world as we", "end of the world as we know", "end of the world as we know it.", "of", "of the", "of the world", "of the world as", "of the world as we", "of the world as we know", "of the world as we know it.", "the", "the world", "the world as", "the world as we", "the world as we know", "the world as we know it.", "world", "world as", "world as we", "world as we know", "world as we know it.", "as", "as we", "as we know", "as we know it.", "we", "we know", "we know it.", "know", "know it.", "it."]

p all.include?( "end of the world" )
#=> true

p all.include?( "end the world" )
#=> false

Better/Faster
-------------------------------------------------
def phrase_matches( string )
     require 'set'
     pieces = string.split( /\s/ )
     out = Set.new
     pieces.each_index{ |start_index|
         (start_index+1).upto( pieces.length ){ |end_index|
             out.add( pieces[start_index...end_index].join(' ') )
         }
     }
     out
end

all = phrase_matches( "It's the end of the world as we know it." )
p all
p all.include?( "end of the world" )
p all.include?( "end the world" )

Complex-But-Memory-Efficient Answer
-------------------------------------------------
class TrieNode
     attr_accessor :children

     def initialize
         @children = {}
     end

     def add_path( array )
         node = self
         array.each{ |item| node = node.children[ item ] ||= TrieNode.new }
     end

     def includes_path?( array )
         node = self
         array.each{ |item| return false unless node = node.children[ item ] }
         true
     end

     def to_hier( depth=0 )
         tabs = "-"*depth
         out = ''
         @children.each{ |char,node|
             out << "#{tabs}#{char}\n"
             out << node.to_hier( depth+1 )
         }
         out
     end
end

class PhraseMatcher
     MATCH_WORDS = /[a-z']+/

     def initialize( string )
         @root = TrieNode.new
         pieces = string.downcase.scan( MATCH_WORDS )
         pieces.each_index{ |start_index|
             (start_index+1).upto( pieces.length ){ |end_index|
                 @root.add_path( pieces[start_index...end_index] )
             }
         }
     end

     def includes_phrase?( string )
         @root.includes_path?( string.scan( MATCH_WORDS) )
     end
end

sub_phrases = PhraseMatcher.new( "It's the end of the world as we know it, and I feel fine." )

p sub_phrases.includes_phrase?( "end of the world" )
#=> true

p sub_phrases.includes_phrase?( "end the world" )
#=> false

sub_phrases.instance_eval{ puts @root.to_hier }
#=> it
#=> -and
#=> --i
#=> ---feel
#=> ----fine
#=> world
#=> -as
#=> --we
#=> ---know
#=> ----it
#=> -----and
#=> ------i
#=> -------feel
#=> --------fine
#=> and
#=> -i
#=> --feel
#=> ---fine
#=> of
#=> -the
#=> --world
#=> ---as
#=> ----we
#=> -----know
#=> ------it
#=> -------and
#=> --------i
#=> ---------feel
#=> ----------fine
#=> it's
#=> -the
#=> --end
#=> ---of
#=> ----the
#=> -----world
#=> ------as
#=> -------we
#=> --------know
#=> ---------it
#=> ----------and
#=> -----------i
#=> ------------feel
#=> -------------fine
#=> we
#=> -know
#=> --it
#=> ---and
#=> ----i
#=> -----feel
#=> ------fine
#=> know
#=> -it
#=> --and
#=> ---i
#=> ----feel
#=> -----fine
#=> end
#=> -of
#=> --the
#=> ---world
#=> ----as
#=> -----we
#=> ------know
#=> -------it
#=> --------and
#=> ---------i
#=> ----------feel
#=> -----------fine
#=> the
#=> -world
#=> --as
#=> ---we
#=> ----know
#=> -----it
#=> ------and
#=> -------i
#=> --------feel
#=> ---------fine
#=> -end
#=> --of
#=> ---the
#=> ----world
#=> -----as
#=> ------we
#=> -------know
#=> --------it
#=> ---------and
#=> ----------i
#=> -----------feel
#=> ------------fine
#=> as
#=> -we
#=> --know
#=> ---it
#=> ----and
#=> -----i
#=> ------feel
#=> -------fine
#=> fine
#=> i
#=> -feel
#=> --fine
#=> feel
#=> -fine

One last followup (sorry, I'm bored onboard a plane) :slight_smile:

I did one manual test of RAM comparing the VM used by the Set storage versus the Trie storage, comparing the previously-measured 496 word document with a document that had 1007 words. The results were as I expected:

469 words:
     create set: 16.040000 1.100000 17.140000 ( 21.742738)
     159MB of VM

     create matcher: 85.430000 1.340000 86.770000 ( 96.524512)
     68MB of VM

1007 words:
     create set: 137.470000 9.400000 146.870000 (166.828737)
     ~1GB of VM

     create matcher: 746.690000 11.050000 757.740000 (806.450292)
     149MB of VM

Conclusion: if you have the RAM to spare, the Set-based approach is quite speedy, but it gets greedy as your full phrase base grows. If you need to save some memory and can spare the time, go with the Trie based approach.

Now, having done all this work...if all you want is sub-phrase matching, why not use a regexp?

469 words:
                               user system total real
     create clean string: 0.010000 0.010000 0.020000 ( 0.003050)
     run 100k matches: 10.750000 0.140000 10.890000 ( 15.839430)
     28MB of VM

1007 words:
                               user system total real
     create clean string: 0.010000 0.010000 0.020000 ( 0.432572)
     run 100k matches: 19.350000 0.200000 19.550000 ( 27.612700)
     28MB of VM

[Slim:~/Desktop/Match Phrases] gavinkis% cat regexp.rb
require 'benchmark'

cleaned = nil
matcher = Regexp.new( "\\b#{ARGV[1]}\\b" )

Benchmark.bm( 20 ){ |x|
         x.report( "create clean string:" ){
                 cleaned = IO.read( ARGV[0] ).downcase.scan( /[a-z']+/ ).join( ' ' )
         }
         x.report( "run 100k matches:"){
                 100_000.times{
                         cleaned =~ matcher
                         cleaned =~ /the brown fox/
                 }
         }
}

exponential run time vs. O(1) runtime once the index is built maybe?

-a

···

On Sat, 28 May 2005, James Edward Gray II wrote:

On May 27, 2005, at 1:39 PM, Dan Fitzpatrick wrote:

I am trying to build an indexing structure on some phrases. Most phrases will have 2 - 5 parts (words). The resulting array will be dumped into an index to find the matching phrases. I don't want to do wildcard searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches for "is some" or "some text", I want it to find this phrase. I don't want a search for "is text" to find this phrase though.

Perhaps I'm not understanding, but how does the following fail to meet your needs?

"This is some text".match(/is some/i)

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
My religion is very simple. My religion is kindness.
--Tenzin Gyatso

===============================================================================

I have a few million phrases to index so I don't want to loop through every phrase and test it with a regular expression.

Below is my working solution. Sometimes just describing the problem helps the brain to take a new approach:

text = "This is some text"
parts = text.split(/\W/)
phrases =
parts.each_index do |s|
   s.upto(parts.size - 1){|e| phrases.push(parts.slice(s..e))}
end
phrases.each{|p| puts p.join(' ')}

Dan

James Edward Gray II wrote:

···

On May 27, 2005, at 1:39 PM, Dan Fitzpatrick wrote:

I am trying to build an indexing structure on some phrases. Most phrases will have 2 - 5 parts (words). The resulting array will be dumped into an index to find the matching phrases. I don't want to do wildcard searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches for "is some" or "some text", I want it to find this phrase. I don't want a search for "is text" to find this phrase though.

Perhaps I'm not understanding, but how does the following fail to meet your needs?

"This is some text".match(/is some/i)

James Edward Gray II

Besides glimpse, there's a bunch of indexers in RAA and elsewhere,
Gonzui, RSI, Odeum, cli:

http://rsi.rubyforge.org/files/lib/rsi/rsi_intro_rb.html
http://gonzui.sourceforge.net/index.html.en
http://clabs.org/ruby.htm
http://www.zedshaw.com/projects/ruby_odeum/

mayb also look at the python indexers and Bayesian classifiers at
www.divmod.org
You didn't talk about stemming, discarding particles like "a",
"the",etc, issues like that but this should get you started

Full Inverted Index maybe?

Dan Fitzpatrick wrote:

···

I have a few million phrases to index so I don't want to loop through every phrase and test it with a regular expression.

Below is my working solution. Sometimes just describing the problem helps the brain to take a new approach:

text = "This is some text"
parts = text.split(/\W/)
phrases =
parts.each_index do |s|
  s.upto(parts.size - 1){|e| phrases.push(parts.slice(s..e))}
end
phrases.each{|p| puts p.join(' ')}

Dan

James Edward Gray II wrote:

On May 27, 2005, at 1:39 PM, Dan Fitzpatrick wrote:

I am trying to build an indexing structure on some phrases. Most phrases will have 2 - 5 parts (words). The resulting array will be dumped into an index to find the matching phrases. I don't want to do wildcard searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches for "is some" or "some text", I want it to find this phrase. I don't want a search for "is text" to find this phrase though.

Perhaps I'm not understanding, but how does the following fail to meet your needs?

"This is some text".match(/is some/i)

James Edward Gray II