Hey,
So as I am reviewing Ruby I came across the old find strings with xyz
pattern.
The code is as follows:
class WordIndex
def initialize
@index = {}
end
def add_to_index(obj, *phrases)
phrases.each do |phrase|
phrase.scan(/\w[\w']+/) do |word| # extract each word
word.downcase!
@index[word] = if @index[word].nil?
@index[word].push(obj)
end
end
end
def lookup(word)
@index[word.downcase]
end
end
So here is my understanding:
phrases contains the phases/patterns to look for. |phrase| is replaced
by each of the phrases to check for.
What it seems to me is that the code is a way to classify objects based on each
word of a set of phrases, which is somehow related to that object. For example,
it could be an html page and each phrase is a part of the text present
on the page,
a file and its contents, etc.
In the method *phrases means that phrases will be an array containing
all parameters
passed to the method. For example:
irb(main):001:0> def a(obj, *phrases)
irb(main):002:1> puts phrases.class
irb(main):003:1> p phrases
irb(main):004:1> end
=> nil
irb(main):005:0> a("an obj", "the first phrase", "the second one", "3rd one")
Array
["the first phrase", "the second one", "3rd one"]
=> nil
Then, the each method will call the block (do ... end) passing each of
the phrases as the block parameter (phrase).
After that, I become a bit ocnfused. I know they are looking for the
phrase in each of the words (.\w\w']+/) but how does that work?
Scanning will iterate through the phrase searching for the patters,
and then pass the block
each result. The result is the section of the string that matches. And
what sections
match this pattern? A word letter, followed by any word letter or a '.
For example:
irb(main):019:0> re = /\w[\w']+/
=> /\w[\w']+/
irb(main):020:0> "this is a normal sentence".scan(re) {|x| p x}
"this"
"is"
"normal"
"sentence"
As you can see, the "a" has been skipped, since the regexp is asking
for at least two consecutive word letters, or a word letter and a '.
How is each string broken down into "words"? w
Scanning the sentence with that regexp does the splitting.
hy is that done anyway (why not just find the pattern and move on?).
Because what the code is doing is retreiving every two letter or more word
to index the object through that word. For each word then, it is creating an
array where it stores the object. This means that at the end you have a
hash with each word in every sentence as a key, and whose value is an
array containing all objs related to that word.
Also, what is obj exactly (other than an object)? How is it being
formed so it can be pushed into the stack?
Well, let's imagine an example of how this method could be used:
I want to read all txt files in a folder and index the name of the file
based on each word in the file.
index = WordIndex.new
Dir["*.txt"].each do |file|
lines = File.open(file) {|f| f.readlines.map {|l| l.chomp}}
index.add_to_index(file, *lines)
end
Now we can locate all files that contain a word:
%w{cat dog}.each {|word| puts "Files that contain '#{word}':
#{index.lookup (word)}
For example I have three files:
one.txt:
I have a dog
I have a cat
two.txt:
I have a dog, and nothing else.
I do have a car too.
three.txt:
I love my cat
I don't love anything else
$ ruby wordindex.rb
Files that contain 'cat': one.txt,three.txt
Files that contain 'dog': one.txt,two.txt
So, you have indexed the contents of the files per word.
Finally, what does index[word] represent? I am guessing a hash...
@index is a hash. @index[word] is an array.
And btw, you can initialize a hash like this and remove a line:
@index = Hash.new {|h,k| h[k] = }
and remove this line:
@index[word] = if @index[word].nil?
Hope this helps,
Jesus.
···
On Fri, Sep 19, 2008 at 9:25 PM, Dave Lenhardt <davidinanhui@gmail.com> wrote: