[QUIZ] Index and Query (#54)

Hey Guys,
I've written up the benchmark results. Don't take them too seriously.
I didn't exactly run them very scientifically. It just gives you a
ball park figure for how well each indexer performs against another.
Hopefully you'll be able to use it to help improve the performance of
your ruby code in future.

http://www.davebalmain.com/articles/2005/11/15/ruby-quiz-54-results

There were certainly a few surprises in there for me. Like how did Bob
and Dale manage to index so quickly without using the StringScanner
class. And why is Bob's search so fast, even though he didn't use an
inverted index. Even for much larger document sets his search runs
almost the same speed as Simple Ferret. He used exactly the same
algorithm as is used in Simple Search, yet his searches are much
quicker. I was also amazed at how well these simple solutions compared
to Lucene, although I realize I'm comparing apples with oranges.

Cheers,
Dave

David,

After reading your results I thought I would try and make a couple of
simple changes. I attempted to cleanup the 'insert' routine since that
is where most of the processing time seemed to be spent. I also added
the ability to perform multi-term searching (individual terms or single
string). This will worsen the look-up times, but it might be a good
change.

If possible, could you run this version through your test to see how it
does?

class IndexHash
  def initialize( documents=nil )
    @index = Hash.new( [] )
    input( documents ) if documents
  end

  def input( documents )
    documents.each_pair do |symbol, contents|
      contents.split.each { |word| insert( symbol, word) }
    end
  end

  def insert( document_symbol, word )
    w = word.downcase
    @index[w] += [ document_symbol ] unless @index[w].include?(
document_symbol )
  end

  def find( *strings )
    result = []
    strings.each do |string|
      string.split.each do |word|
        result += @index[ word.downcase ]
      end
    end
    result.uniq
  end

  def words
    @index.keys.sort
  end
end

class IndexBitmap
  def initialize( documents=nil )
    @index = []
    @documents = Hash.new( 0 )
    input( documents ) if documents
  end

  def input( documents )
    documents.each_pair do |symbol, contents|
      contents.split.each { |word| insert( symbol, word) }
    end
  end

  def insert( document_symbol, word )
    w = word.downcase
    @index.push( w ) unless @index.include?( w )
    @documents[ document_symbol ] |= (1<<@index.index( w ))
  end

  def find( *strings )
    result = []
    mask = 0

    strings.each do |string|
      string.split.each do |word|
        w = word.downcase
        mask |= (1<<@index.index(w)) if @index.index(w)
      end
    end

    @documents.each_pair do |symbol, value|
      result.push( symbol ) if value & mask
    end
    result
  end
  
  def words
    @index.sort
  end
end

David Balmain wrote:

And why is Bob's search so fast, even though he didn't use an
inverted index.

Blind dumb luck, I'm sure :slight_smile:

For kicks, I've created a version of my solution that uses an inverted index:

   http://users.adelphia.net/~showaltb/rubyquiz/54/indexer2.rb

David,

After reading your results I thought I would try and make a couple of
simple changes. I attempted to cleanup the 'insert' routine since that
is where most of the processing time seemed to be spent. I also added
the ability to perform multi-term searching (individual terms or single
string). This will worsen the look-up times, but it might be a good
change.

If possible, could you run this version through your test to see how it
does?

Hi Dale,
I've updated the results. You may notice a number of the indexing
times have changed. I modified my test to make it a little fairer.
Anyway, you earned yourself some green realestate on the bottom chart.
Somethings up with the search results from your first bitmap index.
I'm just running that one again. By the way, I had to change the
second last line of your find method to;

  result.push( symbol ) if value & mask > 0

Cheers,
Dave

···

On 11/16/05, Dale Martenson <dale.martenson@gmail.com> wrote:

class IndexHash
        def initialize( documents=nil )
                @index = Hash.new( )
                input( documents ) if documents
        end

        def input( documents )
                documents.each_pair do |symbol, contents|
                        contents.split.each { |word| insert( symbol, word) }
                end
        end

        def insert( document_symbol, word )
                w = word.downcase
                @index[w] += [ document_symbol ] unless @index[w].include?(
document_symbol )
        end

        def find( *strings )
                result =
                strings.each do |string|
                        string.split.each do |word|
                                result += @index[ word.downcase ]
                        end
                end
                result.uniq
        end

        def words
                @index.keys.sort
        end
end

class IndexBitmap
        def initialize( documents=nil )
                @index =
                @documents = Hash.new( 0 )
                input( documents ) if documents
        end

        def input( documents )
                documents.each_pair do |symbol, contents|
                        contents.split.each { |word| insert( symbol, word) }
                end
        end

        def insert( document_symbol, word )
                w = word.downcase
                @index.push( w ) unless @index.include?( w )
                @documents[ document_symbol ] |= (1<<@index.index( w ))
        end

        def find( *strings )
                result =
                mask = 0

                strings.each do |string|
                        string.split.each do |word|
                                w = word.downcase
                                mask |= (1<<@index.index(w)) if @index.index(w)
                        end
                end

                @documents.each_pair do |symbol, value|
                        result.push( symbol ) if value & mask
                end
                result
        end

        def words
                @index.sort
        end
end

Nice. Search is twice as fast as your original version. You've
restored my faith in smart algorithms. :wink:

Dave

···

On 11/17/05, Bob Showalter <bob_showalter@taylorwhite.com> wrote:

David Balmain wrote:
> And why is Bob's search so fast, even though he didn't use an
> inverted index.

Blind dumb luck, I'm sure :slight_smile:

For kicks, I've created a version of my solution that uses an inverted
index:

   http://users.adelphia.net/~showaltb/rubyquiz/54/indexer2.rb

Hi Dave,

Thanks for updating the results.

My first version of the IndexBitmap class has a bug in the 'insert'
routine. It added values rather than ORed values together. Oops. None
of my tests originally covered this case so it slipped by me.

@documents[ document_symbol ] += (1<<@index.index( word.downcase ))

should be

@documents[ document_symbol ] |= (1<<@index.index( word.downcase ))

Thanks for catching my error where I wasn't checking the mask correctly
in all cases. Again, I lacked a test for this case. Shame. Shame.

Thank you very much for your testing and analysis.

--Dale