Quicker finding strings? Alternative for array, hash, set?

I've been searching for this information already, but cannot really find
the answer I'm looking for.

I have to find values in a list of some 40,000 strings. Putting all
these in an array is not likely to be the quickest way.... and it isn't.
A set seems to be faster already. Maybe using a map where I only use the
keys is also faster. But I'm wondering: is there no such thing as a tree
to store and retrieve items in a fast way?

If I remember well, for n elements in a list it should take log(n) steps
then, where it takes n/2 just going through an array if I only want to
pick up the element. In my situation it is much closer to n because most
of the times the value is NOT in the list.

Any ideas?

···

--
Posted via http://www.ruby-forum.com/.

If you are only concerned with lookup time, a hash has amortized O(1)
lookup time:

require 'benchmark'

words = ("aaaa".."zzzz").to_a # => 456,976 strings

hash = {}
words.each {|w| hash[w] = true}

TIMES = 10
Benchmark.bmbm do |x|
  x.report("Array#find existing") do
    TIMES.times do
      words.find {|w| w == "mmmm"}
    end
  end
  x.report("Array#find non-existing") do
    TIMES.times do
        words.find {|w| w == "mmmmm"}
  end
end
  x.report("Hash# existing") do
    TIMES.times do
      hash["mmmm"]
    end
  end
  x.report("Hash# non-existing") do
    TIMES.times do
        hash["mmmmm"]
  end
end
end

jesus@jesus-laptop:~/personal/programacion/ruby/Benchmarks/lib$ ruby
hash_array2.rb
Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

                              user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000032)

Jesus.

···

On Mon, Feb 2, 2009 at 12:00 PM, Patrick Put <patrick.put@gmail.com> wrote:

I've been searching for this information already, but cannot really find
the answer I'm looking for.

I have to find values in a list of some 40,000 strings. Putting all
these in an array is not likely to be the quickest way.... and it isn't.
A set seems to be faster already. Maybe using a map where I only use the
keys is also faster. But I'm wondering: is there no such thing as a tree
to store and retrieve items in a fast way?

If I remember well, for n elements in a list it should take log(n) steps
then, where it takes n/2 just going through an array if I only want to
pick up the element. In my situation it is much closer to n because most
of the times the value is NOT in the list.

Any ideas?

Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

                              user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. :wink: Thanks for showing me this!

I ran a program over some 1.8 billion lines of data this weekend. Good
thing it was more slowed down because of IO than my Ruby programming. It
would have been faster of course, but the damage is not that big...

···

--
Posted via http://www.ruby-forum.com/\.

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

Cheers

robert

···

2009/2/2 Patrick Put <patrick.put@gmail.com>:

Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

                              user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. :wink: Thanks for showing me this!

--
remember.guy do |as, often| as.you_can - without end

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

Is the implementation "under the hood" the same? That would mean my
choice to use a Set was not so bad after all.

···

--
Posted via http://www.ruby-forum.com/\.

A trie might also be a good idea. Some tries were proposed as solutions
to Ruby Quiz #103 (http://rubyquiz.com/quiz103.html\)

···

On Mon, 02 Feb 2009 08:21:48 -0500, Robert Klemme wrote:

2009/2/2 Patrick Put <patrick.put@gmail.com>:

Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

                              user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. :wink: Thanks for showing me this!

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

Cheers

robert

--
Chanoch (Ken) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

If you check here:

http://www.ruby-doc.org/core/classes/Set.html

you can see the source code for each method. For example this is the method new:

# File lib/set.rb, line 64
  def initialize(enum = nil, &block) # :yields: o
    @hash ||= Hash.new

    enum.nil? and return

    if block
      enum.each { |o| add(block[o]) }
    else
      merge(enum)
    end
  end

So you can see, it uses a Hash underneath. Let's take a look at the add method:

# File lib/set.rb, line 195
  def add(o)
    @hash[o] = true
    self
  end

So, it's using a hash with true as the values (the same I did in my
example). Of course, if the true doesn't mean anything to you (which
shouldn't as you are not using the values), using a Set is a bit more
clear, and has convenience methods for Set "arithmetic".

Jesus.

···

On Mon, Feb 2, 2009 at 3:37 PM, Patrick Put <patrick.put@gmail.com> wrote:

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

Is the implementation "under the hood" the same? That would mean my
choice to use a Set was not so bad after all.

Set uses a Hash as storage.
See set.rb:
--------------8<-------------
   # The equality of each couple of elements is determined according to
   # Object#eql? and Object#hash, since Set uses Hash as storage.
-------------->8-------------

regards, Sandor Szücs

···

On 02.02.2009, at 15:37, Patrick Put wrote:

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

Is the implementation "under the hood" the same?

--

Good idea. But this is only a realistic option if the trie is
implemented in C I guess - otherwise the standard String#hash and
Set#include? are faster.

Kind regards

robert

···

2009/2/3 Ken Bloom <kbloom@gmail.com>:

On Mon, 02 Feb 2009 08:21:48 -0500, Robert Klemme wrote:

2009/2/2 Patrick Put <patrick.put@gmail.com>:

Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

                              user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash# existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash# non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. :wink: Thanks for showing me this!

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

A trie might also be a good idea. Some tries were proposed as solutions
to Ruby Quiz #103 (http://rubyquiz.com/quiz103.html\)

--
remember.guy do |as, often| as.you_can - without end

Very clear. I should/could have checked this myself. I'm only using Ruby
for a few months now, and loving it more and more.

Thanks guys!

···

--
Posted via http://www.ruby-forum.com/.