Indexing hash with longer strings faster?

Hi, I have some wierd results measuring Ruby's hash operations,
particulary indexing hash using two member string array. Have a look:

require 'benchmark'

=> false

puts(Benchmark.measure do

?> h = {}

  1000000.times do |i|

?> i1 = rand(100)

    i2 = rand(100)
    a = [i1.to_s, i2.to_s]
    h[a] ||= 0
    h[a] += 1
  end
  puts h.size
end)

10000
12.900000 0.050000 12.950000 ( 13.088721)
=> nil

puts(Benchmark.measure do

?> h = {}

  1000000.times do |i|

?> i1 = rand(100)

    i2 = rand(100)
    a = [i1.to_s + '00000', i2.to_s + '00000']
    h[a] ||= 0
    h[a] += 1
  end
  puts h.size
end)

10000
  7.700000 0.040000 7.740000 ( 7.858381)
=> nil

In other words, I just made array members slightly longer and working
with hash was considerably faster.

Why's that happening?

BTW,

jablan-mbp:~ $ ruby -v
ruby 1.8.6 (2008-08-11 patchlevel 287) [universal-darwin9.0]

<snip>
Benchmarking is tricky business. I am not sure my benchmarks are worth
much, but they do not confirm your results. BTW using: ruby 1.9.1p243
(2009-07-16 revision 24175) [i686-linux]

First you really should not bm inside irb. Second you should use bmbm
for rehearsals, but even with that you could get bitten by tons of
things happening on your machine.
Nevertheless for whats worth it - and because I adore benchmarking :wink:
- here is my code and results:

require 'benchmark'

N = 100_000
5.times do
  Benchmark.bmbm do | bm |
    h = Hash::new( 0 )
  bm.report( "2 chars" ) do
    N.times do
      h[ [rand(100).to_s, rand(100).to_s] ] += 1
    end
  end
  h = Hash::new( 0 )
  bm.report( "7 chars" ) do
    N.times do
      h[ [rand(100).to_s + "00000", rand(100).to_s + "00000" ] ] += 1
    end
  end
  end
end
robert@siena:~/log/ruby/ML 19:50:13
519/25 > ruby bm1.rb
Rehearsal -------------------------------------------
2 chars 0.700000 0.010000 0.710000 ( 0.953483)
7 chars 0.960000 0.010000 0.970000 ( 1.367117)
---------------------------------- total: 1.680000sec

              user system total real
2 chars 0.740000 0.020000 0.760000 ( 1.031102)
7 chars 1.010000 0.020000 1.030000 ( 1.323072)
Rehearsal -------------------------------------------
2 chars 0.720000 0.000000 0.720000 ( 0.955180)
7 chars 0.980000 0.010000 0.990000 ( 1.320693)
---------------------------------- total: 1.710000sec

              user system total real
2 chars 0.710000 0.000000 0.710000 ( 0.923019)
7 chars 0.900000 0.010000 0.910000 ( 1.262338)
Rehearsal -------------------------------------------
2 chars 0.700000 0.000000 0.700000 ( 0.926117)
7 chars 0.990000 0.020000 1.010000 ( 1.356475)
---------------------------------- total: 1.710000sec

              user system total real
2 chars 0.710000 0.000000 0.710000 ( 0.940901)
7 chars 0.880000 0.020000 0.900000 ( 1.269465)
Rehearsal -------------------------------------------
2 chars 0.700000 0.010000 0.710000 ( 1.016537)
7 chars 0.990000 0.010000 1.000000 ( 1.360516)
---------------------------------- total: 1.710000sec

              user system total real
2 chars 0.720000 0.010000 0.730000 ( 0.920533)
7 chars 0.920000 0.000000 0.920000 ( 1.198681)
Rehearsal -------------------------------------------
2 chars 0.720000 0.010000 0.730000 ( 0.946056)
7 chars 1.020000 0.010000 1.030000 ( 1.343575)
---------------------------------- total: 1.760000sec

              user system total real
2 chars 0.720000 0.010000 0.730000 ( 1.023185)
7 chars 0.930000 0.000000 0.930000 ( 1.211184)

···

On Tue, Jul 28, 2009 at 6:50 PM, jablan<jablan@gmail.com> wrote:

--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

First, thanks a lot for your benchmarking tips, I'm new in the
business; and thanks for your BM-friendly rewrite of my code.

However, I still keep getting unexpected results:

jablan-mbp:dev $ ruby bm1.rb
Rehearsal -------------------------------------------
2 chars 0.840000 0.000000 0.840000 ( 0.863001)
7 chars 0.590000 0.010000 0.600000 ( 0.597103)
---------------------------------- total: 1.440000sec

              user system total real
2 chars 0.900000 0.000000 0.900000 ( 0.909053)
7 chars 0.620000 0.000000 0.620000 ( 0.644989)
Rehearsal -------------------------------------------
2 chars 0.910000 0.000000 0.910000 ( 0.924786)
7 chars 0.620000 0.010000 0.630000 ( 0.631793)
---------------------------------- total: 1.540000sec

              user system total real
2 chars 0.900000 0.000000 0.900000 ( 0.909935)
7 chars 0.570000 0.000000 0.570000 ( 0.578445)
Rehearsal -------------------------------------------
2 chars 0.880000 0.010000 0.890000 ( 0.892464)
7 chars 0.650000 0.000000 0.650000 ( 0.672226)
---------------------------------- total: 1.540000sec

              user system total real
2 chars 0.910000 0.010000 0.920000 ( 0.947815)
7 chars 0.580000 0.000000 0.580000 ( 0.593960)
Rehearsal -------------------------------------------
2 chars 0.890000 0.010000 0.900000 ( 0.896618)
7 chars 0.650000 0.000000 0.650000 ( 0.660125)
---------------------------------- total: 1.550000sec

              user system total real
2 chars 0.900000 0.000000 0.900000 ( 0.919042)
7 chars 0.590000 0.010000 0.600000 ( 0.595202)
Rehearsal -------------------------------------------
2 chars 0.900000 0.000000 0.900000 ( 0.913919)
7 chars 0.660000 0.010000 0.670000 ( 0.682484)
---------------------------------- total: 1.570000sec

              user system total real
2 chars 0.920000 0.000000 0.920000 ( 0.950286)
7 chars 0.590000 0.010000 0.600000 ( 0.601025)

I will try to run the code on other platforms and/or ruby versions and
I will post the findings... The bad thing is that I'm not doing this
just for kicks, I need to optimise a nasty script... :frowning:

Is this 1.8?
Given your results it is not completely unthinkable that the hash
values for "00".."99" have more collisions and if collision handling
is with linked lists that might slow down.
A priori I would not believe in that but why not check?

I get in 1.9 the expected
[*"00".."99"].map( &:hash ).uniq.size
=> 100

can you try this in 1.8
  ...map{ |x| x.hash }.uniq.size

?

HTH
Robert

···

2009/7/28 Mladen Jablanović <jablan@gmail.com>:
--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

Is this 1.8?
Given your results it is not completely unthinkable that the hash
values for "00".."99" have more collisions and if collision handling
is with linked lists that might slow down.
A priori I would not believe in that but why not check?

I get in 1.9 the expected
[*"00".."99"].map( &:hash ).uniq.size
=> 100

can you try this in 1.8
...map{ |x| x.hash }.uniq.size

Sorry forgot two things
you did not produce 00, 01, thus
[*0..99].map( &:to_s ).map( a:hash )
and secondly,
a negative result does not say anything about collisions, only a
positive (result < 100) would. In that case we would need to look into
the source, ( I do not think profiling might show internal hash
behavior ).
Cheers
Robert

···

On Tue, Jul 28, 2009 at 9:16 PM, Robert Dober<robert.dober@gmail.com> wrote:

2009/7/28 Mladen Jablanović <jablan@gmail.com>:
?

HTH
Robert
--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

You pointed to the right direction here!

Although your example gives me 100 as result (all the hashes are
unique), here's what suggests that hash collision indeed is the reason
of the slowness:

[*0..99].map{|i| [*0..99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.size

=> 1756

[*0..99].map{|i| [*0..99].map{|j| [i.to_s + '00000',j.to_s + '00000'].hash}}.flatten.uniq.size

=> 10000

Can you please just post the ruby 1.9 results here?

Thanks!

You pointed to the right direction here!

Although your example gives me 100 as result (all the hashes are
unique), here's what suggests that hash collision indeed is the reason
of the slowness:

[*0..99].map{|i| [*0..99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.size

=> 1756

[*0..99].map{|i| [*0..99].map{|j| [i.to_s + '00000',j.to_s + '00000'].hash}}.flatten.uniq.size

=> 10000

OMG my code was sloppy, but you got it right, here is the 1.9 code
506/18 > cat hashes.rb && ruby -v hashes.rb
#!/usr/local/bin/ruby -w
# encoding: utf-8
# file: /home/robert/log/ruby/ML/hashes.rb
p [*0..99].map{|i| [*0..99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.size
p [*0..99].map{|i| [*0..99].map{|j| [i.to_s + '00000',j.to_s +
'00000'].hash}}.flatten.uniq.size

# vim: sts=2 sw=2 ft=ruby expandtab nu :
ruby 1.9.1p243 (2009-07-16 revision 24175) [i686-linux]
10000
10000

as expected :).
Cheers
Robert

···

2009/7/29 Mladen Jablanović <jablan@gmail.com>:

Can you please just post the ruby 1.9 results here?

Thanks!

--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]