Indexing hash with longer strings faster?

Mladen_Jablanovic · 28 July 2009 16:50

Hi, I have some wierd results measuring Ruby's hash operations,
particulary indexing hash using two member string array. Have a look:

require 'benchmark'

=> false

puts(Benchmark.measure do

?> h = {}

1000000.times do |i|

?> i1 = rand(100)

    i2 = rand(100)
    a = [i1.to_s, i2.to_s]
    h[a] ||= 0
    h[a] += 1
  end
  puts h.size
end)

10000
12.900000 0.050000 12.950000 ( 13.088721)
=> nil

puts(Benchmark.measure do

?> h = {}

1000000.times do |i|

?> i1 = rand(100)

    i2 = rand(100)
    a = [i1.to_s + '00000', i2.to_s + '00000']
    h[a] ||= 0
    h[a] += 1
  end
  puts h.size
end)

10000
7.700000 0.040000 7.740000 ( 7.858381)
=> nil

In other words, I just made array members slightly longer and working
with hash was considerably faster.

Why's that happening?

BTW,

jablan-mbp:~ $ ruby -v
ruby 1.8.6 (2008-08-11 patchlevel 287) [universal-darwin9.0]

Forum · 28 July 2009 17:52

<snip>
Benchmarking is tricky business. I am not sure my benchmarks are worth
much, but they do not confirm your results. BTW using: ruby 1.9.1p243
(2009-07-16 revision 24175) [i686-linux]

First you really should not bm inside irb. Second you should use bmbm
for rehearsals, but even with that you could get bitten by tons of
things happening on your machine.
Nevertheless for whats worth it - and because I adore benchmarking
- here is my code and results:

require 'benchmark'

N = 100_000
5.times do
  Benchmark.bmbm do | bm |
    h = Hash::new( 0 )
  bm.report( "2 chars" ) do
    N.times do
      h[ [rand(100).to_s, rand(100).to_s] ] += 1
    end
  end
  h = Hash::new( 0 )
  bm.report( "7 chars" ) do
    N.times do
      h[ [rand(100).to_s + "00000", rand(100).to_s + "00000" ] ] += 1
    end
  end
  end
end
robert@siena:~/log/ruby/ML 19:50:13
519/25 > ruby bm1.rb
Rehearsal -------------------------------------------
2 chars 0.700000 0.010000 0.710000 ( 0.953483)
7 chars 0.960000 0.010000 0.970000 ( 1.367117)
---------------------------------- total: 1.680000sec

user system total real
2 chars 0.740000 0.020000 0.760000 ( 1.031102)
7 chars 1.010000 0.020000 1.030000 ( 1.323072)
Rehearsal -------------------------------------------
2 chars 0.720000 0.000000 0.720000 ( 0.955180)
7 chars 0.980000 0.010000 0.990000 ( 1.320693)
---------------------------------- total: 1.710000sec

user system total real
2 chars 0.710000 0.000000 0.710000 ( 0.923019)
7 chars 0.900000 0.010000 0.910000 ( 1.262338)
Rehearsal -------------------------------------------
2 chars 0.700000 0.000000 0.700000 ( 0.926117)
7 chars 0.990000 0.020000 1.010000 ( 1.356475)
---------------------------------- total: 1.710000sec

user system total real
2 chars 0.710000 0.000000 0.710000 ( 0.940901)
7 chars 0.880000 0.020000 0.900000 ( 1.269465)
Rehearsal -------------------------------------------
2 chars 0.700000 0.010000 0.710000 ( 1.016537)
7 chars 0.990000 0.010000 1.000000 ( 1.360516)
---------------------------------- total: 1.710000sec

user system total real
2 chars 0.720000 0.010000 0.730000 ( 0.920533)
7 chars 0.920000 0.000000 0.920000 ( 1.198681)
Rehearsal -------------------------------------------
2 chars 0.720000 0.010000 0.730000 ( 0.946056)
7 chars 1.020000 0.010000 1.030000 ( 1.343575)
---------------------------------- total: 1.760000sec

user system total real
2 chars 0.720000 0.010000 0.730000 ( 1.023185)
7 chars 0.930000 0.000000 0.930000 ( 1.211184)

···

On Tue, Jul 28, 2009 at 6:50 PM, jablan<jablan@gmail.com> wrote:

--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

Mladen_Jablanovic · 28 July 2009 18:15

First, thanks a lot for your benchmarking tips, I'm new in the
business; and thanks for your BM-friendly rewrite of my code.

However, I still keep getting unexpected results:

jablan-mbp:dev $ ruby bm1.rb
Rehearsal -------------------------------------------
2 chars 0.840000 0.000000 0.840000 ( 0.863001)
7 chars 0.590000 0.010000 0.600000 ( 0.597103)
---------------------------------- total: 1.440000sec

user system total real
2 chars 0.900000 0.000000 0.900000 ( 0.909053)
7 chars 0.620000 0.000000 0.620000 ( 0.644989)
Rehearsal -------------------------------------------
2 chars 0.910000 0.000000 0.910000 ( 0.924786)
7 chars 0.620000 0.010000 0.630000 ( 0.631793)
---------------------------------- total: 1.540000sec

user system total real
2 chars 0.900000 0.000000 0.900000 ( 0.909935)
7 chars 0.570000 0.000000 0.570000 ( 0.578445)
Rehearsal -------------------------------------------
2 chars 0.880000 0.010000 0.890000 ( 0.892464)
7 chars 0.650000 0.000000 0.650000 ( 0.672226)
---------------------------------- total: 1.540000sec

user system total real
2 chars 0.910000 0.010000 0.920000 ( 0.947815)
7 chars 0.580000 0.000000 0.580000 ( 0.593960)
Rehearsal -------------------------------------------
2 chars 0.890000 0.010000 0.900000 ( 0.896618)
7 chars 0.650000 0.000000 0.650000 ( 0.660125)
---------------------------------- total: 1.550000sec

user system total real
2 chars 0.900000 0.000000 0.900000 ( 0.919042)
7 chars 0.590000 0.010000 0.600000 ( 0.595202)
Rehearsal -------------------------------------------
2 chars 0.900000 0.000000 0.900000 ( 0.913919)
7 chars 0.660000 0.010000 0.670000 ( 0.682484)
---------------------------------- total: 1.570000sec

user system total real
2 chars 0.920000 0.000000 0.920000 ( 0.950286)
7 chars 0.590000 0.010000 0.600000 ( 0.601025)

I will try to run the code on other platforms and/or ruby versions and
I will post the findings... The bad thing is that I'm not doing this
just for kicks, I need to optimise a nasty script...

Forum · 28 July 2009 19:16

Is this 1.8?
Given your results it is not completely unthinkable that the hash
values for "00".."99" have more collisions and if collision handling
is with linked lists that might slow down.
A priori I would not believe in that but why not check?

I get in 1.9 the expected
[*"00".."99"].map( &:hash ).uniq.size
=> 100

can you try this in 1.8
...map{ |x| x.hash }.uniq.size

?

HTH
Robert

···

2009/7/28 Mladen Jablanović <jablan@gmail.com>:
--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

Forum · 28 July 2009 19:20

Is this 1.8?
Given your results it is not completely unthinkable that the hash
values for "00".."99" have more collisions and if collision handling
is with linked lists that might slow down.
A priori I would not believe in that but why not check?

I get in 1.9 the expected
[*"00".."99"].map( &:hash ).uniq.size
=> 100

can you try this in 1.8
...map{ |x| x.hash }.uniq.size

Sorry forgot two things
you did not produce 00, 01, thus
[*0..99].map( &:to_s ).map( a:hash )
and secondly,
a negative result does not say anything about collisions, only a
positive (result < 100) would. In that case we would need to look into
the source, ( I do not think profiling might show internal hash
behavior ).
Cheers
Robert

···

On Tue, Jul 28, 2009 at 9:16 PM, Robert Dober<robert.dober@gmail.com> wrote:

2009/7/28 Mladen Jablanović <jablan@gmail.com>:
?

HTH
Robert
--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

Mladen_Jablanovic · 29 July 2009 15:15

You pointed to the right direction here!

Although your example gives me 100 as result (all the hashes are
unique), here's what suggests that hash collision indeed is the reason
of the slowness:

[*0..99].map{|i| [*0..99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.size

=> 1756

[*0..99].map{|i| [*0..99].map{|j| [i.to_s + '00000',j.to_s + '00000'].hash}}.flatten.uniq.size

=> 10000

Can you please just post the ruby 1.9 results here?

Thanks!

Forum · 29 July 2009 16:21

You pointed to the right direction here!

Although your example gives me 100 as result (all the hashes are
unique), here's what suggests that hash collision indeed is the reason
of the slowness:

[*0..99].map{|i| [*0..99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.size

=> 1756

[*0..99].map{|i| [*0..99].map{|j| [i.to_s + '00000',j.to_s + '00000'].hash}}.flatten.uniq.size

=> 10000

OMG my code was sloppy, but you got it right, here is the 1.9 code
506/18 > cat hashes.rb && ruby -v hashes.rb
#!/usr/local/bin/ruby -w
# encoding: utf-8
# file: /home/robert/log/ruby/ML/hashes.rb
p [*0..99].map{|i| [*0..99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.size
p [*0..99].map{|i| [*0..99].map{|j| [i.to_s + '00000',j.to_s +
'00000'].hash}}.flatten.uniq.size

# vim: sts=2 sw=2 ft=ruby expandtab nu :
ruby 1.9.1p243 (2009-07-16 revision 24175) [i686-linux]
10000
10000

as expected :).
Cheers
Robert

···

2009/7/29 Mladen Jablanović <jablan@gmail.com>:

Can you please just post the ruby 1.9 results here?

Thanks!

--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

Topic		Replies	Views
Inserting hash value slows down as table gets larger ruby-talk	10	151	31 March 2011
String Hashing Algorithms ruby-talk	15	97	12 May 2005
Hash optimization question ruby-talk	10	105	26 September 2008
Hash ruby-talk	21	128	11 September 2007
Performance of ruby hashes ruby-talk	12	83	19 October 2006

Indexing hash with longer strings faster?

Related topics