Hash optimization question

Are there any libraries that overcome this problem:

(slow example)

Benchmark.measure { a = {:abc => :def}; 100_000.times { a.each{}}}

=> #<... @real=0.113940954208374, @utime=0.0999999999999999,
@cstime=0.0>

(fast example)

Benchmark.measure { a = {:abc => :def}.to_a; 100_000.times { a.each{}}}

=> #<...@real=0.0591599941253662, @utime=0.0600000000000001,
@cstime=0.0>

I.e. they optimize hashes such that their .each_xxx functions work as
fast as arrays?
My gut instinct is that this is a problem that could be overcome with
some hacking in the core, but thought I'd ask.

-=R

···

--
Posted via http://www.ruby-forum.com/\.

No idea, you would have to look at the sources.

A few thoughts nevertheless: if you have to do the array conversion
prior to iterating then numbers suffer dramatically:

irb(main):003:0> Benchmark.measure { a = {:abc => :def}.to_a;
1_000_000.times { a.each{}}}
=> #<Benchmark::Tms:0x7fed1744 @label="", @real=0.641000032424927,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.641, @total=0.641>

irb(main):004:0> Benchmark.measure { a = {:abc => :def};
1_000_000.times { a.each{}}}
=> #<Benchmark::Tms:0x10013be0 @label="", @real=2.58999991416931,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=2.578, @total=2.578>

irb(main):005:0> Benchmark.measure { a = {:abc => :def};
1_000_000.times { a.to_a.each{}}}
=> #<Benchmark::Tms:0x7fee2184 @label="", @real=4.41000008583069,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=4.422, @total=4.422>

In other words: if speeding up relies on building an Array like
structure iterating will be dramatically slower - unless you want to
waste the added space of an Array representation and add the logic to
keep track of changes.

I haven't looked into Ruby's Hash iterating in detail but it might be
slower because of a general properties of hash tables: they are
usually bigger (i.e. have more buckets) than the number of elements in
the hash to allow for efficient hashing. Now if there are no links
between hash entries (whose maintenance I believe would increase
insertion overhead) iteration needs to access each hash bucket even
those with no data in. That might account for the slower runtimes.

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Kind regards

robert

···

2008/9/3 Roger Pack <rogerpack2005@gmail.com>:

Are there any libraries that overcome this problem:

(slow example)

Benchmark.measure { a = {:abc => :def}; 100_000.times { a.each{}}}

=> #<... @real=0.113940954208374, @utime=0.0999999999999999,
@cstime=0.0>

(fast example)

Benchmark.measure { a = {:abc => :def}.to_a; 100_000.times { a.each{}}}

=> #<...@real=0.0591599941253662, @utime=0.0600000000000001,
@cstime=0.0>

I.e. they optimize hashes such that their .each_xxx functions work as
fast as arrays?
My gut instinct is that this is a problem that could be overcome with
some hacking in the core, but thought I'd ask.

--
use.inject do |as, often| as.you_can - without end

Long time not having been on the same thread Robert:)
Well indeed I would suggest to anyone having problems with performance
of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.
Cheers
Robert

···

On Wed, Sep 3, 2008 at 11:24 AM, Robert Klemme <shortcutter@googlemail.com> wrote:

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

--
C'est véritablement utile puisque c'est joli.

Antoine de Saint Exupéry

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Yeah I guess it's hard to have the best of both worlds. Hmm. I suppose
that for small hashes which call .each a fix would definitely help, for
large hashes that call .each "a lot" it would probably also be quicker,
and for any hashes which don't call .each it wouldn't be helpful. I
wonder if the slowdown is small enough to make it worth it anyway
[though it would use more RAM too--then again, Ruby isn't known for its
low memory usage] :slight_smile:

In the example I was thinking of, I was using a hash to parse lines in a
file

a la

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}

string.each_line {|l| identifiers.each{|reg, description| if l =~ reg
then; do something; end }

So...my particular example I'm only using a hash because of the clean
syntactic look :slight_smile: [not for the functionality at all].

Rails uses it quite a bit, too, hence my enquiring about it.

I think that for now I'll just write a Hash monkey patcher a la

a = {}
a.fast_each_ify
a[:abc] = :def # elements from here on are now wrapped internally to
allow for a quicker .each

Thanks!
-=R

···

--
Posted via http://www.ruby-forum.com/\.

I haven't looked into Ruby's Hash iterating in detail but it might be
slower because of a general properties of hash tables: they are
usually bigger (i.e. have more buckets) than the number of elements in
the hash to allow for efficient hashing. Now if there are no links
between hash entries (whose maintenance I believe would increase
insertion overhead) iteration needs to access each hash bucket even
those with no data in. That might account for the slower runtimes.

Aside: ruby-1.9 keeps links between hash entries, exactly for the
purpose of faster iteration. It also has the potentially useful
side-effect that hashes iterate in the same order that the entries were
inserted.

(However, depending on this feature is a good way of writing code which
won't run properly in 1.8)

···

--
Posted via http://www.ruby-forum.com/\.

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Long time not having been on the same thread Robert:)

:slight_smile:

Well indeed I would suggest to anyone having problems with performance
of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.

Binary search might be an additional option for small to medium sized
collections if retrieval speed is not paramount but iteration speed is
important. (Btw, do we have binary search in the std lib by now?
This should probably go into Array...)

Kind regards

robert

···

2008/9/3 Robert Dober <robert.dober@gmail.com>:

On Wed, Sep 3, 2008 at 11:24 AM, Robert Klemme > <shortcutter@googlemail.com> wrote:

--
use.inject do |as, often| as.you_can - without end

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Yeah I guess it's hard to have the best of both worlds. Hmm. I suppose
that for small hashes which call .each a fix would definitely help, for
large hashes that call .each "a lot" it would probably also be quicker,
and for any hashes which don't call .each it wouldn't be helpful. I
wonder if the slowdown is small enough to make it worth it anyway
[though it would use more RAM too--then again, Ruby isn't known for its
low memory usage] :slight_smile:

The critical issue is not the size of the Has (at least not only) but
rather the frequency of change. Personally, since a Hash is primary
for efficient lookups I would not want to invest in any iteration
speedups if there was a chance of lookup and insertion slowdown. If
you are iterating through a Hash all the time then you are probably
not using the proper data structure.

In the example I was thinking of, I was using a hash to parse lines in a
file

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}

string.each_line {|l| identifiers.each{|reg, description| if l =~ reg
then; do something; end }

So...my particular example I'm only using a hash because of the clean
syntactic look :slight_smile: [not for the functionality at all].

Ah!

I think that for now I'll just write a Hash monkey patcher a la

Why do that? Your identifiers is constant anyway - as far as I can
see. Why not just do

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}.to_a

Cheers

robert

···

2008/9/4 Roger Pack <rogerpack2005@gmail.com>:

--
use.inject do |as, often| as.you_can - without end

Well indeed I would suggest to anyone having problems with performance
of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.

So our goal is to see if Array#assoc is "about as fast as hash based
lookup"

a = {1 => 2, 2 => 3 }
b = [[1, 2], [2, 3]]

Benchmark.measure { 1000000.times { a[2] }}

=> #<Benchmark::Tms:0x5d93dc @total=0.26, @cutime=0.0, @label="",
@stime=0.0, @real=0.270273208618164, @utime=0.26, @cstime=0.0>

Benchmark.measure { 1000000.times { b.assoc 1 }}

=> #<Benchmark::Tms:0x5b2e1c @total=0.26, @cutime=0.0, @label="",
@stime=0.0, @real=0.264580965042114, @utime=0.26, @cstime=0.0>

faster than a hash :slight_smile:

Benchmark.measure { 1000000.times { b.assoc 2 }}

=> #<Benchmark::Tms:0x5e4408 @total=0.35, @cutime=0.0, @label="",
@stime=0.01, @real=0.35266900062561, @utime=0.34, @cstime=0.0>

Quite a bit slower.

So with only 2 entries #assoc is slower.

Oh well, we do what we can.
-=R

···

--
Posted via http://www.ruby-forum.com/\.

Aside: ruby-1.9 keeps links between hash entries, exactly for the
purpose of faster iteration. It also has the potentially useful
side-effect that hashes iterate in the same order that the entries were
inserted.

Interesting.
It appears that whatever the internal linking mechanism, it still
differs from Array#each speed-wise:

1.8.6 Hash:

Benchmark.measure { a = {:a => :b, :c => :d}; 100_000.times { a.each {} }}

=> #<Benchmark::Tms:0x26c2f8 @total=0.17, @cutime=0.0, @label="",
@stime=0.0, @real=0.173249959945679, @utime=0.17, @cstime=0.0>

1.8.6 Array:

Benchmark.measure { a = [[:a => :b], [:c => :d]]; 100_000.times { a.each {} }}

=> #<Benchmark::Tms:0x5de7c4 @total=0.07, @cutime=0.0, @label="",
@stime=0.0, @real=0.0752890110015869, @utime=0.07, @cstime=0.0>

so 0.17 to 0.7 s

and on 1.9 Hash:

Benchmark.measure { a = {:a => :b, :c => :d}; 100_000.times { a.each {} }}

=> #<Benchmark::Tms:0x130efc @label="", @real=0.142009973526001,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.14, @total=0.14>

1.9 Array:

Benchmark.measure { a = [[:a, :b], [:c, :d]]; 100_000.times { a.each {} }}

=> #<Benchmark::Tms:0x13379c @label="", @real=0.0471560955047607,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.05, @total=0.05>

so 0.14 to 0.05 s

The numbers seem very similar, comparison wise, to each other.

-=R

···

--
Posted via http://www.ruby-forum.com/\.

Why do that? Your identifiers is constant anyway - as far as I can
see. Why not just do

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}.to_a

Yeah--works like a charm.

This makes me wish that there were a data structure that has
object-based O(1) lookup/insertion/deletion, and O(n) .each calls.
Apparently Hash doesn't :slight_smile:

My theory is that some people use hashes just for the convenience of
object-based lookup [ex: scope[:include] in rails] and not for the real
strength of hashes, which, as you mentioned, is "efficient
lookup/insertion/deletion"

Thanks!

-=R

···

--
Posted via http://www.ruby-forum.com/\.

Why do that? Your identifiers is constant anyway - as far as I can
see. Why not just do

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}.to_a

Yeah--works like a charm.

Great!

This makes me wish that there were a data structure that has
object-based O(1) lookup/insertion/deletion, and O(n) .each calls.
Apparently Hash doesn't :slight_smile:

Not so fast! Keep in mind that these figures are *asymptotic* there
are always constants involved and it might well be that for the N's we
deal with results look very different. That's why we do all the
benchmarking when we want to find out which approach is best. :slight_smile:

My theory is that some people use hashes just for the convenience of
object-based lookup [ex: scope[:include] in rails] and not for the real
strength of hashes, which, as you mentioned, is "efficient
lookup/insertion/deletion"

I am not sure what you mean by "object-based lookup", especially how
it is not "efficient lookup/insertion/deletion" or differs from that.
I mean, people do use these lookups *because* they are efficient.

Thanks!

You're welcome.

Cheers

robert

···

2008/9/4 Roger Pack <rogerpack2005@gmail.com>:

--
use.inject do |as, often| as.you_can - without end