Inserting hash value slows down as table gets larger

People,

In response to people's suggestions about speeding up my script by replacing output to many small files with output to one large file I have implemented a hash table I can write out with YAML. However, I find as the hash table gets larger, the script slows down . . but when I try and work out what is happening by producing a small test script that does more or less the same thing, I can't reproduce the problem . .

The test script is:

#!/usr/bin/ruby

h1 = Hash.new( 0 )
srand = 0

# seeds = [ '01', '01', '01', '22' ]
# seeds = [ '01', '01', '20', '22' ]
# seeds = [ '01', '32', '20', '22' ]
seeds = [ '50', '32', '20', '22' ]

for a in '01' .. seeds[0]
   start = Time.now
# puts a
   for b in '01' .. seeds[1]
# puts b
     for c in '01' .. seeds[2]
# puts c
       for d in '01' .. seeds[3]
# print "#{d} "
         h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }
       end
# puts
     end
# puts
   end
# puts
   stop = Time.now
   puts stop - start
end

The script is faster with the hash insertion commented out of course and the time between iterations of the outer loop are constant in both scripts - but they are longer and STILL constant with the insertion not commented out in the test script. In my actual script, when the insertion is not commented out the time between iterations in the outer loop gets longer and longer eg 36 sec -> a few minutes before I kill it about half way through . .

Can anyone suggest a way of working out why the production hash insertion behaves differently and somewhat unexpectedly?

Thanks,

Phil.

···

--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

I suspect is it the memory related to object creation for Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

With this line as originally
       h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }
I get 20.8 seconds

set to
      h1[ "#{a}.#{b}.#{c}.#{d}" ] = 3
I get 2.54 seconds

set to
       Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

I get 3.75 seconds

set to
       # h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

I get 0.47 seconds

Dave.

···

On 17 Mar 2011, at 12:56, Philip Rhoades wrote:

People,

In response to people's suggestions about speeding up my script by replacing output to many small files with output to one large file I have implemented a hash table I can write out with YAML. However, I find as the hash table gets larger, the script slows down . . but when I try and work out what is happening by producing a small test script that does more or less the same thing, I can't reproduce the problem . .

The test script is:

#!/usr/bin/ruby

h1 = Hash.new( 0 )
srand = 0

# seeds = [ '01', '01', '01', '22' ]
# seeds = [ '01', '01', '20', '22' ]
# seeds = [ '01', '32', '20', '22' ]
seeds = [ '50', '32', '20', '22' ]

for a in '01' .. seeds[0]
start = Time.now
# puts a
for b in '01' .. seeds[1]
# puts b
   for c in '01' .. seeds[2]
# puts c
     for d in '01' .. seeds[3]
# print "#{d} "
       h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }
     end
# puts
   end
# puts
end
# puts
stop = Time.now
puts stop - start
end

The script is faster with the hash insertion commented out of course and the time between iterations of the outer loop are constant in both scripts - but they are longer and STILL constant with the insertion not commented out in the test script. In my actual script, when the insertion is not commented out the time between iterations in the outer loop gets longer and longer eg 36 sec -> a few minutes before I kill it about half way through . .

Can anyone suggest a way of working out why the production hash insertion behaves differently and somewhat unexpectedly?

Thanks,

Phil.
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

"#{a}.#{b}.#{c}.#{d}".to_i.freeze

I don't think that does what you expect. Try it in irb.

···

--
Posted via http://www.ruby-forum.com/.

Dave Baldwin wrote in post #987924:

set to
       Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

I get 3.75 seconds

set to
       # h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(1) ) } }

I get 0.47 seconds

Try these too:

    h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = 3

    h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new etc

It's a foible of Ruby that if you use a String as a hash key, it is
dup'd and frozen - unless it was frozen already, which is what the above
does.

The other thing to beware of is YAML - try using Marshal instead as a
comparison, because YAML might not be very efficient when dealing with
very large data structures. I think you said you needed a human-editable
form, but this would rule out YAML as the source of the problem.

If it is, then you can look at alternative YAML libraries, or at JSON.

···

--
Posted via http://www.ruby-forum.com/.

Brian,

"#{a}.#{b}.#{c}.#{d}".to_i.freeze

I don't think that does what you expect. Try it in irb.

Oops! . . forgot to take the dots out . .

In any case, in the actual live script (as opposed to the test script) I did the same comparison of:

$recs[ "#{$seed_pre}.#{$seed_post}.#{$seed}.#{sgen}".freeze ] = $iarr
$recs[ "#{$seed_pre}#{$seed_post}#{$seed}#{sgen}".to_i.freeze ] = $iarr

and unfortunately there was no real difference because of the amount of other processing that happens inside the loops . .

Thanks,

Phil.

···

On 2011-03-21 00:21, Brian Candler wrote:
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

Brian,

Dave Baldwin wrote in post #987924:

set to
        Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

I get 3.75 seconds

set to
        # h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(1) ) } }

I get 0.47 seconds

Try these too:

     h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = 3

     h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new etc

It's a foible of Ruby that if you use a String as a hash key, it is
dup'd and frozen - unless it was frozen already, which is what the above
does.

Adding "freeze" improves the situation by about a second on my machine:

  18.3 -> 17.3

The other thing to beware of is YAML - try using Marshal instead as a
comparison, because YAML might not be very efficient when dealing with
very large data structures. I think you said you needed a human-editable
form, but this would rule out YAML as the source of the problem.

Yes that is very slow too but that is another issue - I can live with that for the convenience of having human readable data . .

If it is, then you can look at alternative YAML libraries, or at JSON.

The other thought I had was putting the data into a sqlite3 db - I will try it and see what happens but I don't imagine it would be faster than a memory based hash table?

Thanks,

Phil.

···

On 2011-03-18 04:04, Brian Candler wrote:
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

The other thought I had was putting the data into a sqlite3 db - I will
try it and see what happens but I don't imagine it would be faster than
a memory based hash table?

Probably not, but at least it persists, and it may be cheaper to make
updates to a large external data structure than rebuild the entire
structure in memory each time.

In your application, do you really need to use Strings?

Your sample code looks like it's handling numeric-style data (although I
realise this is just a test case for the problems you're having).
Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.

Or, if you're handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn't allocate any memory; it just points to the entry in the symbol
table.

Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id

Although that's more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.

···

--
Posted via http://www.ruby-forum.com/.

Brian,

The other thought I had was putting the data into a sqlite3 db - I will
try it and see what happens but I don't imagine it would be faster than
a memory based hash table?

Probably not, but at least it persists, and it may be cheaper to make
updates to a large external data structure than rebuild the entire
structure in memory each time.

Yes, using a db was much slower

In your application, do you really need to use Strings?

I guess not but I preferred a tag of:

  01.01.01.01

to:

  1010101

Your sample code looks like it's handling numeric-style data (although I
realise this is just a test case for the problems you're having).
Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.

Are you talking about the hash key or the hash values? - the values in the real script will all be floats . .

Or, if you're handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn't allocate any memory; it just points to the entry in the symbol
table.

Not sure what you mean - example?

Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id

I ran that code but I don't understand how it helps . .

Although that's more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.

Still not clear - examples?

The other thing that occurred to me was that on my 64-bit machine maybe I could run 2-3 threads for inserting into the hash table?

BTW, I did end up changing to JSON - a YAML dump on a 32000x22 hash table was deadly . .

Thanks,

Phil.

···

On 2011-03-19 00:42, Brian Candler wrote:
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

Philip Rhoades wrote in post #988264:

Your sample code looks like it's handling numeric-style data (although I
realise this is just a test case for the problems you're having).
Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.

Are you talking about the hash key or the hash values?

Either.

- the values in
the real script will all be floats . .

Then they will be allocated on the heap, just like strings. I presume
you're aware of the inherent inaccuracy of floats (in any language), and
are OK with this.

1.0/2.0 == 1.0 + 1.0/2.0 - 1.0

=> true

1.0/10.0 == 1.0 + 1.0/10.0 - 1.0

=> false

Or, if you're handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn't allocate any memory; it just points to the entry in the symbol
table.

Not sure what you mean - example?

a = []
a[0] = :foo
a[1] = :foo
a[2] = :foo

puts a[0].object_id
puts a[1].object_id
puts a[2].object_id

Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id

I ran that code but I don't understand how it helps . .

It uses less memory if you have (say) millions of identical strings. It
may help garbage collection performance, but not much else.

Although that's more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.

Still not clear - examples?

Suppose the strings "foo" and "bar" comprise 80% of your hash keys or
values. Then mapping them to the same frozen string means that you only
have one instance of string "foo" and one instance of string "bar" in
the system, instead of (say) millions of distinct strings. You can still
use individual strings for the other 20%.

This is really an edge optimisation though, you really shouldn't need to
be worrying about these things - if they are significant, then perhaps
ruby is the wrong language for the problem in hand.

The other thing that occurred to me was that on my 64-bit machine maybe
I could run 2-3 threads for inserting into the hash table?

Noooo..... even in ruby 1.9, there is a global interpreter lock.
Multiple threads gain you nothing really, except for threads which are
blocked on I/O.

Even if there were not, having multiple threads contending on the same
hash (and controlling access via, say, a mutex) would be pretty much
guaranteed to make performance worse not better.

Regards,

Brian.

···

--
Posted via http://www.ruby-forum.com/.

Brian,

Philip Rhoades wrote in post #988264:

Your sample code looks like it's handling numeric-style data (although I
realise this is just a test case for the problems you're having).
Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.

Are you talking about the hash key or the hash values?

Either.

Right - for the following in my test script:

h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new(2){ Array.new(1){ Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i ] = Array.new(2){ Array.new(1){ Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i.freeze ] = Array.new(2){ Array.new(1){ Array.new( 20, rand(100) ) } }

I get the following times:

18.350s
18.113s
  4.724s
  4.896s

So I guess I should live with the slight decrease of readability when searching for particular results in the JSON output file by using ints instead of strings for the hash keys.

- the values in
the real script will all be floats . .

Then they will be allocated on the heap, just like strings. I presume
you're aware of the inherent inaccuracy of floats (in any language), and
are OK with this.

1.0/2.0 == 1.0 + 1.0/2.0 - 1.0

=> true

1.0/10.0 == 1.0 + 1.0/10.0 - 1.0

=> false

I suppose I could convert them all to six or eight digit ints . . they are measures of biological diversity and changing them backwards and forwards is a bit of a hassle but maybe it is worth doing for the speed advantage? - I will try my test script with floats and see what happens.

Or, if you're handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn't allocate any memory; it just points to the entry in the symbol
table.

Not sure what you mean - example?

a = []
a[0] = :foo
a[1] = :foo
a[2] = :foo

puts a[0].object_id
puts a[1].object_id
puts a[2].object_id

The reason the numbers are called "seeds" is that they correspond to the seed for the random dumber generator in the C/C++ simulation program - so they are all unique for each of the 32,000 simulations.

Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id

I ran that code but I don't understand how it helps . .

It uses less memory if you have (say) millions of identical strings.

Not the case for the keys and unlikely for the values.

It
may help garbage collection performance, but not much else

Although that's more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.

Still not clear - examples?

Suppose the strings "foo" and "bar" comprise 80% of your hash keys or
values. Then mapping them to the same frozen string means that you only
have one instance of string "foo" and one instance of string "bar" in
the system, instead of (say) millions of distinct strings. You can still
use individual strings for the other 20%.

Unfortunately this doesn't correspond to my case . .

This is really an edge optimisation though, you really shouldn't need to
be worrying about these things - if they are significant, then perhaps
ruby is the wrong language for the problem in hand.

The other thing that occurred to me was that on my 64-bit machine maybe
I could run 2-3 threads for inserting into the hash table?

Noooo..... even in ruby 1.9, there is a global interpreter lock.
Multiple threads gain you nothing really, except for threads which are
blocked on I/O.

Right.

Even if there were not, having multiple threads contending on the same
hash (and controlling access via, say, a mutex) would be pretty much
guaranteed to make performance worse not better.

OK - oh well it was worth a thought!

Many thanks!

Regards,

Phil.

···

On 2011-03-20 08:31, Brian Candler wrote:
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

People,

To add to my last post:

Brian,

Philip Rhoades wrote in post #988264:

Your sample code looks like it's handling numeric-style data
(although I
realise this is just a test case for the problems you're having).
Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.

Are you talking about the hash key or the hash values?

Either.

Right - for the following in my test script:

h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20,
rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i.freeze ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }

I get the following times:

18.350s
18.113s
4.724s
4.896s

So I guess I should live with the slight decrease of readability when
searching for particular results in the JSON output file by using ints
instead of strings for the hash keys.

- the values in
the real script will all be floats . .

Then they will be allocated on the heap, just like strings. I presume
you're aware of the inherent inaccuracy of floats (in any language), and
are OK with this.

1.0/2.0 == 1.0 + 1.0/2.0 - 1.0

=> true

1.0/10.0 == 1.0 + 1.0/10.0 - 1.0

=> false

I suppose I could convert them all to six or eight digit ints . . they
are measures of biological diversity and changing them backwards and
forwards is a bit of a hassle but maybe it is worth doing for the speed
advantage? - I will try my test script with floats and see what happens.

Or, if you're handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn't allocate any memory; it just points to the entry in the symbol
table.

Not sure what you mean - example?

a = []
a[0] = :foo
a[1] = :foo
a[2] = :foo

puts a[0].object_id
puts a[1].object_id
puts a[2].object_id

The reason the numbers are called "seeds" is that they correspond to the
seed for the random dumber generator in the C/C++ simulation program -
so they are all unique for each of the 32,000 simulations.

Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id

I ran that code but I don't understand how it helps . .

It uses less memory if you have (say) millions of identical strings.

Not the case for the keys and unlikely for the values.

It
may help garbage collection performance, but not much else

Although that's more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values
you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.

Still not clear - examples?

Suppose the strings "foo" and "bar" comprise 80% of your hash keys or
values. Then mapping them to the same frozen string means that you only
have one instance of string "foo" and one instance of string "bar" in
the system, instead of (say) millions of distinct strings. You can still
use individual strings for the other 20%.

Unfortunately this doesn't correspond to my case . .

This is really an edge optimisation though, you really shouldn't need to
be worrying about these things - if they are significant, then perhaps
ruby is the wrong language for the problem in hand.

The other thing that occurred to me was that on my 64-bit machine maybe
I could run 2-3 threads for inserting into the hash table?

Noooo..... even in ruby 1.9, there is a global interpreter lock.
Multiple threads gain you nothing really, except for threads which are
blocked on I/O.

Right.

Even if there were not, having multiple threads contending on the same
hash (and controlling access via, say, a mutex) would be pretty much
guaranteed to make performance worse not better.

OK - oh well it was worth a thought!

I ran the normal number of the inner two loops and only one of the outermost loop (ie 1/50th of the total) and got the following with different versions of Ruby:

ruby-1.8.7-p334 [ x86_64 ] = 31.41
ruby-1.9.2-p180 [ x86_64 ] = 19.53
rbx-head [ ] = 128.42

Many thanks!

Regards,

Phil.

···

On 2011-03-20 16:44, Philip Rhoades wrote:

On 2011-03-20 08:31, Brian Candler wrote:

--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au