Why does Ruby do 'str_replace' when storing values in an hash?

Dear list,

some days ago I wrote a script that stored values in an Hashtable with
different buckets. It was some string-parsing grunt work.

I noticed that Ruby became pretty slow when the hashtable grew to ~10 Mio
entries. To make sure it is actually Ruby that is to blame, I wrote comparison
script snippets in Perl and Python, and yes, Ruby was just plain slower.

By reducing the amount of code step-by-step I was able to narrow the problem
down. Currently, I can reproduce the behaviour with the following simple
scriptlet:

---%<---
tbl = { }

10_000_000.times do |i|
  tbl['last'] = i*i*i
end
--->%---

Execution times are as follows:

+ ruby loop0r.rb
real 0m10.741s
user 0m10.596s
sys 0m0.107s

+ perl loop0r.pl
real 0m4.503s
user 0m4.420s
sys 0m0.048s

+ python loop0r.py
real 0m6.704s
user 0m6.367s
sys 0m0.274s

Replacing the String I use as Hash key with a symbol, i.e., :last, lowers the
execution time of Ruby dramatically so that it becomes fast than Python (and
still slower than Perl, but that's ok).

I was baffled and ran callgrind against the interpreter. It came up with this:

---%<---
183,248,191 /usr/local/rvm/src/ruby-2.0.0-p247/vm.inc:vm_exec_core'2
155,033,405 /usr/local/rvm/src/ruby-2.0.0-p247/siphash.c:ruby_sip_hash24
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
112,110,714 /usr/local/rvm/src/ruby-2.0.0-p247/insns.def:vm_exec_core'2
93,170,146 /usr/local/rvm/src/ruby-2.0.0-p247/string.c:str_replace
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
78,086,663 /usr/local/rvm/src/ruby-2.0.0-p247/st.c:st_update
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
74,991,856 /usr/local/rvm/src/ruby-2.0.0-p247/gc.c:slot_sweep
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
73,137,809 /usr/local/rvm/src/ruby-2.0.0-
p247/vm_insnhelper.c:vm_call_cfunc_with_frame'2
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
63,043,562 /usr/local/rvm/src/ruby-2.0.0-p247/vm_insnhelper.c:vm_push_frame
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
61,029,008 /usr/local/rvm/src/ruby-2.0.0-p247/vm.c:rb_yield
--->%---

Now: Why does Ruby call string.c:str_replace? There is obviously no operation
that modifies a string here. In fact, I even use the same String as hash key
over and over again.

Of course, I can also freeze the String, which amounts to the same as using a
Symbol (performance-wise). Still, I don't understand why Ruby is calling
str_*replace*.

What is happening here? And how can I circumvent it?

Thanks alot for any replies!

      --- Eric

str_replace is (probably, I'm not an expert here) the function that
replaces a string object's value, not to be confused with PHP's str_replace
(which in ruby is called gsub).

Objects created in ruby blocks (including literals) are created each time
the block is run. Have you tried this?

---%<---
tbl = { }
key = 'last'

10_000_000.times do |i|
  tbl[key] = i*i*i
end
--->%---

I imagine that will have the same effect as either the symbol (which is a
singleton), or the frozen string (which is effectively a singleton). No new
string literal = no new string object = no replacing the existing one.

···

On 8 April 2014 23:11, Eric MSP Veith <eveith@wwweb-library.net> wrote:

Dear list,

some days ago I wrote a script that stored values in an Hashtable with
different buckets. It was some string-parsing grunt work.

I noticed that Ruby became pretty slow when the hashtable grew to ~10 Mio
entries. To make sure it is actually Ruby that is to blame, I wrote
comparison
script snippets in Perl and Python, and yes, Ruby was just plain slower.

By reducing the amount of code step-by-step I was able to narrow the
problem
down. Currently, I can reproduce the behaviour with the following simple
scriptlet:

---%<---
tbl = { }

10_000_000.times do |i|
  tbl['last'] = i*i*i
end
--->%---

Execution times are as follows:

+ ruby loop0r.rb
real 0m10.741s
user 0m10.596s
sys 0m0.107s

+ perl loop0r.pl
real 0m4.503s
user 0m4.420s
sys 0m0.048s

+ python loop0r.py
real 0m6.704s
user 0m6.367s
sys 0m0.274s

Replacing the String I use as Hash key with a symbol, i.e., :last, lowers
the
execution time of Ruby dramatically so that it becomes fast than Python
(and
still slower than Perl, but that's ok).

I was baffled and ran callgrind against the interpreter. It came up with
this:

---%<---
183,248,191 /usr/local/rvm/src/ruby-2.0.0-p247/vm.inc:vm_exec_core'2
155,033,405 /usr/local/rvm/src/ruby-2.0.0-p247/siphash.c:ruby_sip_hash24
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
112,110,714 /usr/local/rvm/src/ruby-2.0.0-p247/insns.def:vm_exec_core'2
93,170,146 /usr/local/rvm/src/ruby-2.0.0-p247/string.c:str_replace
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
78,086,663 /usr/local/rvm/src/ruby-2.0.0-p247/st.c:st_update
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
74,991,856 /usr/local/rvm/src/ruby-2.0.0-p247/gc.c:slot_sweep
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
73,137,809 /usr/local/rvm/src/ruby-2.0.0-
p247/vm_insnhelper.c:vm_call_cfunc_with_frame'2
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
63,043,562
/usr/local/rvm/src/ruby-2.0.0-p247/vm_insnhelper.c:vm_push_frame
[/home/eveith/.rvm/rubies/ruby-2.0.0-p247/lib/libruby.so.2.0.0]
61,029,008 /usr/local/rvm/src/ruby-2.0.0-p247/vm.c:rb_yield
--->%---

Now: Why does Ruby call string.c:str_replace? There is obviously no
operation
that modifies a string here. In fact, I even use the same String as hash
key
over and over again.

Of course, I can also freeze the String, which amounts to the same as
using a
Symbol (performance-wise). Still, I don't understand why Ruby is calling
str_*replace*.

What is happening here? And how can I circumvent it?

Thanks alot for any replies!

                        --- Eric

--
  Matthew Kerwin
  http://matthew.kerwin.net.au/

> str_replace is (probably, I'm not an expert here) the function that
> replaces a string object's value, not to be confused with PHP's str_replace
> (which in ruby is called gsub).

Oh yes, you are right of course. No wonder I was mistaken with my own
analysis. Everything makes perfect sense now. :slight_smile:

> Objects created in ruby blocks (including literals) are created each time
> the block is run. Have you tried this? [...]

Yep, that helps and puts ruby down to 3s.

Btw, if you do know the strings beforehand, ruby-trunk will prefreeze
and deduplicate string literals when doing hash assignments.

However, I'm wondering how this will help me with the actual string-parsing
grunt work. I don't know the hash keys beforehand, so its still strings that
are used. Is there a more efficient way to do this instead of using the Hash
class?

I'll try to make dedupe+prefreeze for dynamic strings possible again in
2.2, but no guarantees. We tried it right before 2.1, but there was a
regression in some cases so we reverted it.

···

Eric MSP Veith <eveith@wwweb-library.net> wrote:

On Wednesday 09 April 2014, 00:41:58, Matthew Kerwin wrote:

str_replace is (probably, I'm not an expert here) the function that
replaces a string object's value, not to be confused with PHP's str_replace
(which in ruby is called gsub).

Oh yes, you are right of course. No wonder I was mistaken with my own
analysis. Everything makes perfect sense now. :slight_smile:

Objects created in ruby blocks (including literals) are created each time
the block is run. Have you tried this? [...]

Yep, that helps and puts ruby down to 3s.

However, I'm wondering how this will help me with the actual string-parsing
grunt work. I don't know the hash keys beforehand, so its still strings that
are used. Is there a more efficient way to do this instead of using the Hash
class?

Thanks alot!

      --- Eric

···

On Wednesday 09 April 2014, 00:41:58, Matthew Kerwin wrote:

However, I'm wondering how this will help me with the actual

string-parsing

grunt work. I don't know the hash keys beforehand, so its still strings

that

are used. Is there a more efficient way to do this instead of using the

Hash

class?

I think it's not the hash, exactly; it's keying the hash on a lot of string
objects.

OTOH you might be able to use a second hash, something like:

´´´
keys = Hash.new{|h,k|h[k]=h.length}
data = {}

#...big loop
data[ keys[my_str] ] = my_val
´´´

Caveat: written on my phone, not verified at all.

The goal is to convert the string to a Fixnum in the simplest way possible.
(Fixnums also being singleton.) Since ´keys´ only updates to add a new
string, it shouldn't suffer the same slowdown.

···

On Apr 9, 2014 4:40 AM, "Eric MSP Veith" <eveith@wwweb-library.net> wrote:

Ruby will copy a String instance when used as a Hash key to avoid
aliasing effects. There is an easy trick to avoid this copying: freeze
them before inserting

irb(main):001:0> s = "foo"
=> "foo"
irb(main):002:0> h = {}
=> {}
irb(main):003:0> h[s]=1
=> 1
irb(main):004:0> h.keys.map {|k| s.equal? k}
=> [false]

irb(main):005:0> s.freeze
=> "foo"
irb(main):006:0> h = {}
=> {}
irb(main):007:0> h[s]=2
=> 2
irb(main):008:0> h.keys.map {|k| s.equal? k}
=> [true]

Note: #equal? tests for identity - not equivalence.

Kind regards

robert

···

On Tue, Apr 8, 2014 at 8:39 PM, Eric MSP Veith <eveith@wwweb-library.net> wrote:

On Wednesday 09 April 2014, 00:41:58, Matthew Kerwin wrote:

str_replace is (probably, I'm not an expert here) the function that
replaces a string object's value, not to be confused with PHP's str_replace
(which in ruby is called gsub).

Oh yes, you are right of course. No wonder I was mistaken with my own
analysis. Everything makes perfect sense now. :slight_smile:

Objects created in ruby blocks (including literals) are created each time
the block is run. Have you tried this? [...]

Yep, that helps and puts ruby down to 3s.

However, I'm wondering how this will help me with the actual string-parsing
grunt work. I don't know the hash keys beforehand, so its still strings that
are used. Is there a more efficient way to do this instead of using the Hash
class?

--
[guy, jim].each {|him| remember.him do |as, often| as.you_can - without end}
http://blog.rubybestpractices.com/