Uniq with count; better way?

Please do not top post.

it seems not quite accurate to me because of block. inject uses convention that the last statement in method is a return.

??? Inject uses the convention that the block passes the "aggregator"
on which it has received as first argument. This does not have
anything to do with "return".

The nature of inject is to assign the last value to the memo that has not been used ever in your case.

What "memo"?

Therefore it's more natural to use short inject method definitions: either a.inject(5, :+) either 5 + a.inject(:+). If the memo return in proc would be unnatural the inject won't pass it to the proc explicitly.

???

On the other side I'm not a proponent of the crazy injects that could be barely understood. I think in this case inject could be used easily as well as the other solutions provided.

You example is not accurate though:

5 + [1, 2, 3, 4].reduce(&:+)

Can you please again explain what is not accurate about Adam's piece
of code? Are you aware of the two different behaviors of #inject?

irb(main):001:0> [[1,2,3],[1],].each do |a|
irb(main):002:1* p a, a.inject(&:+), a.inject(0, &:+)
irb(main):003:1> end
[1, 2, 3]
6
6
[1]
1
1

nil
0
=> [[1, 2, 3], [1], ]

Both have their use and it depends on the use case which you pick.

Kind regards

robert

···

On Mon, Jan 16, 2012 at 7:08 PM, Sigurd <cu9ypd@gmail.com> wrote:

On Jan 16, 2012, at 6:14 PM, Adam Prescott wrote:

On Jan 16, 2012 4:09 PM, "Sigurd" <cu9ypd@gmail.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Abinoam,

Monday, January 16, 2012, 6:05:33 PM, you wrote:

···

On Mon, Jan 16, 2012 at 9:22 PM, Abinoam Jr. <abinoam@gmail.com> wrote:

On Mon, Jan 16, 2012 at 1:48 PM, Magnus Holm <judofyr@gmail.com> wrote:

On Mon, Jan 16, 2012 at 17:04, Adam Prescott <adam@aprescott.com> wrote:

On Mon, Jan 16, 2012 at 16:00, Sigurd <cu9ypd@gmail.com> wrote:

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res }

I think this is a misuse of inject, personally, every time I see it. It's
harder to read and it doesn't give the feeling of actually "reducing"
(inject's alias) the array down to one thing. The required `; res` is a
sign of that. Compare:

[1, 2, 3, 4].inject(5) { |a, b| a + b }

There's always each_with_object, although it's a little long:

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res += 1 }

I think Magnus Holm is the clearest (IMHO, yes, it's just a taste and
humble opinion.).

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) {|num, hsh| hsh[num] += 1}

Another way (not better) I remember is...

Hash[ [4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]

See: Module: Enumerable (Ruby 1.9.3)

It also can be... clearer?!?

Hash[ [4,5,6,4,5,6,6,7].group_by {|n| n}.map {|ix, els| [ix, els.size] } ]

Perhaps something like this (same as Magnus Holm) just hiding the
complexity into the method.

class Array
def totalize_to_hash
hsh = Hash.new(0)
self.each do |n|
hsh[n] += 1
end
hsh
end
end

[4,5,6,4,5,6,6,7].totalize_to_hash

Abinoam Jr.

Some benchmark results...

n = 100_000
Benchmark.bm(15) do |b|
  b.report("Ralph Shneiver:") { n.times { a = [4,5,6,4,5,6,6,7];
result = Hash.new(0); a.each { |x| result += 1 }; result} }
  b.report("Sigurd:") { n.times {
[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res } } }
  b.report("Keinich #1") { n.times { Hash[a.group_by{|n|n}.map{|k,
v>[k, v.size]}] } }
  b.report("Keinich #2") { n.times {
Hash.new(0).tap{|h|a.each{|n|h[n] += 1}} } }
  b.report("Magnus Holm:") { n.times {
[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res += 1
} } }
  b.report("Abinoam #1:") { n.times { Hash[
[4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]
} }
end

                     user system total real
Ralph Shneiver: 0.290000 0.000000 0.290000 ( 0.259640)
Sigurd: 0.320000 0.000000 0.320000 ( 0.289873)
Keinich #1 0.560000 0.000000 0.560000 ( 0.497736)
Keinich #2 0.280000 0.000000 0.280000 ( 0.250843)
Magnus Holm: 0.310000 0.000000 0.310000 ( 0.283344)
Abinoam #1: 1.140000 0.000000 1.140000 ( 1.042744)

Abinoam Jr.

Wow ... thank you!

Sorry for the mess... (some error in the bench code + horrible text wrapping)

Apologizing with the gist...

Abinoam Jr.

···

On Mon, Jan 16, 2012 at 10:05 PM, Abinoam Jr. <abinoam@gmail.com> wrote:

On Mon, Jan 16, 2012 at 9:22 PM, Abinoam Jr. <abinoam@gmail.com> wrote:

On Mon, Jan 16, 2012 at 1:48 PM, Magnus Holm <judofyr@gmail.com> wrote:

On Mon, Jan 16, 2012 at 17:04, Adam Prescott <adam@aprescott.com> wrote:

On Mon, Jan 16, 2012 at 16:00, Sigurd <cu9ypd@gmail.com> wrote:

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res }

I think this is a misuse of inject, personally, every time I see it. It's
harder to read and it doesn't give the feeling of actually "reducing"
(inject's alias) the array down to one thing. The required `; res` is a
sign of that. Compare:

[1, 2, 3, 4].inject(5) { |a, b| a + b }

There's always each_with_object, although it's a little long:

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res += 1 }

I think Magnus Holm is the clearest (IMHO, yes, it's just a taste and
humble opinion.).

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) {|num, hsh| hsh[num] += 1}

Another way (not better) I remember is...

Hash[ [4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]

See: Module: Enumerable (Ruby 1.9.3)

It also can be... clearer?!?

Hash[ [4,5,6,4,5,6,6,7].group_by {|n| n}.map {|ix, els| [ix, els.size] } ]

Perhaps something like this (same as Magnus Holm) just hiding the
complexity into the method.

class Array
def totalize_to_hash
hsh = Hash.new(0)
self.each do |n|
hsh[n] += 1
end
hsh
end
end

[4,5,6,4,5,6,6,7].totalize_to_hash

Abinoam Jr.

Some benchmark results...

n = 100_000
Benchmark.bm(15) do |b|
b.report("Ralph Shneiver:") { n.times { a = [4,5,6,4,5,6,6,7];
result = Hash.new(0); a.each { |x| result += 1 }; result} }
b.report("Sigurd:") { n.times {
[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res } } }
b.report("Keinich #1") { n.times { Hash[a.group_by{|n|n}.map{|k,
v>[k, v.size]}] } }
b.report("Keinich #2") { n.times {
Hash.new(0).tap{|h|a.each{|n|h[n] += 1}} } }
b.report("Magnus Holm:") { n.times {
[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res += 1
} } }
b.report("Abinoam #1:") { n.times { Hash[
[4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]
} }
end

                user     system      total        real

Ralph Shneiver: 0.290000 0.000000 0.290000 ( 0.259640)
Sigurd: 0.320000 0.000000 0.320000 ( 0.289873)
Keinich #1 0.560000 0.000000 0.560000 ( 0.497736)
Keinich #2 0.280000 0.000000 0.280000 ( 0.250843)
Magnus Holm: 0.310000 0.000000 0.310000 ( 0.283344)
Abinoam #1: 1.140000 0.000000 1.140000 ( 1.042744)

Abinoam Jr.

Kenichi let's add the word "histogram" to the search.

Perhaps naming the method something "histogram" related would be a good choice.

Abinoam Jr.

···

On Tue, Jan 17, 2012 at 11:27 AM, Kenichi Kamiya <kachick1@gmail.com> wrote:

I found a below discussion in rails. (via google)
https://github.com/rails/rails/pull/1932

It has a same name and aims to same goal.
And it was estimated "too specific".

I think too, but this name not recall other case for me.
uhh...

2012/1/17 Abinoam Jr. <abinoam@gmail.com>:

On Tue, Jan 17, 2012 at 7:58 AM, Kenichi Kamiya <kachick1@gmail.com> wrote:

# if Enumerable has method for this case
an aproach to http://www.ruby-forum.com/topic/3446541 · GitHub

I would like to have it.

We can discuss it better here at ruby talk to see the pros and cons.
If somebody is able to do the C code of it...
Perhaps we could issue a feature request.

Abinoam Jr.

--
Kenichi Kamiya

Please do not top post.

it seems not quite accurate to me because of block. inject uses convention that the last statement in method is a return.

??? Inject uses the convention that the block passes the "aggregator"
on which it has received as first argument. This does not have
anything to do with "return".

The nature of inject is to assign the last value to the memo that has not been used ever in your case.

What "memo"?

Memo is the name of the first parameter that is passed to the block, or at least it's how it's named in doc and how i've used to call it.

Therefore it's more natural to use short inject method definitions: either a.inject(5, :+) either 5 + a.inject(:+). If the memo return in proc would be unnatural the inject won't pass it to the proc explicitly.

???

The argue was about the explicit memo returns so on a given sample I tried to show that that sample do not necessarily need to use blocks. But I see no problem to explicitly return value if i was passed to that block. Yes, sometimes it looks ugly, sometimes, especially with tap, it looks pretty nice.

On the other side I'm not a proponent of the crazy injects that could be barely understood. I think in this case inject could be used easily as well as the other solutions provided.

You example is not accurate though:

5 + [1, 2, 3, 4].reduce(&:+)

Can you please again explain what is not accurate about Adam's piece
of code? Are you aware of the two different behaviors of #inject?

Technically it's good. My argument was not that the code is not functional or doing something in a wrong way or that syntax is unacceptable. The point i have is that Adams code is not the best example illustrating that the explicit inject block returns is the sign of incorrect inject use or let's call it "code smell". I'm not against that particular piece of code, it's just someone is wrong in internet again :slight_smile:

irb(main):001:0> [[1,2,3],[1],].each do |a|
irb(main):002:1* p a, a.inject(&:+), a.inject(0, &:+)
irb(main):003:1> end
[1, 2, 3]
6
6
[1]
1
1

nil
0
=> [[1, 2, 3], [1], ]

Both have their use and it depends on the use case which you pick.

I don't know can this be called different behavior. Inject use the first value of collection if it's not set explicitly so it's like particular case of inject, but i'm ok with the two different cases as well.

···

On Jan 16, 2012, at 10:15 PM, Robert Klemme wrote:

On Mon, Jan 16, 2012 at 7:08 PM, Sigurd <cu9ypd@gmail.com> wrote:

On Jan 16, 2012, at 6:14 PM, Adam Prescott wrote:

On Jan 16, 2012 4:09 PM, "Sigurd" <cu9ypd@gmail.com> wrote:

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

> user system total real
> Ralph Shneiver: 0.290000 0.000000 0.290000 ( 0.259640)
> Sigurd: 0.320000 0.000000 0.320000 ( 0.289873)
> Keinich #1 0.560000 0.000000 0.560000 ( 0.497736)
> Keinich #2 0.280000 0.000000 0.280000 ( 0.250843)
> Magnus Holm: 0.310000 0.000000 0.310000 ( 0.283344)
> Abinoam #1: 1.140000 0.000000 1.140000 ( 1.042744)
>
> Abinoam Jr.
>

Sorry for the mess... (some error in the bench code + horrible text
wrapping)

Apologizing with the gist...
1624016’s gists · GitHub

Thanks. Very interesting.

I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
there
would be a significant performance difference. Because, upon determining the
"bins" of uniq values, that is essentially a form of "sorting", which can
be O(n^2)
if not careful.

Turns out I was wrong for ruby 1.9.3 (but right for some other rubies).

I rewrote the code to create large datasets (array up to 10_000_000), but
with
the data inside a set of 0...100.

require 'benchmark'

n = 1
#ar = [4,5,6,4,5,6,6,7]
ar = .tap{|a| 10_000_000.times {a << rand(100)}} #1_000_000, 2_000_000,
...
puts "SAMPLE of ar"
puts ar[0...20]
puts "SIZE"
puts ar.size
Benchmark.bm(15) do |b|
b.report("Ralph Shneiver:"){ n.times { result = Hash.new(0); ar.each { |x|
result += 1 }; result} }
b.report("Sigurd:") { n.times { ar.inject(Hash.new(0)) {|res, x| res +=
1; res } } }
b.report("Keinich #1") { n.times { Hash[ar.group_by{|n|n}.map{|k,v|[k,
v.size]}] } }
b.report("Keinich #2") { n.times { Hash.new(0).tap{|h|ar.each{|n|h[n] +=
1}} } }
b.report("Magnus Holm:") { n.times { ar.each_with_object(Hash.new(0)) {

x, res| res += 1 } } }

b.report("Abinoam #1:") { n.times { Hash[ar.sort.chunk {|n| n}.map {|ix,
els> [ix, els.size] } ] } }
end

RUBY 1.9.3:

···

On Tue, Jan 17, 2012 at 2:44 AM, Abinoam Jr. <abinoam@gmail.com> wrote:

On Mon, Jan 16, 2012 at 10:05 PM, Abinoam Jr. <abinoam@gmail.com> wrote:

==========

For ruby 1.9.3-p0 all solutions perform approx. the same (at least same
order)
and seem to stay quasi linear.

SIZE
1000000 #[1_000_000 that is ; n = 1]
                      user system total real
Ralph Shneiver: 0.180000 0.000000 0.180000 ( 0.174283)
Sigurd: 0.200000 0.000000 0.200000 ( 0.203652)
Keinich #1 0.140000 0.000000 0.140000 ( 0.142833)
Keinich #2 0.180000 0.000000 0.180000 ( 0.177456)
Magnus Holm: 0.200000 0.000000 0.200000 ( 0.205895)
Abinoam #1: 0.260000 0.000000 0.260000 ( 0.254554)

SIZE
2000000 #[2_000_000 that is ; n = 1]
                      user system total real
Ralph Shneiver: 0.340000 0.010000 0.350000 ( 0.350032)
Sigurd: 0.410000 0.000000 0.410000 ( 0.406483)
Keinich #1 0.280000 0.000000 0.280000 ( 0.285213)
Keinich #2 0.350000 0.010000 0.360000 ( 0.354640)
Magnus Holm: 0.410000 0.000000 0.410000 ( 0.411010)
Abinoam #1: 0.470000 0.030000 0.500000 ( 0.498782)

SIZE
10000000 #[10_000_000 that is; n = 1 here]

                      user system total real
Ralph Shneiver: 1.710000 0.040000 1.750000 ( 1.748137)
Sigurd: 2.000000 0.010000 2.010000 ( 2.012496)
Keinich #1 1.380000 0.030000 1.410000 ( 1.409462)
Keinich #2 1.750000 0.010000 1.760000 ( 1.760997)
Magnus Holm: 1.990000 0.020000 2.010000 ( 2.014282)
Abinoam #1: 2.500000 0.060000 2.560000 ( 2.562646)

All solutions in Ruby 1.9.3 are relatively "linear".

Keinich #1 is the fastest.

I believe it is because I limited the dataset to 100 different values, the
number of
bins is fixed and order O(n) can be achieved (not the O(n.log(n)) that I had
expected, incorrectly).

RUBY 1.8.7:

For ruby 1.8.7-p I could not test the last 2 solutions (methods not
implemented).
For the first 3, _significant_ differences arise.

SIZE
1000000 #[1_000_000 that is ; n = 1]

                     user system total real
Ralph Shneiver: 0.370000 0.010000 0.380000 ( 0.369785)
Sigurd: 2.320000 0.030000 2.350000 ( 2.360403)
Keinich #1 1.520000 0.040000 1.560000 ( 1.562627)
Keinich #2 16.520000 0.100000 16.620000 ( 16.623032)

SIZE
2000000 #[2_000_000 that is ; n = 1]
                     user system total real
Ralph Shneiver: 0.720000 0.010000 0.730000 ( 0.737673)
Sigurd: 8.040000 0.110000 8.150000 ( 8.142827)
Keinich #1 5.670000 0.040000 5.710000 ( 5.716364)
Keinich #2 52.600000 0.480000 53.080000 ( 53.100823)

SIZE
10000000 #[10_000_000 that is ; n = 1]

                     user system total real
Ralph Shneiver: 3.680000 0.000000 3.680000 ( 3.680230)

Striking: Only the solution of Ralph Shneiver remains linear in MRI 1.8.7

JRUBY (1.6.5.1)

$ ruby -v
jruby 1.6.5.1 (ruby-1.8.7-p330) (2011-12-27 1bf37c2) (OpenJDK Server VM
1.6.0_23) [linux-i386-java]

I also tested JRuby out of curiosity.

SIZE
1000000 #[1_000_000 that is ; n = 1]
                     user system total real
Ralph Shneiver: 0.244000 0.000000 0.244000 ( 0.220000)
Sigurd: 0.264000 0.000000 0.264000 ( 0.264000)
Keinich #1 0.112000 0.000000 0.112000 ( 0.112000)
Keinich #2 11.256000 0.000000 11.256000 ( 11.256000)

SIZE
2000000 #[2_000_000 that is ; n = 1]
                     user system total real
Ralph Shneiver: 0.392000 0.000000 0.392000 ( 0.368000)
Sigurd: 0.439000 0.000000 0.439000 ( 0.438000)
Keinich #1 0.196000 0.000000 0.196000 ( 0.196000)
Keinich #2 14.462000 0.000000 14.462000 ( 14.462000)

SIZE
10000000 #[10_000_000 that is ; n = 1]]

                     user system total real
Ralph Shneiver: 1.604000 0.000000 1.604000 ( 1.581000)
Sigurd: 1.769000 0.000000 1.769000 ( 1.769000)
Keinich #1 0.967000 0.000000 0.967000 ( 0.967000)
Keinich #2 8.978000 0.000000 8.978000 ( 8.978000)

Interesting again.

Those 2 solutions where not yet available on JRuby 1.6.5.1:

Magnus Holm: NoMethodError: undefined method `each_with_object' for
#<Array:0x18eb7b8>
Abinoam #1: NoMethodError: undefined method `chunk' for
#<Array:0x1719f30>

Ralph Shneiver remains linear again.

But then again ... Keinich #1 is significantly faster on 10_000_000 than
the others.
Keinich #2 seems to be not very predictable?

JRUBY HEAD (in rvm):

$ ruby -v
jruby 1.7.0.dev (ruby-1.8.7-p357) (2012-01-17 7de254f) (Java HotSpot(TM)
Server VM 1.6.0_26) [linux-i386-java]

1000000 #[1_000_000 that is ; n = 1]

                     user system total real
Ralph Shneiver: 0.238000 0.000000 0.238000 ( 0.223000)
Sigurd: 0.286000 0.000000 0.286000 ( 0.286000)
Keinich #1 0.120000 0.000000 0.120000 ( 0.120000)
Keinich #2 7.841000 0.000000 7.841000 ( 7.841000)

SIZE
2000000 #[2_000_000 that is ; n = 1]

                     user system total real
Ralph Shneiver: 0.426000 0.000000 0.426000 ( 0.410000)
Sigurd: 0.478000 0.000000 0.478000 ( 0.478000)
Keinich #1 0.213000 0.000000 0.213000 ( 0.213000)
Keinich #2 21.928000 0.000000 21.928000 ( 21.928000)

SIZE
10000000 #[10_000_000 that is ; n = 1]

                     user system total real
Ralph Shneiver: 1.550000 0.000000 1.550000 ( 1.535000)
Sigurd: 1.795000 0.000000 1.795000 ( 1.795000)
Keinich #1 1.060000 0.000000 1.060000 ( 1.060000)
Keinich #2 116.826000 0.000000 116.826000 (116.826000)

Similar results to JRuby 1.6.5.1

Caveat:

I did not check the correctness of the result, only the timing.

Questions:

Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
Could it be because the Hash is now "ordered" so it can do an efficient
algorithm when adding an entry to a bin? Or is it object creation?

Why are certain methods leading to non-linear behavior?

My conclusions (?):

* be careful about performance with large data sets

* Fastest overall: Keinich #1 on JRuby 1.6.5.1

* CRuby 1.9.3 seems to be the only on the remains linear for all
  solutions on large data sets.

* "Ralph Shneiver" is the only solution that remains linear for
  all tested rubies (MRI 1.9.3, MRI 1.8.7, JRuby 1.6.5.1 JRuby head)

HTH,

Peter

A histogram is a diagram, visually representing frequencies. A hash is not
a histogram.

···

On Tue, Jan 17, 2012 at 20:02, Abinoam Jr. <abinoam@gmail.com> wrote:

Kenichi let's add the word "histogram" to the search.

Perhaps naming the method something "histogram" related would be a good
choice.

Please do not top post.

it seems not quite accurate to me because of block. inject uses convention that the last statement in method is a return.

Therefore it's more natural to use short inject method definitions: either a.inject(5, :+) either 5 + a.inject(:+). If the memo return in proc would be unnatural the inject won't pass it to the proc explicitly.

???

The argue was about the explicit memo returns so on a given sample I tried to show that that sample do not necessarily need to use blocks. But I see no problem to explicitly return value if i was passed to that block. Yes, sometimes it looks ugly, sometimes, especially with tap, it looks pretty nice.

Funny thing is, that when you do x.inject(&:+) #inject actually sees a
block. This is just a shorthand notation for doing x.inject {|a,b|
a+b}. See doc of #to_proc and here:

irb(main):028:0> o = Object.new
=> #<Object:0x9f8966c>
irb(main):029:0> def o.to_proc; lambda {|*a| printf "block %p\n", a} end
=> nil
irb(main):030:0> def f;if block_given? then yield else puts "no block" end end
=> nil
irb(main):031:0> f() { puts 123 }
123
=> nil
irb(main):032:0> f
no block
=> nil
irb(main):033:0> f &o
block
=> nil
irb(main):034:0> f(&o)
block
=> nil

On the other side I'm not a proponent of the crazy injects that could be barely understood. I think in this case inject could be used easily as well as the other solutions provided.

You example is not accurate though:

5 + [1, 2, 3, 4].reduce(&:+)

Can you please again explain what is not accurate about Adam's piece
of code? Are you aware of the two different behaviors of #inject?

Technically it's good. My argument was not that the code is not functional or doing something in a wrong way or that syntax is unacceptable. The point i have is that Adams code is not the best example illustrating that the explicit inject block returns is the sign of incorrect inject use or let's call it "code smell". I'm not against that particular piece of code, it's just someone is wrong in internet again :slight_smile:

Aha. I would not have called that "not accurate".

irb(main):001:0> [[1,2,3],[1],].each do |a|
irb(main):002:1* p a, a.inject(&:+), a.inject(0, &:+)
irb(main):003:1> end
[1, 2, 3]
6
6
[1]
1
1

nil
0
=> [[1, 2, 3], [1], ]

Both have their use and it depends on the use case which you pick.

I don't know can this be called different behavior. Inject use the first value of collection if it's not set explicitly so it's like particular case of inject, but i'm ok with the two different cases as well.

It is different behavior because the block is invoked different number
of times. And behavior is controlled by the fact whether an argument
is passed to #inject or not:

irb(main):016:0> [[1,2,3],[1],].each do |a|
irb(main):017:1* p a
irb(main):018:1> p a.inject {|x,y| printf "block 1: %p, %p\n", x, y;x+y},
irb(main):019:1* a.inject(0) {|x,y| printf "block 2: %p, %p\n", x, y;x+y},
irb(main):020:1* a.inject(nil) {|x,y| printf "block 3: %p, %p\n", x, y;x.to_i+y}
irb(main):021:1> end
[1, 2, 3]
block 1: 1, 2
block 1: 3, 3
block 2: 0, 1
block 2: 1, 2
block 2: 3, 3
block 3: nil, 1
block 3: 1, 2
block 3: 3, 3
6
6
6
[1]
block 2: 0, 1
block 3: nil, 1
1
1
1

nil
0
nil
=> [[1, 2, 3], [1], ]

Cheers

robert

···

On Mon, Jan 16, 2012 at 10:03 PM, Sigurd <cu9ypd@gmail.com> wrote:

On Jan 16, 2012, at 10:15 PM, Robert Klemme wrote:

On Mon, Jan 16, 2012 at 7:08 PM, Sigurd <cu9ypd@gmail.com> wrote:

On Jan 16, 2012, at 6:14 PM, Adam Prescott wrote:

On Jan 16, 2012 4:09 PM, "Sigurd" <cu9ypd@gmail.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

It's a reasonable suggestion.

http://en.wikipedia.org/wiki/Histogram:

"""
In a more general mathematical sense, a histogram is a function mi that counts the number of observations that fall into each of the disjoint categories (known as bins), whereas the graph of a histogram is merely one way to represent a histogram.
"""

···

On 01/17/2012 12:07 PM, Adam Prescott wrote:

On Tue, Jan 17, 2012 at 20:02, Abinoam Jr.<abinoam@gmail.com> wrote:

Kenichi let's add the word "histogram" to the search.

Perhaps naming the method something "histogram" related would be a good
choice.

A histogram is a diagram, visually representing frequencies. A hash is not
a histogram.

I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
there
would be a significant performance difference. Because, upon determining the
"bins" of uniq values, that is essentially a form of "sorting", which can
be O(n^2)
if not careful.

There is no sorting going on. Some tests explicitly use a Hash others
use Enumerable#group_by which also uses a Hash (you can see it from
the return).

Questions:

Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
Could it be because the Hash is now "ordered" so it can do an efficient
algorithm when adding an entry to a bin?

No. Hashes in 1.9.* only maintain insertion order - nothing else.

Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

I tried to tweak a bit but not much gain (~1% at most):

Cheers

robert

···

On Tue, Jan 17, 2012 at 12:08 PM, Peter Vandenabeele <peter@vandenabeele.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Hm! I haven't heard that use before. Still, would that be well-understood?
If most people think of it as the visual representation of the frequency,
the method name might seem off. But thanks for pointing that one out. :slight_smile:

···

On Tue, Jan 17, 2012 at 21:29, Joel VanderWerf <joelvanderwerf@gmail.com>wrote:

It's a reasonable suggestion.

http://en.wikipedia.org/wiki/**Histogram&lt;http://en.wikipedia.org/wiki/Histogram&gt;
:

> I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
> there
> would be a significant performance difference. Because, upon determining
the
> "bins" of uniq values, that is essentially a form of "sorting", which can
> be O(n^2)
> if not careful.

There is no sorting going on.

I beg to differ ...

I would assume the only task with an order higher than O(n) is the
"binning".

That is
* finding the hash key for which the value needs to be incremented
* or finding the set to which the entry needs to be added

I would expect those searches to go dramatically faster if the "keys"
are indexed, so the algorithm can find the key in O(log(n)). So there
the fundamental of a sorting algorithm is used (where to insert the
new element, or add it to the bin if it already exists).

But maybe, because I choose a _constant_ and low number of only
100 bins, this effect does not come out clearly ...

I changed the number of "bins" from 100 to 1_000_000 and for certain
algorithms it has a larger impact then others (but none are dramatic
in Ruby 1.9.3).

  Binning 1_000_000 numbers over different numbers of bins · GitHub

The initial "Ralph Shneiver" algorithm is least affected.

Some tests explicitly use a Hash others

use Enumerable#group_by which also uses a Hash (you can see it from
the return).

> Questions:
> ========
>
> Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
> Could it be because the Hash is now "ordered" so it can do an efficient
> algorithm when adding an entry to a bin?

No. Hashes in 1.9.* only maintain insertion order - nothing else.

Do they have a sort of b_tree index on the keys internally ?

> Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

OK, understood.

> Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

"Ralph Shneiver:" => { result = Hash.new(0); ar.each { |x| result += 1 }

<speculation>
Could it be that the reason that the "Ralph Shneiver" method remains
stable and linear in all tests, is that it only assigns max 100 entries in
the
hash and then does nothing more than _increase a number_ (an "in place"
action that does not assign new memory and triggers no GC).
</speculation>

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

Thanks.

I tried to tweak a bit but not much gain (~1% at most):
Couting occurrencies in an Enumerable · GitHub

If I understand correctly, you tested for MRI Ruby 1.9.3 and find
similar results ?

Thanks,

Peter

···

On Wed, Jan 18, 2012 at 2:57 PM, Robert Klemme <shortcutter@googlemail.com>wrote:

On Tue, Jan 17, 2012 at 12:08 PM, Peter Vandenabeele > <peter@vandenabeele.com> wrote:

Peter,

Wednesday, January 18, 2012, 11:22:49 AM, you wrote:

> I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
> there
> would be a significant performance difference. Because, upon determining
the
> "bins" of uniq values, that is essentially a form of "sorting", which can
> be O(n^2)
> if not careful.

There is no sorting going on.

I beg to differ ...

I agree with you. Obviously there is sorting going on.

I would assume the only task with an order higher than O(n) is the
"binning".

That is
* finding the hash key for which the value needs to be incremented
* or finding the set to which the entry needs to be added

I would expect those searches to go dramatically faster if the "keys"
are indexed, so the algorithm can find the key in O(log(n)). So there
the fundamental of a sorting algorithm is used (where to insert the
new element, or add it to the bin if it already exists).

I'm looking at the C code for uniq. I think I understand what it is doing.

The C code makes a call to ary_make_hash. I can guess what it does. I am wondering if there is an equivalent Ruby call.

uniq and what we are doing here are closely related.

The basic idea for uniq (in the case where there is no block given) appears to be
1) Create a hash from an array (i.e ary_make_hash). I'm guessing that duplicates are ignored by the nature of creating a hash.
2) For each element of the original array, attempt to delete that key from the created hash. If you are able to do it, add that key to the uniq array.

Clever. I'm guessing doing it that way is faster than manipulation the output from Hash#to_a

But maybe, because I choose a _constant_ and low number of only
100 bins, this effect does not come out clearly ...

I changed the number of "bins" from 100 to 1_000_000 and for certain
algorithms it has a larger impact then others (but none are dramatic
in Ruby 1.9.3).

  Binning 1_000_000 numbers over different numbers of bins · GitHub

The initial "Ralph Shneiver" algorithm is least affected.

It's Shenlvar not Shnelver! :slight_smile:

Some tests explicitly use a Hash others

use Enumerable#group_by which also uses a Hash (you can see it from
the return).

> Questions:
> ========
>
> Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
> Could it be because the Hash is now "ordered" so it can do an efficient
> algorithm when adding an entry to a bin?

No. Hashes in 1.9.* only maintain insertion order - nothing else.

Do they have a sort of b_tree index on the keys internally ?

> Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

OK, understood.

> Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

"Ralph Shneiver:" =>> { result = Hash.new(0); ar.each { |x| result += 1 }

<speculation>
Could it be that the reason that the "Ralph Shneiver" method remains
stable and linear in all tests, is that it only assigns max 100 entries in
the
hash and then does nothing more than _increase a number_ (an "in place"
action that does not assign new memory and triggers no GC).
</speculation>

I suspect your speculation is quite correct.

I am going to paste in stuff from another post for the reader's convenience. I am quoting AJ

n = 100_000
Benchmark.bm(15) do |b|
  b.report("Ralph Shneiver:") { n.times { a = [4,5,6,4,5,6,6,7];
result = Hash.new(0); a.each { |x| result += 1 }; result} }
  b.report("Sigurd:") { n.times {
[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res } } }
  b.report("Keinich #1") { n.times { Hash[a.group_by{|n|n}.map{|k,
v>[k, v.size]}] } }
  b.report("Keinich #2") { n.times {
Hash.new(0).tap{|h|a.each{|n|h[n] += 1}} } }
  b.report("Magnus Holm:") { n.times {
[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res += 1
} } }
  b.report("Abinoam #1:") { n.times { Hash[
[4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]
} }
end

"Sigurd" creates a new object, "res", for each iteration of each

"Keinich #1" appends stuff to a hash table value every time a new value is detected. This has got to be more work.

"Keinich #2" is quite close to "Shnelvar" and the timings show this.

"Abinoam #1" seems to be doing a lot of work. At least two sorts. The first an explicit one and the second an implied one in the creation of a hash table.

···

On Wed, Jan 18, 2012 at 2:57 PM, Robert Klemme > <shortcutter@googlemail.com>wrote:

On Tue, Jan 17, 2012 at 12:08 PM, Peter Vandenabeele >> <peter@vandenabeele.com> wrote:

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

Thanks.

I tried to tweak a bit but not much gain (~1% at most):
Couting occurrencies in an Enumerable · GitHub

If I understand correctly, you tested for MRI Ruby 1.9.3 and find
similar results ?

Thanks,

Peter

--
Best regards,
Ralph mailto:ralphs@dos32.com

Have you had a chance to look at Yura Sokolov's Hash optimization
feature request?

  https://bugs.ruby-lang.org/issues/5903

Jon

···

---
Fail fast. Fail often. Fail publicly. Learn. Adapt. Repeat.
http://thecodeshop.github.com | http://jonforums.github.com/
twitter: @jonforums

--
Posted via http://www.ruby-forum.com/.

The nice thing about open source is you guys don't need to sit around and blindly speculate...

% cd RUBY19
/Users/ryan/Links/RUBY19
% grep -i sort hash.c st.c
%

Not conclusive, I know.

You're more than welcome to read the two files. They're less than 5k loc in sum.

···

On Jan 18, 2012, at 12:26 , Ralph Shnelvar wrote:

I agree with you. Obviously there is sorting going on.

> The initial "Ralph Shneiver" algorithm is least affected.

It's Shenlvar not Shnelver! :slight_smile:

Very sorry for that ... This is the second time on short notice
that I err with names :-/

I am going to paste in stuff from another post for the reader's

convenience. I am quoting AJ

> n = 100_000
> Benchmark.bm(15) do |b|
> b.report("Ralph Shneiver:") { n.times { a = [4,5,6,4,5,6,6,7];

I am afraid the misreading of your name started here ...

Peter

···

On Wed, Jan 18, 2012 at 9:26 PM, Ralph Shnelvar <ralphs@dos32.com> wrote:

I stand corrected ... (should have looked instead of speculated).

Do I understand correctly from `st_lookup` in `st.c` that:

* if (table->entries_packed) there is a linear search in bins
* else FIND_ENTRY uses a hashing mechanism to keep the cost of
  look-up essentially linear.

I had speculated that a b-tree with sorting was used to keep the
look-up cost low (but I presume now that is not the case)!

Thanks !

Peter

···

On Wed, Jan 18, 2012 at 10:24 PM, Ryan Davis <ryand-ruby@zenspider.com>wrote:

On Jan 18, 2012, at 12:26 , Ralph Shnelvar wrote:

> I agree with you. Obviously there is sorting going on.

The nice thing about open source is you guys don't need to sit around and
blindly speculate...

% cd RUBY19
/Users/ryan/Links/RUBY19
% grep -i sort hash.c st.c
%

Not conclusive, I know.

You're more than welcome to read the two files. They're less than 5k loc
in sum.

Sorry Ralph Shenlvar!

···

On Wed, Jan 18, 2012 at 6:25 PM, Peter Vandenabeele <peter@vandenabeele.com> wrote:

On Wed, Jan 18, 2012 at 9:26 PM, Ralph Shnelvar <ralphs@dos32.com> wrote:

> The initial "Ralph Shneiver" algorithm is least affected.

It's Shenlvar not Shnelver! :slight_smile:

Very sorry for that ... This is the second time on short notice
that I err with names :-/

I am going to paste in stuff from another post for the reader's

convenience. I am quoting AJ

> n = 100_000
> Benchmark.bm(15) do |b|
> b.report("Ralph Shneiver:") { n.times { a = [4,5,6,4,5,6,6,7];

I am afraid the misreading of your name started here ...

Peter

> I agree with you. Obviously there is sorting going on.

...

I stand corrected ... (should have looked instead of speculated).

Do I understand correctly from `st_lookup` in `st.c` that:

* if (table->entries_packed) there is a linear search in bins
* else FIND_ENTRY uses a hashing mechanism to keep the cost of
look-up essentially linear.

I had speculated that a b-tree with sorting was used to keep the
look-up cost low (but I presume now that is not the case)!

Peter, a Hash uses _hashing_ to keep lookup cost low. This has complexity O(1).

Cheers

robert

···

On Wed, Jan 18, 2012 at 10:50 PM, Peter Vandenabeele <peter@vandenabeele.com> wrote:

On Wed, Jan 18, 2012 at 10:24 PM, Ryan Davis <ryand-ruby@zenspider.com>wrote:

On Jan 18, 2012, at 12:26 , Ralph Shnelvar wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert,

Peter, a Hash uses _hashing_ to keep lookup cost low. This has complexity O(1).

Oh god, I feel so stupid.

Thanks for pointing that out.

···

Hash table - Wikipedia