Uniq with count; better way?

Ralph_Shnelvar · 16 January 2012 15:51

a = [4,5,6,4,5,6,6,7]

result = Hash.new(0)
a.each { |x| result[x] += 1 }

p result

The result I am getting
{4=>2, 5=>2, 6=>3, 7=>1}
is what I want.

Is there a better way; perhaps using uniq?

Alex_Newone · 16 January 2012 16:00

The first that came to my mind.

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res }

Is it good enough for you?

···

On Jan 16, 2012, at 5:51 PM, Ralph Shnelvar wrote:

a = [4,5,6,4,5,6,6,7]

result = Hash.new(0)
a.each { |x| result += 1 }

p result

The result I am getting
{4=>2, 5=>2, 6=>3, 7=>1}
is what I want.

Is there a better way; perhaps using uniq?

Kenichi_Kamiya · 16 January 2012 16:21

I like this

a = [4,5,6,4,5,6,6,7]

# 1
p Hash[a.group_by{|n|n}.map{|k, v|[k, v.size]}]

# 2
p Hash.new(0).tap{|h|a.each{|n|h[n] += 1}}

···

2012/1/17 Ralph Shnelvar <ralphs@dos32.com>:

a = [4,5,6,4,5,6,6,7]

result = Hash.new(0)
a.each { |x| result += 1 }

p result

The result I am getting
{4=>2, 5=>2, 6=>3, 7=>1}
is what I want.

Is there a better way; perhaps using uniq?

Karsten_Meier · 23 January 2012 10:15

If your data items are integers, and from a rather small range (compared
to computer memory...), then you can use an array instead of an hash:

maxval = 10
result = Array.new(maxval+1, 0)
ar.each{
>x> result[x] += 1;
}

This returns an array and not an hash.
[0, 0, 0, 0, 2, 2, 3, 1, 0, 0, 0]

To make a histogram, that data structure is even better. Otherwise you
need to transform it to a hash again. But for large data sets I still
expect it to be faster:
Your cpu does not need to calculate a hash key of every single data
item, because the data item is already a perfect key for the array. Also
no hash key collisions can occur.

Regards

Karsten Meier

···

--
Posted via http://www.ruby-forum.com/.

Karsten_Meier · 23 January 2012 14:08

Ok, I tried some benchmarks. We have now even more variables, as they
also depend on "maxval" from the dataset.

maxval = 1000
ar = [].tap{|a| 1_000_000.times {a << rand(maxval)}}

b.report("Meier:") {
  n.times {
    hist = Array.new(maxval+1, 0)
    ar.each{|x| hist[x] += 1;}
    result = Hash.new(0)
    0.upto(maxval){|i| result[i] = hist[i] unless hist[i] == 0}
    result
  }
}

On my jruby and my windows- mri 1.8.7 my algorithm was fastest for
maxvalue of 10, 100 or 10000, for example:

SIZE
1000000
MAXVAL
10000
user system total real
Ralph Shneiver: 0.533000 0.000000 0.533000 ( 0.518000)
Meier: 0.312000 0.000000 0.312000 ( 0.312000)
Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)

(I have no 1.9.3 yet on my windows PC, so it may be different there)

But here are two observations:
1) speed-up is not so big as I expected. In C, I expect array lookup to
be factors better than hash calculation (followed by an array-lookup in
the hash-table...). In Ruby it seem to be not so much faster. But the
speedup gets bigger for bigger values of maxval.

2) My algorithm sometimes run much much slower when #kennich1 had run
before mine. It seem to get worse with big values of maxval, but not
with jruby --1.9 option. It is not the array-allocate itself that is the
problem.
Is it possible that group_by changes the internal array structure, So I
get a non-continous-array?

Regards

Karsten Meier

···

--
Posted via http://www.ruby-forum.com/.

Adam_Prescott · 16 January 2012 16:04

I think this is a misuse of inject, personally, every time I see it. It's
harder to read and it doesn't give the feeling of actually "reducing"
(inject's alias) the array down to one thing. The required `; res` is a
sign of that. Compare:

[1, 2, 3, 4].inject(5) { |a, b| a + b }

···

On Mon, Jan 16, 2012 at 16:00, Sigurd <cu9ypd@gmail.com> wrote:

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res }

Ralph_Shnelvar · 16 January 2012 20:16

Kenichi,

Monday, January 16, 2012, 9:21:51 AM, you wrote:

···

2012/1/17 Ralph Shnelvar <ralphs@dos32.com>:

a = [4,5,6,4,5,6,6,7]

result = Hash.new(0)
a.each { |x| result += 1 }

p result

The result I am getting
{4=>2, 5=>2, 6=>3, 7=>1}
is what I want.

Is there a better way; perhaps using uniq?

I like this

a = [4,5,6,4,5,6,6,7]

# 1
p Hash[a.group_by{|n|n}.map{|k, v|[k, v.size]}]

# 2
p Hash.new(0).tap{|h|a.each{|n|h[n] += 1}}

I like #2. I can understand it. I'm still having trouble wrapping my head around #1.

Having said that, is your #2 better than mine in any dimension (comprehensibility and/or speed of execution?

Robert_K1 · 23 January 2012 10:43

If your data items are integers, and from a rather small range (compared
to computer memory...), then you can use an array instead of an hash:

maxval = 10
result = Array.new(maxval+1, 0)
ar.each{
>x> result += 1;
}

This returns an array and not an hash.
[0, 0, 0, 0, 2, 2, 3, 1, 0, 0, 0]

To make a histogram, that data structure is even better. Otherwise you
need to transform it to a hash again. But for large data sets I still
expect it to be faster:

Don't expect, measure. There's Benchmark...

Your cpu does not need to calculate a hash key of every single data
item, because the data item is already a perfect key for the array. Also
no hash key collisions can occur.

But if only few of the numbers in the range are used you waste a
potentially large Array for just a few entries.

Kind regards

robert

···

On Mon, Jan 23, 2012 at 11:15 AM, Karsten Meier <developer@handylearn-projects.de> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert_K1 · 23 January 2012 14:31

Ok, I tried some benchmarks. We have now even more variables, as they
also depend on "maxval" from the dataset.

maxval = 1000
ar = .tap{|a| 1_000_000.times {a << rand(maxval)}}

b.report("Meier:") {
n.times {
hist = Array.new(maxval+1, 0)
ar.each{|x| hist += 1;}
result = Hash.new(0)
0.upto(maxval){|i| result[i] = hist[i] unless hist[i] == 0}

You could also do

result.each_with_index {|c,i| result[i] = c if c.nonzero?}

result
}
}

That's just part of the testing code, isn't it? Why not share the
complete code?

On my jruby and my windows- mri 1.8.7 my algorithm was fastest for
maxvalue of 10, 100 or 10000, for example:

SIZE
1000000
MAXVAL
10000
user system total real
Ralph Shneiver: 0.533000 0.000000 0.533000 ( 0.518000)
Meier: 0.312000 0.000000 0.312000 ( 0.312000)
Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)

(I have no 1.9.3 yet on my windows PC, so it may be different there)

But here are two observations:
1) speed-up is not so big as I expected. In C, I expect array lookup to
be factors better than hash calculation (followed by an array-lookup in
the hash-table...). In Ruby it seem to be not so much faster. But the
speedup gets bigger for bigger values of maxval.

2) My algorithm sometimes run much much slower when #kennich1 had run
before mine. It seem to get worse with big values of maxval, but not
with jruby --1.9 option. It is not the array-allocate itself that is the
problem.
Is it possible that group_by changes the internal array structure, So I
get a non-continous-array?

No. It's more likely that you are hit by GC I'd say. You could also
try Benchmark.bmbm for warm up before the test.

Kind regards

robert

···

On Mon, Jan 23, 2012 at 3:08 PM, Karsten Meier <developer@handylearn-projects.de> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Peter_Vandenabeele1 · 23 January 2012 14:39

Interesting.

I added your algorithm to the list and tested on ruby 1.9.3

$ ruby -v
ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux]

SIZE
1000000
MAXVAL
1000
user system total real
Ralph Shneiver: 0.370000 0.000000 0.370000 ( 0.369229)
Sigurd: 0.420000 0.000000 0.420000 ( 0.418634)
Meier: 0.270000 0.000000 0.270000 ( 0.274136)
Keinich #1 0.320000 0.000000 0.320000 ( 0.320962)
Keinich #2 0.380000 0.000000 0.380000 ( 0.372422)
Magnus Holm: 0.420000 0.000000 0.420000 ( 0.423316)
Abinoam #1: 0.600000 0.000000 0.600000 ( 0.597028)

And I also retested in the latest jruby-head (1.7.0.dev)

$ ruby -v
jruby 1.7.0.dev (ruby-1.8.7-p357) (2012-01-23 f80ab05) (Java HotSpot(TM)
Server VM 1.6.0_26) [linux-i386-java]

SIZE
1000000
MAXVAL
1000
                     user system total real
Ralph Shneiver: 0.492000 0.000000 0.492000 ( 0.476000)
Sigurd: 0.473000 0.000000 0.473000 ( 0.473000)
Meier: 0.287000 0.000000 0.287000 ( 0.287000)
Keinich #1 0.308000 0.000000 0.308000 ( 0.308000)
Keinich #2 7.374000 0.000000 7.374000 ( 7.374000)
Magnus Holm: NoMethodError: undefined method `each_with_object' for
#<Array:0x19c6163>
   __file__ at sb.rb:30
      times at org/jruby/RubyFixnum.java:261
...

So, at least for these 2 cased, your algorithm seems somewhat
faster.

As long as the array is not "sparsely" populated, this
approach certainly makes sense.

HTH,

Peter

···

On Mon, Jan 23, 2012 at 3:08 PM, Karsten Meier < developer@handylearn-projects.de> wrote:

Ok, I tried some benchmarks. We have now even more variables, as they
also depend on "maxval" from the dataset.

maxval = 1000
ar = .tap{|a| 1_000_000.times {a << rand(maxval)}}

b.report("Meier:") {
n.times {
   hist = Array.new(maxval+1, 0)
   ar.each{|x| hist += 1;}
   result = Hash.new(0)
   0.upto(maxval){|i| result[i] = hist[i] unless hist[i] == 0}
   result
}
}

On my jruby and my windows- mri 1.8.7 my algorithm was fastest for
maxvalue of 10, 100 or 10000, for example:

SIZE
1000000
MAXVAL
10000
                    user system total real
Ralph Shneiver: 0.533000 0.000000 0.533000 ( 0.518000)
Meier: 0.312000 0.000000 0.312000 ( 0.312000)
Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)

(I have no 1.9.3 yet on my windows PC, so it may be different there)

Alex_Newone · 16 January 2012 16:09

Well it's one of the possible solutions.
You example is not accurate though:

5 + [1, 2, 3, 4].reduce(&:+)

···

On Jan 16, 2012, at 6:04 PM, Adam Prescott wrote:

On Mon, Jan 16, 2012 at 16:00, Sigurd <cu9ypd@gmail.com> wrote:

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res }

I think this is a misuse of inject, personally, every time I see it. It's
harder to read and it doesn't give the feeling of actually "reducing"
(inject's alias) the array down to one thing. The required `; res` is a
sign of that. Compare:

[1, 2, 3, 4].inject(5) { |a, b| a + b }

Magnus_Holm1 · 16 January 2012 16:48

There's always each_with_object, although it's a little long:

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res += 1 }

···

On Mon, Jan 16, 2012 at 17:04, Adam Prescott <adam@aprescott.com> wrote:

On Mon, Jan 16, 2012 at 16:00, Sigurd <cu9ypd@gmail.com> wrote:

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res }

I think this is a misuse of inject, personally, every time I see it. It's
harder to read and it doesn't give the feeling of actually "reducing"
(inject's alias) the array down to one thing. The required `; res` is a
sign of that. Compare:

[1, 2, 3, 4].inject(5) { |a, b| a + b }

Kenichi_Kamiya · 17 January 2012 10:58

Abinoam,
thank you for get the benchmarks.
I guess the different of spped from generating objects.
They generate minimal objects which Object#tap and Enumerable#each_with_object.

···

-------------------------------------------------------------------------------
Ralph,
I like #1 for comprehensibility meaning

# looks "construct Hash instance from inner subscript"
Hash

# looks "collect the own values"
a.group_by{|n|n}

# counting
map{|k, v|[k, v.size]}]

# if Hash has own map

gist.github.com

https://gist.github.com/kachick/1626034

hash_maph.rb

# an aproach to http://www.ruby-forum.com/topic/3446541

$VERBOSE = true

class Hash
  def maph
    return to_enum(__method__) unless block_given?
 
    {}.tap {|hash|
      each_pair do |key, value|

This file has been truncated. show original

-------------------------------------------------------------------------------
I like #2 for comprehensibility, clear name-space, and speed meanings

# comprehensibility
# looks like "list comprehension"
tap{|h|a.each{|n|h[n] += 1}}

# clear name-space
no make variables for out of block

# speed meanings
look Abinoam's benchmarks

# if Enumerable has method for this case

gist.github.com

https://gist.github.com/kachick/1625981

enumerable_count_by.rb

$VERBOSE = true

module Enumerable
  def count_by
    return to_enum(__method__) unless block_given?
 
    Hash.new(0).tap{|h|
      each{|v|h[yield v] += 1}
    }
  end

This file has been truncated. show original

2012/1/17 Ralph Shnelvar <ralphs@dos32.com>:

Kenichi,

Monday, January 16, 2012, 9:21:51 AM, you wrote:

> 2012/1/17 Ralph Shnelvar <ralphs@dos32.com>:

a = [4,5,6,4,5,6,6,7]

result = Hash.new(0)
a.each { |x| result += 1 }

p result

The result I am getting
{4=>2, 5=>2, 6=>3, 7=>1}
is what I want.

Is there a better way; perhaps using uniq?

> I like this

> a = [4,5,6,4,5,6,6,7]

> # 1
> p Hash[a.group_by{|n|n}.map{|k, v|[k, v.size]}]

> # 2
> p Hash.new(0).tap{|h|a.each{|n|h[n] += 1}}

I like #2. I can understand it. I'm still having trouble wrapping my head around #1.

Having said that, is your #2 better than mine in any dimension (comprehensibility and/or speed of execution?

--
Kenichi Kamiya

Peter_Vandenabeele1 · 23 January 2012 14:43

The full code for my recent tests is here:

https://gist.github.com/1663455

Peter

···

On Mon, Jan 23, 2012 at 3:38 PM, Peter Vandenabeele <peter@vandenabeele.com>wrote:

I added your algorithm to the list and tested on ruby 1.9.3

Adam_Prescott · 16 January 2012 16:14

In what sense is that more "accurate"?

···

On Jan 16, 2012 4:09 PM, "Sigurd" <cu9ypd@gmail.com> wrote:

You example is not accurate though:

5 + [1, 2, 3, 4].reduce(&:+)

abinoam · 17 January 2012 00:22

I think Magnus Holm is the clearest (IMHO, yes, it's just a taste and
humble opinion.).

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) {|num, hsh| hsh[num] += 1}

Another way (not better) I remember is...

Hash[ [4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]

See: Module: Enumerable (Ruby 1.9.3)

It also can be... clearer?!?

Hash[ [4,5,6,4,5,6,6,7].group_by {|n| n}.map {|ix, els| [ix, els.size] } ]

Perhaps something like this (same as Magnus Holm) just hiding the
complexity into the method.

class Array
  def totalize_to_hash
    hsh = Hash.new(0)
    self.each do |n|
      hsh[n] += 1
    end
    hsh
  end
end

[4,5,6,4,5,6,6,7].totalize_to_hash

Abinoam Jr.

···

On Mon, Jan 16, 2012 at 1:48 PM, Magnus Holm <judofyr@gmail.com> wrote:

On Mon, Jan 16, 2012 at 17:04, Adam Prescott <adam@aprescott.com> wrote:

On Mon, Jan 16, 2012 at 16:00, Sigurd <cu9ypd@gmail.com> wrote:

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res }

I think this is a misuse of inject, personally, every time I see it. It's
harder to read and it doesn't give the feeling of actually "reducing"
(inject's alias) the array down to one thing. The required `; res` is a
sign of that. Compare:

[1, 2, 3, 4].inject(5) { |a, b| a + b }

There's always each_with_object, although it's a little long:

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res += 1 }

abinoam · 17 January 2012 13:50

I would like to have it.

We can discuss it better here at ruby talk to see the pros and cons.
If somebody is able to do the C code of it...
Perhaps we could issue a feature request.

Abinoam Jr.

···

On Tue, Jan 17, 2012 at 7:58 AM, Kenichi Kamiya <kachick1@gmail.com> wrote:

# if Enumerable has method for this case
an aproach to http://www.ruby-forum.com/topic/3446541 · GitHub

Alex_Newone · 16 January 2012 18:08

Well,

it seems not quite accurate to me because of block. inject uses convention that the last statement in method is a return. The nature of inject is to assign the last value to the memo that has not been used ever in your case. Therefore it's more natural to use short inject method definitions: either a.inject(5, :+) either 5 + a.inject(:+). If the memo return in proc would be unnatural the inject won't pass it to the proc explicitly.

On the other side I'm not a proponent of the crazy injects that could be barely understood. I think in this case inject could be used easily as well as the other solutions provided.

···

On Jan 16, 2012, at 6:14 PM, Adam Prescott wrote:

On Jan 16, 2012 4:09 PM, "Sigurd" <cu9ypd@gmail.com> wrote:

You example is not accurate though:

5 + [1, 2, 3, 4].reduce(&:+)

In what sense is that more "accurate"?

abinoam · 17 January 2012 01:05

Some benchmark results...

n = 100_000
Benchmark.bm(15) do |b|
  b.report("Ralph Shneiver:") { n.times { a = [4,5,6,4,5,6,6,7];
result = Hash.new(0); a.each { |x| result += 1 }; result} }
  b.report("Sigurd:") { n.times {
[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res } } }
  b.report("Keinich #1") { n.times { Hash[a.group_by{|n|n}.map{|k,
v>[k, v.size]}] } }
  b.report("Keinich #2") { n.times {
Hash.new(0).tap{|h|a.each{|n|h[n] += 1}} } }
  b.report("Magnus Holm:") { n.times {
[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res += 1
} } }
  b.report("Abinoam #1:") { n.times { Hash[
[4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]
} }
end

user system total real
Ralph Shneiver: 0.290000 0.000000 0.290000 ( 0.259640)
Sigurd: 0.320000 0.000000 0.320000 ( 0.289873)
Keinich #1 0.560000 0.000000 0.560000 ( 0.497736)
Keinich #2 0.280000 0.000000 0.280000 ( 0.250843)
Magnus Holm: 0.310000 0.000000 0.310000 ( 0.283344)
Abinoam #1: 1.140000 0.000000 1.140000 ( 1.042744)

Abinoam Jr.

···

On Mon, Jan 16, 2012 at 9:22 PM, Abinoam Jr. <abinoam@gmail.com> wrote:

On Mon, Jan 16, 2012 at 1:48 PM, Magnus Holm <judofyr@gmail.com> wrote:

On Mon, Jan 16, 2012 at 17:04, Adam Prescott <adam@aprescott.com> wrote:

On Mon, Jan 16, 2012 at 16:00, Sigurd <cu9ypd@gmail.com> wrote:

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res += 1; res }

I think this is a misuse of inject, personally, every time I see it. It's
harder to read and it doesn't give the feeling of actually "reducing"
(inject's alias) the array down to one thing. The required `; res` is a
sign of that. Compare:

[1, 2, 3, 4].inject(5) { |a, b| a + b }

There's always each_with_object, although it's a little long:

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res += 1 }

I think Magnus Holm is the clearest (IMHO, yes, it's just a taste and
humble opinion.).

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) {|num, hsh| hsh[num] += 1}

Another way (not better) I remember is...

Hash[ [4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]

See: Module: Enumerable (Ruby 1.9.3)

It also can be... clearer?!?

Hash[ [4,5,6,4,5,6,6,7].group_by {|n| n}.map {|ix, els| [ix, els.size] } ]

Perhaps something like this (same as Magnus Holm) just hiding the
complexity into the method.

class Array
def totalize_to_hash
hsh = Hash.new(0)
self.each do |n|
hsh[n] += 1
end
hsh
end
end

[4,5,6,4,5,6,6,7].totalize_to_hash

Abinoam Jr.

Kenichi_Kamiya · 17 January 2012 15:27

I found a below discussion in rails. (via google)

It has a same name and aims to same goal.
And it was estimated "too specific".

I think too, but this name not recall other case for me.
uhh...

···

2012/1/17 Abinoam Jr. <abinoam@gmail.com>:

On Tue, Jan 17, 2012 at 7:58 AM, Kenichi Kamiya <kachick1@gmail.com> wrote:

# if Enumerable has method for this case
an aproach to http://www.ruby-forum.com/topic/3446541 · GitHub

I would like to have it.

We can discuss it better here at ruby talk to see the pros and cons.
If somebody is able to do the C code of it...
Perhaps we could issue a feature request.

Abinoam Jr.

--
Kenichi Kamiya

Topic		Replies	Views
Optimization question ruby-talk	38	131	22 February 2003
Count occurences of vars in array ruby-talk	17	206	25 July 2010
How to get non-unique elements from an array? ruby-talk	64	443	19 October 2005
Numbers of a record in an array ruby-talk	16	117	1 April 2008
Hey! where's #count? ruby-talk	23	240	15 July 2002

Uniq with count; better way?

Related topics