Remove duplicates of array of object based on a attribute

hi all,
          how to remove duplicates of an array of objects based a
attribute of the object. For ex
           i am having an array of ruby beans named diagnoses . i want
remove duplicates from the based on the diagnoses id. assume diagnoses
have attributes id and weightage .So for two diagnoses with same id and
different weightage , the diagnoses with lower weightage should be
removed.
           Can anyone help me??

···

--
Posted via http://www.ruby-forum.com/.

module Enumerable
      def group_by &b
        h = Hash.new{|h,k| h[k] = }
        each{|x| h[x.instance_eval(&b)] << x}
        h.values
      end
    end

    old_diagnoses = [
      {:id => 1, :w => 30},
      {:id => 2, :w => 20},
      {:id => 3, :w => 10},
      {:id => 1, :w => 10},
      {:id => 1, :w => 40},
      {:id => 2, :w => 50},
      {:id => 4, :w => 60},
      {:id => 4, :w => 30},
      {:id => 2, :w => 20},
      {:id => 3, :w => 10}
    ]
    new_diagnoses =

    groups = old_diagnoses.group_by{ |d| d[:id] }

    groups.each do |group|
      new_diagnoses << group.sort_by{ |g| g[:w] }.last
    end

    p old_diagnoses
    p new_diagnoses

[{:w=>30, :id=>1}, {:w=>20, :id=>2}, {:w=>10, :id=>3}, {:w=>10, :id=>1},
{:w=>40, :id=>1}, {:w=>50, :id=>2}, {:w=>60, :id=>4}, {:w=>30, :id=>4},
{:w=>20, :id=>2}, {:w=>10, :id=>3}]

[{:w=>40, :id=>1}, {:w=>50, :id=>2}, {:w=>10, :id=>3}, {:w=>60, :id=>4}]

···

On 3/6/07, senthil <senthilkumar@srishtisoft.com> wrote:

           i am having an array of ruby beans named diagnoses . i want
remove duplicates from the based on the diagnoses id. assume diagnoses
have attributes id and weightage .So for two diagnoses with same id and
different weightage , the diagnoses with lower weightage should be
removed.
           Can anyone help me??

From: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/228538

Here's my best shot at it:

require 'set'
class Array
  def uniq_by
    seen = Set.new
    select{ |x| seen.add?( yield( x ) ) }
  end
end

a = [ {:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3} ]
p a, a.uniq, a.uniq_by{ |h| h[:a] }
#=> [{:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3}]
#=> [{:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3}]
#=> [{:a=>1, :d=>1}, {:b=>2}]

(Note how :b=>2 and :c=>3 have the same value for :a (nil), so only
one is included.)

Here's another (assumedly slower) version that doesn't rely on Set:

class Array
  def uniq_by
    seen = {}
    select{ |x|
      v = yield(x)
      !seen[v] && (seen[v]=true)
    }
  end
end

···

On Mar 6, 7:03 am, senthil <senthilku...@srishtisoft.com> wrote:

hi all,
          how to remove duplicates of an array of objects based a
attribute of the object. For ex
           i am having an array of ruby beans named diagnoses . i want
remove duplicates from the based on the diagnoses id. assume diagnoses
have attributes id and weightage .So for two diagnoses with same id and
different weightage , the diagnoses with lower weightage should be
removed.

senthil, please don't take this personally, your question is OK, but the following sounds so very wrong:

           i am having an array of ruby beans (...)

All we have in Ruby are objects. No beans, POROs, ERBs, and all this cruft.

Regards,
Pit

And here's the inevitable one-liner... :}

(But I do prefer the group_by version...)

gegroet,
Erik V. - http://www.erikveen.dds.nl/

···

----------------------------------------------------------------

################################################################

arr = [
   {:id => 1, :w => 30},
   {:id => 2, :w => 20},
   {:id => 3, :w => 10},
   {:id => 1, :w => 10},
   {:id => 1, :w => 40},
   {:id => 2, :w => 50},
   {:id => 4, :w => 60},
   {:id => 4, :w => 30},
   {:id => 2, :w => 20},
   {:id => 3, :w => 10}
]

################################################################

res1=arr.inject({}){|h,o|(h[o[:id]]||=[])<<o;h}.values.map{|a|
a.sort_by{|o|o[:w]}.pop}

################################################################

res2 =
arr.inject({}) do |h,o|
   (h[o[:id]] ||= []) << o ; h
end.values.collect do |a|
   a.sort_by do |o|
     o[:w]
   end.pop
end

################################################################

module Enumerable
   def hash_by(&block)
     inject({}){|h, o| (h[block.call(o)] ||= []) << o ; h}
   end

   def group_by(&block)
     hash_by(&block).sort.transpose.pop
   end
end

res3 =
arr.group_by do |o|
   o[:id]
end.collect do |a|
   a.sort_by do |o|
     o[:w]
   end.pop
end

################################################################

p res1
p res2
p res3

################################################################

----------------------------------------------------------------

Huh...actually, the hash-based one seems faster than the Set-based
one:

  require 'set'
  class Array
    def uniq_by1
      seen = Set.new
      select{ |x| seen.add?( yield( x ) ) }
    end
    def uniq_by2
      seen = {}
      select{ |x| !seen[v=yield(x)] && (seen[v]=true) }
    end
  end

  require 'benchmark'
  a = [ {:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3},
        {:a=>2, :e=>7}, {:a=>3, :b=>2}, {:a=>1}, {:a=>4}, {:f=>6} ]
  N = 10_000
  Benchmark.bmbm{ |x|
    x.report( 'with_set' ){
      N.times{
        a.uniq_by1{ |h| h[:a] }
        a.uniq_by1{ |h| h[:b] }
      }
    }
    x.report( 'with_hash' ){
      N.times{
        a.uniq_by2{ |h| h[:a] }
        a.uniq_by2{ |h| h[:b] }
      }
    }
  }

  #=> Rehearsal ---------------------------------------------
  #=> with_set 1.840000 0.030000 1.870000 ( 2.401238)
  #=> with_hash 1.270000 0.030000 1.300000 ( 1.701307)
  #=> ------------------------------------ total: 3.170000sec
  #=>
  #=> user system total real
  #=> with_set 1.820000 0.020000 1.840000 ( 2.187477)
  #=> with_hash 1.250000 0.020000 1.270000 ( 1.555490)

(Yes, my laptop is rather old and slow.)

···

On Mar 6, 7:27 am, "Phrogz" <g...@refinery.com> wrote:

Here's another (assumedly slower) version that doesn't rely on Set:

Erik Veenstra wrote:

And here's the inevitable one-liner... :}

Not that we're golfing, but I like this one better in terms of one-
linedness:
  Hash[ *map{ |o| [ o[:id], o ] }.flatten ].values

Oops, I meant:
  Hash[ *a.map{ |o| [ o[:id], o ] }.flatten ].values

···

On Mar 6, 1:47 pm, "Phrogz" <g...@refinery.com> wrote:

Erik Veenstra wrote:
> And here's the inevitable one-liner... :}

Not that we're golfing, but I like this one better in terms of one-
linedness:
  Hash[ *map{ |o| [ o[:id], o ] }.flatten ].values

And faster still, by a hair, is a last-in approach. Upon reflection,
all these techniques rely only on methods already in Enumerable, so
they can be put there instead of being Array-specific.

  module Enumerable
    require 'set'
    def uniq_by1
      seen = Set.new
      select{ |x| seen.add?( yield( x ) ) }
    end
    def uniq_by2
      seen = {}
      select{ |x| !seen[v=yield(x)] && (seen[v]=true) }
    end
    def uniq_by3
      Hash[ *map{ |x| [ yield(x), x ] }.flatten ].values
    end

    def uniq_by4
      # fastest, preserves last-seen value for a key
      h = {}
      each{ |x| h[yield(x)] = x }
      h.values
    end

    def uniq_by5
      # near-fastest, preserves first-seen value for a key
      h = {}
      each{ |x| v=yield(x); h[v]=x unless h.include?(v) }
      h.values
    end
  end

  a = [ {:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3},
        {:a=>2, :e=>7}, {:a=>3, :b=>2}, {:a=>1}, {:a=>4}, {:f=>6} ]

  require 'benchmark'
  N = 20_000
  Benchmark.bmbm{ |x|
    x.report( 'with set' ){
      N.times{
        a.uniq_by1{ |h| h[:a] }
        a.uniq_by1{ |h| h[:b] }
      }
    }
    x.report( 'with hash' ){
      N.times{
        a.uniq_by2{ |h| h[:a] }
        a.uniq_by2{ |h| h[:b] }
      }
    }
    x.report( 'Hash..values' ){
      N.times{
        a.uniq_by3{ |h| h[:a] }
        a.uniq_by3{ |h| h[:b] }
      }
    }
    x.report( '#values (last in)' ){
      N.times{
        a.uniq_by4{ |h| h[:a] }
        a.uniq_by4{ |h| h[:b] }
      }
    }
    x.report( '#values (first in)' ){
      N.times{
        a.uniq_by5{ |h| h[:a] }
        a.uniq_by5{ |h| h[:b] }
      }
    }
  }

  #=> Rehearsal ------------------------------------------------------
  #=> with set 2.500000 0.016000 2.516000 ( 2.547000)
  #=> with hash 1.312000 0.000000 1.312000 ( 1.313000)
  #=> Hash..values 2.453000 0.000000 2.453000 ( 2.453000)
  #=> #values (last in) 1.110000 0.000000 1.110000 ( 1.109000)
  #=> #values (first in) 1.296000 0.000000 1.296000 ( 1.297000)
  #=> --------------------------------------------- total: 8.687000sec
  #=>
  #=> user system total real
  #=> with set 2.000000 0.000000 2.000000 ( 1.999000)
  #=> with hash 1.297000 0.000000 1.297000 ( 1.297000)
  #=> Hash..values 2.531000 0.000000 2.531000 ( 2.532000)
  #=> #values (last in) 1.125000 0.015000 1.140000 ( 1.140000)
  #=> #values (first in) 1.344000 0.000000 1.344000 ( 1.344000)

···

On Mar 6, 7:40 am, "Phrogz" <g...@refinery.com> wrote:

On Mar 6, 7:27 am, "Phrogz" <g...@refinery.com> wrote:

> Here's another (assumedly slower) version that doesn't rely on Set:

Huh...actually, the hash-based one seems faster than the Set-based
one:

Hash[ *a.map{ |o| [ o[:id], o ] }.flatten ].values

Not bad...

How does this ensure that the maximum :w is used?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

Hash[ *a.map{ |o| [ o[:id], o ] }.flatten ].values
=> [{:id=>1, :w=>40}, {:id=>2, :w=>20}, {:id=>3, :w=>10}, {:id=>4, :w=>30}]

Hash[*(a.sort_by{|z|z[:id]}).map{|o|[o[:id],o]}.flatten].values
=> [{:id=>1, :w=>40}, {:id=>2, :w=>50}, {:id=>3, :w=>10}, {:id=>4, :w=>60}]

···

On 3/6/07, Erik Veenstra <erikveen@gmail.com> wrote:

> Hash[ *a.map{ |o| [ o[:id], o ] }.flatten ].values

Not bad...

How does this ensure that the maximum :w is used?