[Q] removing array duplicates where a subset is unique

I need to remove duplicates from an array of arrays. I can't use Array#uniq because some fields are different and not part of the "key." Here's an example where the first 3 elements of each sub array are the "key" and determine uniqueness. I want to keep only the first one I get.

>> a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
=> [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

The return value of deduplicating this array should be: [[1, 2, 3, 4, 5]]

Here is my first attempt at solving the problem:

>> def dedup ary
>> ary.map do |line|
?> dupes = ary.select { |row| row[0..2] == line[0..2] }
?> dupes.first
>> end.uniq
>> end
=> nil
>>
?> dedup a
=> [[1, 2, 3, 4, 5]]

This works. However, it is *super slow* when operating on my dataset. My arrays contain hundreds of thousands of sub arrays. The unique key for each sub array is the first 12 (of 18) elements. It is taking many seconds to produce each intermediate array ("dupes" in the example above), so deduping the entire thing would likely take days.

Anyone have a superior and faster solution?

cr

Chuck Remes wrote:

I need to remove duplicates from an array of arrays. I can't use Array#uniq because some fields are different and not part of the "key." Here's an example where the first 3 elements of each sub array are the "key" and determine uniqueness. I want to keep only the first one I get.

>> a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
=> [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

The return value of deduplicating this array should be: [[1, 2, 3, 4, 5]]

Might be faster if your intermediate is a hash, so instead of N**2 time it's N.

a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

h = {}
a.each do |row|
   h[row[0..2]] ||= row # record the first match
end

p h.values # ==> [[1, 2, 3, 4, 5]]

Note that the output may come out in a different order. Does that matter?

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Hi --

I need to remove duplicates from an array of arrays. I can't use Array#uniq because some fields are different and not part of the "key." Here's an example where the first 3 elements of each sub array are the "key" and determine uniqueness. I want to keep only the first one I get.

a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

=> [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

The return value of deduplicating this array should be: [[1, 2, 3, 4, 5]]

Here is my first attempt at solving the problem:

def dedup ary
  ary.map do |line|

?> dupes = ary.select { |row| row[0..2] == line[0..2] }
?> dupes.first

  end.uniq
end

=> nil

?> dedup a
=> [[1, 2, 3, 4, 5]]

This works. However, it is *super slow* when operating on my dataset. My arrays contain hundreds of thousands of sub arrays. The unique key for each sub array is the first 12 (of 18) elements. It is taking many seconds to produce each intermediate array ("dupes" in the example above), so deduping the entire thing would likely take days.

Anyone have a superior and faster solution?

See if this speeds it up meaningfully (and make sure I've got the
logic right):

   def dedup(ary)
     uniq = {}
     ary.each do |line|
       uniq[ary[0..2]] ||= line
     end
     uniq.values
   end

David

···

On Sat, 18 Jul 2009, Chuck Remes wrote:

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)

David A. Black wrote:

   def dedup(ary)
     uniq = {}
     ary.each do |line|
       uniq[ary[0..2]] ||= line
     end
     uniq.values
   end

Sweet! (I love Ruby; thanks Matz.)

Regards,

···

--
Bil

David and Joel,

you both provided the same solution. I will test this to see what kind of performance I get. It will be hell on memory, but I assumed any solution likely would be. (And Joel, I have presorted the array prior to removing the dupes so I have already taken care of the ordering issue.)

Thanks for your suggestions.

cr

···

On Jul 17, 2009, at 5:39 PM, David A. Black wrote:

Hi --

On Sat, 18 Jul 2009, Chuck Remes wrote:

I need to remove duplicates from an array of arrays. I can't use Array#uniq because some fields are different and not part of the "key." Here's an example where the first 3 elements of each sub array are the "key" and determine uniqueness. I want to keep only the first one I get.

a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

=> [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

The return value of deduplicating this array should be: [[1, 2, 3, 4, 5]]

Here is my first attempt at solving the problem:

def dedup ary
ary.map do |line|

?> dupes = ary.select { |row| row[0..2] == line[0..2] }
?> dupes.first

end.uniq
end

=> nil
?> dedup a
=> [[1, 2, 3, 4, 5]]

This works. However, it is *super slow* when operating on my dataset. My arrays contain hundreds of thousands of sub arrays. The unique key for each sub array is the first 12 (of 18) elements. It is taking many seconds to produce each intermediate array ("dupes" in the example above), so deduping the entire thing would likely take days.

Anyone have a superior and faster solution?

See if this speeds it up meaningfully (and make sure I've got the
logic right):

def dedup(ary)
   uniq = {}
   ary.each do |line|
     uniq[ary[0..2]] ||= line
   end
   uniq.values
end

I think what Joel was referring to was that in Ruby 1.8 a Hash doesn't
maintain insertion order when traversed (Ruby 1.9 does maintain
insertion order):

ruby 1.8.2 (2004-12-25) [powerpc-darwin8.0]:
irb(main):001:0> h = {}
=> {}
irb(main):002:0> 5.times{|n| h[n] = n}
=> 5
irb(main):003:0> h
=> {0=>0, 1=>1, 2=>2, 3=>3, 4=>4}
irb(main):004:0> h["sadf"] = 3
=> 3
irb(main):005:0> h
=> {0=>0, 1=>1, "sadf"=>3, 2=>2, 3=>3, 4=>4}

···

On Fri, Jul 17, 2009 at 7:51 PM, Chuck Remes<cremes.devlist@mac.com> wrote:

(And Joel, I have presorted the array prior to removing the
dupes so I have already taken care of the ordering issue.)

Hi --

Hi --

I need to remove duplicates from an array of arrays. I can't use Array#uniq because some fields are different and not part of the "key." Here's an example where the first 3 elements of each sub array are the "key" and determine uniqueness. I want to keep only the first one I get.

a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

=> [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

The return value of deduplicating this array should be: [[1, 2, 3, 4, 5]]

Here is my first attempt at solving the problem:

def dedup ary
ary.map do |line|

?> dupes = ary.select { |row| row[0..2] == line[0..2] }
?> dupes.first

end.uniq
end

=> nil
?> dedup a
=> [[1, 2, 3, 4, 5]]

This works. However, it is *super slow* when operating on my dataset. My arrays contain hundreds of thousands of sub arrays. The unique key for each sub array is the first 12 (of 18) elements. It is taking many seconds to produce each intermediate array ("dupes" in the example above), so deduping the entire thing would likely take days.

Anyone have a superior and faster solution?

See if this speeds it up meaningfully (and make sure I've got the
logic right):

def dedup(ary)
  uniq = {}
  ary.each do |line|
    uniq[ary[0..2]] ||= line
  end
  uniq.values
end

David and Joel,

you both provided the same solution. I will test this to see what kind of performance I get. It will be hell on memory, but I assumed any solution likely would be. (And Joel, I have presorted the array prior to removing the dupes so I have already taken care of the ordering issue.)

I believe the version you had originally, where you do a mapping of
the whole array, will typically use much more memory than the hash
version. Let's say your original array has 1000 inner arrays, with 10
that are considered unique. The mapping will be a new array, also of
1000 elements. The hash will have 10 key/value pairs -- thus much
smaller.

David

···

On Sat, 18 Jul 2009, Chuck Remes wrote:

On Jul 17, 2009, at 5:39 PM, David A. Black wrote:

On Sat, 18 Jul 2009, Chuck Remes wrote:

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)

Hi --

(And Joel, I have presorted the array prior to removing the
dupes so I have already taken care of the ordering issue.)

I think what Joel was referring to was that in Ruby 1.8 a Hash doesn't
maintain insertion order when traversed (Ruby 1.9 does maintain
insertion order):

ruby 1.8.2 (2004-12-25) [powerpc-darwin8.0]:
irb(main):001:0> h = {}
=> {}
irb(main):002:0> 5.times{|n| h[n] = n}
=> 5
irb(main):003:0> h
=> {0=>0, 1=>1, 2=>2, 3=>3, 4=>4}
irb(main):004:0> h["sadf"] = 3
=> 3
irb(main):005:0> h
=> {0=>0, 1=>1, "sadf"=>3, 2=>2, 3=>3, 4=>4}

If you wanted to maintain order you could (for some but probably not
much performance penalty) do something like:

   def dedup(ary)
     uniq = {}
     res =
     ary.each do |line|
       key = line[0..2]
       next if uniq[key]
       res << (uniq[key] = line)
     end
     res
   end

David

···

On Sat, 18 Jul 2009, brabuhr@gmail.com wrote:

On Fri, Jul 17, 2009 at 7:51 PM, Chuck Remes<cremes.devlist@mac.com> wrote:

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)

Oh yes, my version had terrible execution performance and memory performance. I was trying to figure out how to use a hash but did not make the leap to the ||= construction on my own. I knew I was missing something obvious... all of your rapid responses proved it.

FYI, the dedup code you provided performs quite admirably. I'll take a look at its memory footprint when I get in the office Monday and report back.

cr

···

On Jul 17, 2009, at 7:55 PM, David A. Black wrote:

Hi --

On Sat, 18 Jul 2009, Chuck Remes wrote:

David and Joel,

you both provided the same solution. I will test this to see what kind of performance I get. It will be hell on memory, but I assumed any solution likely would be. (And Joel, I have presorted the array prior to removing the dupes so I have already taken care of the ordering issue.)

I believe the version you had originally, where you do a mapping of
the whole array, will typically use much more memory than the hash
version. Let's say your original array has 1000 inner arrays, with 10
that are considered unique. The mapping will be a new array, also of
1000 elements. The hash will have 10 key/value pairs -- thus much
smaller.

Hi---

Hi --

(And Joel, I have presorted the array prior to removing the
dupes so I have already taken care of the ordering issue.)

I think what Joel was referring to was that in Ruby 1.8 a Hash doesn't
maintain insertion order when traversed (Ruby 1.9 does maintain
insertion order):

ruby 1.8.2 (2004-12-25) [powerpc-darwin8.0]:
irb(main):001:0> h = {}
=> {}
irb(main):002:0> 5.times{|n| h[n] = n}
=> 5
irb(main):003:0> h
=> {0=>0, 1=>1, 2=>2, 3=>3, 4=>4}
irb(main):004:0> h["sadf"] = 3
=> 3
irb(main):005:0> h
=> {0=>0, 1=>1, "sadf"=>3, 2=>2, 3=>3, 4=>4}

If you wanted to maintain order you could (for some but probably not
much performance penalty) do something like:

def dedup(ary)
   uniq = {}
   res =
   ary.each do |line|
     key = line[0..2]
     next if uniq[key]
     res << (uniq[key] = line)
   end
   res
end

David

I missed the beginning of this thread, but here is an implementation I've used successfully:

def uniq_by(subject, &block)
   h = {}
   a =
   subject.each do |s|
     comparator = yield(s)
     unless h[comparator]
       a.push(s)
       h[comparator] = s
     end
   end
   a
end

Usage:

u = uniq_by(ary|{ |item| item.element }

Basically, what this allows you to do is specify what exactly about an array item must be unique. It also preserves the original array order, with a "first entry wins" approach to duplicate elimination.

Hope this is useful.

···

On Jul 17, 2009, at 6:14 PM, David A. Black wrote:

On Sat, 18 Jul 2009, brabuhr@gmail.com wrote:

On Fri, Jul 17, 2009 at 7:51 PM, Chuck >> Remes<cremes.devlist@mac.com> wrote:

Chuck Remes wrote:

(And Joel, I have presorted the array prior
to removing the dupes so I have already taken care of the ordering
issue.)

That's well and good, but in the process of using a hash to remove the
duplicates, the result will be out of order. See example below.

I was trying to figure out how to use a hash but did not
make the leap to the ||= construction on my own.

A simple if statement will achieve the same result:

a = [
  [1, 2, 3, 10],
  [1, 2, 3, 20],
  [1, 2, 3, 30],
  [2, 2, 3, 40]
]

h = {}

a.each do |suba|
  key = suba.slice(0, 3)

  if h[key]
    next
  else
    h[key] = suba
  end

end

p h.values

--output:--
[[2, 2, 3, 40], [1, 2, 3, 10]]

···

--
Posted via http://www.ruby-forum.com/\.

Hi --

···

On Mon, 20 Jul 2009, 7stud -- wrote:

Chuck Remes wrote:

(And Joel, I have presorted the array prior
to removing the dupes so I have already taken care of the ordering
issue.)

That's well and good, but in the process of using a hash to remove the
duplicates, the result will be out of order. See example below.

I was trying to figure out how to use a hash but did not
make the leap to the ||= construction on my own.

A simple if statement will achieve the same result:

a = [
[1, 2, 3, 10],
[1, 2, 3, 20],
[1, 2, 3, 30],
[2, 2, 3, 40]
]

h = {}

a.each do |suba|
key = suba.slice(0, 3)

if h[key]
   next
else
   h[key] = suba
end

end

p h.values

--output:--
[[2, 2, 3, 40], [1, 2, 3, 10]]

I'm not sure what that buys you, though. The ||= idiom should work
fine, unless you need to do something extra during the if statement.

David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)

True, but in my case the order only matters for *picking* the first of the duplicates. I understand hashes in 1.8 don't maintain order, but that isn't relevant since I don't care about output order.

I suppose I could have written a longer explanation of what I needed, but I favored brevity over those other details.

cr

···

On Jul 19, 2009, at 3:43 PM, 7stud -- wrote:

Chuck Remes wrote:

(And Joel, I have presorted the array prior
to removing the dupes so I have already taken care of the ordering
issue.)

That's well and good, but in the process of using a hash to remove the
duplicates, the result will be out of order. See example below.

David A. Black wrote:

Hi --

[2, 2, 3, 40]
   h[key] = suba
end

end

p h.values

--output:--
[[2, 2, 3, 40], [1, 2, 3, 10]]

I'm not sure what that buys you, though. The ||= idiom should work
fine, unless you need to do something extra during the if statement.

David

Facets has enumerable#uniq_by implemented like this:

def uniq_by #:yield:
    h = {}; inject() {|a,x| h[yield(x)] ||= a << x}
end

So you can do:

  require 'facets'
  a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
  res = a.uniq_by{|el| el[0..2]}

Don't know about performance.

hth,

Siep

···

On Mon, 20 Jul 2009, 7stud -- wrote:

--
Posted via http://www.ruby-forum.com/\.

David A. Black wrote:

I'm not sure what that buys you, though.

Simplicity of comprehension over brevity.

The ||= idiom should work
fine,

Of course.

···

--
Posted via http://www.ruby-forum.com/\.

Hi --

···

On Mon, 20 Jul 2009, 7stud -- wrote:

David A. Black wrote:

I'm not sure what that buys you, though.

Simplicity of comprehension over brevity.

I guess I like the ||= idiom because it has both. But the if statement
will definitely work, of course.

David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)

Siep Korteling wrote:

Don't know about performance.

Just remember this:

inject() = shite

···

--
Posted via http://www.ruby-forum.com/\.

Ooh, I like this one too. I'll try it out and see what kind of performance I get. I understand that #inject performance is weak in MRI 1.8, but I primarily use JRuby where it doesn't share the same performance issues as the C version.

cr

···

On Jul 19, 2009, at 5:07 PM, Siep Korteling wrote:

David A. Black wrote:

Hi --

On Mon, 20 Jul 2009, 7stud -- wrote:

[2, 2, 3, 40]
  h[key] = suba
end

end

p h.values

--output:--
[[2, 2, 3, 40], [1, 2, 3, 10]]

I'm not sure what that buys you, though. The ||= idiom should work
fine, unless you need to do something extra during the if statement.

David

Facets has enumerable#uniq_by implemented like this:

def uniq_by #:yield:
   h = {}; inject() {|a,x| h[yield(x)] ||= a << x}
end

David A. Black wrote:

Hi --

David A. Black wrote:

I'm not sure what that buys you, though.

Simplicity of comprehension over brevity.

I guess I like the ||= idiom because it has both.

For you, yes. But apparently not for the op:

I was trying to figure out how to use a hash but did not
make the leap to the ||= construction on my own.

The ||= idiom should work
fine

..until it doesn't (I think you know what I'm refering to).

···

On Mon, 20 Jul 2009, 7stud -- wrote:

--
Posted via http://www.ruby-forum.com/\.

Hi --

David A. Black wrote:

Hi --

David A. Black wrote:

I'm not sure what that buys you, though.

Simplicity of comprehension over brevity.

I guess I like the ||= idiom because it has both.

For you, yes. But apparently not for the op:

I was trying to figure out how to use a hash but did not
make the leap to the ||= construction on my own.

I wouldn't recommend freezing one's knowledge or leap abilities at a
particular point, though :slight_smile: I'm certainly in sympathy with being
suspicious of punctuation-heavy stuff, but in the case of ||= it's
such a common idiom, and so easy to learn, that it seems like a bit of
an artificial hardship not to learn it. Still, if will work too :slight_smile:

The ||= idiom should work
fine

..until it doesn't (I think you know what I'm refering to).

The hash default thing? I don't think that comes into play here, does
it?

David

···

On Mon, 20 Jul 2009, 7stud -- wrote:

On Mon, 20 Jul 2009, 7stud -- wrote:

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)