List of removed items using Array#uniq

Claus_Spitzer · 20 July 2005 05:06

Greetings! I have a small dilemma, and was wondering if I could gain
some insight as to how I could work around it the Ruby Way (TM). Here
is the situation:
I am working on a large array where all of the elements are arrays
themselves. The subarrays each have three strings as their elements.
Below is an example of the contents:

[["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["gain", "in", "years"],
["leave", "as", "bidder"],
... ] (1)

Now, some of the entries are duplicate. I am removing the duplicates
using Array#uniq, but for the purpose of my work I need to know which
entries were duplicate to begin with (knowing how many duplicates
there were would be a bonus, but is not critical). I figured that I
could do something like
duplicates = tuples - tuples.uniq
, but that does not seem to work insofar as that the duplicates get
removed with uniq but do not show up in the duplicates array. My
guess is that Array#uniq utilizes the Array#eql? method to do
comparisons, whereas array - array uses the == method, and since each
subarray is a different object they do not match with == . Is that
correct? Either way, I was wondering if anyone had suggestions as to
how I could make this work. I would much rather use some good Ruby
intrinsic than creating my own kludge to count the duplicates. If
possible I'd also like to keep this as close to O(n) as possible,
since I'm handling arrays with thousands of entries.

Many regards...
-CWS

(1) For the curious ones out there, I am working on prepositional
phrase attachment disambiguation... This array is what I use to train
the algorithm.

Charles_Mills1 · 20 July 2005 05:40

Claus Spitzer wrote:

Greetings! I have a small dilemma, and was wondering if I could gain
some insight as to how I could work around it the Ruby Way (TM). Here
is the situation:
I am working on a large array where all of the elements are arrays
themselves. The subarrays each have three strings as their elements.
Below is an example of the contents:

[["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["gain", "in", "years"],
["leave", "as", "bidder"],
... ] (1)

Now, some of the entries are duplicate. I am removing the duplicates
using Array#uniq, but for the purpose of my work I need to know which
entries were duplicate to begin with (knowing how many duplicates
there were would be a bonus, but is not critical). I figured that I
could do something like
duplicates = tuples - tuples.uniq
, but that does not seem to work insofar as that the duplicates get
removed with uniq but do not show up in the duplicates array. My
guess is that Array#uniq utilizes the Array#eql? method to do
comparisons, whereas array - array uses the == method, and since each
subarray is a different object they do not match with == . Is that
correct? Either way, I was wondering if anyone had suggestions as to
how I could make this work. I would much rather use some good Ruby
intrinsic than creating my own kludge to count the duplicates. If
possible I'd also like to keep this as close to O(n) as possible,
since I'm handling arrays with thousands of entries.

Many regards...
-CWS

(1) For the curious ones out there, I am working on prepositional
phrase attachment disambiguation... This array is what I use to train
the algorithm.

Here is an ugly O(n*log(n)) version that only works if the elements are
comparable (which yours are):
s = a.sort # a == your array
u = ; d =; prev = nil
s.each do |elem|
if elem == prev then d else u end << elem
prev = elem
end

-Charlie

Matthew_Desmarais · 20 July 2005 05:50

Claus Spitzer wrote:

Greetings! I have a small dilemma, and was wondering if I could gain
some insight as to how I could work around it the Ruby Way (TM). Here
is the situation:
I am working on a large array where all of the elements are arrays
themselves. The subarrays each have three strings as their elements. Below is an example of the contents:

[["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["gain", "in", "years"],
["leave", "as", "bidder"],
... ] (1)

Now, some of the entries are duplicate. I am removing the duplicates
using Array#uniq, but for the purpose of my work I need to know which
entries were duplicate to begin with (knowing how many duplicates
there were would be a bonus, but is not critical). I figured that I
could do something like
duplicates = tuples - tuples.uniq
, but that does not seem to work insofar as that the duplicates get
removed with uniq but do not show up in the duplicates array. My
guess is that Array#uniq utilizes the Array#eql? method to do
comparisons, whereas array - array uses the == method, and since each
subarray is a different object they do not match with == . Is that
correct? Either way, I was wondering if anyone had suggestions as to
how I could make this work. I would much rather use some good Ruby
intrinsic than creating my own kludge to count the duplicates. If
possible I'd also like to keep this as close to O(n) as possible,
since I'm handling arrays with thousands of entries.

Many regards...
-CWS

(1) For the curious ones out there, I am working on prepositional
phrase attachment disambiguation... This array is what I use to train
the algorithm.

Hi Claus,

I may be way off, but maybe this'll help...

all_phrases = [["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["retain", "for", "period"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["retain", "for", "period"],
["gain", "in", "years"],
["leave", "as", "bidder"]]

phrase_count = Hash.new(0)
all_phrases.each {|phrase| phrase_count[phrase] += 1}

So you should end up with a Hash; the keys are your phrase list uniq'd and the values are a count of the number of times they've appeared...

Matthew

Mill_Mill · 21 July 2005 04:02

all = [["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["retain", "for", "period"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["retain", "for", "period"],
["gain", "in", "years"],
["gain", "in", "years"],
["leave", "as", "bidder"]]

duplicates = all.uniq.select{|x|all.select{|y|y==x}.size>1}
duplicate_with_count = all.inject(Hash.new(0)){|n,x|n[x]+=1;n}.select{|k,v|v>1}

hope this helps.

-mill

Claus_Spitzer · 21 July 2005 16:29

Thanks to all who sent in suggestions. They are much appreciated!
Regards...
-CWS

Mill_Mill · 21 July 2005 09:39

Mill Mill <koyakam <at> gmail.com> writes:

duplicates = all.uniq.select{|x|all.select{|y|y==x}.size>1}
duplicate_with_count = all.inject(Hash.new(0)){|n,x|n+=1;n}.select{|k,v|v>1}

hope this helps.

-mill

and this maybe better too:
duplicates_with_count=Hash.new(1)
s=all.sort
s.each_index{|i| duplicates_with_count[s[i]]+=1 if s[i]==s[i-1]}

Topic		Replies	Views
Making Array#uniq work ruby-talk	0	72	24 January 2006
Returning a duplicate from an Array ruby-talk	11	160	15 January 2009
How to get only the duplicatet items from an array ruby-talk	4	116	27 October 2006
Detecting duplicates in an array, anything in the standard library? ruby-talk	34	186	22 August 2007
Isolating non-unique items in an array ruby-talk	15	117	13 October 2006

List of removed items using Array#uniq

Related topics