List of removed items using Array#uniq

Greetings! I have a small dilemma, and was wondering if I could gain
some insight as to how I could work around it the Ruby Way (TM). Here
is the situation:
I am working on a large array where all of the elements are arrays
themselves. The subarrays each have three strings as their elements.
Below is an example of the contents:

[["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["gain", "in", "years"],
["leave", "as", "bidder"],
... ] (1)

Now, some of the entries are duplicate. I am removing the duplicates
using Array#uniq, but for the purpose of my work I need to know which
entries were duplicate to begin with (knowing how many duplicates
there were would be a bonus, but is not critical). I figured that I
could do something like
   duplicates = tuples - tuples.uniq
, but that does not seem to work insofar as that the duplicates get
removed with uniq but do not show up in the duplicates array. My
guess is that Array#uniq utilizes the Array#eql? method to do
comparisons, whereas array - array uses the == method, and since each
subarray is a different object they do not match with == . Is that
correct? Either way, I was wondering if anyone had suggestions as to
how I could make this work. I would much rather use some good Ruby
intrinsic than creating my own kludge to count the duplicates. If
possible I'd also like to keep this as close to O(n) as possible,
since I'm handling arrays with thousands of entries.

Many regards...
   -CWS

(1) For the curious ones out there, I am working on prepositional
phrase attachment disambiguation... This array is what I use to train
the algorithm.

Claus Spitzer wrote:

Greetings! I have a small dilemma, and was wondering if I could gain
some insight as to how I could work around it the Ruby Way (TM). Here
is the situation:
I am working on a large array where all of the elements are arrays
themselves. The subarrays each have three strings as their elements.
Below is an example of the contents:

[["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["gain", "in", "years"],
["leave", "as", "bidder"],
... ] (1)

Now, some of the entries are duplicate. I am removing the duplicates
using Array#uniq, but for the purpose of my work I need to know which
entries were duplicate to begin with (knowing how many duplicates
there were would be a bonus, but is not critical). I figured that I
could do something like
   duplicates = tuples - tuples.uniq
, but that does not seem to work insofar as that the duplicates get
removed with uniq but do not show up in the duplicates array. My
guess is that Array#uniq utilizes the Array#eql? method to do
comparisons, whereas array - array uses the == method, and since each
subarray is a different object they do not match with == . Is that
correct? Either way, I was wondering if anyone had suggestions as to
how I could make this work. I would much rather use some good Ruby
intrinsic than creating my own kludge to count the duplicates. If
possible I'd also like to keep this as close to O(n) as possible,
since I'm handling arrays with thousands of entries.

Many regards...
   -CWS

(1) For the curious ones out there, I am working on prepositional
phrase attachment disambiguation... This array is what I use to train
the algorithm.

Here is an ugly O(n*log(n)) version that only works if the elements are
comparable (which yours are):
s = a.sort # a == your array
u = ; d =; prev = nil
s.each do |elem|
  if elem == prev then d else u end << elem
  prev = elem
end

-Charlie

Claus Spitzer wrote:

Greetings! I have a small dilemma, and was wondering if I could gain
some insight as to how I could work around it the Ruby Way (TM). Here
is the situation:
I am working on a large array where all of the elements are arrays
themselves. The subarrays each have three strings as their elements. Below is an example of the contents:

[["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["gain", "in", "years"],
["leave", "as", "bidder"],
... ] (1)

Now, some of the entries are duplicate. I am removing the duplicates
using Array#uniq, but for the purpose of my work I need to know which
entries were duplicate to begin with (knowing how many duplicates
there were would be a bonus, but is not critical). I figured that I
could do something like
  duplicates = tuples - tuples.uniq
, but that does not seem to work insofar as that the duplicates get
removed with uniq but do not show up in the duplicates array. My
guess is that Array#uniq utilizes the Array#eql? method to do
comparisons, whereas array - array uses the == method, and since each
subarray is a different object they do not match with == . Is that
correct? Either way, I was wondering if anyone had suggestions as to
how I could make this work. I would much rather use some good Ruby
intrinsic than creating my own kludge to count the duplicates. If
possible I'd also like to keep this as close to O(n) as possible,
since I'm handling arrays with thousands of entries.

Many regards...
  -CWS

(1) For the curious ones out there, I am working on prepositional
phrase attachment disambiguation... This array is what I use to train
the algorithm.

Hi Claus,

I may be way off, but maybe this'll help...

all_phrases = [["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["retain", "for", "period"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["retain", "for", "period"],
["gain", "in", "years"],
["leave", "as", "bidder"]]

phrase_count = Hash.new(0)
all_phrases.each {|phrase| phrase_count[phrase] += 1}

So you should end up with a Hash; the keys are your phrase list uniq'd and the values are a count of the number of times they've appeared...

Matthew

all = [["join", "as", "director"],
["retain", "for", "period"],
["pour", "into", "fund"],
["treat", "like", "royalty"],
["join", "for", "evening"],
["haul", "for", "race"],
["retain", "for", "period"],
["stop", "because", "dispute"],
["increase", "from", "period"],
["keep", "with", "magazine"],
["offer", "in", "years"],
["award", "on", "advertising"],
["retain", "for", "period"],
["gain", "in", "years"],
["gain", "in", "years"],
["leave", "as", "bidder"]]

duplicates = all.uniq.select{|x|all.select{|y|y==x}.size>1}
duplicate_with_count = all.inject(Hash.new(0)){|n,x|n[x]+=1;n}.select{|k,v|v>1}

hope this helps.

-mill

Thanks to all who sent in suggestions. They are much appreciated!
Regards...
-CWS

Mill Mill <koyakam <at> gmail.com> writes:

duplicates = all.uniq.select{|x|all.select{|y|y==x}.size>1}
duplicate_with_count = all.inject(Hash.new(0)){|n,x|n+=1;n}.select{|k,v|v>1}

hope this helps.

-mill

and this maybe better too:
duplicates_with_count=Hash.new(1)
s=all.sort
s.each_index{|i| duplicates_with_count[s[i]]+=1 if s[i]==s[i-1]}