The issue here is that we could have duplicate entries. A duplicate entry
is when both the ID & the INSTANCE from one record is found on another
record. M task is to identified and all dups and create a file with them. A
dup can occur at any time during the 24 hours period.
My idea was taking each ID and doing a grep against the file. That will
certainly find a match because at the very minimum, it will find itself.
That's no good because I only want to out the record when I find two or
more instances of the ID/INSTANCE. I also thought about extracting just the
ID/INSTANCE and creating an array such as this:
The idea was to take each element of the first column and navigate the
entire file looking for dups. I'm stuck here also finding a good way to do
this. The other problem is if I will be able to create an array so large.
Remember that there could be potentially 2 million records.
I'd create a DB table with fields for ID and INSTANCE and put a
unique constraint on them. Then insert your records, rescuing the
ActiveRecord::RecordNotUnique errors and writing those entries
to your dup file (or another DB table).
FWIW,
···
On Thu, Oct 24, 2013 at 11:17 AM, Ruby Student <ruby.student@gmail.com> wrote:
The issue here is that we could have duplicate entries. A duplicate entry is
when both the ID & the INSTANCE from one record is found on another record.
M task is to identified and all dups and create a file with them.
I think you are close, but what you are considering is searching the entire file for each [ID, Instance] pair, in essence doing n*n comparisons. This is not good, especially when n = 1000000.
Instead, go through the file once and maintain a Set containing each [ID, Instance] pair. For each record, check if the pair is already in the Set. If it is, you have a duplicate. If not, add it to the Set and move on.
-Justin
···
On 10/24/2013 11:17 AM, Ruby Student wrote:
Hello Team,
I have a file with over a million records ordered in chronological
order. A piece of the file looks like this:
The issue here is that we could have duplicate entries. A duplicate
entry is when both the ID & the INSTANCE from one record is found on
another record. M task is to identified and all dups and create a file
with them. A dup can occur at any time during the 24 hours period.
My idea was taking each ID and doing a grep against the file. That will
certainly find a match because at the very minimum, it will find itself.
That's no good because I only want to out the record when I find two or
more instances of the ID/INSTANCE. I also thought about extracting just
the ID/INSTANCE and creating an array such as this:
The idea was to take each element of the first column and navigate the
entire file looking for dups. I'm stuck here also finding a good way to
do this. The other problem is if I will be able to create an array so
large. Remember that there could be potentially 2 million records.
M[y] task is to identified and all dups and create a file with them.
Your program just writes the second, third and any further occurrence.
Btw. you can simplify lines 21 to 25 to
puts line unless entries.add? entry
alternatively
entries.add? entry or puts line
Agree that you could save a lot of memory by considering a day at a time
instead of all entires.
It's not about saving memory. With large files it may not even be
possible to execute a program which holds all file content in memory
because it dies from memory exhaustion. If OTOH there is enough
memory in the machine then chances are that the second run through the
files is done from cached file content and will be much cheaper than
the first one.
Cheers
robert
···
On Fri, Oct 25, 2013 at 6:12 PM, Justin Collins <justincollins@ucla.edu> wrote:
On 10/25/2013 01:59 AM, Robert Klemme wrote:
On Thu, Oct 24, 2013 at 11:00 PM, Reid Thompson <Reid.Thompson@ateb.com> >> wrote: