Team,
Every week I get a large file, over 50 millions records with record length
150 chars. These files can actually be over 15GB.
I need to take the new file and compare it against the one from the
previous week.
Reading the files into two arrays would make the process a bit easier, but
the files are too large and when I try using arrays, the process crashes
with out of storage messages.
I am looking for suggestions on how to efficiently perform the following
process:
1. Compare each record from the file from this week against last week
file
2. If every record are the same, do nothing or just indicate so: *SAM*
3. If there is any duplicate records on the new file, output the record
to a file of dups
4. If there are any new record, (records found on the new file, but not
on last week file) output: *INS* followed by the record
5. If there is a record which is found on last week file (old file) but
not on this week file, output: *DEL* followed by the record
6. If there is a record with the same key (the first 13 chars) on both
files, but the rest of the record is different, output: *UPD* followed
by the record
Hey, I can do all of the above doing reading each record from both files
and do different type of comparison/match, but I was wondering if there is
an efficient way to do this. I was looking for suggestions.
Thank you
···
--
Ruby Student