Looking for suggestions processing and comparing 2 very large files

Team,

Every week I get a large file, over 50 millions records with record length

150 chars. These files can actually be over 15GB.

I need to take the new file and compare it against the one from the
previous week.
Reading the files into two arrays would make the process a bit easier, but
the files are too large and when I try using arrays, the process crashes
with out of storage messages.

I am looking for suggestions on how to efficiently perform the following
process:

   1. Compare each record from the file from this week against last week
   file
   2. If every record are the same, do nothing or just indicate so: *SAM*
   3. If there is any duplicate records on the new file, output the record
   to a file of dups
   4. If there are any new record, (records found on the new file, but not
   on last week file) output: *INS* followed by the record
   5. If there is a record which is found on last week file (old file) but
   not on this week file, output: *DEL* followed by the record
   6. If there is a record with the same key (the first 13 chars) on both
   files, but the rest of the record is different, output: *UPD* followed
   by the record

Hey, I can do all of the above doing reading each record from both files
and do different type of comparison/match, but I was wondering if there is
an efficient way to do this. I was looking for suggestions.

Thank you

···

--
Ruby Student

This is almost exactly what diff in Unix/linux/Mac OS X does. There are also
diff utilities for Windows, so instead of reinventing the wheel, you may want to
look at the diff utilities out there already.

Wayne

The big question is... are these files SORTED, preferably on some
UNIQUE key, or at least some order that will remain the same from week
to week? If yes, then you can use the same sort of techniques as in
the "diff" utility found on every Unix-derived system and many others.
(Windows has something similar but the name escapes me at the moment.
IIRC, in an ironic twist, this is one of those cases where the
Windows command has a *more* cryptic name than its Unix cognate.) How
to make a "diff" type program has been covered in gazillions of
blog/magazine articles, textbooks, etc., so I won't go into detail.
If you're lucky, you might even be able to just use the ones existing
on your system, with some shell scripting for glue.

On the other claw, if the records are in random order, then you've got
a much more serious problem. In that case, ASSUMING that the keys,
and number of updated/duplicated records, are both quite small, off
the top of my head I think I'd:

- Extract the keys from last week's file
- Ditto for this week's
- Sort those, assuming the keys are sufficiently smaller that this is reasonable
- Diff them.
- Extract the actual records from both weeks for any matching keys.
- Sort and diff, under the same assumption.

Or, if the above data sets are not small enough to make sorting
reasonable, but the potential dups might at least fit in RAM:

- Extract last week's keys into a Set
- Initialize a "Needs Further Inspection" (NFI) Set
- Iterate over this week's records:
  = Try to find the key in last week's Set of keys
  = If seen, remove from last weeks and add to NFI Set
  = Else process as an Insertion
- Anything left in last week's Set is a Removal
- (You can now get rid of last week's Set of keys)
- Extract last week's full records matching NFI keys,
  putting them in a hash keyed by the key
- Extract this week's records matching NFI keys,
  looking them up in the hash
- Compare the entire records, processing as either
  Duplicate or Update as needed

-Dave

···

On Mon, Oct 22, 2012 at 2:21 PM, Ruby Student <ruby.student@gmail.com> wrote:

Every week I get a large file, over 50 millions records

--
Dave Aronson, the T. Rex of Codosaurus LLC,
secret-cleared freelance software developer
taking contracts in or near NoVa or remote.
See information at http://www.Codosaur.us/.

If they are sorted you can split into pieces, eg create an
a,b,c file and compare a(old) with a(new) and b(old) with b(new) only.

If files are still to big use two chars: aa,ab or the like

Marc Weber

I forgot to mention that the files are sorted!

···

On Mon, Oct 22, 2012 at 2:57 PM, Dave Aronson <rubytalk2dave@davearonson.com > wrote:

On Mon, Oct 22, 2012 at 2:21 PM, Ruby Student <ruby.student@gmail.com> > wrote:

> Every week I get a large file, over 50 millions records

The big question is... are these files SORTED, preferably on some
UNIQUE key, or at least some order that will remain the same from week
to week? If yes, then you can use the same sort of techniques as in
the "diff" utility found on every Unix-derived system and many others.
(Windows has something similar but the name escapes me at the moment.
IIRC, in an ironic twist, this is one of those cases where the
Windows command has a *more* cryptic name than its Unix cognate.) How
to make a "diff" type program has been covered in gazillions of
blog/magazine articles, textbooks, etc., so I won't go into detail.
If you're lucky, you might even be able to just use the ones existing
on your system, with some shell scripting for glue.

On the other claw, if the records are in random order, then you've got
a much more serious problem. In that case, ASSUMING that the keys,
and number of updated/duplicated records, are both quite small, off
the top of my head I think I'd:

- Extract the keys from last week's file
- Ditto for this week's
- Sort those, assuming the keys are sufficiently smaller that this is
reasonable
- Diff them.
- Extract the actual records from both weeks for any matching keys.
- Sort and diff, under the same assumption.

Or, if the above data sets are not small enough to make sorting
reasonable, but the potential dups might at least fit in RAM:

- Extract last week's keys into a Set
- Initialize a "Needs Further Inspection" (NFI) Set
- Iterate over this week's records:
  = Try to find the key in last week's Set of keys
  = If seen, remove from last weeks and add to NFI Set
  = Else process as an Insertion
- Anything left in last week's Set is a Removal
- (You can now get rid of last week's Set of keys)
- Extract last week's full records matching NFI keys,
  putting them in a hash keyed by the key
- Extract this week's records matching NFI keys,
  looking them up in the hash
- Compare the entire records, processing as either
  Duplicate or Update as needed

-Dave

--
Dave Aronson, the T. Rex of Codosaurus LLC,
secret-cleared freelance software developer
taking contracts in or near NoVa or remote.
See information at http://www.Codosaur.us/.

--
Ruby Student

One could as well read records from both files at the same time and
output diffs while doing that. If the data is line based, there are
tools already for comparing two sorted files:

$ comm --help
Usage: comm [OPTION]... FILE1 FILE2
Compare sorted files FILE1 and FILE2 line by line.

With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2,
and column three contains lines common to both files.

  -1 suppress column 1 (lines unique to FILE1)
  -2 suppress column 2 (lines unique to FILE2)
  -3 suppress column 3 (lines that appear in both files)

  --check-order check that the input is correctly sorted, even
                      if all input lines are pairable
  --nocheck-order do not check that the input is correctly sorted
  --output-delimiter=STR separate columns with STR
      --help display this help and exit
      --version output version information and exit

Note, comparisons honor the rules specified by `LC_COLLATE'.

Examples:
  comm -12 file1 file2 Print only lines present in both file1 and file2.
  comm -3 file1 file2 Print lines in file1 not in file2, and vice versa.

Report comm bugs to bug-coreutils@gnu.org
GNU coreutils home page: <http://www.gnu.org/software/coreutils/>
General help using GNU software: <http://www.gnu.org/gethelp/>
For complete documentation, run: info coreutils 'comm invocation'

If files are not line based it may be possible to convert them to a
line based format and then use comm like this:

$ comm <(conver old) <(convert new)

Kind regards

robert

···

On Mon, Oct 22, 2012 at 10:05 PM, Marc Weber <marco-oweber@gmx.de> wrote:

If they are sorted you can split into pieces, eg create an
a,b,c file and compare a(old) with a(new) and b(old) with b(new) only.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

PS: You can use this scheme if you want to implement this yourself

···

On Wed, Oct 24, 2012 at 9:50 AM, Robert Klemme <shortcutter@googlemail.com> wrote:

On Mon, Oct 22, 2012 at 10:05 PM, Marc Weber <marco-oweber@gmx.de> wrote:

If they are sorted you can split into pieces, eg create an
a,b,c file and compare a(old) with a(new) and b(old) with b(new) only.

One could as well read records from both files at the same time and
output diffs while doing that. If the data is line based, there are
tools already for comparing two sorted files:

$ comm --help

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/