I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
I need to compare the keys from the new file to the keys on the file from
last week. If the rest of the records are the same, then I do nothing. If
the keys matches but the rest of the records are different, I then have an
update and I will output that record to a new file.
I was wondering if there is an efficient way to do this in Ruby. Either any
built-in method or an efficient algorithm which I can implement.
Generate an MD5 or stronger hash on the file. If anything changes, the hash
changes. I say 'or stronger' because MD5 has been shown to have (extremely
rare) overlaps.
···
On Mon, Nov 12, 2012 at 4:21 PM, Ruby Student <ruby.student@gmail.com>wrote:
Team,
I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
I need to compare the keys from the new file to the keys on the file from
last week. If the rest of the records are the same, then I do nothing. If
the keys matches but the rest of the records are different, I then have an
update and I will output that record to a new file.
I was wondering if there is an efficient way to do this in Ruby. Either
any built-in method or an efficient algorithm which I can implement.
You or someone else posted this exact problem just a few weeks ago. Please don't repost if that was you. Please don't post your homework regardless. We're happy to answer questions and give advice. We're not happy to directly contribute to your grade.
···
On Nov 12, 2012, at 14:21 , Ruby Student <ruby.student@gmail.com> wrote:
I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
You can get quite efficient using the important fact that the files are already sorted by your key value.
Your problem is a slight variation on the merge phase of an External Sort
Let's name the two files P (as in "previous") and N (as in "new").
At each step, you have the next record from P and from N.
One of three things will be true:
1) the keys match
Compare the rest of the records and
a) write the update if different
b) do nothing if all are the same
read the next P and the next N
2) Pkey < Nkey
Pkey was missing from N
Delete the record P (update to nothing?) if deletions can happen
read the next P (only; use the same N for the next iteration)
3) Pkey > Nkey
Nkey was missing from P
Insert a new record N
read the next N (only; use the same P for the next iteration)
If the end of file is reached for only one of the files, treat the key as always greater than the other from the not-yet-at-end file.
When both ends-of-file have been reached, you are done.
Close the files.
Of course, this depends on being able to read the files one record at a time (or having a buffer to do so). Since it seems like these might be line-oriented files, I suspect you're OK here.
-Rob
···
On Nov 12, 2012, at 5:21 PM, Ruby Student wrote:
Team,
I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
I need to compare the keys from the new file to the keys on the file from last week. If the rest of the records are the same, then I do nothing. If the keys matches but the rest of the records are different, I then have an update and I will output that record to a new file.
I was wondering if there is an efficient way to do this in Ruby. Either any built-in method or an efficient algorithm which I can implement.
I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
I need to compare the keys from the new file to the keys on the file from last week. If the rest of the records are the same, then I do nothing. If the keys matches but the rest of the records are different, I then have an update and I will output that record to a new file.
I was wondering if there is an efficient way to do this in Ruby. Either any built-in method or an efficient algorithm which I can implement.
Not sure if you dealing with hashes or arrays but take a look at the '-' method for arrays, it's blazing fast for comparing even large arrays.
On 12 Νοε 2012, at 23:21 , Ruby Student <ruby.student@gmail.com> wrote:
--
The wise man said: "Never argue with an idiot. They bring you down to their level and beat you with experience."
On Mon, Nov 12, 2012 at 11:57 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:
On Nov 12, 2012, at 14:21 , Ruby Student <ruby.student@gmail.com> wrote:
I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
You or someone else posted this exact problem just a few weeks ago. Please don't repost if that was you. Please don't post your homework regardless. We're happy to answer questions and give advice. We're not happy to directly contribute to your grade.
You stated: "IIRC I even posted a solution outline."
I went to your blog (http://blog.rubybestpractices.com/\) but could not find
the item you mentioned. Would you mind to post the link to it?
Thank you
···
On Wed, Nov 14, 2012 at 10:53 AM, Panagiotis Atmatzidis <atma@convalesco.org > wrote:
Hello,
On 12 Νοε 2012, at 23:21 , Ruby Student <ruby.student@gmail.com> wrote:
> Team,
>
> I have two large files, about 16 million records each.
> The files are sorted.
> The first 13 characters are used as a key.
> We get an updated file every week.
> We also keep the previous week file.
>
> I need to compare the keys from the new file to the keys on the file
from last week. If the rest of the records are the same, then I do nothing.
If the keys matches but the rest of the records are different, I then have
an update and I will output that record to a new file.
> I was wondering if there is an efficient way to do this in Ruby. Either
any built-in method or an efficient algorithm which I can implement.
Not sure if you dealing with hashes or arrays but take a look at the '-'
method for arrays, it's blazing fast for comparing even large arrays.
>
> Thank you
>
>
> --
> Ruby Student
Panagiotis (atmosx) Atmatzidis
email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0xE736C6A0
gpg --keyserver x-hkp://pgp.mit.edu --recv-keys 0xE736C6A0
--
The wise man said: "Never argue with an idiot. They bring you down to
their level and beat you with experience."
I assume he meant he posted a solution on this very list
when you(!) asked the same question in October.
Search the ruby-talk archives for
"Looking for suggestions processing and comparing 2 very large files".
As far as I remember there were several useful suggestions.
···
Am 16.11.2012 14:35, schrieb Ruby Student:
Robert,
You stated: "IIRC I even posted a solution outline."
I went to your blog (http://blog.rubybestpractices.com/\) but could not
find the item you mentioned. Would you mind to post the link to it?