Primary Key Hash help

I have a huge data file with rows like this:

AL Prattville 36066 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36067 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36068 01001 US 01-001-0870 01 001 1 3 0162328 1

The unique key for each row is all columns. I’d like to generate a single
column unique key that’s derived from all the data, so a simple sequential
number primary key won’t do. Any tips on how I could do this in Ruby?

Chris
http://clabs.org

Make a hash sum over the data:

require ‘md5’
unique_key = MD5.new(line)

-billy.

···

On Mon, Aug 12, 2002 at 11:41:29PM +0900, Chris Morris wrote:

I have a huge data file with rows like this:

AL Prattville 36066 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36067 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36068 01001 US 01-001-0870 01 001 1 3 0162328 1

The unique key for each row is all columns. I’d like to generate a single
column unique key that’s derived from all the data, so a simple sequential
number primary key won’t do. Any tips on how I could do this in Ruby?


Meisterbohne Söflinger Straße 100 Tel: +49-731-399 499-0
eLösungen 89077 Ulm Fax: +49-731-399 499-9

I have a huge data file with rows like this:

AL Prattville 36066 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36067 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36068 01001 US 01-001-0870 01 001 1 3 0162328 1

The unique key for each row is all columns. I’d like to generate a single
column unique key that’s derived from all the data, so a simple sequential
number primary key won’t do. Any tips on how I could do this in Ruby?

Given

row = line.split

a unique key could be

row.join

.

Just wondering: I saw the MD5 solution, but feel that I am missing
something. Why MD5 instead of:

key = row.hash

I am certaint that this is flawed, but fail to see why. :confused:

– Nikodemus

···

On Mon, 12 Aug 2002, Chris Morris wrote:

The unique key for each row is all columns. I’d like to generate a single
column unique key that’s derived from all the data, so a simple sequential
number primary key won’t do. Any tips on how I could do this in Ruby?

Cool - I had inklings of something like this in my head … but I couldn’t
put it together. I figured someone would be able to pop out an easy answer.

Is the MD5 hash always going to be the same length (it seems that way from
playing with it)? Will different implementations of MD5 yield the same
output for the same input?

···

----- Original Message -----
From: “Philipp Meier” meier@meisterbohne.de
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Monday, August 12, 2002 9:59 AM
Subject: Re: Primary Key Hash help

Make a hash sum over the data:

require 'md5’
unique_key = MD5.new(line)

-billy.

AL Prattville 36068 01001 US 01-001-0870 01 001 1 3 0162328 1

The unique key for each row is all columns. I’d like to generate a
single
column unique key that’s derived from all the data, so a simple
sequential
number primary key won’t do. Any tips on how I could do this in Ruby?

Given

row = line.split

a unique key could be

row.join

True - but the result would be of varying length and a bit cumbersome. I was
hoping for something along the lines of what Philipp replied with (MD5
hash). The MD5 hash is still a little cumbersome, but less than just joining
all the data together.

Thanks for the reply.

Chris
http://clabs.org

Just wondering: I saw the MD5 solution, but feel that I am missing
something. Why MD5 instead of:

  key = row.hash

pigeon% ruby -e 'p "Robertsons".hash == "Sutton".hash'
true
pigeon%

Guy Decoux

As a purely theoretical “nit” here, I don’t think any hash function
will have any guarantees of uniqueness of output for all unique
inputs, though the chance for a collision is probably too small for
you to be concerned about given your domain space.

···

— Chris Morris chrismo@clabs.org wrote:

Cool - I had inklings of something like this in my head … but I
couldn’t
put it together. I figured someone would be able to pop out an easy
answer.

Is the MD5 hash always going to be the same length (it seems that
way from
playing with it)? Will different implementations of MD5 yield the
same
output for the same input?

=====

Use your computer to help find a cure for cancer: http://members.ud.com/projects/cancer/

Yahoo IM: michael_s_campbell


Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

Cool - I had inklings of something like this in my head … but I couldn’t
put it together. I figured someone would be able to pop out an easy answer.

Is the MD5 hash always going to be the same length (it seems that way from
playing with it)? Will different implementations of MD5 yield the same
output for the same input?

Yes and yes. MD5 is a standardized specification (thus it will always
have the same output in all implementations, assuming no bugs are
present), and it’s size in bits is fixed (forget off the top of my head
what size exactly it is).

···

On Mon, 2002-08-12 at 11:17, Chris Morris wrote:

----- Original Message -----
From: “Philipp Meier” meier@meisterbohne.de
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Monday, August 12, 2002 9:59 AM
Subject: Re: Primary Key Hash help

Make a hash sum over the data:

require ‘md5’
unique_key = MD5.new(line)

-billy.

“Chris Morris” chrismo@clabs.org writes:

True - but the result would be of varying length and a bit cumbersome. I was
hoping for something along the lines of what Philipp replied with (MD5
hash). The MD5 hash is still a little cumbersome, but less than just joining
all the data together.

Chris:

As it’s a hash, of course, it is not guaranteed to generate unique
values for unique rows. In practice I’d be very(!) surprised if you
got a clash, but it it was crucially important that keys were
one-to-one with records you might want to find an alternative.

Dave

Just wondering: I saw the MD5 solution, but feel that I am
missing
something. Why MD5 instead of:

key = row.hash

pigeon% ruby -e ‘p “Robertsons”.hash == “Sutton”.hash’
true
pigeon%

I think we addressed the non-uniqueness of any hash algorithm
initially and the original author was aware of it.

Are you simply saying here that Ruby’s hash algorithm is… bad?

···

=====

Use your computer to help find a cure for cancer: http://members.ud.com/projects/cancer/

Yahoo IM: michael_s_campbell


Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

As it’s a hash, of course, it is not guaranteed to generate unique
values for unique rows. In practice I’d be very(!) surprised if you
got a clash, but it it was crucially important that keys were
one-to-one with records you might want to find an alternative.

Yeah, I did a little reading between there and here – it does seem quite
unlikely, but possible. Is there an alternative that’s in the same ballpark,
but not what angus suggested? I’m guessing not – unless there was something
inherent in my dataset that I could take advantage of.

I’ll probably go with MD5 – for my needs I’d be able to detect a duplicate
hash should it ever occur and deal with it then.

Chris

128 bits.
Another fail-proof way which guarantees no clashes would be join()'ing
the fields and expanding the resulting string to a fixed size.
Anyway, collisions would be very rare with MD5…
You can also make them even more uncommon this way:
1) get the MD5sum of “#{row}”
2) get the MD5sum of “#{row}an arbitrary string
3) concat the results from 1) and 2) and you’ve got a 256 bits
“extended” MD5

You can repeat 2) as many times as you like and rest assured that
an asteroid crashing on your computer is more probable than this
“extended” MD5 giving the same value for 2 different rows :slight_smile:
You can also compare it with the probability of a hardware failure (or
that of a bug in your code, if you’re really good at it :slight_smile:

···

On Tue, Aug 13, 2002 at 12:21:44AM +0900, Sean Middleditch wrote:

Yes and yes. MD5 is a standardized specification (thus it will always
have the same output in all implementations, assuming no bugs are
present), and it’s size in bits is fixed (forget off the top of my head
what size exactly it is).


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Beeping is cute, if you are in the office :wink:
– Alan Cox

The MD5 Message-Digest Algorithm:
http://www.ietf.org/rfc/rfc1321.txt

···

On Mon, 12 Aug 2002 15:25:30 GMT, Sean Middleditch elanthis@awesomeplay.com wrote:

On Mon, 2002-08-12 at 11:17, Chris Morris wrote:

Cool - I had inklings of something like this in my head … but I couldn’t
put it together. I figured someone would be able to pop out an easy answer.

Is the MD5 hash always going to be the same length (it seems that way from
playing with it)? Will different implementations of MD5 yield the same
output for the same input?

Yes and yes. MD5 is a standardized specification (thus it will always
have the same output in all implementations, assuming no bugs are
present), and it’s size in bits is fixed (forget off the top of my head
what size exactly it is).

Just wondering: I saw the MD5 solution, but feel that I am missing
something. Why MD5 instead of:
key = row.hash

pigeon% ruby -e ‘p “Robertsons”.hash == “Sutton”.hash’
true
pigeon%

I think we addressed the non-uniqueness of any hash algorithm
initially and the original author was aware of it.

But it seems collisions are frequent with String#hash.
It’s a fast function but it “fills” the image domain less uniformly than
MD5. Plus its width is that of a Fixnum (31 bits…), compared to MD5’s
128. MD5’s result space is 29 orders bigger than String#hash’s.
It’s no surprise collisions are far more probable :slight_smile:

Are you simply saying here that Ruby’s hash algorithm is… bad?

[pertinent source code below]
It seems Ruby could also use Perl’s hash algorithm.

Google gives this reference to Perl’s algorithm:
http://mail-index.netbsd.org/tech-perform/2001/11/28/0006.html
which discusses its performance vs other functions.

The first algorithm is PJ Weinberg’s hash function.
Ruby uses the last one, unless told otherwise (you’d have to change
config.h). I don’t know how well it compares to the others in terms of
speed and “one-way-ness”, but these algorithms should be pretty much
similar performance-wise.

register long len = RSTRING(str)->len;
register char *p = RSTRING(str)->ptr;
register int key = 0;

#ifdef HASH_ELFHASH
register unsigned int g;

while (len--) {
    key = (key << 4) + *p++;
    if (g = key & 0xF0000000)
        key ^= g >> 24;
    key &= ~g;
}

#elif HASH_PERL
while (len–) {
key = key*33 + p++;
}
key = key + (key>>5);
#else
while (len–) {
key = key
65599 + *p;
p++;
}
key = key + (key>>5);
#endif
return key;
}

···

On Wed, Aug 14, 2002 at 11:11:52PM +0900, Michael Campbell wrote:


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

The most important design issue… is the fact that Linux is supposed to
be fun…
– Linus Torvalds at the First Dutch International Symposium on Linux

If you find a duplicate in your small dataspace, you will be famous
overnight, not to mention the gigantic ripple that you will cause
in computer security. In fact, if you do discover a collision, I know some
people from the NSA that would pay you well to tell them first.
:slight_smile:

···

On Tue, Aug 13, 2002 at 12:37:51AM +0900, Chris Morris wrote:

As it’s a hash, of course, it is not guaranteed to generate unique
values for unique rows. In practice I’d be very(!) surprised if you
got a clash, but it it was crucially important that keys were
one-to-one with records you might want to find an alternative.

Yeah, I did a little reading between there and here – it does seem quite
unlikely, but possible. Is there an alternative that’s in the same ballpark,
but not what angus suggested? I’m guessing not – unless there was something
inherent in my dataset that I could take advantage of.

I’ll probably go with MD5 – for my needs I’d be able to detect a duplicate
hash should it ever occur and deal with it then.


Jim Freeze
If only I had something clever to say for my comment…
~

If you find a duplicate in your small dataspace, you will be famous
overnight, not to mention the gigantic ripple that you will cause
in computer security. In fact, if you do discover a collision, I know some
people from the NSA that would pay you well to tell them first.
:slight_smile:

Excellent. I’ll be in touch :wink:

Chris