AL Prattville 36066 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36067 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36068 01001 US 01-001-0870 01 001 1 3 0162328 1
The unique key for each row is all columns. I’d like to generate a single
column unique key that’s derived from all the data, so a simple sequential
number primary key won’t do. Any tips on how I could do this in Ruby?
On Mon, Aug 12, 2002 at 11:41:29PM +0900, Chris Morris wrote:
I have a huge data file with rows like this:
AL Prattville 36066 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36067 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36068 01001 US 01-001-0870 01 001 1 3 0162328 1
The unique key for each row is all columns. I’d like to generate a single
column unique key that’s derived from all the data, so a simple sequential
number primary key won’t do. Any tips on how I could do this in Ruby?
AL Prattville 36066 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36067 01001 US 01-001-0870 01 001 1 3 0162328 1
AL Prattville 36068 01001 US 01-001-0870 01 001 1 3 0162328 1
The unique key for each row is all columns. I’d like to generate a single
column unique key that’s derived from all the data, so a simple sequential
number primary key won’t do. Any tips on how I could do this in Ruby?
Just wondering: I saw the MD5 solution, but feel that I am missing
something. Why MD5 instead of:
key = row.hash
I am certaint that this is flawed, but fail to see why.
– Nikodemus
···
On Mon, 12 Aug 2002, Chris Morris wrote:
The unique key for each row is all columns. I’d like to generate a single
column unique key that’s derived from all the data, so a simple sequential
number primary key won’t do. Any tips on how I could do this in Ruby?
Cool - I had inklings of something like this in my head … but I couldn’t
put it together. I figured someone would be able to pop out an easy answer.
Is the MD5 hash always going to be the same length (it seems that way from
playing with it)? Will different implementations of MD5 yield the same
output for the same input?
···
----- Original Message -----
From: “Philipp Meier” meier@meisterbohne.de
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Monday, August 12, 2002 9:59 AM
Subject: Re: Primary Key Hash help
AL Prattville 36068 01001 US 01-001-0870 01 001 1 3 0162328 1
The unique key for each row is all columns. I’d like to generate a
single
column unique key that’s derived from all the data, so a simple
sequential
number primary key won’t do. Any tips on how I could do this in Ruby?
Given
row = line.split
a unique key could be
row.join
True - but the result would be of varying length and a bit cumbersome. I was
hoping for something along the lines of what Philipp replied with (MD5
hash). The MD5 hash is still a little cumbersome, but less than just joining
all the data together.
As a purely theoretical “nit” here, I don’t think any hash function
will have any guarantees of uniqueness of output for all unique
inputs, though the chance for a collision is probably too small for
you to be concerned about given your domain space.
Cool - I had inklings of something like this in my head … but I
couldn’t
put it together. I figured someone would be able to pop out an easy
answer.
Is the MD5 hash always going to be the same length (it seems that
way from
playing with it)? Will different implementations of MD5 yield the
same
output for the same input?
Cool - I had inklings of something like this in my head … but I couldn’t
put it together. I figured someone would be able to pop out an easy answer.
Is the MD5 hash always going to be the same length (it seems that way from
playing with it)? Will different implementations of MD5 yield the same
output for the same input?
Yes and yes. MD5 is a standardized specification (thus it will always
have the same output in all implementations, assuming no bugs are
present), and it’s size in bits is fixed (forget off the top of my head
what size exactly it is).
···
On Mon, 2002-08-12 at 11:17, Chris Morris wrote:
----- Original Message -----
From: “Philipp Meier” meier@meisterbohne.de
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Monday, August 12, 2002 9:59 AM
Subject: Re: Primary Key Hash help
True - but the result would be of varying length and a bit cumbersome. I was
hoping for something along the lines of what Philipp replied with (MD5
hash). The MD5 hash is still a little cumbersome, but less than just joining
all the data together.
Chris:
As it’s a hash, of course, it is not guaranteed to generate unique
values for unique rows. In practice I’d be very(!) surprised if you
got a clash, but it it was crucially important that keys were
one-to-one with records you might want to find an alternative.
As it’s a hash, of course, it is not guaranteed to generate unique
values for unique rows. In practice I’d be very(!) surprised if you
got a clash, but it it was crucially important that keys were
one-to-one with records you might want to find an alternative.
Yeah, I did a little reading between there and here – it does seem quite
unlikely, but possible. Is there an alternative that’s in the same ballpark,
but not what angus suggested? I’m guessing not – unless there was something
inherent in my dataset that I could take advantage of.
I’ll probably go with MD5 – for my needs I’d be able to detect a duplicate
hash should it ever occur and deal with it then.
128 bits.
Another fail-proof way which guarantees no clashes would be join()'ing
the fields and expanding the resulting string to a fixed size.
Anyway, collisions would be very rare with MD5…
You can also make them even more uncommon this way:
1) get the MD5sum of “#{row}”
2) get the MD5sum of “#{row}an arbitrary string”
3) concat the results from 1) and 2) and you’ve got a 256 bits
“extended” MD5
You can repeat 2) as many times as you like and rest assured that
an asteroid crashing on your computer is more probable than this
“extended” MD5 giving the same value for 2 different rows
You can also compare it with the probability of a hardware failure (or
that of a bug in your code, if you’re really good at it
···
On Tue, Aug 13, 2002 at 12:21:44AM +0900, Sean Middleditch wrote:
Yes and yes. MD5 is a standardized specification (thus it will always
have the same output in all implementations, assuming no bugs are
present), and it’s size in bits is fixed (forget off the top of my head
what size exactly it is).
Cool - I had inklings of something like this in my head … but I couldn’t
put it together. I figured someone would be able to pop out an easy answer.
Is the MD5 hash always going to be the same length (it seems that way from
playing with it)? Will different implementations of MD5 yield the same
output for the same input?
Yes and yes. MD5 is a standardized specification (thus it will always
have the same output in all implementations, assuming no bugs are
present), and it’s size in bits is fixed (forget off the top of my head
what size exactly it is).
I think we addressed the non-uniqueness of any hash algorithm
initially and the original author was aware of it.
But it seems collisions are frequent with String#hash.
It’s a fast function but it “fills” the image domain less uniformly than
MD5. Plus its width is that of a Fixnum (31 bits…), compared to MD5’s
128. MD5’s result space is 29 orders bigger than String#hash’s.
It’s no surprise collisions are far more probable
Are you simply saying here that Ruby’s hash algorithm is… bad?
[pertinent source code below]
It seems Ruby could also use Perl’s hash algorithm.
The first algorithm is PJ Weinberg’s hash function.
Ruby uses the last one, unless told otherwise (you’d have to change
config.h). I don’t know how well it compares to the others in terms of
speed and “one-way-ness”, but these algorithms should be pretty much
similar performance-wise.
register long len = RSTRING(str)->len;
register char *p = RSTRING(str)->ptr;
register int key = 0;
#ifdef HASH_ELFHASH
register unsigned int g;
while (len--) {
key = (key << 4) + *p++;
if (g = key & 0xF0000000)
key ^= g >> 24;
key &= ~g;
}
If you find a duplicate in your small dataspace, you will be famous
overnight, not to mention the gigantic ripple that you will cause
in computer security. In fact, if you do discover a collision, I know some
people from the NSA that would pay you well to tell them first.
···
On Tue, Aug 13, 2002 at 12:37:51AM +0900, Chris Morris wrote:
As it’s a hash, of course, it is not guaranteed to generate unique
values for unique rows. In practice I’d be very(!) surprised if you
got a clash, but it it was crucially important that keys were
one-to-one with records you might want to find an alternative.
Yeah, I did a little reading between there and here – it does seem quite
unlikely, but possible. Is there an alternative that’s in the same ballpark,
but not what angus suggested? I’m guessing not – unless there was something
inherent in my dataset that I could take advantage of.
I’ll probably go with MD5 – for my needs I’d be able to detect a duplicate
hash should it ever occur and deal with it then.
–
Jim Freeze
If only I had something clever to say for my comment…
~
If you find a duplicate in your small dataspace, you will be famous
overnight, not to mention the gigantic ripple that you will cause
in computer security. In fact, if you do discover a collision, I know some
people from the NSA that would pay you well to tell them first.