The difficulty that you’ll run into is in your need for the new, shorter value to be unique. Hashes are not, and cannot be
designed to be unique. It’s all in the numbers. If you have a 100 character string of 8 bit characters (assuming ASCII, not Unicode),
the you have 800 bits of information. You could tale advantage of the fact that not all 256 values of a byte are valid for your string
to reduce it’s size some. If you limit to 7-bit ascii, then there’s 1 bit per byte that could be “reclaimed”. All of these factors are
taken into account in compression algorithms. So compression is the direction you need to look. Be careful, because many
compression algorithms give longer results than their input if the input is particularly short (I seem to recall that some have fall-back
approaches to account for this.
Hashes (either the built in hash() method you’ve already discovered the issues with, or cryptographic hashes like MD5 or SHA1 are
designed to statistically minimize the number of has collisions. You can take a reasonably long input and the odds of any two, different,
strings being the same are VERY low, but it’s not guaranteed. Two inputs producing the same hash is referred to as a hash collision. Cryptographic
hashes are designed to minimize collisions, but, since they are of a fixed size, there are only so many possible result values and that won’t be
enough to guarantee unique results for strings. If your value space (i.e. the number of strings you’re trying to ensure uniqueness for is
not in the millions are billions, and you can live with the results of your compare basically being a statement that, if you get the same value,
there’s a 1 in XXXXXXXXX change of them being actually being different strings, then cryptographic hash might be sufficient for you. Just be aware
that two strings with the same has might be VERY VERY likely to be the same string, but that it’s, at least remotely possible, that they are tow different
strings producing the same hash.
Look into compression methods first. Compression is what you’ve described. If you’re strings are sufficiently long, then, off-the-shelf compression
could easily be your answer. If they’re short and you have special knowledge of the allowed input values (ex: you’re using ASCII, and only allow
a-z, A-Z, space, comma, period, …) you may find that there are only, say 100 valid values per character (or anything less than 128), then you could
compress them to 7/8ths of their original size (using very simplistic compression). Take a look at simple zip compression and others like it. Their purpose
is to do what you’re asking… provide a shorter value from which must be unique for every unique input value (since it must be able to decompress).
Using the theoretical 100 values per character scenario I just gave. The number of possible values of a string are 100^n (where n is the number of characters),
So, fo 20 characters…
possible values = 100^20 => 1e40
number of bits = log2(possible values) => 132.8771
bytes = number of bits / 8 => 16.6096
So, in theory, you can get 20 character strings down to 17 bytes
If you go up to 200 characters…167 bytes
Encryption, as you’ve seen, has no goal of producing shorter output than the input, so it’s not going to provide your solution.
(OK, I’ve started rambling.. probably more detail than you needed… look for compression routines)
···
On Jan 1, 2014, at 10:56 PM, Rodrigo Lueneberg <lists@ruby-forum.com> wrote:
I am trying to generate a unique number for a string and would like some
suggestions.
So far, I've researched the hash, but it seems not consistent on the
values generated. It is not reliable since it does not generate the same
value all the time.
My next idea was to use any easy encrypt method, but that would generate
a large string and would be resource expensive.
My next idea is convert the string to hex. Asp.net uses this approach a
lot. Is this is a good idea? What are the drawbacks?
The reason I need this unique "Hashcode" is because I want to save this
value on the database and compare it with new inserts in order to avoid
duplicate values. There are hundreds of Json objects that corresponds to
fields in the db.
So I thought of instead of comparing values with each
field maybe the easier method was to find a way to get the unique Json
string value/code. With the hashcode I can check in the last 15 minutes
if the user is trying to submit the same data to db and prevent it from
being submitted again.
I also think that this method uses very little memory compared to using
session, but I don't want to discard the session idea yet. If there is a
good approach, I may use it.
Well, I hope I made it clear enough to get some feedbacks.
Thanks
Rod
--
Posted via http://www.ruby-forum.com/\.