Best approach to convert string to unique value

7stud2 · 2 January 2014 03:56

I am trying to generate a unique number for a string and would like some
suggestions.

So far, I've researched the hash, but it seems not consistent on the
values generated. It is not reliable since it does not generate the same
value all the time.

My next idea was to use any easy encrypt method, but that would generate
a large string and would be resource expensive.

My next idea is convert the string to hex. Asp.net uses this approach a
lot. Is this is a good idea? What are the drawbacks?

The reason I need this unique "Hashcode" is because I want to save this
value on the database and compare it with new inserts in order to avoid
duplicate values. There are hundreds of Json objects that corresponds to
fields in the db.

So I thought of instead of comparing values with each
field maybe the easier method was to find a way to get the unique Json
string value/code. With the hashcode I can check in the last 15 minutes
if the user is trying to submit the same data to db and prevent it from
being submitted again.

I also think that this method uses very little memory compared to using
session, but I don't want to discard the session idea yet. If there is a
good approach, I may use it.

Well, I hope I made it clear enough to get some feedbacks.

Thanks

Rod

···

--
Posted via http://www.ruby-forum.com/.

Joel_VanderWerf · 2 January 2014 05:12

I am trying to generate a unique number for a string and would like some
suggestions.

So far, I've researched the hash, but it seems not consistent on the
values generated. It is not reliable since it does not generate the same
value all the time.

Maybe because of key ordering differences? String#hash will not detect that, but Hash#hash will:

>> {a: 1, b: 2}.hash
=> -755489698742161519
>> {b: 2, a: 1}.hash
=> -755489698742161519

>> "{a: 1, b: 2}".hash
=> -3350349176911638030
>> "{b: 2, a: 1}".hash
=> -4008483756911189513

You can fix that by parsing the json and calling #hash on the parsed object instead.

Then compare with hash values stored in the db. If found, then do a full object compare to see if it is just a hash collision or a data-entry dup. Both should be rare, so you won't spend much time comparing objects.

OTOH, if data-entry dup is common, you probably need a different solution...

···

On 01/01/2014 07:56 PM, Rodrigo Lueneberg wrote:

Mike_IMAP · 2 January 2014 14:05

The difficulty that you’ll run into is in your need for the new, shorter value to be unique. Hashes are not, and cannot be
designed to be unique. It’s all in the numbers. If you have a 100 character string of 8 bit characters (assuming ASCII, not Unicode),
the you have 800 bits of information. You could tale advantage of the fact that not all 256 values of a byte are valid for your string
to reduce it’s size some. If you limit to 7-bit ascii, then there’s 1 bit per byte that could be “reclaimed”. All of these factors are
taken into account in compression algorithms. So compression is the direction you need to look. Be careful, because many
compression algorithms give longer results than their input if the input is particularly short (I seem to recall that some have fall-back
approaches to account for this.

Hashes (either the built in hash() method you’ve already discovered the issues with, or cryptographic hashes like MD5 or SHA1 are
designed to statistically minimize the number of has collisions. You can take a reasonably long input and the odds of any two, different,
strings being the same are VERY low, but it’s not guaranteed. Two inputs producing the same hash is referred to as a hash collision. Cryptographic
hashes are designed to minimize collisions, but, since they are of a fixed size, there are only so many possible result values and that won’t be
enough to guarantee unique results for strings. If your value space (i.e. the number of strings you’re trying to ensure uniqueness for is
not in the millions are billions, and you can live with the results of your compare basically being a statement that, if you get the same value,
there’s a 1 in XXXXXXXXX change of them being actually being different strings, then cryptographic hash might be sufficient for you. Just be aware
that two strings with the same has might be VERY VERY likely to be the same string, but that it’s, at least remotely possible, that they are tow different
strings producing the same hash.

Look into compression methods first. Compression is what you’ve described. If you’re strings are sufficiently long, then, off-the-shelf compression
could easily be your answer. If they’re short and you have special knowledge of the allowed input values (ex: you’re using ASCII, and only allow
a-z, A-Z, space, comma, period, …) you may find that there are only, say 100 valid values per character (or anything less than 128), then you could
compress them to 7/8ths of their original size (using very simplistic compression). Take a look at simple zip compression and others like it. Their purpose
is to do what you’re asking… provide a shorter value from which must be unique for every unique input value (since it must be able to decompress).

Using the theoretical 100 values per character scenario I just gave. The number of possible values of a string are 100^n (where n is the number of characters),
So, fo 20 characters…
possible values = 100^20 => 1e40
number of bits = log2(possible values) => 132.8771
bytes = number of bits / 8 => 16.6096
So, in theory, you can get 20 character strings down to 17 bytes

If you go up to 200 characters…167 bytes

Encryption, as you’ve seen, has no goal of producing shorter output than the input, so it’s not going to provide your solution.

(OK, I’ve started rambling.. probably more detail than you needed… look for compression routines)

···

On Jan 1, 2014, at 10:56 PM, Rodrigo Lueneberg <lists@ruby-forum.com> wrote:

I am trying to generate a unique number for a string and would like some
suggestions.

So far, I've researched the hash, but it seems not consistent on the
values generated. It is not reliable since it does not generate the same
value all the time.

My next idea was to use any easy encrypt method, but that would generate
a large string and would be resource expensive.

My next idea is convert the string to hex. Asp.net uses this approach a
lot. Is this is a good idea? What are the drawbacks?

The reason I need this unique "Hashcode" is because I want to save this
value on the database and compare it with new inserts in order to avoid
duplicate values. There are hundreds of Json objects that corresponds to
fields in the db.

So I thought of instead of comparing values with each
field maybe the easier method was to find a way to get the unique Json
string value/code. With the hashcode I can check in the last 15 minutes
if the user is trying to submit the same data to db and prevent it from
being submitted again.

I also think that this method uses very little memory compared to using
session, but I don't want to discard the session idea yet. If there is a
good approach, I may use it.

Well, I hope I made it clear enough to get some feedbacks.

Thanks

Rod

--
Posted via http://www.ruby-forum.com/\.

Florian_Gilcher · 2 January 2014 08:25

#hash yields different results after a runtime restart since Ruby 1.9.something to avoid hash-collision DOS attacks:

$ irb
[1] pry(main)> "foo".hash
=> -3624421491175122911
[2] pry(main)> "foo".hash
=> -3624421491175122911
[3] pry(main)> ^D
$ irb
[1] pry(main)> "foo".hash
=> -1929116373959977848

Use one of the hash functions provided by e.g. OpenSSL. Be aware that these are cryptographic hashes, so they do need space and time on purpose. Otherwise search for fast hash functions in Ruby, but they are sometimes more prone to collisions!

As Joel mentioned, be aware that hashing always leads to collisions, so the check should be something like (pseudocode):

  candidate = lookup(hash(user_action))
   if candidate && candidate == user_action
      raise DoublePost
   end

Even then, only fix a problem that you actually have: for proper duplicate detection, you have to save the value to the database in any case.

Given that databases are good at storing and building efficient indices over string values, I'd actually expect that just storing the original value and just querying for it on a database field with a proper index should be faster than hashing on the client side and then querying the database for a stored hash value.

Do measure first and if you cannot point to an obvious problem, don't fix.

Regards,
Florian

···

On 02 Jan 2014, at 06:12, Joel VanderWerf <joelvanderwerf@gmail.com> wrote:

On 01/01/2014 07:56 PM, Rodrigo Lueneberg wrote:

I am trying to generate a unique number for a string and would like some
suggestions.

So far, I've researched the hash, but it seems not consistent on the
values generated. It is not reliable since it does not generate the same
value all the time.

Maybe because of key ordering differences? String#hash will not detect that, but Hash#hash will:

{a: 1, b: 2}.hash

=> -755489698742161519

{b: 2, a: 1}.hash

=> -755489698742161519

"{a: 1, b: 2}".hash

=> -3350349176911638030

"{b: 2, a: 1}".hash

=> -4008483756911189513

You can fix that by parsing the json and calling #hash on the parsed object instead.

Then compare with hash values stored in the db. If found, then do a full object compare to see if it is just a hash collision or a data-entry dup. Both should be rare, so you won't spend much time comparing objects.

OTOH, if data-entry dup is common, you probably need a different solution...

Robert_K1 · 2 January 2014 10:29

I am trying to generate a unique number for a string and would like some
suggestions.

So far, I've researched the hash, but it seems not consistent on the
values generated. It is not reliable since it does not generate the same
value all the time.

You could use MD5 or another cryptographic robust hash code.

My next idea was to use any easy encrypt method, but that would generate
a large string and would be resource expensive.

Encryption does not reduce the size so it won't help with saving memory.

My next idea is convert the string to hex. Asp.net uses this approach a
lot. Is this is a good idea? What are the drawbacks?

What does "convert to hex" mean in this context?

The reason I need this unique "Hashcode" is because I want to save this
value on the database and compare it with new inserts in order to avoid
duplicate values. There are hundreds of Json objects that corresponds to
fields in the db.

What type of DB is this? What are all these strings that you insert? I
mean, JSON is just a format. What do these represent? What do you mean
by "hundreds of Json objects that corresponds to fields in the db"?

So I thought of instead of comparing values with each
field maybe the easier method was to find a way to get the unique Json
string value/code. With the hashcode I can check in the last 15 minutes
if the user is trying to submit the same data to db and prevent it from
being submitted again.

I also think that this method uses very little memory compared to using
session, but I don't want to discard the session idea yet. If there is a
good approach, I may use it.

What kind of session are we talking about here?

Well, I hope I made it clear enough to get some feedbacks.

Not entirely - at least not for me. Can you describe the nature of
your application and the data that you want to protect against
duplicate submission?

Kind regards

robert

···

On Thu, Jan 2, 2014 at 4:56 AM, Rodrigo Lueneberg <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

7stud2 · 2 January 2014 11:53

Thanks for the comments. Just answering Robert's post.

What does "convert to hex" mean in this context? RE: My next idea is
convert the string to hex. Asp.net uses this approach a lot. Is this is
a good idea? What are the drawbacks?

RE: Well, you just made me think and it sounds like a dumb idea to user
hex. I thought there was a way to generate a unique hex, but it will
just be a big string, same as encrypting with not much use for this
purpose.

What type of DB is this? What are all these strings that you insert? I
mean, JSON is just a format. What do these represent? What do you mean
by "hundreds of Json objects that corresponds to fields in the db"?

RE: db used is Mysql, the strings are JSON such as {"field1":"bla bla",
"field2": "ble ble", etc. Where field one correspond to Mysql field1,
and so far. The JSON is not a pure javascript object. I use
JSON.stringify to turn it into a string. It is this JSON string that I
want to find a hashcode for.

What kind of session are we talking about here?

RE: Web session (either stored in db, memory or cookie). The session
here would be something like the session token for an specific period of
time, for instance, let's say the user can be logged in for max of 20
min with no activity. If, after this period is passed he/she can
reinsert values. This may be the easiest approach if there is a way to
grab this unique token. But this has another disadvantage. What if I
want to restrict the same purchases during a larger period, let's say,
24hours period. The only way to control that would be via cookie right?

Can you describe the nature of your application and the data that you
want to protect against duplicate submission?

RE: The nature is similar to a shopping cart. But instead of products,
print services are being sold and their corresponding files are sent
together with each order. There are two tables: one is order_header,
another is order_details and the last is order_assets which contains the
files.

Ps: Using index on order_header to validate uniqueness is not a good
idea because the user can have two or three, or forth... orders with
same content. (As far as it is not in the same period of time).

Hope I answered your questions

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 2 January 2014 13:23

Thanks for the comments. Just answering Robert's post.

Can you please stick with the usual quoting style? That will make it
much easier for everyone to follow discussions.

What type of DB is this? What are all these strings that you insert? I
mean, JSON is just a format. What do these represent? What do you mean
by "hundreds of Json objects that corresponds to fields in the db"?

RE: db used is Mysql, the strings are JSON such as {"field1":"bla bla",
"field2": "ble ble", etc. Where field one correspond to Mysql field1,
and so far. The JSON is not a pure javascript object. I use
JSON.stringify to turn it into a string. It is this JSON string that I
want to find a hashcode for.

So you want to avoid that the exact same order (potentially containing
multiple documents to print) to be repeated? That would mean if a user
submits order for Doc1 and Doc2 and later for Doc1 only the second one
will still succeed. Also, submitting an order with Doc2 and Doc1
(order reversed) will also succeed. Is this what you want?

What kind of session are we talking about here?

RE: Web session (either stored in db, memory or cookie). The session

Can you describe the nature of your application and the data that you
want to protect against duplicate submission?

RE: The nature is similar to a shopping cart. But instead of products,
print services are being sold and their corresponding files are sent
together with each order. There are two tables: one is order_header,
another is order_details and the last is order_assets which contains the
files.

In case you want to avoid duplicate printing of the same file you
could store a SHA1 hash in order_assets along with the blob that
contains the file data to avoid duplicate submission of a file.
However, that would be a different mechanism from the one described
above.

Are you trying to guard against browser resubmitting forms or is there
another reason for what you are trying to do?

Ps: Using index on order_header to validate uniqueness is not a good
idea because the user can have two or three, or forth... orders with
same content. (As far as it is not in the same period of time).

You could define time periods (e.g. every five minutes), calculate the
order's period and store it in the datbase. Then create a unique index
covering the period and the header (whatever that is).

Hope I answered your questions

Partly yes. Thank you.

Kind regards

robert

···

On Thu, Jan 2, 2014 at 12:53 PM, Rodrigo Lueneberg <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

7stud2 · 2 January 2014 14:11

Can you please stick with the usual quoting style? That will make it
much easier for everyone to follow discussions.

RE: sorry, my bad.

So you want to avoid that the exact same order (potentially containing
multiple documents to print) to be repeated? That would mean if a user
submits order for Doc1 and Doc2 and later for Doc1 only the second one
will still succeed. Also, submitting an order with Doc2 and Doc1
(order reversed) will also succeed. Is this what you want?

RE: more or less. To be a dupe, the duplicated order needs to contain
exactly the same files, the same quantities, the same price for the
service, etc. The order in which this files are submitted does not
matter.

In case you want to avoid duplicate printing of the same file you
could store a SHA1 hash in order_assets along with the blob that
contains the file data to avoid duplicate submission of a file.
However, that would be a different mechanism from the one described
above.

RE: this may be the solution, except I don't need to store in order
assets, but only in order header. Let's say I have a SHA1 hash
"#$lkdsflda" and this string is equal to the overall Json string. If the
user resubmits, all I need is to generate the SHA1 hash again and
compare to the SHA1 hash stored in db. I mean, I don't need to check for
each file. Why would you do this way?

Are you trying to guard against browser resubmitting forms or is there
another reason for what you are trying to do?

RE: browser and malicious user.

You could define time periods (e.g. every five minutes), calculate the
order's period and store it in the datbase. Then create a unique index
covering the period and the header (whatever that is).

RE: to me this sounds to be the best idea ever.

Thank you very much

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 2 January 2014 14:33

Can you please stick with the usual quoting style? That will make it
much easier for everyone to follow discussions.

RE: sorry, my bad.

So you want to avoid that the exact same order (potentially containing
multiple documents to print) to be repeated? That would mean if a user
submits order for Doc1 and Doc2 and later for Doc1 only the second one
will still succeed. Also, submitting an order with Doc2 and Doc1
(order reversed) will also succeed. Is this what you want?

RE: more or less. To be a dupe, the duplicated order needs to contain
exactly the same files, the same quantities, the same price for the
service, etc. The order in which this files are submitted does not
matter.

In case you want to avoid duplicate printing of the same file you
could store a SHA1 hash in order_assets along with the blob that
contains the file data to avoid duplicate submission of a file.
However, that would be a different mechanism from the one described
above.

RE: this may be the solution, except I don't need to store in order
assets, but only in order header. Let's say I have a SHA1 hash
"#$lkdsflda" and this string is equal to the overall Json string. If the
user resubmits, all I need is to generate the SHA1 hash again and
compare to the SHA1 hash stored in db. I mean, I don't need to check for
each file. Why would you do this way?

That was just an option depending on what you want to achieve.
Apparently this does not meet your needs.

Are you trying to guard against browser resubmitting forms or is there
another reason for what you are trying to do?

RE: browser and malicious user.

For browser resubmit issues you could use a different approach: place
a hidden field in the HTML form with a generated value (e.g. UUID).
Check that value on submission and reject if it has been seen already.

Not sure what a malicious user would be up to as it seems you know the
user already and the printing jobs likely will be billed against him.
If you need to avoid fake submissions you could store the generated
value (UUID or other) and only accept exactly one submission. You
could use a table with

value (PK)
count (initial 0, constraint to be 0 or 1)
timestamp (not null)

have a index on timestamp (for cleanup)

On submission of the form you do

update the_table
set count = count + 1
where value = ?

Bind parameter is set to what you got from the form. There are two
error situations you need to cover
1. count exceeds valid range => this is a resubmission
2. no rows are updated => forged request

In regular intervals you delete all rows with a timestamp older than x
time units.

You may also want to look into cross site scripting defense. IIRC a
similar mechanism is used there as well.

You could define time periods (e.g. every five minutes), calculate the
order's period and store it in the datbase. Then create a unique index
covering the period and the header (whatever that is).

RE: to me this sounds to be the best idea ever.

Nice to hear that this will work for you.

Thank you very much

You're welcome.

Cheers

robert

···

On Thu, Jan 2, 2014 at 3:11 PM, Rodrigo Lueneberg <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

7stud2 · 2 January 2014 15:04

thanks Robert,

You just keep coming with better ideas. This last input about the UUID
was very good. I heard somewhere that UUID is not "atomic" when used in
different "Load Balancing" servers. But I don't care if it works even
with some small risk. I just have one server.

Yeah, malicous user in my case is fake or accidental submissions. But
you know what? with files stored in Amazon S3, I must be careful not to
be billed by storage usage.

thanks
Rod

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Hash ruby-talk	21	128	11 September 2007
Convert text string i.e 'Peter' into integer ID ruby-talk	22	152	14 November 2008
Primary Key Hash help ruby-talk	16	131	14 August 2002
String Hashing Algorithms ruby-talk	15	97	12 May 2005
Inserting hash value slows down as table gets larger ruby-talk	10	151	31 March 2011

Best approach to convert string to unique value

Related topics