Sets, uniqueness not unique

Robert · 16 September 2005 08:01

Hugh Sasse wrote:

Hugh Sasse wrote:

Hmm, that's interesting, but I don't get:

code << "def hash() " << fields.map {|f| "self.#{f}.hash" }.join(" ^
") << " end\n"

Shouldn't hash return a Fixnum?

Definitely!

[...]

The function above appears to return a string with numbers separated
by " ^ ".

Nope. The join appears during code generation and not during
evaluation of the method. You can easily verify this by printing
code after it's completed.

Oh, then it's exclusive or. I'm clearly being as sharp as a sponge
today.

I'll have to remember that phrase - I could use it myself from time to
time.

While my brain is behaving like cottage cheese, it's probably not
the time to ask how one might guarantee that you don't stomp on the
hashes of other ojects in the system. If you have an even number of
elements, all the same Fixnum, like [1,1,1,1] then they would hash
to 0, as would [2,2], I "think".
irb(main):004:0> [1,1].inject(0) { |a,b| a ^= b.hash}
=> 0
irb(main):005:0> [2,1,1,2].inject(0) { |a,b| a ^= b.hash}
=> 0

Btw, the assignment is superfluous. The result of a^b.hash is the next
iteration's a.

irb(main):006:0>

Yes. The algorithm can certainly be improved on. Typically you rather do
something similar to

(a.hash ^ (b.hash << 3) ^ (c.hash << 7)) & MAX_HASH

09:53:59 [~]: irbs

[2,1,1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

=> 2781

[1,2, 1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

=> 1885

i.e. by shifting you make sure that order matters etc.

Kind regards

robert

···

On Fri, 16 Sep 2005, Robert Klemme wrote:

Hugh_Sasse · 16 September 2005 11:21

Yes, good point, the result of the block....

------------------------------------------------------ Enumerable#inject
enum.inject(initial) {| memo, obj | block } => obj
enum.inject {| memo, obj | block } => obj

···

On Fri, 16 Sep 2005, Robert Klemme wrote:

Hugh Sasse wrote:

While my brain is behaving like cottage cheese, it's probably not
the time to ask how one might guarantee that you don't stomp on the
hashes of other ojects in the system. If you have an even number of
elements, all the same Fixnum, like [1,1,1,1] then they would hash
to 0, as would [2,2], I "think".
irb(main):004:0> [1,1].inject(0) { |a,b| a ^= b.hash}
=> 0
irb(main):005:0> [2,1,1,2].inject(0) { |a,b| a ^= b.hash}
=> 0

Btw, the assignment is superfluous. The result of a^b.hash is the next
iteration's a.

------------------------------------------------------------------------
      Combines the elements of _enum_ by applying the block to an
      accumulator value (_memo_) and each element in turn. At each step,
      _memo_ is set to the value returned by the block. The first form
      ================================================
         [...]

irb(main):006:0>

Yes. The algorithm can certainly be improved on. Typically you rather do
something similar to

(a.hash ^ (b.hash << 3) ^ (c.hash << 7)) & MAX_HASH

09:53:59 [~]: irbs

[2,1,1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

=> 2781

[1,2, 1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

=> 1885

Ah, so that's what MAX_HASH is, I couldn't remember how big Fixnums
were.

I was thinking about something like a linear congruential
random number generator like:
brains hgs 22 %> irb
irb(main):001:0> [2,1,1,2].inject(0) {|a,b| ((a * 31)+b.hash) % 4093082899}
=> 151936
irb(main):002:0> [2,2,1,1].inject(0) {|a,b| ((a * 31)+b.hash) % 4093082899}
=> 153856
irb(main):003:0> [2,1,2,1].inject(0) {|a,b| ((a * 31)+b.hash) % 4093082899}
=> 151996
irb(main):004:0>

like http://www.cs.bell-labs.com/cm/cs/pearls/markovhash.c

from "Programming Pearls".

with a largish prime grabbed from

being the biggest I could see less than 0XFFFF_FFFF (4294967295)

------------------------------------------------ Class: Fixnum < Integer
A +Fixnum+ holds +Integer+ values that can be represented in a
native machine word (minus 1 bit). If any operation on a +Fixnum+

Should that be 0x7FFF_FFFF? (2147483647) According to

this would seem to be a prime number, so could be used as the
modulus anyway.

i.e. by shifting you make sure that order matters etc.

Kind regards

robert

Thank you,
Hugh

Robert · 16 September 2005 11:56

Hugh Sasse wrote:

Hugh Sasse wrote:

While my brain is behaving like cottage cheese, it's probably not
the time to ask how one might guarantee that you don't stomp on the
hashes of other ojects in the system. If you have an even number of
elements, all the same Fixnum, like [1,1,1,1] then they would hash
to 0, as would [2,2], I "think".
irb(main):004:0> [1,1].inject(0) { |a,b| a ^= b.hash}
=> 0
irb(main):005:0> [2,1,1,2].inject(0) { |a,b| a ^= b.hash}
=> 0

Btw, the assignment is superfluous. The result of a^b.hash is the
next iteration's a.

Yes, good point, the result of the block....

Yep.

------------------------------------------------------
      Enumerable#inject enum.inject(initial) {| memo, obj | block }
      => obj enum.inject {| memo, obj | block } => obj
------------------------------------------------------------------------
      Combines the elements of _enum_ by applying the block to an
      accumulator value (_memo_) and each element in turn. At each
      step, _memo_ is set to the value returned by the block. The
      first form ================================================
         [...]

irb(main):006:0>

Yes. The algorithm can certainly be improved on. Typically you
rather do something similar to

(a.hash ^ (b.hash << 3) ^ (c.hash << 7)) & MAX_HASH

09:53:59 [~]: irbs

[2,1,1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

=> 2781

>> [1,2, 1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

=> 1885

Ah, so that's what MAX_HASH is, I couldn't remember how big Fixnums
were.

No, that's not really the limit - I just choose 32 bit as it's an often
used size.

13:41:59 [~]: ruby -e 'i=1;i<<=1 while Fixnum === i;i>>=1;puts i.to_s(16),
i.class'
20000000
Fixnum

I was thinking about something like a linear congruential
random number generator like:
brains hgs 22 %> irb
irb(main):001:0> [2,1,1,2].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151936
irb(main):002:0> [2,2,1,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 153856
irb(main):003:0> [2,1,2,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151996
irb(main):004:0>

like
http://www.cs.bell-labs.com/cm/cs/pearls/markovhash.c

from "Programming Pearls".

That code does only one modulus on the NHASH. Your code does it for each
iteration. But otherwise 31 seems a good number - Java containers use 31,
too.

From AbstractList:

    public int hashCode() {
        int hashCode = 1;
        Iterator i = iterator();
        while (i.hasNext()) {
            Object obj = i.next();
            hashCode = 31*hashCode + (obj==null ? 0 : obj.hashCode());
        }
        return hashCode;
    }

with a largish prime grabbed from
Small random primes (1 of 3)

being the biggest I could see less than 0XFFFF_FFFF (4294967295)

------------------------------------------------ Class: Fixnum <
Integer A +Fixnum+ holds +Integer+ values that can be
represented in a native machine word (minus 1 bit). If any
operation on a +Fixnum+

Should that be 0x7FFF_FFFF? (2147483647)
According to
Print Prime Numbers
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be slightly more
efficient.

Kind regards

robert

···

On Fri, 16 Sep 2005, Robert Klemme wrote:

Hugh_Sasse · 16 September 2005 12:40

Hugh Sasse wrote:

>> [1,2, 1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

=> 1885

Ah, so that's what MAX_HASH is, I couldn't remember how big Fixnums
were.

No, that's not really the limit - I just choose 32 bit as it's an often
used size.

13:41:59 [~]: ruby -e 'i=1;i<<=1 while Fixnum === i;i>>=1;puts i.to_s(16),
i.class'
20000000
Fixnum

I was thinking about something like a linear congruential
random number generator like:
brains hgs 22 %> irb
irb(main):001:0> [2,1,1,2].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151936
irb(main):002:0> [2,2,1,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 153856
irb(main):003:0> [2,1,2,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151996
irb(main):004:0>

like
http://www.cs.bell-labs.com/cm/cs/pearls/markovhash.c

from "Programming Pearls".

That code does only one modulus on the NHASH. Your code does it for each

I wasn't following closely enough.

iteration. But otherwise 31 seems a good number - Java containers use 31,
too.

From AbstractList:

[...]

------------------------------------------------ Class: Fixnum <
Integer A +Fixnum+ holds +Integer+ values that can be
represented in a native machine word (minus 1 bit). If any
operation on a +Fixnum+

Should that be 0x7FFF_FFFF? (2147483647)
According to
http://www.rsok.com/~jrm/printprimes.html
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be slightly more
efficient.

Agreed. Is the RCR yours? Maybe an RCRCR is appropriate, "change
the hashing algorithm in that code to use LCG".

Kind regards

robert

Hugh

···

On Fri, 16 Sep 2005, Robert Klemme wrote:

On Fri, 16 Sep 2005, Robert Klemme wrote:

Robert · 16 September 2005 13:06

Hugh Sasse wrote:

Hugh Sasse wrote:

>> [1,2, 1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

=> 1885

Ah, so that's what MAX_HASH is, I couldn't remember how big Fixnums
were.

No, that's not really the limit - I just choose 32 bit as it's an
often used size.

13:41:59 [~]: ruby -e 'i=1;i<<=1 while Fixnum === i;i>>=1;puts
i.to_s(16), i.class'
20000000
Fixnum

I was thinking about something like a linear congruential
random number generator like:
brains hgs 22 %> irb
irb(main):001:0> [2,1,1,2].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151936
irb(main):002:0> [2,2,1,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 153856
irb(main):003:0> [2,1,2,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151996
irb(main):004:0>

like
http://www.cs.bell-labs.com/cm/cs/pearls/markovhash.c

from "Programming Pearls".

That code does only one modulus on the NHASH. Your code does it for
each

I wasn't following closely enough.

Someone talked about his brain behaving like cottage cheese today... :-))

iteration. But otherwise 31 seems a good number - Java containers
use 31, too.

From AbstractList:

         [...]

------------------------------------------------ Class: Fixnum <
      Integer A +Fixnum+ holds +Integer+ values that can be
      represented in a native machine word (minus 1 bit). If any
operation on a +Fixnum+

Should that be 0x7FFF_FFFF? (2147483647)
According to
http://www.rsok.com/~jrm/printprimes.html
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be slightly
more efficient.

Agreed. Is the RCR yours?

Yes.

Maybe an RCRCR is appropriate, "change
the hashing algorithm in that code to use LCG".

Done.

Thanks for the inspiring discussion!

robert

···

On Fri, 16 Sep 2005, Robert Klemme wrote:

On Fri, 16 Sep 2005, Robert Klemme wrote:

Hugh_Sasse · 16 September 2005 13:25

That code does only one modulus on the NHASH. Your code does it for
each

I wasn't following closely enough.

Someone talked about his brain behaving like cottage cheese today... :-))

There is that.

[...]

Should that be 0x7FFF_FFFF? (2147483647)
According to
Print Prime Numbers
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be slightly
more efficient.

I thought about this over lunch. I don't think you can use bit AND.

Suppose we have a 4 bit machine[!] and MAX_HASH is 3

6 % 3 == 0, but 6 & 3 == 2
7 % 3 == 1, but 7 & 3 == 3

etc. The theory behind linear *congruential* random number
generators, (which is how we're stirring the hashes) is based on
primality in modulo arithmetic. I think if we destroy that
relationship things will be less random and we'll get more
collisions.

Testing for randomness is tricky enough, so I'd not like to have to
prove this. But I have a suspicion that given the amount of work by
Knuth and others on the topic, if & could be used instead of % it
would be the recommended way to do it, because it is cheaper/faster.

Agreed. Is the RCR yours?

Yes.

Maybe an RCRCR is appropriate, "change
the hashing algorithm in that code to use LCG".

Done.

Thanks for the inspiring discussion!

robert

Thank you,
Hugh

···

On Fri, 16 Sep 2005, Robert Klemme wrote:

Robert · 16 September 2005 16:11

Hugh Sasse wrote:

That code does only one modulus on the NHASH. Your code does it
for each

I wasn't following closely enough.

Someone talked about his brain behaving like cottage cheese today...
:-))

There is that.

         [...]

Should that be 0x7FFF_FFFF? (2147483647)
According to
Print Prime Numbers
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be
slightly more efficient.

I thought about this over lunch. I don't think you can use bit AND.

Suppose we have a 4 bit machine[!] and MAX_HASH is 3

   6 % 3 == 0, but 6 & 3 == 2
   7 % 3 == 1, but 7 & 3 == 3

Erm, if you want to use bit AND instead of modulo you have to use
different values (i.e. the modulo value us bit and value + 1):

8.times {|i| print i % 4, " ", i & 3, "\n"}

0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3

This in turn means that the modulo always must be a power of two and thus
is extremely unlikely to be prime. I would even go so far as to claim
that the likelyhood is smaller than every nonzero positive epsilon in the
real numbers that you can think of.

etc. The theory behind linear *congruential* random number
generators, (which is how we're stirring the hashes) is based on
primality in modulo arithmetic. I think if we destroy that
relationship things will be less random and we'll get more
collisions.

Well, sometimes you'll have to do tradeoffs. The Java implementation
doesn't even use and or modulo to limit the value (not needed with Java
int, they simply wrap). I'll trust the Sun guys to have ensured that the
way they do it is a reasonable tradeoff between randomness and speed.

Testing for randomness is tricky enough, so I'd not like to have to
prove this. But I have a suspicion that given the amount of work by
Knuth and others on the topic, if & could be used instead of % it
would be the recommended way to do it, because it is cheaper/faster.

Yes, if you want to use a prime you definitely have to use modulo (see
above). I was more on the Sun side, i.e. just limiting the value in order
to keep it a Fixnum.

Kind regards

robert

···

On Fri, 16 Sep 2005, Robert Klemme wrote:

Hugh_Sasse · 16 September 2005 17:11

Hugh Sasse wrote:

That code does only one modulus on the NHASH. Your code does it
for each

I wasn't following closely enough.

Someone talked about his brain behaving like cottage cheese today...

Looks like it still is then...

:-))

There is that.

         [...]

Should that be 0x7FFF_FFFF? (2147483647)
According to
Print Prime Numbers
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be
slightly more efficient.

I thought about this over lunch. I don't think you can use bit AND.

Suppose we have a 4 bit machine[!] and MAX_HASH is 3

   6 % 3 == 0, but 6 & 3 == 2
   7 % 3 == 1, but 7 & 3 == 3

Erm, if you want to use bit AND instead of modulo you have to use
different values (i.e. the modulo value us bit and value + 1):

8.times {|i| print i % 4, " ", i & 3, "\n"}

0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3

When I did that in my head I munged it then! (hence my comment
above).

This in turn means that the modulo always must be a power of two and thus
is extremely unlikely to be prime. I would even go so far as to claim
that the likelyhood is smaller than every nonzero positive epsilon in the
real numbers that you can think of.

But in this case (2 ** n) - 1 is prime, which is what we want. So I
think it should be 0x8000_0000 (2147483648) for the & version,
equivalent to 0x7FFF_FFFF for the % version.

etc. The theory behind linear *congruential* random number
generators, (which is how we're stirring the hashes) is based on
primality in modulo arithmetic. I think if we destroy that
relationship things will be less random and we'll get more
collisions.

Well, sometimes you'll have to do tradeoffs. The Java implementation
doesn't even use and or modulo to limit the value (not needed with Java
int, they simply wrap).

Ouch.

I'll trust the Sun guys to have ensured that the
way they do it is a reasonable tradeoff between randomness and speed.

Java was intended for small systems originally, then backwards
compatibility, ....

Testing for randomness is tricky enough, so I'd not like to have to
prove this. But I have a suspicion that given the amount of work by
Knuth and others on the topic, if & could be used instead of % it
would be the recommended way to do it, because it is cheaper/faster.

Yes, if you want to use a prime you definitely have to use modulo (see
above). I was more on the Sun side, i.e. just limiting the value in order
to keep it a Fixnum.

I suppose I may be aiming too much for perfection, but I wonder how
the & 0x8000_0000 version stacks up.

···

On Sat, 17 Sep 2005, Robert Klemme wrote:

On Fri, 16 Sep 2005, Robert Klemme wrote:

Kind regards

robert

Hugh_Sasse · 16 September 2005 17:39

Which, (having tested it) means my previous reply was complete
rubbish! I was confused again.

I don't think modulo need be that expensive, because it can be

    def modulo (mod)
      if self > mod
        modulo(self - mod)
      else
        self
      end
    end

when the number and modulus are positive. That may cost too much
chough.
Hugh

···

On Sat, 17 Sep 2005, Hugh Sasse wrote:

On Sat, 17 Sep 2005, Robert Klemme wrote:

Hugh Sasse wrote:

On Fri, 16 Sep 2005, Robert Klemme wrote:

Erm, if you want to use bit AND instead of modulo you have to use
different values (i.e. the modulo value us bit and value + 1):

When I did that in my head I munged it then! (hence my comment
above).

This in turn means that the modulo always must be a power of two and thus
is extremely unlikely to be prime. I would even go so far as to claim
that the likelyhood is smaller than every nonzero positive epsilon in the
real numbers that you can think of.

Topic		Replies	Views
Getting an array of unique objects of a class i wrote ruby-talk	5	147	16 September 2008
Duplicity in Set ruby-talk	2	82	11 April 2006
Newbie question about sets ruby-talk	7	79	16 January 2007
[?] problem using set with hash objects ruby-talk	2	109	18 September 2007
Set.new vs Set[] ruby-talk	0	72	17 October 2005

Sets, uniqueness not unique

Related topics