Sane #hash implementation?

I have a Set subclass, Block. I need two blocks to be considered the
same by Set iff they have the same elements. From what I gathered, Set
treats two objects as equal when they are eql? and have the same hash.

Implementing Block#eql? seems to be trivial, as Set#== is there to help:

class Block < Set
  def eql? other
    self == other
  end
end

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

What’s the proper way to go at it? Adding the elements’ hashes together
is obviously wrong, as four different elements might produce the same
sum when paired; also, the docs say, ‘any hash value that exceeds the
capacity of a Fixnum will be truncated before being used,’ so hash can’t
exceed the (word - 1 bit) size.

-- Shot

···

--
Hell hath no fury like a bureaucrat scorned. -- Milton Friedman

There's nothing about hash codes that requires that
  a.hash != b.hash if a != b
You want to avoid collisions as much as possible, but
that's an issue of performance, not correctness, so you'd be perfectly
correct in adding up the hash codes of the members to create the has code.

Nor is there anything that requires you to keep your hash code within the
capacity of a Fixnum when you generate it. That's why it's automatically
truncated for you. The only reason you need to know is so that your hash
function doesn't hash things in such a way that the truncated number has
lots of collisions. (Once again a performance issue).

You may want to look at .:: General Purpose Hash Function Algorithms - By Arash Partow ::.
which discusses some relatively important hash functions. You may have to
experiment to see what works best. Wikipedia may also have good
information.

--Ken

···

On Tue, 30 Jan 2007 07:38:20 +0900, Shot (Piotr Szotkowski) wrote:

I have a Set subclass, Block. I need two blocks to be considered the
same by Set iff they have the same elements. From what I gathered, Set
treats two objects as equal when they are eql? and have the same hash.

Implementing Block#eql? seems to be trivial, as Set#== is there to help:

class Block < Set
  def eql? other
    self == other
  end
end

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

What’s the proper way to go at it? Adding the elements’ hashes together
is obviously wrong, as four different elements might produce the same
sum when paired; also, the docs say, ‘any hash value that exceeds the
capacity of a Fixnum will be truncated before being used,’ so hash can’t
exceed the (word - 1 bit) size.

-- Shot

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

Try:

class Block
   def hash
     to_a.hash
   end
end

···

On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:

I have a Set subclass, Block. I need two blocks to be considered the
same by Set iff they have the same elements. From what I gathered, Set
treats two objects as equal when they are eql? and have the same hash.

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

Ken Bloom:

There's nothing about hash codes that requires that
  a.hash != b.hash if a != b
You want to avoid collisions as much as possible, but that's an issue
of performance, not correctness, so you'd be perfectly correct in
adding up the hash codes of the members to create the has code.

True, but I do want to be able to key Hashes with my Blocks, so I do
want to avoid collisions as much as possible. I should’ve tested the
add-up-the-members’-hashes approach before/instead of crossing it out,
though.

You may want to look at
.:: General Purpose Hash Function Algorithms - By Arash Partow ::.
which discusses some relatively important hash functions.
You may have to experiment to see what works best. Wikipedia
may also have good information.

Thanks for the pointers! I knew about hash functions before, I just
wasn’t sure whether there’s an idiomatic/popular/proper Ruby approach
to writing #hash.

-- Shot

···

--
I am a computer. I am dumber than any human and smarter than any administrator.

Eric Hodel:

···

On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

Try:

class Block
  def hash
    to_a.hash
  end
end

Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :slight_smile:

-- Shot
--
Of course I can see iso-8859-1 characters, I'm French. -- Christian Marillat

You may want to define #eql? in terms of #to_a as well...

···

On Jan 29, 2007, at 23:26, Shot (Piotr Szotkowski) wrote:

Eric Hodel:

On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

Try:

class Block
  def hash
    to_a.hash
  end
end

Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :slight_smile:

Shot (Piotr Szotkowski) schrieb:

Eric Hodel:

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

Try:

class Block
  def hash
    to_a.hash
  end
end

Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :slight_smile:

Shot, you should be aware that this works for your example given above, but not for the following:

   Block.new([1,12]).hash # => 57
   Block.new([12,1]).hash # => 23

If this is a problem, you have to change the implementation to

   class Block
     def hash
       sort.hash
     end
   end

Regards,
Pit

···

On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:

[I'm resending this since the ML server is dropping some of my messages to
this thread for some reason.]

Eric Hodel:

>> I'm stuck when it comes to Block#hash, though; I need these to be true:
>> Block.new.hash == Block.new.hash
>> Block.new([1,2]).hash == Block.new([1,2]).hash

> Try:

> class Block
> def hash
> to_a.hash
> end
> end

Thanks a lot, Eric! This is the d'oh! solution I was looking for. :slight_smile:

It's not a correct one though.

  require 'set'
  a = Set.new(0..1000)
  b = Set.new((0..1000).sort_by{rand})
  a == b # => true
  a.to_a.hash # => 31141761
  b.to_a.hash # => 374826672

(The order of the elements in Hash#to_a can change.)

Try this, it's O(n ln n) but actually works

  require 'set'
  class Block < Set
    def hash
      map{|x| x.hash}.sort.hash
    end
  end
  a = Block.new(0..1000)
  b = Block.new((0..1000).sort_by{rand})
  a == b # => true
  a.hash # => 944434529
  b.hash # => 944434529

···

On Tue, Jan 30, 2007 at 04:26:18PM +0900, Shot (Piotr Szotkowski) wrote:

> On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

That would fail for the same reason as your Block#hash.

···

On Tue, Jan 30, 2007 at 05:10:51PM +0900, Eric Hodel wrote:

On Jan 29, 2007, at 23:26, Shot (Piotr Szotkowski) wrote:
>Eric Hodel:
>>On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:
>>>I’m stuck when it comes to Block#hash, though; I need these to be
>>>true:
>>>Block.new.hash == Block.new.hash
>>>Block.new([1,2]).hash == Block.new([1,2]).hash
>
>>Try:
>
>>class Block
>> def hash
>> to_a.hash
>> end
>>end
>
>Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for.
>:)

You may want to define #eql? in terms of #to_a as well...

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

Eric Hodel:

You may want to define #eql? in terms of #to_a as well...

Thanks, but (considering Mauricio Fernandez’s fix) what would be the
benefit of defining #eql? in terms of Set#sort instead of Set#==?

Te docs say on Set#==, ‘Returns true if two sets are equal. The equality
of each couple of elements is defined according to Object#eql?.’

-- Shot

···

--
One of the advantages of being disorderly is that one is
constantly making exciting discoveries. -- A. A. Milne

Mauricio Fernandez:

(The order of the elements in Hash#to_a can change.)

Thanks for pointing this out, I fixed my specs and my code.

Try this, it's O(n ln n) but actually works

Ok, I guess I have a question on what does exactly Benchmark results
mean. Doesn’t the below mean hash_map_sort is actually a bit slower than
hash_sort?

Also, to ask two basic, follow-up questions: which Benchmark time is
the most significant one and is it more-or-less independent from other
system load? (I guess the two runs wouldn’t differ as much any of the
times was other-load-independent…)

$ cat block_hash.rb
require 'benchmark'
require 'set'

class Block < Set
  def hash_sort
    sort.hash
  end
  def hash_map_sort
    map {|a| a.hash}.sort.hash
  end
end

srand 1979
array = (0..1_000_000).sort_by {rand}

Benchmark.bmbm do |b|
  b.report 'hash_sort' do
    Block.new(array).hash_sort
  end
  b.report 'hash_map_sort' do
    Block.new(array).hash_map_sort
  end
end

$ ruby block_hash.rb
Rehearsal -------------------------------------------------
hash_sort 8.000000 1.366667 9.366667 ( 6.208513)
hash_map_sort 9.750000 1.500000 11.250000 ( 7.179749)
--------------------------------------- total: 20.616667sec

                    user system total real
hash_sort 9.666667 1.100000 10.766667 ( 6.677047)
hash_map_sort 9.566667 1.550000 11.116667 ( 6.809388)

$ ruby block_hash.rb
Rehearsal -------------------------------------------------
hash_sort 8.133333 1.316667 9.450000 ( 5.746571)
hash_map_sort 9.650000 1.766667 11.416667 ( 7.065570)
--------------------------------------- total: 20.866667sec

                    user system total real
hash_sort 9.316667 1.100000 10.416667 ( 6.684088)
hash_map_sort 9.450000 1.566667 11.016667 ( 6.796299)

-- Shot

···

--
It is a good morning exercise for a research scientist
to discard a pet hypothesis every day before breakfast.
It keeps him young. -- Konrad Lorenz

"None". You cannot just
  def eql?(o); sort == o.sort end
either, since there's no guarantee there's a total order associated to the
set (it seems my reply to Pit where I pointed this out was dropped too):

  require 'set'
  Set.new(['a', 1]).sort # =>
  # ~> -:2:in `sort': comparison of String with 1 failed (ArgumentError)
  # ~> from -:2

This is why the hash implementation I gave you does
  map{|x| x.hash}.sort.hash # sorting hash values, i.e. Integers
instead of
  sort.map{|x| x.hash}.hash

Even when the elements can be ordered, sort would be O(n ln n) vs. #=='s O(n)

Of course, the former is a "fast NlnN", being written in C, so it would seem
that it can keep the speed advantage up to large values of N.

However, #sort will process the whole array before it can start comparing
values, and #== can begin right away. If there are many differing elements,
the probability of finding one of them in the first comparisons is very high.

Somewhat surprisingly, #== can beat #sort even in the most favorable case for
the latter (only one differing element, the first in the sorted array but not
in Hash#to_a, small sets):

  require 'benchmark'
  require 'set'
  size = 64
  elm = -10
  elm2 = -2
  s1 = Set.new((0..size).to_a << -1)
  s2 = Set.new((0..size).to_a << elm)

  GC.disable # don't let this interfere
  def bm(s1, s2, elm, iterations = 5000)
      puts "#== will stop after #{1 + s2.to_a.index(elm)} .include? tests"
      Benchmark.bmbm(10) do |bm|
        bm.report("Set#sort") { iterations.times{ s1.sort == s2.sort } }
        bm.report("Set#==") { iterations.times{ s1 == s2 } }
      end
  end

  bm(s1, s2, elm)
  puts "-" * 40
  bm(s1, Set.new((0..size).to_a << elm2), elm2)

  # >> #== will stop after 43 .include? tests
  # >> Rehearsal ---------------------------------------------
  # >> Set#sort 0.700000 0.050000 0.750000 ( 0.768818)
  # >> Set#== 0.370000 0.010000 0.380000 ( 0.381117)
  # >> ------------------------------------ total: 1.130000sec
  # >>
  # >> user system total real
  # >> Set#sort 0.680000 0.060000 0.740000 ( 0.736052)
  # >> Set#== 0.380000 0.010000 0.390000 ( 0.380995)
  # >> ----------------------------------------
  # >> #== will stop after 7 .include? tests
  # >> Rehearsal ---------------------------------------------
  # >> Set#sort 0.620000 0.010000 0.630000 ( 0.637358)
  # >> Set#== 0.170000 0.000000 0.170000 ( 0.175028)
  # >> ------------------------------------ total: 0.800000sec
  # >>
  # >> user system total real
  # >> Set#sort 0.750000 0.050000 0.800000 ( 0.807318)
  # >> Set#== 0.170000 0.010000 0.180000 ( 0.176633)

···

On Wed, Jan 31, 2007 at 06:37:14PM +0900, Shot (Piotr Szotkowski) wrote:

Eric Hodel:

> You may want to define #eql? in terms of #to_a as well...

Thanks, but (considering Mauricio Fernandez’s fix) what would be the
benefit of defining #eql? in terms of Set#sort instead of Set#==?

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

Shot (Piotr Szotkowski) schrieb:

Ok, I guess I have a question on what does exactly Benchmark results
mean. Doesn’t the below mean hash_map_sort is actually a bit slower than
hash_sort?

This is what I would expect. The advantage of Mauricio's solution is that you can use it even if the Set elements are not comparable to each other. My proposal was meant for your use case with integers as Set elements.

srand 1979
array = (0..1_000_000).sort_by {rand}

Benchmark.bmbm do |b|
  b.report 'hash_sort' do
    Block.new(array).hash_sort
  end
  b.report 'hash_map_sort' do
    Block.new(array).hash_map_sort
  end
end

With this benchmark, you also measure the time needed to construct the Block instances. If you want to compare the two #hash methods, you should change your benchmark to something like:

   block = Block.new(array)

   Benchmark.bmbm do |b|
     b.report 'hash_sort' do
       block.hash_sort
     end
     b.report 'hash_map_sort' do
       block.hash_map_sort
     end
   end

From your benchmark, you can see that it takes a lot of time to compute the hash values. Will your sets contain 1_000_000 elements? How often will they change? How are they created? Maybe you need to cache the hash values or implement a different hash algorithm.

Regards,
Pit

Mauricio Fernandez:

what would be the benefit of defining #eql?
in terms of Set#sort instead of Set#==?

"None". You cannot just
  def eql?(o); sort == o.sort end
either

But I can

class Block < Set
  def eql? other
    self == other
  end
end

can’t I? Set#== says that ‘the equality of each couple of elements is
defined according to Object#eql?’ and that, coupled with Block#hash,
is enough for me to create a set of blocks acting the way I need it.

Note: I need a set of blocks to consider two blocks the same if the
blocks contain the same integers, i.e., I need two such blocks to be
eql? and have the same hashes; the above seems to do it for me quite
nicely…

(Thaks a lot for your detailed explanation snipped here,
I appreciate it very much and learned a lot from it!)

  GC.disable # don't let this interfere

I’m learning new things every day, this one will come in very handy.

-- Shot

···

On Wed, Jan 31, 2007 at 06:37:14PM +0900, Shot (Piotr Szotkowski) wrote:

--
My favourite was some time ago, and involved a female customer thanking
"Mr. Daemon" for his effort trying to deliver her mail, and offering
him a "good time" if he ever visited Sydney. -- Matt McLeod

Pit Capitain:

The advantage of Mauricio's solution is that you can use it even if
the Set elements are not comparable to each other. My proposal was
meant for your use case with integers as Set elements.

Ah, right. Still, blocks will contain integers only, so
Array#sort-based Block#hash and Set#==-based Block#eql?
should suffice.

With this benchmark, you also measure the time needed to construct
the Block instances. If you want to compare the two #hash methods,
you should change your benchmark to something like:

D’oh! Why did I think moving the array creation
out was the most I can get to unbias the benchmark?

From your benchmark, you can see that it takes a lot of time to
compute the hash values. Will your sets contain 1_000_000 elements?
How often will they change? How are they created? Maybe you need to
cache the hash values or implement a different hash algorithm.

I don’t know, yet. I was just curious whether the elements’-hash-based
hash algorithm would be any faster than the just-sort-based one (given
that from a quick glance it seemed to incorporate additional work); now
I understand its advantage lies elsewhere.

Thanks a lot for your input!

-- Shot

···

--
In ObjectSpace, nobody can hear you quack. -- Luc Heinrich, ruby-talk

Hello,

is my news reader (Gnus v5.9.0) broken? I'm just trying to explain why
I see things like that:

"Shot (Piotr Szotkowski)" <shot@hot.pl> wrote/schrieb <20070131151238.GH3332@durance.shot.pl>:

--jE+K4o1MICf3EEet
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[..]
>> in terms of Set#sort instead of Set#=3D=3D?

                                          ^^^^^

> "None". You cannot just =20

                            ^^^
Regards
  Thomas

Thomas Hafner:

is my news reader (Gnus v5.9.0) broken? I'm just
trying to explain why I see things like that:

Content-Transfer-Encoding: quoted-printable

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

in terms of Set#sort instead of Set#=3D=3D?

                                        ^^^^^

"None". You cannot just =20

                          ^^^

It looks like Gnus 5.9 doesn’t handle quoted-printable.

A quick googling turned up
http://www.splode.com/~friedman/software/emacs-lisp/src/qpdecode.el

-- Shot

···

--
A funny symbol that I can't read has just been input. Continue,
and I'll forget that it ever happened. -- TeX

"Shot (Piotr Szotkowski)" <shot@hot.pl> wrote/schrieb <20070203234943.GX3332@durance.shot.pl>:

It looks like Gnus 5.9 doesn=E2=80=99t handle quoted-printable.

Strange that it does handle it most of the time, but not always. I
never observed the problem with e-mails (Gnus is a mail client, too),
e.g.:

Header =
  ...
  Content-Type: multipart/mixed; boundary="=-=-="

First part =

···

--=-=-=
  Content-Type: text/plain; charset=utf-8
  Content-Transfer-Encoding: quoted-printable
  ...

=> Ok.

Only some news articles have disturbed my Gnus so far.

Thanks
  Thomas