Sets are not extensional

Csaba_Henk4 · 5 August 2004 21:36

(FYI: the phenomena described in the sequel were seen produced by the following
ruby instances: a vanilla 1.8.1, a Debianish 1.8.2 and a cvs snapshot
from the 10th of May, identifying itself as 1.9.0.)

require 'set'
#the empty set:
Set[]
=> #<Set: {}>
#the singleton of the empty set:
Set[ [Set[]] ]
=> #<Set: {[#<Set: {}>]}>
#if sets were extensional as God ordered, this would give the same, but...
Set[ [Set[], Set[]] ]
=> #<Set: {[#<Set: {}>, #<Set: {}>]}>
#and also...
Set[Set[]].member? Set[]
=> false
[Set[], Set[]].uniq
=> [#<Set: {}>, #<Set: {}>]
Set[ [Set[]] ] - Set[ [Set[]] ]
=> #<Set: {[#<Set: {}>]}>
#anyway...
Set[] == Set[]
=> true
Set[ [Set[]] ] == Set[ [Set[]] ]
=> false

(Set.new behaves the same way as Set[])

Now I don't see what's the purpose of having a set class which is not
extensional (ie., two sets behave as the same one in all respect if their
elements are the same). But, given it's non-extensional, why isn't it
consequently non-extensional ("Set[] == Set[]" evaluates in an extensional
manner, that is, to "true")? And how does Array#uniq work? -- by what rule
does it dispose elements if not by getting "true" by a "==" comparison with
an other member of the array?

···

--
Csaba

"There's more to life, than books, you know but not much more..."
[The Smiths]
***
If you want to send me a mail, see it in the mailto link at
http://www.renyi.hu/~ekho/egyelore.html

Kristof_Bastiaensen1 · 6 August 2004 00:01

I think you raise a valid question. Set is implemented using hashtables,
and compares hashes instead of objects. A different hashtable gets a
different hash value, which is the reason for the strange(?) behaviour:

{}.hash == {}.hash
=> false
Set.hash == Set.hash
=> false
.hash == .hash
=> true

I have no idea for the reason why different hashes have different
hashvalues, while different arrays may get the same value (based on the
contents of the array).

KB

···

On Thu, 05 Aug 2004 21:32:55 +0000, Csaba Henk wrote:

Set[Set].member? Set
=> false
And how does Array#uniq work? -- by what rule does it dispose elements
if not by getting "true" by a "==" comparison with an other member of
the array?

Csaba_Henk4 · 6 August 2004 18:11

I've read the thread spawned by this post, and now I summarize that part of
what was written there which is relevant to my original problem, and wasn't
stated there explicitly.

"Set" can be forced to behave extensionally by doing so:

class Set
  def eql? other
    subset? other and other.subset? self
  end
  def hash
    to_a.hash
  end
end

-- as it seems to me, Array#uniq uses *both* of eql? and hash methods, and
to get rid of the (IMHO pathological) behaviour illustrated by the above
examples, both methods need to be overwritten. Concerning the above
re-implementation of these methods, what I did with "hash" is just a hack, of
course, I don't claim that it clean or right to do so, it just something
which does the job.

···

In article <slrnch59s7.77a.csaba@degas.ceu.hu>, Csaba Henk wrote:

require 'set'
#the empty set:
Set
=> #<Set: {}>
#the singleton of the empty set:
Set[ [Set] ]
=> #<Set: {[#<Set: {}>]}>
#if sets were extensional as God ordered, this would give the same, but...
Set[ [Set, Set] ]
=> #<Set: {[#<Set: {}>, #<Set: {}>]}>
#and also...
Set[Set].member? Set
=> false
[Set, Set].uniq
=> [#<Set: {}>, #<Set: {}>]
Set[ [Set] ] - Set[ [Set] ]
=> #<Set: {[#<Set: {}>]}>
#anyway...
Set == Set
=> true
Set[ [Set] ] == Set[ [Set] ]
=> false

(Set.new behaves the same way as Set)

Now I don't see what's the purpose of having a set class which is not
extensional (ie., two sets behave as the same one in all respect if their
elements are the same). But, given it's non-extensional, why isn't it
consequently non-extensional ("Set == Set" evaluates in an extensional
manner, that is, to "true")? And how does Array#uniq work? -- by what rule
does it dispose elements if not by getting "true" by a "==" comparison with
an other member of the array?

--
Csaba

"There's more to life, than books, you know but not much more..."
[The Smiths]
***
If you want to send me a mail, see it in the mailto link at
http://www.renyi.hu/~ekho/egyelore.html

George_Ogata1 · 6 August 2004 01:41

Kristof Bastiaensen <kristof@vleeuwen.org> writes:

{}.hash == {}.hash
=> false
Set.hash == Set.hash
=> false
.hash == .hash
=> true

I have no idea for the reason why different hashes have different
hashvalues, while different arrays may get the same value (based on the
contents of the array).

This seems like a blatant bug to me. `a.eql? b' must imply `a.hash ==
b.hash', yet:

irb(main):001:0> a = {}; b = {}
=> {}
irb(main):002:0> a.eql?(b)
=> true
irb(main):003:0> a.hash == b.hash
=> false

Kristof_Bastiaensen1 · 7 August 2004 13:11

Hi,

[snip]
I've read the thread spawned by this post, and now I summarize that part
of what was written there which is relevant to my original problem, and
wasn't stated there explicitly.

I think the thread was relevant to your post, since it described
what was causing the bug. The following should solve the problems
(but can be very slow for large Hashes/Sets):

class Array
  def hash
    to_a.hash
  end
end

Regards,
KB

···

On Fri, 06 Aug 2004 18:08:57 +0000, Csaba Henk wrote:

"Set" can be forced to behave extensionally by doing so:

class Set
  def eql? other
    subset? other and other.subset? self
  end
  def hash
    to_a.hash
  end
end

-- as it seems to me, Array#uniq uses *both* of eql? and hash methods, and
to get rid of the (IMHO pathological) behaviour illustrated by the above
examples, both methods need to be overwritten. Concerning the above
re-implementation of these methods, what I did with "hash" is just a hack,
of course, I don't claim that it clean or right to do so, it just
something which does the job.

Reimer_Behrends · 7 August 2004 16:51

[...]

"Set" can be forced to behave extensionally by doing so:

class Set
   def eql? other
     subset? other and other.subset? self
   end
   def hash
     to_a.hash
   end
end

Note that to_a.hash is effectively non-deterministic. If @hash == { 1 =>
true, 2 => true }, then to_a can be both [1, 2] and [2, 1], with
different hash values. To generate a correct hash function, you have to
use something like:

to_a.map{|el| el.hash}.sort.hash

or:

to_a.inject(0){|a,b| a ^ b.hash}

Reimer Behrends

···

Csaba Henk (csaba@phony_for_avoiding_spam.org) wrote:

Charles_Mills1 · 6 August 2004 02:34

Kristof Bastiaensen <kristof@vleeuwen.org> writes:

{}.hash == {}.hash
=> false
Set.hash == Set.hash
=> false
.hash == .hash
=> true

I have no idea for the reason why different hashes have different
hashvalues, while different arrays may get the same value (based on the
contents of the array).

This seems like a blatant bug to me. `a.eql? b' must imply `a.hash ==
b.hash', yet:

I agree:
> a = {} #=> {}
> a.hash #=> 1650134
> a.id #=> 1650134
> b = {} #=> {}
> b.hash #=> 1639814
# So hash on an empty hash returns the objects id....
> a = {"a"=>"b"} #=> {"a"=>"b"}
> a.hash #=> 1682094
> a.clear #=> {}
> a.hash #=> 1682094
# Looks like no two hashes hash to one another...
# Here is the reason
/* in object.c */
     rb_mKernel = rb_define_module("Kernel");
     rb_include_module(rb_cObject, rb_mKernel);
/* ..snip.. */
     rb_define_method(rb_mKernel, "object_id", rb_obj_id, 0);
/* in hash.c the "hash" method is never overridden. So rb_obj_id() is the "hash" method for hashes. */

-Charlie

···

On Aug 5, 2004, at 6:41 PM, George Ogata wrote:

Charles_Mills1 · 7 August 2004 15:16

Hi,

[snip]
I've read the thread spawned by this post, and now I summarize that part
of what was written there which is relevant to my original problem, and
wasn't stated there explicitly.

I think the thread was relevant to your post, since it described
what was causing the bug. The following should solve the problems
(but can be very slow for large Hashes/Sets):

class Array
  def hash
    to_a.hash
  end
end

I think you mean:

class Hash
   def hash
     to_a.hash
   end
end

-Charlie

···

On Aug 7, 2004, at 6:11 AM, Kristof Bastiaensen wrote:

On Fri, 06 Aug 2004 18:08:57 +0000, Csaba Henk wrote:

Regards,
KB

"Set" can be forced to behave extensionally by doing so:

class Set
  def eql? other
    subset? other and other.subset? self
  end
  def hash
    to_a.hash
  end
end

-- as it seems to me, Array#uniq uses *both* of eql? and hash methods, and
to get rid of the (IMHO pathological) behaviour illustrated by the above
examples, both methods need to be overwritten. Concerning the above
re-implementation of these methods, what I did with "hash" is just a hack,
of course, I don't claim that it clean or right to do so, it just
something which does the job.

Kristof_Bastiaensen1 · 7 August 2004 18:26

hi,

Note that to_a.hash is effectively non-deterministic. If @hash == { 1 =>
true, 2 => true }, then to_a can be both [1, 2] and [2, 1], with different
hash values.

As far as I know, hash.to_a always generates the same array.
The reason is that hashes are in fact unordered, so if you
give the same elements in a different order, it is still the
same Hash. That means that when hash1 == hash2, then
hash1.to_a == hash2.to_a. So fortunately there is no need to
use map or sort.

Regards,
KB

···

On Sat, 07 Aug 2004 16:45:22 +0000, Reimer Behrends wrote:

Robert · 6 August 2004 10:11

"Charles Mills" <cmills@freeshell.org> schrieb im Newsbeitrag
news:2B60D928-E751-11D8-AF77-000A95A27A10@freeshell.org...

> Kristof Bastiaensen <kristof@vleeuwen.org> writes:
>
>> {}.hash == {}.hash
>> => false
>> Set.hash == Set.hash
>> => false
>> .hash == .hash
>> => true
>>
>> I have no idea for the reason why different hashes have different
>> hashvalues, while different arrays may get the same value (based on
>> the
>> contents of the array).
>
> This seems like a blatant bug to me. `a.eql? b' must imply `a.hash ==
> b.hash', yet:

Yes, this issue has come up before IIRC. Maybe Matz did something about
this but it's not included in a release version.

I agree:
> a = {} #=> {}
> a.hash #=> 1650134
> a.id #=> 1650134
> b = {} #=> {}
> b.hash #=> 1639814
# So hash on an empty hash returns the objects id....
> a = {"a"=>"b"} #=> {"a"=>"b"}
> a.hash #=> 1682094
> a.clear #=> {}
> a.hash #=> 1682094
# Looks like no two hashes hash to one another...
# Here is the reason
/* in object.c */
     rb_mKernel = rb_define_module("Kernel");
     rb_include_module(rb_cObject, rb_mKernel);
/* ..snip.. */
     rb_define_method(rb_mKernel, "object_id", rb_obj_id, 0);
/* in hash.c the "hash" method is never overridden. So rb_obj_id() is
the "hash" method for hashes. */

Where do you take that conclusion from? From the source quotes you name,
there is no indication of a connection between #hash and #id. And you're
showing code for Kernel which is unrelated to this thread. The only
conclusion so far is, that Hash#hash is Object#hash (because it's not
overridden as you state) and thus can't take into account the contents of
the Hash. Can you please clarify this?

Kind regards

robert

···

On Aug 5, 2004, at 6:41 PM, George Ogata wrote:

Kristof_Bastiaensen1 · 7 August 2004 18:16

Hi,

[snip]
I've read the thread spawned by this post, and now I summarize that
part
of what was written there which is relevant to my original problem, and
wasn't stated there explicitly.

I think the thread was relevant to your post, since it described what
was causing the bug. The following should solve the problems (but can
be very slow for large Hashes/Sets):

class Array
  def hash
    to_a.hash
  end
end

I think you mean:

class Hash
   def hash
     to_a.hash
   end
end

-Charlie

I did. Thanks!

KB

···

On Sun, 08 Aug 2004 00:16:11 +0900, Charles Mills wrote:

On Aug 7, 2004, at 6:11 AM, Kristof Bastiaensen wrote:

On Fri, 06 Aug 2004 18:08:57 +0000, Csaba Henk wrote:

Regards,
KB

"Set" can be forced to behave extensionally by doing so:

class Set
  def eql? other
    subset? other and other.subset? self
  end
  def hash
    to_a.hash
  end
end

-- as it seems to me, Array#uniq uses *both* of eql? and hash methods,
and
to get rid of the (IMHO pathological) behaviour illustrated by the
above
examples, both methods need to be overwritten. Concerning the above
re-implementation of these methods, what I did with "hash" is just a
hack,
of course, I don't claim that it clean or right to do so, it just
something which does the job.

Reimer_Behrends · 8 August 2004 05:31

[...]

As far as I know, hash.to_a always generates the same array.

No. Hash#to_a will generate the same array for the same hash object,
but for different hash objects with equal keys and values, it will
not. Same for Hash#keys and Hash#values (Set#to_a is based on
Hash#keys).

$ cat hashtest.rb
h = { }; 20.times{|i| h[i] = true };p h.keys
h2 = { }; 20.times{|i| h2[19-i] = true };p h2.keys
puts h == h2

$ ruby -v hashtest.rb
ruby 1.8.1 (2003-12-25) [i686-linux]
[16, 5, 11, 0, 17, 6, 12, 1, 18, 7, 13, 2, 19, 8, 14, 3, 9, 15, 4, 10]
[5, 16, 0, 11, 6, 17, 1, 12, 7, 18, 2, 13, 8, 19, 3, 14, 9, 4, 15, 10]
true

Reimer Behrends

···

Kristof Bastiaensen (kristof@vleeuwen.org) wrote:

Florian_Gross · 6 August 2004 10:36

Moin!

The only
conclusion so far is, that Hash#hash is Object#hash (because it's not
overridden as you state) and thus can't take into account the contents of
the Hash. Can you please clarify this?

irb(main):001:0> {1 => 2}.method(:hash)
=> #<Method: Hash(Kernel)#hash>

So it is indeed not overwritten.

Kind regards
robert

More regards,
Florian Gross

Charles_Mills1 · 6 August 2004 14:08

"Charles Mills" <cmills@freeshell.org> schrieb im Newsbeitrag
news:2B60D928-E751-11D8-AF77-000A95A27A10@freeshell.org...

Kristof Bastiaensen <kristof@vleeuwen.org> writes:

{}.hash == {}.hash
=> false
Set.hash == Set.hash
=> false
.hash == .hash
=> true

I have no idea for the reason why different hashes have different
hashvalues, while different arrays may get the same value (based on
the
contents of the array).

This seems like a blatant bug to me. `a.eql? b' must imply `a.hash ==
b.hash', yet:

Yes, this issue has come up before IIRC. Maybe Matz did something about
this but it's not included in a release version.

I agree:

a = {} #=> {}
a.hash #=> 1650134
a.id #=> 1650134
b = {} #=> {}
b.hash #=> 1639814

# So hash on an empty hash returns the objects id....

a = {"a"=>"b"} #=> {"a"=>"b"}
a.hash #=> 1682094
a.clear #=> {}
a.hash #=> 1682094

# Looks like no two hashes hash to one another...
# Here is the reason
/* in object.c */
     rb_mKernel = rb_define_module("Kernel");
     rb_include_module(rb_cObject, rb_mKernel);
/* ..snip.. */
     rb_define_method(rb_mKernel, "object_id", rb_obj_id, 0);

/* I intended to show this line: */
rb_define_method(rb_mKernel, "hash", rb_obj_id, 0);

Anyway, so the code above shows class Object including module Kernel. Module Kernel defines "hash" as rb_obj_id(). So Object gets rb_obj_id() as its "hash" method.

/* in hash.c the "hash" method is never overridden. So rb_obj_id() is
the "hash" method for hashes. */

Where do you take that conclusion from? From the source quotes you name,
there is no indication of a connection between #hash and #id. And you're
showing code for Kernel which is unrelated to this thread. The only
conclusion so far is, that Hash#hash is Object#hash (because it's not
overridden as you state) and thus can't take into account the contents of
the Hash. Can you please clarify this?

Don't forget you can always look at the source yourself
-Charlie

···

On Aug 6, 2004, at 3:11 AM, Robert Klemme wrote:

On Aug 5, 2004, at 6:41 PM, George Ogata wrote:

Robert · 8 August 2004 08:36

"Kristof Bastiaensen" <kristof@vleeuwen.org> schrieb im Newsbeitrag
news:pan.2004.08.07.18.15.03.660477@vleeuwen.org...

>
>> Hi,
>>
>>> [snip]
>>> I've read the thread spawned by this post, and now I summarize that
>>> part
>>> of what was written there which is relevant to my original problem,

and

>>> wasn't stated there explicitly.
>>>
>>>
>> I think the thread was relevant to your post, since it described what
>> was causing the bug. The following should solve the problems (but can
>> be very slow for large Hashes/Sets):
>>
>> class Array
>> def hash
>> to_a.hash
>> end
>> end
>>
> I think you mean:
>
> class Hash
> def hash
> to_a.hash
> end
> end
>
> -Charlie
>

I did. Thanks!

IMHO a solution that recalculates the hash value on each change is more
efficient. I imagine something like this (incomplete):

class FixedHash < Hash
  def initialize(*a,&b)
    super
    calc_hash
  end

  def =(key,val)
    @hc ^= self[key].hash << 3 if has_key? key
    @hc ^= val.hash << 3
    super
  end

  def clear
    super
    calc_hash
  end

# all other modifying methods must be
# changed, too

def hash() @hc end

private

HASH_MASK = (1 << 32) - 1

  def calc_hash
    @hc = 0
    each {|k,v| @hc ^= (k.hash ^ (v.hash << 3)) }
    @hc &= HASH_MASK
  end
end

h = FixedHash.new

=> {}

h2 = FixedHash.new

=> {}

7.times{|i| h[i] = true; h2[6-i] = true }

=> 7

h

=> {5=>true, 0=>true, 6=>true, 1=>true, 2=>true, 3=>true, 4=>true}

h2

=> {5=>true, 0=>true, 6=>true, 1=>true, 2=>true, 3=>true, 4=>true}

h == h2

=> true

h.sort

=> [[0, true], [1, true], [2, true], [3, true], [4, true], [5, true], [6,
true]]

h2.sort

=> [[0, true], [1, true], [2, true], [3, true], [4, true], [5, true], [6,
true]]

h.hash

=> 16

h2.hash

=> 16

Kind regards

robert

···

On Sun, 08 Aug 2004 00:16:11 +0900, Charles Mills wrote:
> On Aug 7, 2004, at 6:11 AM, Kristof Bastiaensen wrote:
>> On Fri, 06 Aug 2004 18:08:57 +0000, Csaba Henk wrote:

Kristof_Bastiaensen1 · 8 August 2004 09:56

You are correct. I had tested it with small Hashes ({1 => 2, 2 => 1}),
but of course in large Hashes objects which fall in the same bucket can
appear on different positions as you said. Perhaps the best solution is
to recalculate the Hashvalue each time (as Robert Klemme proposed).

Thanks,
KB

···

On Sun, 08 Aug 2004 05:26:15 +0000, Reimer Behrends wrote:

Kristof Bastiaensen (kristof@vleeuwen.org) wrote: [...]

As far as I know, hash.to_a always generates the same array.

No. Hash#to_a will generate the same array for the same hash object, but
for different hash objects with equal keys and values, it will not. Same
for Hash#keys and Hash#values (Set#to_a is based on Hash#keys).

<snip>

Robert · 6 August 2004 12:51

"Florian Gross" <flgr@ccan.de> schrieb im Newsbeitrag
news:2nh57hFqhe5U1@uni-berlin.de...

Moin!

> The only
> conclusion so far is, that Hash#hash is Object#hash (because it's not
> overridden as you state) and thus can't take into account the contents

of

> the Hash. Can you please clarify this?

irb(main):001:0> {1 => 2}.method(:hash)
=> #<Method: Hash(Kernel)#hash>

So it is indeed not overwritten.

Thanks for confirming. But I was after something differnt: The statement
I was referring to is

/* in hash.c the "hash" method is never overridden. So rb_obj_id() is
the "hash" method for hashes. */

I can't see the connection between hash and id - at least not in the code
snippets posted.

Regards

robert

Robert · 6 August 2004 14:36

"Charles Mills" <cmills@freeshell.org> schrieb im Newsbeitrag
news:1A5A54EE-E7B2-11D8-9E07-000A95A27A10@freeshell.org...

>
> "Charles Mills" <cmills@freeshell.org> schrieb im Newsbeitrag
> news:2B60D928-E751-11D8-AF77-000A95A27A10@freeshell.org...
>>
>>
>>> Kristof Bastiaensen <kristof@vleeuwen.org> writes:
>>>
>>>> {}.hash == {}.hash
>>>> => false
>>>> Set.hash == Set.hash
>>>> => false
>>>> .hash == .hash
>>>> => true
>>>>
>>>> I have no idea for the reason why different hashes have different
>>>> hashvalues, while different arrays may get the same value (based on
>>>> the
>>>> contents of the array).
>>>
>>> This seems like a blatant bug to me. `a.eql? b' must imply `a.hash
>>> ==
>>> b.hash', yet:
>
> Yes, this issue has come up before IIRC. Maybe Matz did something
> about
> this but it's not included in a release version.
>
>> I agree:
>>> a = {} #=> {}
>>> a.hash #=> 1650134
>>> a.id #=> 1650134
>>> b = {} #=> {}
>>> b.hash #=> 1639814
>> # So hash on an empty hash returns the objects id....
>>> a = {"a"=>"b"} #=> {"a"=>"b"}
>>> a.hash #=> 1682094
>>> a.clear #=> {}
>>> a.hash #=> 1682094
>> # Looks like no two hashes hash to one another...
>> # Here is the reason
>> /* in object.c */
>> rb_mKernel = rb_define_module("Kernel");
>> rb_include_module(rb_cObject, rb_mKernel);
>> /* ..snip.. */
>> rb_define_method(rb_mKernel, "object_id", rb_obj_id, 0);

/* I intended to show this line: */
rb_define_method(rb_mKernel, "hash", rb_obj_id, 0);

Anyway, so the code above shows class Object including module Kernel.
Module Kernel defines "hash" as rb_obj_id(). So Object gets
rb_obj_id() as its "hash" method.

Ah, ok. A reasonable choice btw.

>> /* in hash.c the "hash" method is never overridden. So rb_obj_id()

is

>> the "hash" method for hashes. */
>
> Where do you take that conclusion from? From the source quotes you
> name,
> there is no indication of a connection between #hash and #id. And
> you're
> showing code for Kernel which is unrelated to this thread. The only
> conclusion so far is, that Hash#hash is Object#hash (because it's not
> overridden as you state) and thus can't take into account the contents
> of
> the Hash. Can you please clarify this?

Don't forget you can always look at the source yourself

Oh, really? I do not have them here and was lacking a bit of time and
enthusiams. So I thought I convice you to post this yourself.

Kind regards

robert

···

On Aug 6, 2004, at 3:11 AM, Robert Klemme wrote:
>> On Aug 5, 2004, at 6:41 PM, George Ogata wrote:

Topic		Replies	Views
Sane #hash implementation? ruby-talk	17	103	4 February 2007
Sets, uniqueness not unique ruby-talk	28	117	16 September 2005
Whats so different about a Hash? ruby-talk	14	136	5 May 2005
Set doesn't have [] instance method ruby-talk	14	111	30 July 2003
Set#== ain't ruby-talk	3	69	22 September 2005

Sets are not extensional

Related topics