A use case for an ordered hash

(thoran@thoran.com) #1

Hi,

I switched from 'my' thread because this seems more appropriate to this thread...

Previously "Re: Why Does Hash Apparently Reorder Its Internal Representation And Other Associated Ponderings"

Or, Is It Possible To Have One's Cake And Eat It Too?

First, a taxa of the different sorts of beasts of which we are speaking might help:

a. associative array
   i. unordered
   ii. insertion order-ordered
   iii. sort function-ordered
     1. manual (adhoc)
     2. automatic (upon insertion)

b. hash
   i. unordered
   ii. insertion order-ordered
   iii. sort function-ordered
     1. manual (adhoc)
     2. automatic (upon insertion)

Does this cover it or is there a redundancy or something missing? For instance are a.i. and a.ii. really the same thing by virtue of there not being a hashing function? Does a.i. make any sense and can be done away with from the get-go? And are a.ii. and b.ii, and a.iii. and b.iii. really the same thing also? Not if the hash still retained the hash as you would it expect it to.

And is there a term which could reasonably cover all of a. and b.? I might have used the term 'hash' to refer to anything which is a key-value pair, but this is apparently confusing or inaccurate.

I tried set, but that's wrong. I tried enumerable, but that's wrong too.

I propose that the term, list, be used to cover anything which would be able to mix in Enumerable and that the constituent parts of a list are called items, rather than an element, which is a set/array term and not appropriate for a hash or associative array. Still that doesn't give a name for associative lists. I love that when it just comes out like that... AssociativeList it is then!

And so are these three sorts of data structures (array, associative array, and hash) sufficiently different to warrant none subclassing from the other? If so, then it would also probably warrant a syntax addition and a core implementation (as discussed). And they are presumably sufficiently similar that it might make sense to subclass all of them from the same base class; list for instance, and then mixin Enumerable there instead. Although one may as well just put the methods in directly to List then however?

There have been concerns that a sort function-ordered or an insertion-ordered, or just ordered hash will cause slowdowns and memory bloat, so I was wondering if it was possible to adjust according to need. So, thank you to Charles, as I am reminded of what I did in compsci all that time ago!

Now given that it is possible or desirable to have a Hash redefine the number of buckets, or rehash, is there any reason why it can't also reassign its internal type at a deeper level depending on access?

Thought experiment time...

What if all Hashes were able to be ordered, but rehashable for storage type depending on how it is instantiated and how it is accessed? One could specify storage organisation at instantiation or later, or allow it to decide for itself what sort of hash it will be.

m = [:a => 'a'] # m is for map. I could go for this term too as it has brevity.
=> [:a => 'a']

m[0]
=> [:a => 'a'] # m remains as an associative array.

m[:a]
=> {:a => 'a'} # m becomes an explicit hash.

m[0]
=> {:a => 'a'} # m becomes an associative array.

m[:a]
=> {:a => 'a'} # m becomes an explicit hash.

m[[0]]
=> [:a => 'a'] # m becomes an associative array.

{} = m, or
m {}= m, or
m.to_h
=> {:a => 'a'}

This would be read as "make m equal to itself if it is already a hash, else make it so". Similarly, going the other way:

[]= m, or
m []= m, or
m.to_assoc
=> [:a => 'a']

Then perhaps one might like to also do something like this?...

n = [:b => 'b']
=> [:b => 'b']

o {}= n
=> {:b => 'b'}

Useful? Anyhow, that was an aside.

Now, why not take this further?... What about a generalized list? As in:

l = [:a => 'a', :b, {:c => 'c'}] # which is really, or at least appears to be as if it were: {0 => [:a => 'a'], 1 => :b, 2 => {:c => 'c'}}

The method of access determines how it is to be treated...

l{0}
=> {:a => 'a'} # or should that be {0 => [:a => 'a']}?

l[[0]]
=> 'a'

l[0]
=> [:a => 'a']

l{1}
=> {1 => :b} # or should that be {1 => {:b => nil}}?

l[1]
=> [1 => :b] # or should that simply be :b?

l[[1]]
=> :b

l[2]
=> {:c => 'c'}

l[[2]]
=> 'c'

l{2}
=> {2 => {:c => 'c'}} # or should that be {:c => 'c'}?

Having only lists might then do away with the literal instantiation of an empty AssociativeArray problem (assuming a double bracket notation):

a = [[]] # Is this a nested array or an associative array?
=> [[]]

list = []
array = [] # Really a list.
hash = {} # Explicit hash list sub-type declaration.

list = {:a => 'a'}
=> {:a => 'a'}

By assigning a non key-value pair to the hash it becomes a more general list internally and this is represented by the form of notation, from {} to [].

list[1] = 'b'
=> [:a => 'a', 'b']

In a list the numeric value of an item is always present, so an assignment which uses a numeric index will assume that it is a simple placement at the position given by the numeric index of whatever object is presented to the list.

An assignment which treats this as an associative array will work the same as if it were an assignment without:

list[[2]] = 'c'
=> [:a => 'a', 'b', 'c']

Another example might be good:

list[3] = {:d => 'd'}
=> [:a => 'a', 'b', 'c', {:d => 'd'}]

list[3]
=> {:d => 'd'}

list[[4]] = {:e => 'e'} # Allowable?
=> [:a => 'a', 'b', 'c', {:d => 'd'}, {:e => 'e'}]

list[4]
=> {:e => 'e'}

list[[4]]
=> 'e'

However,

list[0]
=> [:a => 'a']

list[[0]]
=> 'a'

This is all of the top of my head, so it probably needs some work...

And as for a dynamically altering Hash, List also could change its internal organisation on the fly depending on need. Such that if all elements are homogenous, then the internal organisation would be optimised for that homogeneity. If that changes to be heterogenous, then a more general and less optimised form of internal organisation can be chosen dynamically to accomodate.

I realise the overhead if types and methods of access changed all the time, but does it?

t

P.S. The irony of my apparently having no thoughts about language design is choking me.

(Matthew Johnson) #2

I read the post regarding a taxonomy of hash-like data structures with interest. It is something I have been thinking about a bit lately (as well as the more general subject of all collections). It may be appropriate to back up a step and consider the idea I mentioned in another thread regarding the naming of concepts. Doing that will allow more powerful duck typing in the design.

Ruby already defines Enumerable and Comparable. By depending upon #each Enumerable the requirements / concept that must be met by classes that include it. To keep things simple we can call the concpet Enumerable as well. Here are some other potential concepts that may fit nicely in Ruby (please focus more on the definition / duck typing than on the names):

Integral
  supports to_int
MutableEnumerable
  depends on each, add, remove
Hashable
  depends up hash
Associative
  depends on fetch, store, possibly remove, each key, and each value
Unique
  depends a single value per key, duplicate keys could be either rejected or overwritten
Multi Valued
  depends upon values being Enumerable
Sequencable
  depends on each, insert, remove, and possibly fetch and store
Sorted / Ordered
  see below

This allows us to define mixins which depend upon these operations or properties. For example, a MutableEnumerable mixin can be defined that implements set operations, compact, clear, remove_if, etc. An Sequencable mixin can be defined that implements all sorts of operations such as append, concat, splice, sort, etc. The dependency of a Sorted / Ordered mixin would be a bit different. It would depend upon a property of the a class mixing it in rather than operations. For example it could define merge and binary search operations.

A Pair concept depending upon first and second may also be desirable. It allows things such as implementing each for Associative that takes a single parameter.

It is clear that there are some relationships between the various concepts. For example, some classes implementing Associative behavior are likely to depend upon keys that implement Hashable behavior and Sequencable classes are also Associative if we assume Integral keys. Also, in some cases a user may know about the state of a particular object that would allow them to leverage a mixin that otherwise wouldn't apply to a class. For example, extending an instance of Array with Sorted if the array is known to be sorted.

Now returning to the thread at hand we can see that the difference between the associative array and hash hierarchies is that the hash hierarchy depends upon Hashable keys (please forgive me if this is stating the obvious). Of course they would also have different implementations of primitive methods. However, many higher level operations that depend upon those primitives would be the same and could indeed be implemented in mixins that expect these primitive operations. The concrete implementations would also be welcome to implement optimizations and additional parameter to higher level operations that are not possible without knowledge of the implementation of the data structure, but this is a bonus and is not strictly necessary. By clearly defining small groups of *naturally* cohesive primitive operations and naming the group (the concepts) we can gain the ability to create concrete implementations with a variety of properties that all support the rich higher level operations that can be supported by their primitives for free (almost for free - the mixins do each need to be implemented once, but will then be available to all concrete implementations support the operations the depend upon). We are now exploiting the power of combinations rather than letting it run away from us.

The properties I see being important are:
Associativity
  Integral (ex. Array), Hashable (ex. Hash), any object (ex. associative array)
Ordering
  insertion order, key order (by <=>), value order (by <=>), any sort function
Quantity
  any value, unique valued, multi-valued
Sequencability
Mutability
  probably less important in Ruby, but immutable variants may be desirable for various reasons

The taxonomy in terms of the concepts (please keep in mind that there are probably better names for the concepts than the names I am using):

a. associative array
  i. unordered

  Associative, MutableEnumerable, Sequencable?

  ii. insertion order-ordered

  Associative, MutableEnumerable, Sorted

  iii. sort function-ordered
    1. manual (adhoc)

  Associative, MutableEnumerable, Sequencable

    2. automatic (upon insertion)

  Associative, MutableEnumerable, Sorted

b. hash
  i. unordered

  Associative, MutableEnumerable, Hashable elements

  ii. insertion order-ordered

  Associative, MutableEnumerable, Sorted, Hashable elements

  iii. sort function-ordered
    1. manual (adhoc)

  Associative, MutableEnumerable, Sequencable, Hashable elements

    2. automatic (upon insertion)

  Associative, MutableEnumerable, Sorted, Hashable elements

As we can see, sequencability was not explicitly mentioned in the previous post, although it was implied by the ad-hoc sort. In order to be ad-hoc sortable by an arbitrary sort function the ability to re-order elements must be supported. Once we have this we actually have much more general capabilities of arbitrarily sequencing elements whether the sequence is sorted or not. We can also see that automatic ordering by key, value, and automatic ordering of arrays may also be useful. We can generate further types of collection structures based on a unique value per key or enumerable values in addition to the default of an arbitrary quantity of values per key.

For instance are a.i. and a.ii. really the same thing by virtue of there not being a hashing function?

I would consider a.i to be most likely insertion ordered by default (when created with literal syntax for example), but should proabably be Sequencable and thus the ordering may not always be maintained. Also, insert and other such operations may be supported thus allowing items to be inserted without being "insertion sorted" (in terms of the actual chronological insertion order).

  Does a.i. make any sense and can be done away with from the get-go?

It is definitely an independent type of data structure with its own use cases.

And are a.ii. and b.ii, and a.iii. and b.iii. really the same thing also? Not if the hash still retained the hash as you would it expect it to.

No. Using hashable elements is different and is likely to increase the performance of fetch operations.

And is there a term which could reasonably cover all of a. and b.? I might have used the term 'hash' to refer to anything which is a key-value pair, but this is apparently confusing or inaccurate.

Both a and b are associative.

I tried set, but that's wrong. I tried enumerable, but that's wrong too.

A set essentially an associative collection where the elements are objects. Conceptually a hash table, an array, and an associative array are associative based on key or index pairs.

And so are these three sorts of data structures (array, associative array, and hash) sufficiently different to warrant none subclassing from the other? If so, then it would also probably warrant a syntax addition and a core implementation (as discussed). And they are presumably sufficiently similar that it might make sense to subclass all of them from the same base class; list for instance, and then mixin Enumerable there instead. Although one may as well just put the methods in directly to List then however?

I hope I have been able to shed some light on the similarities and differences of them. The implementation of primitive operations are different leading to different potential operations and performance characteristics while many of the higher level operations are the same and can be shared via mixins.

There have been concerns that a sort function-ordered or an insertion-ordered, or just ordered hash will cause slowdowns and memory bloat, so I was wondering if it was possible to adjust according to need. So, thank you to Charles, as I am reminded of what I did in compsci all that time ago!

The ability to select the features you require for a given application would be very desirable indeed. Here's a syntax idea - let's borrow the idea of regex options where we can have /a regex/ix and use option characters to specify what concepts should be supported. {}s could be sequencable, {}i could be insertion ordered, {}k would be key ordered {}v would be value ordered, etc. The same could then be implemented for associative arrays [key => value]v for a value ordered associative array. We could even define ordering for regular arrays if desired [a, b, c]o would be ordered by <=> on the value of a, b, and c and maintain ordering when items are added. Some of these options may be of debatable value, but I think the notation is elegant, concise, and consistent with our current notation.

This post has discussed duck typed concepts in the context of collections, but the usefulness of duck typed concepts are certainly not limited to collections. Their essence is to identify and name a set *naturally* cohesive group of requirements on a type. These requirements can be operations, operations of an associated type (ex. Enumerable requiring elements to support <=> for sort operations to work),

Matthew