Brian Candler wrote:
Charles Hixson wrote in post #1084133:
Hashes are slow compared to direct indexing.
Says who?
This is a commonly true statement for a wide variety of languages. I am quite likely to be translating this work into another language. In the other language I am certain that it is a true statement. Unless Array access is significantly slower than Hash access (which I cannot believe) this is the way to proceed.
Push might be a good
answer, but not if it allocates storage item by item.
It doesn't. Look at the MRI C implementation if you really care: the
underlying array storage size is increased in chunks. Ditto for Strings.
That's ok. push was the wrong answer anyway. The right answer was that Arrays automatically extend to encompass the largest index that you use to store into them.
void
rb_ary_store(ary, idx, val)
...
if (idx>= RARRAY(ary)->aux.capa) {
long new_capa = RARRAY(ary)->aux.capa / 2;
if (new_capa< ARY_DEFAULT_SIZE) {
new_capa = ARY_DEFAULT_SIZE;
}
if (new_capa>= ARY_MAX_SIZE - idx) {
new_capa = (ARY_MAX_SIZE - idx) / 2;
}
new_capa += idx;
REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa);
RARRAY(ary)->aux.capa = new_capa;
}
[from ruby-1.8.7-p352/array.c]
Sorry, my C isn't that good, and there are a lot of undefined terms in that snippet. I can't really say I understand it. I can usually guess what's going on, but undefined macros make me quite unsure...and I try to avoid pointers when writing code, because I want to understand it later. For that matter, I usually avoid macros, too. I realize that this is a common C coding style, but it's one reason I dislike C. I'd prefer D, FORTRAN, or even Ada or Eiffel. (Yaa...the code is what it is. But to me what it is is confusing.)
Allocation of memory is another expensive operation
Says who? Expensive compared to all the other things going on in your
program? Your program will probably be creating objects
left-right-and-centre, and having them garbage collected too.
Among system calls, allocation of memory is one of the more expensive. OTOH, the comment earlier about Array automatically extending itself answers this as there are
steps automatically taken to minimize the number of allocations of memory. I don't think that using push repeatedly is a good idea (and in any case, it doesn't fit with my logic, unless there is something about the system that demands that it be done that way, which there isn't).
I think you are approaching this the wrong way. I suggest you first
write some code that works, using the most simple and logical code.
*Then* optimise it if required: and optimise by measuring. Even for
experienced programmers, usually the hot-spot turns out not to be where
they guess it might be. And as you have already implied, you might need
to optimise for memory usage over speed.
If I design the basic framework of the application around a poor design, I'll have to rewrite the entire thing. This is something to get right at the start. It's not a premature optimization. There are lots of places where I'm taking to "do enough and patch it later" approach. The reason this isn't one, is because this needs to be handled now.
> From a speed point of view, I'm pretty confident in saying that the
implementation of Hash (in C) versus the implementation of Array (in C)
is unlikely to be the bottleneck in your program. An Array may use less
memory than a Hash - but since the Array or Hash holds only object
references (essentially pointers), both may be small compared to the
memory allocated to the objects you are holding within them.
OK. If it's not the bottleneck, that will be very good. Is there ANY reason that Hash is a better choice? I don't believe that there is, but I could be wrong. Certainly if they are the same speed, and use about the same amount of memory, I would prefer to use an Array. My expectation is that Array will be faster as well as using significantly less RAM, but were they the same, I would still prefer to use an Array, as it will match more nearly the preferred order of I/O (among other reasons). It is true that my reasons for preferring an Array are all minor, and if Array has some real drawback, then I would readily switch to Hash. But I haven't heard of one. I had some questions about how to use it, but those have, I believe, been answered.
P.S.: I understand that if I use a Hash, item will be retrieved in the order entered when I iterate through them. This isn't my desired result. I want to retrieve them in index # order. I'd rather not have to sort them.
P.P.S: A part of my design is driven by my desire to have a language independent form of serialization...and YAML won't work for my purposes. I may decide to have an isomorphic realization of this in a binary serialization that isn't language independent, but that WOULD be premature optimization.