Extending Array instances

I'm trying to figure out a good way to extend an Array, when the items may not be added in index order, and I don't know the final highest index. Does anyone have any suggestions?

What occurs to me is that I could double the length of the array whenever I want to add more than one element by doing something like:
a = a + Array.new(a.length)
Naturally, it would need to be a bit more complicated, as, e.g., I'd need to ensure that the resulting array actually *would* hold the proposed index. Still, something only a little more complicated should _work_, but is that the best approach?

push?

I've used that when I didn't know (or care) about the length of the array.

Wayne

Charles Hixson wrote in post #1084111:

I'm trying to figure out a good way to extend an Array, when the items
may not be added in index order, and I don't know the final highest
index. Does anyone have any suggestions?

This happens automatically.

a = [:foo, :bar]

=> [:foo, :bar]

a[10] = :baz

=> :baz

a

=> [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

As others have said: using a Hash may be more efficient, if there are
large gaps - but then you may have to sort the keys if you want to
iterate over it in order. It depends on your use case.

···

--
Posted via http://www.ruby-forum.com/\.

I'd just use a hash.

-- Matma Rex

Brian Candler wrote:

Charles Hixson wrote in post #1084111:

I'm trying to figure out a good way to extend an Array, when the items
may not be added in index order, and I don't know the final highest
index. Does anyone have any suggestions?

This happens automatically.

a = [:foo, :bar]

=> [:foo, :bar]

a[10] = :baz

=> :baz

a

=> [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

As others have said: using a Hash may be more efficient, if there are
large gaps - but then you may have to sort the keys if you want to
iterate over it in order. It depends on your use case.

Hashes are slow compared to direct indexing. Push might be a good answer, but not if it allocates storage item by item. Allocation of memory is another expensive operation. But I *am* considering using gdbm instead of an Array. That is slow like a hash, but it eliminates needing to persist at a later time, and it conserves RAM usage. OTOH, it would be a lot faster to just loop through the array marshaling out item by item. Or perhaps just doing the entire thing in one blurt.

I'm expecting this array to eventually use over a GB of storage, so efficiency considerations are somewhat significant. gdbm solves some of these, by not dealing with the entire array at once, but only with the recently active instances. But it's slower, which I was trying to avoid by using an Array. However, I don't know just how many items the Array will need to hold, it could easily be in the millions, and will at least be in the hundreds of thousands. So pre-allocating isn't very desirable, but neither is adding instance by instance. (Besides, they won't necessarily come in order. It could well be that I'll get the 100,000th entry immediately after getting the 21st. No problem for a hash, but if I'm going to slow down to hash speed, I should probably just go directly to gdbm.)

Brian Candler wrote:

Charles Hixson wrote in post #1084111:

I'm trying to figure out a good way to extend an Array, when the items
may not be added in index order, and I don't know the final highest
index. Does anyone have any suggestions?

This happens automatically.

a = [:foo, :bar]

=> [:foo, :bar]

a[10] = :baz

=> :baz

a

=> [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

As others have said: using a Hash may be more efficient, if there are
large gaps - but then you may have to sort the keys if you want to
iterate over it in order. It depends on your use case.

Rereading your post, I see I missed a crucial part of it:

a[10] = :baz
=> a == [:foo, :bar, nil, nil, nil, nil, nil, nil, nil, nil, :baz]

And THAT was the part I really needed to get the first time.

Brian Candler wrote:

As others have said: using a Hash may be more efficient, if there are
large gaps - but then you may have to sort the keys if you want to
iterate over it in order. It depends on your use case.

Hashes are slow compared to direct indexing.

Did you measure it? If not, please do. It's easy to arrive at wrong
conclusions by assuming what may be truth in other contexts in true in
a new context as well.

I'm expecting this array to eventually use over a GB of storage, so

Are you talking about memory or number of instances? I am asking
because both are not the same and it may be more difficult to asses
memory consumption than you think.

efficiency considerations are somewhat significant. gdbm solves some of
these, by not dealing with the entire array at once, but only with the
recently active instances. But it's slower, which I was trying to avoid by
using an Array. However, I don't know just how many items the Array will
need to hold, it could easily be in the millions, and will at least be in
the hundreds of thousands.

So how do you know we are talking about "over a GB of storage"?

So pre-allocating isn't very desirable, but
neither is adding instance by instance. (Besides, they won't necessarily
come in order. It could well be that I'll get the 100,000th entry
immediately after getting the 21st. No problem for a hash, but if I'm going
to slow down to hash speed, I should probably just go directly to gdbm.)

Again: measure before you judge. It's really much better to base
decisions on facts than on speculation.

Cheers

robert

···

On Mon, Nov 12, 2012 at 10:12 PM, Charles Hixson <charleshixsn@earthlink.net> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Charles Hixson wrote in post #1084133:

Hashes are slow compared to direct indexing.

Says who?

Push might be a good
answer, but not if it allocates storage item by item.

It doesn't. Look at the MRI C implementation if you really care: the
underlying array storage size is increased in chunks. Ditto for Strings.

void
rb_ary_store(ary, idx, val)
...
    if (idx >= RARRAY(ary)->aux.capa) {
        long new_capa = RARRAY(ary)->aux.capa / 2;

        if (new_capa < ARY_DEFAULT_SIZE) {
            new_capa = ARY_DEFAULT_SIZE;
        }
        if (new_capa >= ARY_MAX_SIZE - idx) {
            new_capa = (ARY_MAX_SIZE - idx) / 2;
        }
        new_capa += idx;
        REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa);
        RARRAY(ary)->aux.capa = new_capa;
    }

[from ruby-1.8.7-p352/array.c]

Allocation of memory is another expensive operation

Says who? Expensive compared to all the other things going on in your
program? Your program will probably be creating objects
left-right-and-centre, and having them garbage collected too.

I think you are approaching this the wrong way. I suggest you first
write some code that works, using the most simple and logical code.
*Then* optimise it if required: and optimise by measuring. Even for
experienced programmers, usually the hot-spot turns out not to be where
they guess it might be. And as you have already implied, you might need
to optimise for memory usage over speed.

From a speed point of view, I'm pretty confident in saying that the
implementation of Hash (in C) versus the implementation of Array (in C)
is unlikely to be the bottleneck in your program. An Array may use less
memory than a Hash - but since the Array or Hash holds only object
references (essentially pointers), both may be small compared to the
memory allocated to the objects you are holding within them.

···

--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote:

Brian Candler wrote:

As others have said: using a Hash may be more efficient, if there are
large gaps - but then you may have to sort the keys if you want to
iterate over it in order. It depends on your use case.

Hashes are slow compared to direct indexing.

Did you measure it? If not, please do. It's easy to arrive at wrong
conclusions by assuming what may be truth in other contexts in true in
a new context as well.

I'm expecting this array to eventually use over a GB of storage, so

Are you talking about memory or number of instances? I am asking
because both are not the same and it may be more difficult to asses
memory consumption than you think.

I'm talking about data size. The number of instances would probably be about 1/20 of the number of items. (Varies because the items themselves contain Arrays that vary in size.) But please note that this is a rough estimate, and could be an underestimate (though I've tried to make a generous estimate).

efficiency considerations are somewhat significant. gdbm solves some of
these, by not dealing with the entire array at once, but only with the
recently active instances. But it's slower, which I was trying to avoid by
using an Array. However, I don't know just how many items the Array will
need to hold, it could easily be in the millions, and will at least be in
the hundreds of thousands.
So how do you know we are talking about "over a GB of storage"?

It's my best estimate before I actually build the thing. I hope it's a generous estimate, as smaller amounts of data would be easier to handle, but I don't want to say that I haven't underestimated.

  So pre-allocating isn't very desirable, but
neither is adding instance by instance. (Besides, they won't necessarily
come in order. It could well be that I'll get the 100,000th entry
immediately after getting the 21st. No problem for a hash, but if I'm going
to slow down to hash speed, I should probably just go directly to gdbm.)

Again: measure before you judge. It's really much better to base
decisions on facts than on speculation.

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

I've measured it in several languages, and occasionally implemented the hash algorithm from scratch in C (two or three different ways of handling collisions...but Unicode means that I don't want to use C, though pointers was my original reason). I'll admit I haven't measured it yet in Ruby, but hash cannot be fast, though direct indexing can be slow. It's in the nature of the algorithm. It can be quite fast compared to a tree lookup, but not compared to a direct index.
FWIW, the answer I was looking for is that Arrays are automatically extended to the last index inserted into. This probably implies that Arrays use some sort of linked block index, though another possibility is that it just allocates a honking big chunk of memory, and does a copy whenever it needs to resize the array. This seems unlikely. Also silly. So I'll bet on a linked block index. (N.B.: Some languages DO just allocate large blocks of RAM for their arrays, but most of the ones that I know do that are compiler languages, and they've got reasonable grounds...big blocks are more efficient to access, and if you're a compiler the code you compile can be as efficient as you are. So the trade-offs are different. Even then they try to avoid copying.)

P.S.: Developing this in Ruby *IS* a measurement test. If I can do it in Ruby, then I've guessed the right size. But picking a bad algorithm to implement wouldn't prove much of anything. If it doesn't work, I'll need to fall back to a faster language. Probably D. That's one reason I want to avoid gdbm. Not only is RAM faster, but it's an easier conversion to another language if it turns out that I need a faster language.
Ruby's a lot faster to develop in, so it's not a bad decision even if Ruby turns out to be too slow (or not able to handle the required amount of memory), because then I'll have a pretty much written application that I can translate to a faster language. (One reason for D is that there are lots of places where I can execute loops in parallel easily. That would be harder in Ruby, Python, etc., though in most ways they are easier.)

···

On Mon, Nov 12, 2012 at 10:12 PM, Charles Hixson > <charleshixsn@earthlink.net> wrote:

Brian Candler wrote:

Charles Hixson wrote in post #1084133:

Hashes are slow compared to direct indexing.

Says who?

Push might be a good
answer, but not if it allocates storage item by item.

It doesn't. Look at the MRI C implementation if you really care: the
underlying array storage size is increased in chunks. Ditto for Strings.

void
rb_ary_store(ary, idx, val)
...
     if (idx>= RARRAY(ary)->aux.capa) {
         long new_capa = RARRAY(ary)->aux.capa / 2;

         if (new_capa< ARY_DEFAULT_SIZE) {
             new_capa = ARY_DEFAULT_SIZE;
         }
         if (new_capa>= ARY_MAX_SIZE - idx) {
             new_capa = (ARY_MAX_SIZE - idx) / 2;
         }
         new_capa += idx;
         REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa);
         RARRAY(ary)->aux.capa = new_capa;
     }

[from ruby-1.8.7-p352/array.c]

Allocation of memory is another expensive operation

Says who? Expensive compared to all the other things going on in your
program? Your program will probably be creating objects
left-right-and-centre, and having them garbage collected too.

I think you are approaching this the wrong way. I suggest you first
write some code that works, using the most simple and logical code.
*Then* optimise it if required: and optimise by measuring. Even for
experienced programmers, usually the hot-spot turns out not to be where
they guess it might be. And as you have already implied, you might need
to optimise for memory usage over speed.

> From a speed point of view, I'm pretty confident in saying that the
implementation of Hash (in C) versus the implementation of Array (in C)
is unlikely to be the bottleneck in your program. An Array may use less
memory than a Hash - but since the Array or Hash holds only object
references (essentially pointers), both may be small compared to the
memory allocated to the objects you are holding within them.

Thank you.

I'm going to need to think about the implications of that. It's looking like I need to change something basic in my design. An Array was the simple and most logical code, but if Arrays are as slow as Hashes (or nearly so) something is not appropriate for this project. Perhaps I WILL need to use a database.

Brian Candler wrote:

Charles Hixson wrote in post #1084133:

Hashes are slow compared to direct indexing.

Says who?

This is a commonly true statement for a wide variety of languages. I am quite likely to be translating this work into another language. In the other language I am certain that it is a true statement. Unless Array access is significantly slower than Hash access (which I cannot believe) this is the way to proceed.

Push might be a good
answer, but not if it allocates storage item by item.

It doesn't. Look at the MRI C implementation if you really care: the
underlying array storage size is increased in chunks. Ditto for Strings.

That's ok. push was the wrong answer anyway. The right answer was that Arrays automatically extend to encompass the largest index that you use to store into them.

void
rb_ary_store(ary, idx, val)
...
     if (idx>= RARRAY(ary)->aux.capa) {
         long new_capa = RARRAY(ary)->aux.capa / 2;

         if (new_capa< ARY_DEFAULT_SIZE) {
             new_capa = ARY_DEFAULT_SIZE;
         }
         if (new_capa>= ARY_MAX_SIZE - idx) {
             new_capa = (ARY_MAX_SIZE - idx) / 2;
         }
         new_capa += idx;
         REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa);
         RARRAY(ary)->aux.capa = new_capa;
     }

[from ruby-1.8.7-p352/array.c]

Sorry, my C isn't that good, and there are a lot of undefined terms in that snippet. I can't really say I understand it. I can usually guess what's going on, but undefined macros make me quite unsure...and I try to avoid pointers when writing code, because I want to understand it later. For that matter, I usually avoid macros, too. I realize that this is a common C coding style, but it's one reason I dislike C. I'd prefer D, FORTRAN, or even Ada or Eiffel. (Yaa...the code is what it is. But to me what it is is confusing.)

Allocation of memory is another expensive operation

Says who? Expensive compared to all the other things going on in your
program? Your program will probably be creating objects
left-right-and-centre, and having them garbage collected too.

Among system calls, allocation of memory is one of the more expensive. OTOH, the comment earlier about Array automatically extending itself answers this as there are
steps automatically taken to minimize the number of allocations of memory. I don't think that using push repeatedly is a good idea (and in any case, it doesn't fit with my logic, unless there is something about the system that demands that it be done that way, which there isn't).

I think you are approaching this the wrong way. I suggest you first
write some code that works, using the most simple and logical code.
*Then* optimise it if required: and optimise by measuring. Even for
experienced programmers, usually the hot-spot turns out not to be where
they guess it might be. And as you have already implied, you might need
to optimise for memory usage over speed.

If I design the basic framework of the application around a poor design, I'll have to rewrite the entire thing. This is something to get right at the start. It's not a premature optimization. There are lots of places where I'm taking to "do enough and patch it later" approach. The reason this isn't one, is because this needs to be handled now.

> From a speed point of view, I'm pretty confident in saying that the
implementation of Hash (in C) versus the implementation of Array (in C)
is unlikely to be the bottleneck in your program. An Array may use less
memory than a Hash - but since the Array or Hash holds only object
references (essentially pointers), both may be small compared to the
memory allocated to the objects you are holding within them.

OK. If it's not the bottleneck, that will be very good. Is there ANY reason that Hash is a better choice? I don't believe that there is, but I could be wrong. Certainly if they are the same speed, and use about the same amount of memory, I would prefer to use an Array. My expectation is that Array will be faster as well as using significantly less RAM, but were they the same, I would still prefer to use an Array, as it will match more nearly the preferred order of I/O (among other reasons). It is true that my reasons for preferring an Array are all minor, and if Array has some real drawback, then I would readily switch to Hash. But I haven't heard of one. I had some questions about how to use it, but those have, I believe, been answered.
P.S.: I understand that if I use a Hash, item will be retrieved in the order entered when I iterate through them. This isn't my desired result. I want to retrieve them in index # order. I'd rather not have to sort them.

P.P.S: A part of my design is driven by my desire to have a language independent form of serialization...and YAML won't work for my purposes. I may decide to have an isomorphic realization of this in a binary serialization that isn't language independent, but that WOULD be premature optimization.

Welcome to silly... tho I'm not sure why you think that.

···

On Nov 13, 2012, at 21:30 , Charles Hixson <charleshixsn@earthlink.net> wrote:

FWIW, the answer I was looking for is that Arrays are automatically extended to the last index inserted into. This probably implies that Arrays use some sort of linked block index, though another possibility is that it just allocates a honking big chunk of memory, and does a copy whenever it needs to resize the array. This seems unlikely. Also silly. So I'll bet on a linked block index.

I've measured it in several languages, and occasionally implemented the hash
algorithm from scratch in C (two or three different ways of handling
collisions...but Unicode means that I don't want to use C, though pointers
was my original reason). I'll admit I haven't measured it yet in Ruby, but
hash cannot be fast, though direct indexing can be slow. It's in the nature
of the algorithm. It can be quite fast compared to a tree lookup, but not
compared to a direct index.

That is still speculation.

FWIW, the answer I was looking for is that Arrays are automatically extended
to the last index inserted into. This probably implies that Arrays use some
sort of linked block index, though another possibility is that it just
allocates a honking big chunk of memory, and does a copy whenever it needs
to resize the array. This seems unlikely.

Please stop speculating and verify the facts (i.e. check Ruby source
code). You'll do yourself a big favor.

Note also that you can still implement a type with the same interface
as Hash or Array and drop it into your implementation later if you
find the std lib types to be too slow for your purposes.

Cheers

robert

···

On Wed, Nov 14, 2012 at 6:30 AM, Charles Hixson <charleshixsn@earthlink.net> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Charles Hixson wrote in post #1084366:

FWIW, the answer I was looking for is that Arrays are automatically
extended to the last index inserted into. This probably implies that
Arrays use some sort of linked block index, though another possibility
is that it just allocates a honking big chunk of memory, and does a copy
whenever it needs to resize the array. This seems unlikely. Also
silly. So I'll bet on a linked block index.

Then you lose the bet. I already posted the code.

The code says: when you set an element beyond the end, the underlying
array storage capacity (aux.capa) is increased to the new end index plus
half the current capacity; new storage is allocated and the old data
copied into the new (read the manpage for "realloc"). So the data is
copied but the number of copy operations goes down as the array grows.

This is of course *way* more efficient for most purposes than an linked
list implementation, because the memory is contiguous and all subsequent
get/set operations are then a simple pointer offsets rather than walking
a list (even a list of chunks).

P.S.: Developing this in Ruby *IS* a measurement test. If I can do it
in Ruby, then I've guessed the right size. But picking a bad algorithm
to implement wouldn't prove much of anything.

You're talking about two different things. Picking a bad algorithm is of
course a bad idea, and will give you bad performance (relative to a good
algorithm) in any language.

But what you are discussing is the same algorithm, simply whether c[k]
is slower or more memory inefficient if collection c is an Array or a
Hash. This is almost certainly irrelevant in the context of your
application.

Code using the obvious approach: if the keys are all non-negative
integers and not too sparse, use an Array. Otherwise use a Hash.

If it doesn't all fit into RAM then this is probably due to the space
used by the objects themselves, not the collection (I notice you haven't
asked how much RAM a String object uses). If it doesn't run fast enough
then the set/get operations of Array or Hash are almost certainly not
the culprit.

···

--
Posted via http://www.ruby-forum.com/\.

Like everyone else on this thread, I'm going to have to say:

would you PLEASE just fucking measure something already???

  require 'benchmark'

it's not hard.

···

On Nov 14, 2012, at 11:06 , Charles Hixson <charleshixsn@earthlink.net> wrote:

I'm going to need to think about the implications of that. It's looking like I need to change something basic in my design. An Array was the simple and most logical code, but if Arrays are as slow as Hashes (or nearly so) something is not appropriate for this project. Perhaps I WILL need to use a database.

Sorry, my C isn't that good, and there are a lot of undefined terms in
that snippet. I can't really say I understand it. I can usually guess
what's going on, but undefined macros make me quite unsure...and I try to
avoid pointers when writing code, because I want to understand it later.
For that matter, I usually avoid macros, too. I realize that this is a
common C coding style, but it's one reason I dislike C. I'd prefer D,
FORTRAN, or even Ada or Eiffel. (Yaa...the code is what it is. But to me
what it is is confusing.)

This is going to sound elitist and snobby, but I'm pretty sure that's the
whole problem right there. If you avoid pointers in C because they're
confusing, then do not ever write C. Pointers are a fundamental part of the
language and how you get things done in it, and actually quite simple when
you're thinking at the C level. It's not a coding style, it's how C works.
And yes, without pointers preallocated arrays will often be faster than
everything else, because without pointers you're losing the ability to do
operations like memory copying and collection restructuring and object
comparisons by reference; reading and writing entire objects is always slow.

I don't know the Ruby C API at all, but my guess at interpreting the
snippet would be:

    if (idx>= RARRAY(ary)->aux.capa) { /* if the index is beyond the
RARRAY (Ruby Array)'s auxiliary capacity (i.e. total allocated space?) */
        long new_capa = RARRAY(ary)->aux.capa / 2; /* new_capa = current
capa / 2 */

         if (new_capa< ARY_DEFAULT_SIZE) { /* clamp to some globally
defined minimum */
             new_capa = ARY_DEFAULT_SIZE; /* (e.g. don't extend by 3, when
512 would be more sensible?) */
         }
         if (new_capa>= ARY_MAX_SIZE - idx) { /* clamp to some globally
understood maximum */
             new_capa = (ARY_MAX_SIZE - idx) / 2; /* i.e. half the
available maximum space */
         }
         new_capa += idx;
         REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa); /* some realloc()
macro -- looks exactly like realloc() to me */
         RARRAY(ary)->aux.capa = new_capa; /* tell the array what its
auxiliary capacity has become */
     }

If I design the basic framework of the application around a poor design,

I'll have to rewrite the entire thing. This is something to get right at
the start. It's not a premature optimization. There are lots of places
where I'm taking to "do enough and patch it later" approach. The reason
this isn't one, is because this needs to be handled now.

Indeed. Write the algorithm. Optimise the algorithm. Make sure you're using
sensible algorithmic techniques. Which container you're using isn't part of
that algorithm.

Again: which container you're using isn't part of the algorithm, especially
in an OO context like Ruby. In the algorithm you say "add to the
container" or "look up thingy in the container." Then when you've
implemented it, if the algorithm is gravy but execution takes too long,
maybe think about alternative _containers_, based on how your (already
good) algorithm is using it. E.g. sparse storage suggests not using an
array, random inserts but sorted iteration implies some sort of heap or
something, etc. The main point is, the outer algorithm is the big deal;
you abstract the container with a little black box called "the container"
until you know more about how the whole system works.

Ruby is a high enough level language that you can abstract the container's
inner workings like this safely. It's also high enough that you can't
really predict the behaviour without _trying_ it (or without having written
dozens of other similar programs and getting a feel for how various objects
behave.)

So again, in summary: write the part of the algorithm you can control, and
make it as good as can be. Then use empirical testing to find the best
objects to fill in those black boxes.

···

On 15 November 2012 06:48, Charles Hixson <charleshixsn@earthlink.net>wrote:

--
  Matthew Kerwin, B.Sc (CompSci) (Hons)
  http://matthew.kerwin.net.au/
  ABN: 59-013-727-651

  "You'll never find a programming language that frees
  you from the burden of clarifying your ideas." - xkcd

Ryan Davis wrote:

FWIW, the answer I was looking for is that Arrays are automatically extended to the last index inserted into. This probably implies that Arrays use some sort of linked block index, though another possibility is that it just allocates a honking big chunk of memory, and does a copy whenever it needs to resize the array. This seems unlikely. Also silly. So I'll bet on a linked block index.

Welcome to silly... tho I'm not sure why you think that.

That's not what the code I saw showed...admittedly my C skill are very poor. If it DOES do massive copying, I'm going to need a very different approach. And I'm not sure what would work. But why wouldn't Ruby use memory mapping? Copying a GB of memory around doesn't sound like a feasible approach, even though this would be done at full C speed.

If that's true, then using a database is the only way I can think of that would work, and there are several reasons why I want to avoid that.

···

On Nov 13, 2012, at 21:30 , Charles Hixson<charleshixsn@earthlink.net> wrote: