Premature Optimization

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m
entries. m is a fairly large integer, on the order of 10,000. Each
entry is either 1 or 0.

The first task I need to accomplish is figuring out how many times a 1
occurs in the ith entry in an array. So for concreteness, if I had
arrays:

first = [1,0,0,0,0]
second = [1,1,0,0,0]
third = [0,0,0,1,0]

I would end up with
count = [2,1,0,1,0]

I'm just trying to give the general flavor of what I'm working on. I
know I can use some simple each_with_index loops to increment
count[index] (something along the lines of:)

count = Array.new(m,0)
[first, second, third].each do |array|
  array.each_with_index do |item, index|
    count[index] += item
  end
end

There are going to be m * 3n * (two constant multipliers for the
looping) object allocations and method calls. m is fairly large, and I
have other, similar, tasks to accomplish with this data. The faster I
can process this data, the more data I can process in a given amount of
time, and the more accurate the analysis will be.

Is there a slick way to do this with unpacking and packing? Or some
other way to do this with strings? Any modules or libraries I should
look into? I'm fairly new to Ruby, though not to scientific
computation. I realize I won't be able to do this any faster than with
m * n * (a constant multiplier), but I need that constant multiplier to
be as small as possible. Any advice would be appreciated.

Thanks!
'cid 'ooh

poopdeville@gmail.com wrote:

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m
entries. m is a fairly large integer, on the order of 10,000. Each
entry is either 1 or 0.

Is there a slick way to do this with unpacking and packing? Or some
other way to do this with strings? Any modules or libraries I should
look into? I'm fairly new to Ruby, though not to scientific
computation. I realize I won't be able to do this any faster than with
m * n * (a constant multiplier), but I need that constant multiplier to
be as small as possible. Any advice would be appreciated.

I would look out for an implementation of a BitSet as this data structure seems crucial for your application. You could even create one your own - it's not too hard.

Kind regards

  robert

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m entries. m
is a fairly large integer, on the order of 10,000. Each entry is either 1
or 0.

The first task I need to accomplish is figuring out how many times a 1
occurs in the ith entry in an array. So for concreteness, if I had arrays:

first = [1,0,0,0,0]
second = [1,1,0,0,0]
third = [0,0,0,1,0]

I would end up with
count = [2,1,0,1,0]

     harp:~ > cat a.rb
     require 'narray'

     first = NArray.to_na [1,0,0,0,0]
     second = NArray.to_na [1,1,0,0,0]
     third = NArray.to_na [0,0,0,1,0]

     count = first.eq(1) + second.eq(1) + third.eq(1)

     p count

     harp:~ > ruby a.rb
     NArray.byte(5):
     [ 2, 1, 0, 1, 0 ]

I'm just trying to give the general flavor of what I'm working on. I know I
can use some simple each_with_index loops to increment count[index]
(something along the lines of:)

count = Array.new(m,0)
[first, second, third].each do |array|
array.each_with_index do |item, index|
   count[index] += item
end
end

There are going to be m * 3n * (two constant multipliers for the
looping) object allocations and method calls. m is fairly large, and I
have other, similar, tasks to accomplish with this data. The faster I
can process this data, the more data I can process in a given amount of
time, and the more accurate the analysis will be.

i use narray on huge (> 1gb) datasets all the time. it's blindingly fast.

regards.

-a

···

On Thu, 26 Oct 2006 poopdeville@gmail.com wrote:
--
my religion is very simple. my religion is kindness. -- the dalai lama

Someone already mentioned the NArray way; if you treat the entire
thing as a set of matrix operations, you can make it insanely fast in
either NArray or GSL. But don't treat it as "N" arrays with "M"
entries -- treat it as a matrix, and then all of your operations can be
done in the external (very fast) code:

m = GSL::Matrix.new([1, 0, 0, 0, 0], [1, 1, 0, 0, 0], [0, 0, 0, 1, 0])

You can then either cheat, and "flatten" the matrix into a vector and
find the sum:

puts m.to_v.sum

(which works since they're ones)

Or pretend you're in Matlab and do it as a set of matrix operations:

colones = GSL::Matrix.new(1, m.size[1])
colones[0].set_all(1)

rowones = GSL::Matrix.new(1, m.size[0])
rowones[0].set_all(1)

puts ((colones * m.transpose) * rowones.transpose)[0,0]

  -Dave

poopdeville@gmail.com wrote:

···

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m
entries. m is a fairly large integer, on the order of 10,000. Each
entry is either 1 or 0.

The first task I need to accomplish is figuring out how many times a 1
occurs in the ith entry in an array. So for concreteness, if I had
arrays:

first = [1,0,0,0,0]
second = [1,1,0,0,0]
third = [0,0,0,1,0]

I would end up with
count = [2,1,0,1,0]

I'm just trying to give the general flavor of what I'm working on. I
know I can use some simple each_with_index loops to increment
count[index] (something along the lines of:)

count = Array.new(m,0)
[first, second, third].each do |array|
  array.each_with_index do |item, index|
    count[index] += item
  end
end

There are going to be m * 3n * (two constant multipliers for the
looping) object allocations and method calls. m is fairly large, and I
have other, similar, tasks to accomplish with this data. The faster I
can process this data, the more data I can process in a given amount of
time, and the more accurate the analysis will be.

Is there a slick way to do this with unpacking and packing? Or some
other way to do this with strings? Any modules or libraries I should
look into? I'm fairly new to Ruby, though not to scientific
computation. I realize I won't be able to do this any faster than with
m * n * (a constant multiplier), but I need that constant multiplier to
be as small as possible. Any advice would be appreciated.

Thanks!
'cid 'ooh

poopdeville@gmail.com wrote:

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice.

<snip>

Thanks for all your replies. I've worked with the Gnu Scientific
Library (and ATLAS, too) before, but I didn't realize there were Ruby
binding for it. I settled on a Ruby/GSL with a custom compiled ATLAS
backend (instead of just BLAS like GSL uses normally). It's screaming
fast.

'cid 'ooh

What's it going to take to get this into the stdlib? It is awesome.

···

On 10/25/06, ara.t.howard@noaa.gov <ara.t.howard@noaa.gov> wrote:

On Thu, 26 Oct 2006 poopdeville@gmail.com wrote:

> Hi everybody,
>
> I'm writing a fairly open-ended question. I'm hoping for suggestions,
> opinions, advice. Suppose I have n arrays, each of which has m entries. m
> is a fairly large integer, on the order of 10,000. Each entry is either 1
> or 0.
>
> The first task I need to accomplish is figuring out how many times a 1
> occurs in the ith entry in an array. So for concreteness, if I had arrays:
>
> first = [1,0,0,0,0]
> second = [1,1,0,0,0]
> third = [0,0,0,1,0]
>
> I would end up with
> count = [2,1,0,1,0]

     harp:~ > cat a.rb
     require 'narray'

     first = NArray.to_na [1,0,0,0,0]
     second = NArray.to_na [1,1,0,0,0]
     third = NArray.to_na [0,0,0,1,0]

     count = first.eq(1) + second.eq(1) + third.eq(1)

     p count

     harp:~ > ruby a.rb
     NArray.byte(5):
     [ 2, 1, 0, 1, 0 ]

> I'm just trying to give the general flavor of what I'm working on. I know I
> can use some simple each_with_index loops to increment count[index]
> (something along the lines of:)
>
> count = Array.new(m,0)
> [first, second, third].each do |array|
> array.each_with_index do |item, index|
> count[index] += item
> end
> end
>
> There are going to be m * 3n * (two constant multipliers for the
> looping) object allocations and method calls. m is fairly large, and I
> have other, similar, tasks to accomplish with this data. The faster I
> can process this data, the more data I can process in a given amount of
> time, and the more accurate the analysis will be.

i use narray on huge (> 1gb) datasets all the time. it's blindingly fast.

Tweaking a bit and benchmarking with 100 rows of 10,000 elements each

                             user system total real
Create Matrix 3.940000 0.090000 4.030000 ( 4.037665)
Add Up Matrix 0.710000 0.000000 0.710000 ( 0.711897)
Create GSL::Matrix 3.400000 0.050000 3.450000 ( 3.496827)
Add GSL::Matrix 0.080000 0.030000 0.110000 ( 0.111982)
Matrix Add GSL::Matrix 0.020000 0.000000 0.020000 ( 0.017661)

With the "Matrix Add" done as:

(@m * colones.transpose).to_v.sum

(to avoid @m.transpose, which creates a copy of all of @m).

Have fun. Note that there are faster ways to initialize a GSL array
from an external source, if you end up concerned about that source of
overhead. (Or an NArray).

  -Dave

dga wrote:

···

Someone already mentioned the NArray way; if you treat the entire
thing as a set of matrix operations, you can make it insanely fast in
either NArray or GSL. But don't treat it as "N" arrays with "M"
entries -- treat it as a matrix, and then all of your operations can be
done in the external (very fast) code:

m = GSL::Matrix.new([1, 0, 0, 0, 0], [1, 1, 0, 0, 0], [0, 0, 0, 1, 0])

You can then either cheat, and "flatten" the matrix into a vector and
find the sum:

puts m.to_v.sum

(which works since they're ones)

Or pretend you're in Matlab and do it as a set of matrix operations:

colones = GSL::Matrix.new(1, m.size[1])
colones[0].set_all(1)

rowones = GSL::Matrix.new(1, m.size[0])
rowones[0].set_all(1)

puts ((colones * m.transpose) * rowones.transpose)[0,0]

  -Dave

poopdeville@gmail.com wrote:
> Hi everybody,
>
> I'm writing a fairly open-ended question. I'm hoping for suggestions,
> opinions, advice. Suppose I have n arrays, each of which has m
> entries. m is a fairly large integer, on the order of 10,000. Each
> entry is either 1 or 0.
>
> The first task I need to accomplish is figuring out how many times a 1
> occurs in the ith entry in an array. So for concreteness, if I had
> arrays:
>
> first = [1,0,0,0,0]
> second = [1,1,0,0,0]
> third = [0,0,0,1,0]
>
> I would end up with
> count = [2,1,0,1,0]
>
> I'm just trying to give the general flavor of what I'm working on. I
> know I can use some simple each_with_index loops to increment
> count[index] (something along the lines of:)
>
> count = Array.new(m,0)
> [first, second, third].each do |array|
> array.each_with_index do |item, index|
> count[index] += item
> end
> end
>
> There are going to be m * 3n * (two constant multipliers for the
> looping) object allocations and method calls. m is fairly large, and I
> have other, similar, tasks to accomplish with this data. The faster I
> can process this data, the more data I can process in a given amount of
> time, and the more accurate the analysis will be.
>
> Is there a slick way to do this with unpacking and packing? Or some
> other way to do this with strings? Any modules or libraries I should
> look into? I'm fairly new to Ruby, though not to scientific
> computation. I realize I won't be able to do this any faster than with
> m * n * (a constant multiplier), but I need that constant multiplier to
> be as small as possible. Any advice would be appreciated.
>
> Thanks!
> 'cid 'ooh

it sure it. i've be campaigning for years to no avail... maybe it's time to
bug matz again? it even compiles on windows: both it and the gsl can be found
precompiled here:

   http://codeforpeople.com/lib/ruby/rb-gsl-win/

regards.

-a

···

On Thu, 26 Oct 2006, Wilson Bilkovich wrote:

What's it going to take to get this into the stdlib? It is awesome.

--
my religion is very simple. my religion is kindness. -- the dalai lama