Premature Optimization

Poopdeville · 25 October 2006 20:40

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m
entries. m is a fairly large integer, on the order of 10,000. Each
entry is either 1 or 0.

The first task I need to accomplish is figuring out how many times a 1
occurs in the ith entry in an array. So for concreteness, if I had
arrays:

first = [1,0,0,0,0]
second = [1,1,0,0,0]
third = [0,0,0,1,0]

I would end up with
count = [2,1,0,1,0]

I'm just trying to give the general flavor of what I'm working on. I
know I can use some simple each_with_index loops to increment
count[index] (something along the lines of:)

count = Array.new(m,0)
[first, second, third].each do |array|
  array.each_with_index do |item, index|
    count[index] += item
  end
end

There are going to be m * 3n * (two constant multipliers for the
looping) object allocations and method calls. m is fairly large, and I
have other, similar, tasks to accomplish with this data. The faster I
can process this data, the more data I can process in a given amount of
time, and the more accurate the analysis will be.

Is there a slick way to do this with unpacking and packing? Or some
other way to do this with strings? Any modules or libraries I should
look into? I'm fairly new to Ruby, though not to scientific
computation. I realize I won't be able to do this any faster than with
m * n * (a constant multiplier), but I need that constant multiplier to
be as small as possible. Any advice would be appreciated.

Thanks!
'cid 'ooh

Robert_K1 · 25 October 2006 20:50

poopdeville@gmail.com wrote:

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m
entries. m is a fairly large integer, on the order of 10,000. Each
entry is either 1 or 0.

Is there a slick way to do this with unpacking and packing? Or some
other way to do this with strings? Any modules or libraries I should
look into? I'm fairly new to Ruby, though not to scientific
computation. I realize I won't be able to do this any faster than with
m * n * (a constant multiplier), but I need that constant multiplier to
be as small as possible. Any advice would be appreciated.

I would look out for an implementation of a BitSet as this data structure seems crucial for your application. You could even create one your own - it's not too hard.

Kind regards

robert

Ara.T.Howard6 · 25 October 2006 21:31

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m entries. m
is a fairly large integer, on the order of 10,000. Each entry is either 1
or 0.

The first task I need to accomplish is figuring out how many times a 1
occurs in the ith entry in an array. So for concreteness, if I had arrays:

first = [1,0,0,0,0]
second = [1,1,0,0,0]
third = [0,0,0,1,0]

I would end up with
count = [2,1,0,1,0]

harp:~ > cat a.rb
require 'narray'

     first = NArray.to_na [1,0,0,0,0]
     second = NArray.to_na [1,1,0,0,0]
     third = NArray.to_na [0,0,0,1,0]

count = first.eq(1) + second.eq(1) + third.eq(1)

p count

     harp:~ > ruby a.rb
     NArray.byte(5):
     [ 2, 1, 0, 1, 0 ]

I'm just trying to give the general flavor of what I'm working on. I know I
can use some simple each_with_index loops to increment count[index]
(something along the lines of:)

count = Array.new(m,0)
[first, second, third].each do |array|
array.each_with_index do |item, index|
count[index] += item
end
end

There are going to be m * 3n * (two constant multipliers for the
looping) object allocations and method calls. m is fairly large, and I
have other, similar, tasks to accomplish with this data. The faster I
can process this data, the more data I can process in a given amount of
time, and the more accurate the analysis will be.

i use narray on huge (> 1gb) datasets all the time. it's blindingly fast.

regards.

-a

···

On Thu, 26 Oct 2006 poopdeville@gmail.com wrote:
--
my religion is very simple. my religion is kindness. -- the dalai lama

dga · 26 October 2006 06:35

Someone already mentioned the NArray way; if you treat the entire
thing as a set of matrix operations, you can make it insanely fast in
either NArray or GSL. But don't treat it as "N" arrays with "M"
entries -- treat it as a matrix, and then all of your operations can be
done in the external (very fast) code:

m = GSL::Matrix.new([1, 0, 0, 0, 0], [1, 1, 0, 0, 0], [0, 0, 0, 1, 0])

You can then either cheat, and "flatten" the matrix into a vector and
find the sum:

puts m.to_v.sum

(which works since they're ones)

Or pretend you're in Matlab and do it as a set of matrix operations:

colones = GSL::Matrix.new(1, m.size[1])
colones[0].set_all(1)

rowones = GSL::Matrix.new(1, m.size[0])
rowones[0].set_all(1)

puts ((colones * m.transpose) * rowones.transpose)[0,0]

-Dave

poopdeville@gmail.com wrote:

···

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m
entries. m is a fairly large integer, on the order of 10,000. Each
entry is either 1 or 0.

The first task I need to accomplish is figuring out how many times a 1
occurs in the ith entry in an array. So for concreteness, if I had
arrays:

first = [1,0,0,0,0]
second = [1,1,0,0,0]
third = [0,0,0,1,0]

I would end up with
count = [2,1,0,1,0]

I'm just trying to give the general flavor of what I'm working on. I
know I can use some simple each_with_index loops to increment
count[index] (something along the lines of:)

count = Array.new(m,0)
[first, second, third].each do |array|
  array.each_with_index do |item, index|
    count[index] += item
  end
end

There are going to be m * 3n * (two constant multipliers for the
looping) object allocations and method calls. m is fairly large, and I
have other, similar, tasks to accomplish with this data. The faster I
can process this data, the more data I can process in a given amount of
time, and the more accurate the analysis will be.

Is there a slick way to do this with unpacking and packing? Or some
other way to do this with strings? Any modules or libraries I should
look into? I'm fairly new to Ruby, though not to scientific
computation. I realize I won't be able to do this any faster than with
m * n * (a constant multiplier), but I need that constant multiplier to
be as small as possible. Any advice would be appreciated.

Thanks!
'cid 'ooh

Poopdeville · 30 October 2006 19:55

poopdeville@gmail.com wrote:

Hi everybody,

I'm writing a fairly open-ended question. I'm hoping for suggestions,
opinions, advice.

<snip>

Thanks for all your replies. I've worked with the Gnu Scientific
Library (and ATLAS, too) before, but I didn't realize there were Ruby
binding for it. I settled on a Ruby/GSL with a custom compiled ATLAS
backend (instead of just BLAS like GSL uses normally). It's screaming
fast.

'cid 'ooh

Wilson_Bilkovich · 25 October 2006 21:50

What's it going to take to get this into the stdlib? It is awesome.

···

On 10/25/06, ara.t.howard@noaa.gov <ara.t.howard@noaa.gov> wrote:

On Thu, 26 Oct 2006 poopdeville@gmail.com wrote:

> Hi everybody,
>
> I'm writing a fairly open-ended question. I'm hoping for suggestions,
> opinions, advice. Suppose I have n arrays, each of which has m entries. m
> is a fairly large integer, on the order of 10,000. Each entry is either 1
> or 0.
>
> The first task I need to accomplish is figuring out how many times a 1
> occurs in the ith entry in an array. So for concreteness, if I had arrays:
>
> first = [1,0,0,0,0]
> second = [1,1,0,0,0]
> third = [0,0,0,1,0]
>
> I would end up with
> count = [2,1,0,1,0]

     harp:~ > cat a.rb
     require 'narray'

     first = NArray.to_na [1,0,0,0,0]
     second = NArray.to_na [1,1,0,0,0]
     third = NArray.to_na [0,0,0,1,0]

     count = first.eq(1) + second.eq(1) + third.eq(1)

     p count

     harp:~ > ruby a.rb
     NArray.byte(5):
     [ 2, 1, 0, 1, 0 ]

> I'm just trying to give the general flavor of what I'm working on. I know I
> can use some simple each_with_index loops to increment count[index]
> (something along the lines of:)
>
> count = Array.new(m,0)
> [first, second, third].each do |array|
> array.each_with_index do |item, index|
> count[index] += item
> end
> end
>
> There are going to be m * 3n * (two constant multipliers for the
> looping) object allocations and method calls. m is fairly large, and I
> have other, similar, tasks to accomplish with this data. The faster I
> can process this data, the more data I can process in a given amount of
> time, and the more accurate the analysis will be.

i use narray on huge (> 1gb) datasets all the time. it's blindingly fast.

dga · 26 October 2006 07:00

Tweaking a bit and benchmarking with 100 rows of 10,000 elements each

user system total real
Create Matrix 3.940000 0.090000 4.030000 ( 4.037665)
Add Up Matrix 0.710000 0.000000 0.710000 ( 0.711897)
Create GSL::Matrix 3.400000 0.050000 3.450000 ( 3.496827)
Add GSL::Matrix 0.080000 0.030000 0.110000 ( 0.111982)
Matrix Add GSL::Matrix 0.020000 0.000000 0.020000 ( 0.017661)

With the "Matrix Add" done as:

(@m * colones.transpose).to_v.sum

(to avoid @m.transpose, which creates a copy of all of @m).

Have fun. Note that there are faster ways to initialize a GSL array
from an external source, if you end up concerned about that source of
overhead. (Or an NArray).

-Dave

dga wrote:

···

Someone already mentioned the NArray way; if you treat the entire
thing as a set of matrix operations, you can make it insanely fast in
either NArray or GSL. But don't treat it as "N" arrays with "M"
entries -- treat it as a matrix, and then all of your operations can be
done in the external (very fast) code:

m = GSL::Matrix.new([1, 0, 0, 0, 0], [1, 1, 0, 0, 0], [0, 0, 0, 1, 0])

You can then either cheat, and "flatten" the matrix into a vector and
find the sum:

puts m.to_v.sum

(which works since they're ones)

Or pretend you're in Matlab and do it as a set of matrix operations:

colones = GSL::Matrix.new(1, m.size[1])
colones[0].set_all(1)

rowones = GSL::Matrix.new(1, m.size[0])
rowones[0].set_all(1)

puts ((colones * m.transpose) * rowones.transpose)[0,0]

-Dave

poopdeville@gmail.com wrote:
> Hi everybody,
>
> I'm writing a fairly open-ended question. I'm hoping for suggestions,
> opinions, advice. Suppose I have n arrays, each of which has m
> entries. m is a fairly large integer, on the order of 10,000. Each
> entry is either 1 or 0.
>
> The first task I need to accomplish is figuring out how many times a 1
> occurs in the ith entry in an array. So for concreteness, if I had
> arrays:
>
> first = [1,0,0,0,0]
> second = [1,1,0,0,0]
> third = [0,0,0,1,0]
>
> I would end up with
> count = [2,1,0,1,0]
>
> I'm just trying to give the general flavor of what I'm working on. I
> know I can use some simple each_with_index loops to increment
> count[index] (something along the lines of:)
>
> count = Array.new(m,0)
> [first, second, third].each do |array|
> array.each_with_index do |item, index|
> count[index] += item
> end
> end
>
> There are going to be m * 3n * (two constant multipliers for the
> looping) object allocations and method calls. m is fairly large, and I
> have other, similar, tasks to accomplish with this data. The faster I
> can process this data, the more data I can process in a given amount of
> time, and the more accurate the analysis will be.
>
> Is there a slick way to do this with unpacking and packing? Or some
> other way to do this with strings? Any modules or libraries I should
> look into? I'm fairly new to Ruby, though not to scientific
> computation. I realize I won't be able to do this any faster than with
> m * n * (a constant multiplier), but I need that constant multiplier to
> be as small as possible. Any advice would be appreciated.
>
> Thanks!
> 'cid 'ooh

Ara.T.Howard6 · 26 October 2006 01:53

it sure it. i've be campaigning for years to no avail... maybe it's time to
bug matz again? it even compiles on windows: both it and the gsl can be found
precompiled here:

http://codeforpeople.com/lib/ruby/rb-gsl-win/

regards.

-a

···

On Thu, 26 Oct 2006, Wilson Bilkovich wrote:

What's it going to take to get this into the stdlib? It is awesome.

--
my religion is very simple. my religion is kindness. -- the dalai lama

Topic		Replies	Views
Premature Optimization ruby-talk	0	72	25 October 2006
Problem (?) with ruby handling 10^7 elements in one Array ruby-talk	4	114	25 August 2003
Efficient Ruby equivalent to Excel's "SUMIF"? ruby-talk	6	150	4 March 2013
Ruby "Speedup" hints? ruby-talk	33	115	20 March 2008
Parallel Array ruby-talk	15	91	5 July 2006

Premature Optimization

Related topics