A Real World example for Ruby to "compiled" version discussion

People,

The "how can we make a ruby compiler" thread has been very interesting - I really like hearing from serious and competent programmers about the theoretical problems involved with this issue.

William Rutiser asked for an expansion on the details of my C/C++ population genetics simulation program as a specific example of how one might proceed depending on a particular situation. I am happy to elaborate - not least because it would be good to get input from experienced Ruby programmers before I just try to replicate the same program in Ruby - I'm sure there will be more sensible/efficient ways of doing things than what I would attempt first off and so comparisons between my C/C++ version and a dodgy Ruby program might be even more unfair . .

I will summarise the C/C++ program as it exists now (it has gone through a number of versions and has a lot of code that does not need replicating for the present comparison) with a general overview (I can add more detail later if people are interested):

- A population is represented by a number of sub-populations which occupy cells of a two dimensional array

- each cell has pointers to three ordered lists - representing the parental population, the offspring population and a temporary population and each element of a list represents an individual

- at each generation (of potentially hundreds), the whole array is iterated through, new offspring are produced, migrants move to adjacent cells, parents die off etc

As well as this main simulation program, I have already replaced all the original shell scripts and some of the statistical processing with Ruby scripts but the main simulation program is where I couldn't afford an order or two increase in running times by rewriting in Ruby.

The main problem I see with some of the Ruby conversions that I have looked at (eg RubyInline) is that the performance problem comes in in repeating the WHOLE simulation with different starting parameters which is done thousands of times - so it is not like you have a single recursive algorithm which is a bottleneck that can be optimised or rewritten in C or something. There are lots of little steps that happen millions of times . .

I hope that is sort of clear?

Regards,

Phil.

···

--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

Could you expand on this??

···

On Oct 6, 2010, at 09:55 , Philip Rhoades wrote:

The main problem I see with some of the Ruby conversions that I have looked at (eg RubyInline) is that the performance problem comes in in repeating the WHOLE simulation with different starting parameters which is done thousands of times - so it is not like you have a single recursive algorithm which is a bottleneck that can be optimised or rewritten in C or something. There are lots of little steps that happen millions of times . .

Philip Rhoades wrote:

People,

The "how can we make a ruby compiler" thread has been very interesting - I really like hearing from serious and competent programmers about the theoretical problems involved with this issue.

William Rutiser asked for an expansion on the details of my C/C++ population genetics simulation program as a specific example of how one might proceed depending on a particular situation. I am happy to elaborate - not least because it would be good to get input from experienced Ruby programmers before I just try to replicate the same program in Ruby - I'm sure there will be more sensible/efficient ways of doing things than what I would attempt first off and so comparisons between my C/C++ version and a dodgy Ruby program might be even more unfair . .

I will summarise the C/C++ program as it exists now (it has gone through a number of versions and has a lot of code that does not need replicating for the present comparison) with a general overview (I can add more detail later if people are interested):

- A population is represented by a number of sub-populations which occupy cells of a two dimensional array

- each cell has pointers to three ordered lists - representing the parental population, the offspring population and a temporary population and each element of a list represents an individual

- at each generation (of potentially hundreds), the whole array is iterated through, new offspring are produced, migrants move to adjacent cells, parents die off etc

As well as this main simulation program, I have already replaced all the original shell scripts and some of the statistical processing with Ruby scripts but the main simulation program is where I couldn't afford an order or two increase in running times by rewriting in Ruby.

The main problem I see with some of the Ruby conversions that I have looked at (eg RubyInline) is that the performance problem comes in in repeating the WHOLE simulation with different starting parameters which is done thousands of times - so it is not like you have a single recursive algorithm which is a bottleneck that can be optimised or rewritten in C or something. There are lots of little steps that happen millions of times . .

I hope that is sort of clear?

Regards,

Phil.

Some questions with my guesses at possible answers:

How much data is carried for each individual? a few alleles? a lot? the whole human genome?

What is done in the innermost loop? I imagine you select a pair of individuals, cross them, etc to produce more individuals. Or perhaps an individual dies or migrated to another population.

What do you do to run a new experiment? generate the initial population, specify the rules for combination, depth, migration, etc? gather data and process it at the end?

How much code do you need to write or rewrite for an experiment? How many parameters are involved?

Can you give some examples of aspects of the C++ program that you hope to improve by moving the sim to Ruby?

Can you say something about the context of your modeling? How is your simulator different from others used in the field?

-- Bill

Specifically

- What properties (conceptually) does an individual / a population have?

- What calculations are done for each individual / population?

- What kind of iteration is applied?

Kind regards

robert

···

On Thu, Oct 7, 2010 at 2:57 AM, Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Oct 6, 2010, at 09:55 , Philip Rhoades wrote:

The main problem I see with some of the Ruby conversions that I have looked at (eg RubyInline) is that the performance problem comes in in repeating the WHOLE simulation with different starting parameters which is done thousands of times - so it is not like you have a single recursive algorithm which is a bottleneck that can be optimised or rewritten in C or something. There are lots of little steps that happen millions of times . .

Could you expand on this??

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

People,

I have accumulated the questions from different people into this response:

The main problem I see with some of the Ruby conversions that I
have looked at (eg RubyInline) is that the performance problem
comes in in repeating the WHOLE simulation with different
starting parameters which is done thousands of times - so it is
not like you have a single recursive algorithm which is a
bottleneck that can be optimised or rewritten in C or something.
There are lots of little steps that happen millions of times . .

Could you expand on this??

Specifically

- What properties (conceptually) does an individual / a population
have?

This is an Individual Based Model (IBM) as opposed to populations that can be modelled as a whole with equations etc - so the simulation reflects this. In the C/C++ version an individual organism is an object with a number of attributes: ID, parent ID(s), (microsatellite) allele repeat length, randomizer int etc

- What calculations are done for each individual / population?

The genetics of the population changes as the individuals change because of mutations which affect repeat length of microsatellites ie they are not selected against/for (neutral mutations).

- What kind of iteration is applied?

For this particular discussion:

for k in 1..K # generations
   for x in 1..X # array width
     for y in 1..Y # array height
       destroy parent ordered list
       destroy tmp ordered list
       offspring list becomes parent list
       for i in 1..I # organism in parent ordered list
         asexually reproduce offspring with possible mutation
         possibly migrate offspring to neighbouring array cell
       end
     end
   end
end

data is written to disk after each generation and genetics statistics are done later.

How much data is carried for each individual? a few alleles? a lot?
the whole human genome?

For this exercise, one loci and allele (asexual reproduction) - multiple loci can be modelled by multiple runs.

What is done in the innermost loop? I imagine you select a pair of
individuals, cross them, etc to produce more individuals. Or perhaps
an individual dies or migrated to another population.

Asexual reproduction in this case and see above.

What do you do to run a new experiment? generate the initial
population, specify the rules for combination, depth, migration, etc?
gather data and process it at the end?

A script calls the simulation program hundreds of times with different starting populations. These starting populations are the output of the same simulation program which has previously been run, starting with different random number seeds, until they reach "quasi-equilibrium" (typically 500 generations). Also see above.

How much code do you need to write or rewrite for an experiment? How
many parameters are involved?

The code doesn't change - a configuration file and the starting (initial) generation can be changed.

Can you give some examples of aspects of the C++ program that you
hope to improve by moving the sim to Ruby?

Not really - it was a while ago now that I did the actual coding (I don't need to touch the code much these days) and I just remember it was a major pain trying to learn about the eccentricities of OOP with C++ - I'm sure it would have been a much more pleasant experience doing it in Ruby . .

I actually don't need to change to Ruby now - I just need to write a thesis . . What would be nice would be to further develop the program, make it more generalised, more friendly for other users, be able to respond to requests for enhancements/changes more quickly etc.

Can you say something about the context of your modeling? How is your
simulator different from others used in the field?

There are a couple of particular issues we wanted to investigate and my supervisor thought that the existing stuff that was around then (about nine years ago now) didn't conveniently allow us to do what we wanted - maybe it is has changed now but I haven't had much exposure to other programs. Also, a lot of them then were written in some sort of graphical environment for Windows or Macs and I wanted to have a CLI interface on Linux and have more direct control over what was happening . .

Thanks,

Phil.

···

On 2010-10-07 23:12, Robert Klemme wrote:

On Thu, Oct 7, 2010 at 2:57 AM, Ryan Davis<ryand-ruby@zenspider.com> > wrote:

On Oct 6, 2010, at 09:55 , Philip Rhoades wrote:

--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

People,

I have accumulated the questions from different people into this response:

[snip /]

For this particular discussion:

for k in 1..K # generations
for x in 1..X # array width
for y in 1..Y # array height
destroy parent ordered list
destroy tmp ordered list
offspring list becomes parent list
for i in 1..I # organism in parent ordered list
asexually reproduce offspring with possible mutation
possibly migrate offspring to neighbouring array cell
end
end
end
end

[snip /]

The code doesn't change - a configuration file and the starting (initial)
generation can be changed.

[snip /]

There are a couple of particular issues we wanted to investigate and my
supervisor thought that the existing stuff that was around then (about nine
years ago now) didn't conveniently allow us to do what we wanted - maybe it
is has changed now but I haven't had much exposure to other programs. Also,
a lot of them then were written in some sort of graphical environment for
Windows or Macs and I wanted to have a CLI interface on Linux and have more
direct control over what was happening . .

This sounds like something JRuby would be good at it, since it can JIT
the loops, and the Java VM reaches nearly-native speeds these days.

So, yes, improved technology could make this faster these days.

···

On Fri, Oct 8, 2010 at 3:37 PM, Philip Rhoades <phil@pricom.com.au> wrote:

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.

Traffic simulation has an analogous distinction: microscopic (traffic as aggregate behavior of individuals) vs. macroscopic (traffic as wave).

I keep trying to see how to use narray to express the state of your model, but the variable length organism list probably prevents that. Narray would give you so much for free, especially if part of the timestep transformation is linear (matrix multiplication). But that's probably simplistic.

···

On 10/08/2010 06:37 AM, Philip Rhoades wrote:

This is an Individual Based Model (IBM) as opposed to populations that
can be modelled as a whole with equations etc - so the simulation
reflects this.

Phillip,

People,

I have accumulated the questions from different people into this response:

[snip /]

For this particular discussion:

for k in 1..K # generations
  for x in 1..X # array width
    for y in 1..Y # array height
      destroy parent ordered list
      destroy tmp ordered list
      offspring list becomes parent list
      for i in 1..I # organism in parent ordered list
        asexually reproduce offspring with possible mutation
        possibly migrate offspring to neighbouring array cell
      end
    end
  end
end

[snip /]

The code doesn't change - a configuration file and the starting (initial)
generation can be changed.

[snip /]

There are a couple of particular issues we wanted to investigate and my
supervisor thought that the existing stuff that was around then (about nine
years ago now) didn't conveniently allow us to do what we wanted - maybe it
is has changed now but I haven't had much exposure to other programs. Also,
a lot of them then were written in some sort of graphical environment for
Windows or Macs and I wanted to have a CLI interface on Linux and have more
direct control over what was happening . .

This sounds like something JRuby would be good at it, since it can JIT
the loops, and the Java VM reaches nearly-native speeds these days.

Native to what? Ruby, Java, C++ ?

So, yes, improved technology could make this faster these days.

I would prefer to be able to produce stand-alone executables if possible . .

Thanks,

Phil.

···

On 2010-10-09 02:57, Phillip Gawlowski wrote:

On Fri, Oct 8, 2010 at 3:37 PM, Philip Rhoades<phil@pricom.com.au> wrote:

--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

Joel.

This is an Individual Based Model (IBM) as opposed to populations that
can be modelled as a whole with equations etc - so the simulation
reflects this.

Traffic simulation has an analogous distinction: microscopic (traffic as
aggregate behavior of individuals) vs. macroscopic (traffic as wave).

Yep, IMBs get used for lots of things.

I keep trying to see how to use narray to express the state of your
model, but the variable length organism list probably prevents that.
Narray would give you so much for free, especially if part of the
timestep transformation is linear (matrix multiplication). But that's
probably simplistic.

I was expecting to have to change my ordered lists to one-dimensional ordered arrays anyway . . I will check out narrays . .

Thanks,

Phil.

···

On 2010-10-09 09:28, Joel VanderWerf wrote:

On 10/08/2010 06:37 AM, Philip Rhoades wrote:

--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au

Re JRuby: Eventually, yes...the current compiler is largely unchanged
from JRuby 1.1, however, and so we're not taking as good advantage of
the JVM as you might hope. Lots of opportunity to improve it, and
obviously we'd like these loops to be as fast as Java loops if
possible...

Soonish, perhaps! Help us fix bugs so I can spend more time on perf :slight_smile:

- Charlie

···

On Fri, Oct 8, 2010 at 10:57 AM, Phillip Gawlowski <cmdjackryan@googlemail.com> wrote:

This sounds like something JRuby would be good at it, since it can JIT
the loops, and the Java VM reaches nearly-native speeds these days.

So, yes, improved technology could make this faster these days.