Ruby and mustard

what do you do when ruby just won't quite cut the mustard?

here's a scenario from last week:

i'm doing some trival satelite image processing: find some pixels in one huge
image which are essentially pointers (indexes) into another image. for each
pixel found to be a certain value set those indexes in a bunch of other
images. here's the catch: the images are all HUGE - 1-3gb - and i've got to
be processing sets of 6-8 of them at a time. most amazingly, a combination of
guy's mmap and strscan does a good job with this task: the mmap ensures good
memory management w/o blowing the top off system limits and strscan is nice
and fast. using this combination i'm able to process an image set in 2-4
minutes. if you think about it this is a really amazing to be doing with a
scripting language. however, we are going to be using this code on 10s of
thousands of image sets - so every second counts. i spent a day writing
equivalent code in c which runs in < 1 minute - nice. the problems is this:
it took a DAY! the ruby code took about 35 minutes to write. we move at an
insane pace around here and i hardly ever have a DAY to do anything. i spent
monday scanning the web to check out the latest developments in languages.
ocaml grabbed my eye. then i spent 4 hours writing my first program in it
that used Bigarray and it's mmap facility to behave as my c program did. it
takes about 1.5 minutes to run and, i must admit, i wasn't really enjoying the
functinal paradigm - suppose that might change though...

so here's my delima:

when you want to write something faster than ruby, but want basic (IMHO)
tools at your disposal like hashes, good string handling, exceptions, etc.
what is the way to go? as i see it there are a few options:

   * do it in another lang. i really don't like this because of learning
     curve and adding extra dependancies (ocamlc for eg.).

   * c++ is simply out. :wink:

   * do it in pure c. have you used getoptlong lately - sheesh.

   * do it using a nice library for c. glib is good - lot's of bells.
     extra dependancy though...

these are the options i've been mulling over. lately, however, i'm starting
to favour this option:

   * just code it in c using ruby's builtin libs. gives you hashes, eval'ing
     code, lists, GC, etc. i'm not adding additional dependancies and
     guarunteed (if my c stays pretty posix) that my code will run where ruby
     will. don't need autotools, no stl., etc. etc.

are there any other options i'm missing?

what do you do when it needs to be __really__ fast BUT you also have to
develop it __really__ fast and would __prefer__ not to add dependancies.

regards.

-a

···

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it; and a weed grows, even though we do
not love it. --Dogen

===============================================================================

It depends. For what you are doing, my preference is to use Ruby
to do all the admin work like parse command lines and manage
the non critical sections of code, and then use either C or Ruby C
to write the speed critical sections.

I too looked at ocaml just last week. The problem I had was that
I could not easily wrap its calls into Ruby for OS X. A method
exists for Linux, but I need BDS support and didn't want to
take the time to do this myself.

Now the question is to use Ruby C or just plain ole C. For
things that I can really abstract, I prefer C (sort of like
the OO style used in the examples in the Pickaxe book) and
then use Swig to wrap my libraries. That way, I can use the
library independent of Ruby if I (or anyone else) ever need to.

···

On Tuesday, 8 June 2004 at 23:53:42 +0900, Ara.T.Howard wrote:

what do you do when it needs to be __really__ fast BUT you also have to
develop it __really__ fast and would __prefer__ not to add dependancies.

--
Jim Freeze
"It's not Camelot, but it's not Cleveland, either."
    -- Kevin White, mayor of Boston

Hello,

when you want to write something faster than ruby, but want basic (IMHO)
tools at your disposal like hashes, good string handling, exceptions, etc.
what is the way to go? as i see it there are a few options:

Do you know about psyco.sf.net ? You might try writing it in Python and
get Psyco's speedup.

Or wait around for another few months until I come up with Psyco for
Ruby. (flame me now).

greetings,
kaspar

semantics & semiotics
code manufacture

www.tua.ch/ruby

If you've got a serious problem to solve, the dependency added by
glib surely pales as a concern. Heck, I don't know what it is, but if
*you* think it will help, use it!

Gavin

···

On Wednesday, June 9, 2004, 12:53:42 AM, Ara.T.Howard wrote:

   * do it using a nice library for c. glib is good - lot's of bells.
     extra dependancy though...

Can you just use a faster computer? If your working for a company its a
good excuse to bug your boss for better hardware, and if your working at
a university you might be able to convince them to let you use their
high-performance computer.(every university has at least 1 supercomputer
right? :slight_smile: )

The nice thing about code is that your algorithms complexity will be the
same no matter what language it's written in, so scaling hardware is
often a nice easy solution. (Unless your already running it on top of
the line system).

···

On Tue, Jun 08, 2004 at 11:53:42PM +0900, Ara.T.Howard wrote:

what do you do when ruby just won't quite cut the mustard?

here's a scenario from last week:

i'm doing some trival satelite image processing: find some pixels in one
huge
image which are essentially pointers (indexes) into another image. for each
pixel found to be a certain value set those indexes in a bunch of other
images. here's the catch: the images are all HUGE - 1-3gb - and i've got to
be processing sets of 6-8 of them at a time. most amazingly, a combination
of
guy's mmap and strscan does a good job with this task: the mmap ensures
good
memory management w/o blowing the top off system limits and strscan is nice
and fast. using this combination i'm able to process an image set in 2-4
minutes. if you think about it this is a really amazing to be doing with a
scripting language. however, we are going to be using this code on 10s of
thousands of image sets - so every second counts. i spent a day writing
equivalent code in c which runs in < 1 minute - nice. the problems is this:
it took a DAY! the ruby code took about 35 minutes to write. we move at an
insane pace around here and i hardly ever have a DAY to do anything. i
spent
monday scanning the web to check out the latest developments in languages.
ocaml grabbed my eye. then i spent 4 hours writing my first program in it
that used Bigarray and it's mmap facility to behave as my c program did. it
takes about 1.5 minutes to run and, i must admit, i wasn't really enjoying
the
functinal paradigm - suppose that might change though...

so here's my delima:

when you want to write something faster than ruby, but want basic (IMHO)
tools at your disposal like hashes, good string handling, exceptions, etc.
what is the way to go? as i see it there are a few options:

  * do it in another lang. i really don't like this because of learning
    curve and adding extra dependancies (ocamlc for eg.).

  * c++ is simply out. :wink:

  * do it in pure c. have you used getoptlong lately - sheesh.

  * do it using a nice library for c. glib is good - lot's of bells.
    extra dependancy though...

these are the options i've been mulling over. lately, however, i'm starting
to favour this option:

  * just code it in c using ruby's builtin libs. gives you hashes, eval'ing
    code, lists, GC, etc. i'm not adding additional dependancies and
    guarunteed (if my c stays pretty posix) that my code will run where ruby
    will. don't need autotools, no stl., etc. etc.

are there any other options i'm missing?

what do you do when it needs to be __really__ fast BUT you also have to
develop it __really__ fast and would __prefer__ not to add dependancies.

regards.

-a
--

> EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
> PHONE :: 303.497.6469
> A flower falls, even though we love it; and a weed grows, even though we
do
> not love it. --Dogen

have you considered using stuff like narray? (I have this strange
vision of images=~matrices, I may be completely wrong, anyway, but
there is a NImage IIRC)

···

il Tue, 8 Jun 2004 08:47:52 -0600, "Ara.T.Howard" <Ara.T.Howard@noaa.gov> ha scritto::

Is it possible to somehow isolate the inner-loop functionality? As long
as you can stay away from raw iterations of every data point -- calling only
row or column operations, for example -- you have a pretty good chance of
being fast enough.

Don't forget you can call any function in any shared library with ruby/dl,
including memcpy and so fourth. Make some trivial extension classes which
simply hold raw chunks of data in a C array, or just pack strings. You
can then pass this data to C lib functions or ruby extensions.

···

--- "Ara.T.Howard" <Ara.T.Howard@noaa.gov> wrote:

   * do it in another lang. i really don't like this because of learning
     curve and adding extra dependancies (ocamlc for eg.).

   * c++ is simply out. :wink:

   * do it in pure c. have you used getoptlong lately - sheesh.

   * do it using a nice library for c. glib is good - lot's of bells.
     extra dependancy though...

these are the options i've been mulling over. lately, however, i'm starting
to favour this option:

   * just code it in c using ruby's builtin libs. gives you hashes, eval'ing
     code, lists, GC, etc. i'm not adding additional dependancies and
     guarunteed (if my c stays pretty posix) that my code will run where ruby
     will. don't need autotools, no stl., etc. etc.

are there any other options i'm missing?

what do you do when it needs to be __really__ fast BUT you also have to
develop it __really__ fast and would __prefer__ not to add dependancies.

__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

Have you taken a look at rubyinline? It lets you embed compiled c
code inline in the middle of a ruby script. That way if you need to
run a fast loop you can switch to C but the cheap code that takes a
while in C is still doable in ruby. Can't remember the url for it,
should find it if you google. I want to say it's zenspiders work.

Charlie

···

On Tue, 8 Jun 2004, Ara.T.Howard wrote:

these are the options i've been mulling over. lately, however, i'm starting
to favour this option:

   * just code it in c using ruby's builtin libs. gives you hashes, eval'ing
     code, lists, GC, etc. i'm not adding additional dependancies and
     guarunteed (if my c stays pretty posix) that my code will run where ruby
     will. don't need autotools, no stl., etc. etc.

are there any other options i'm missing?

what do you do when it needs to be __really__ fast BUT you also have to
develop it __really__ fast and would __prefer__ not to add dependancies.

what do you do when ruby just won't quite cut the mustard?

...

when you want to write something faster than ruby, but want basic (IMHO)
tools at your disposal like hashes, good string handling, exceptions, etc.
what is the way to go?

It may not be a practical solution now, because it will take you some time
to learn, but you might want to look into using OCaml in the future. OCaml
is a powerful functional language that can produce native binary
executables on a number of platforms. OCaml execution speed is extremely
impressive. In just about every scenario, the speed of an OCaml program
exceeds that of a comparable c++ program, and there some situations where
Ocaml execution speed may even exceed that of a comperable c program.

Ocaml is also a good choice, because it is object oriented, and it
provides a lot of really useful/powerful extras that vastly simplify code.
Hashes are available via the Hashtbl module, and type-homogenous lists are
a native part of the language. OCaml has a map function, and handles lists
with incredible grace.

-- SegPhault

···

On Tue, 08 Jun 2004 08:47:52 -0600, Ara.T.Howard wrote:

Glib is the base library for GTK+ -- it's an object and signal toolkit
in C. Very nice API, for C anyway. It has a lot of ruby-ish features.
It's nice to use.

However, I'd use Ruby for the same: Same functionality, plus the
language to code (or even just prototype) additional parts in.

Ari

···

On Wed, 2004-06-09 at 00:55 +0900, Gavin Sinclair wrote:

On Wednesday, June 9, 2004, 12:53:42 AM, Ara.T.Howard wrote:

> * do it using a nice library for c. glib is good - lot's of bells.
> extra dependancy though...

If you've got a serious problem to solve, the dependency added by
glib surely pales as a concern. Heck, I don't know what it is, but if
*you* think it will help, use it!

Conan wrote:

Can you just use a faster computer? If your working for a company its a
good excuse to bug your boss for better hardware, and if your working at
a university you might be able to convince them to let you use their
high-performance computer.(every university has at least 1 supercomputer
right? :slight_smile: )

Heh, look at his signature. "@noaa.gov". That may mean he has access to supercomputers, but it also means that if he doesn't, getting something faster may be incredibly painful.

Btw, Ara, I don't have any answers to your question, but I think the fact you're asking it is great. It's really amazing to show some real world Ruby applications, esp. when it involves serious number crunching.

Ben

~ > cat /proc/cpuinfo | grep GHz
model name : Intel(R) Xeon(TM) CPU 2.80GHz

yes - there ARE four - it's plenty __fast__. we simply need it to be faster.

:wink:

-a

···

On Wed, 9 Jun 2004, Conan wrote:

Can you just use a faster computer? If your working for a company its a
good excuse to bug your boss for better hardware, and if your working at
a university you might be able to convince them to let you use their
high-performance computer.(every university has at least 1 supercomputer
right? :slight_smile: )

The nice thing about code is that your algorithms complexity will be the
same no matter what language it's written in, so scaling hardware is
often a nice easy solution. (Unless your already running it on top of
the line system).

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it; and a weed grows, even though we do
not love it. --Dogen

===============================================================================

good points - esp. bits about abstraction. i have done that approch before
(swig wrapped generic c code) - perhaps i'll return to it...

-a

···

On Wed, 9 Jun 2004, Jim Freeze wrote:

On Tuesday, 8 June 2004 at 23:53:42 +0900, Ara.T.Howard wrote:

what do you do when it needs to be __really__ fast BUT you also have to
develop it __really__ fast and would __prefer__ not to add dependancies.

It depends. For what you are doing, my preference is to use Ruby
to do all the admin work like parse command lines and manage
the non critical sections of code, and then use either C or Ruby C
to write the speed critical sections.

I too looked at ocaml just last week. The problem I had was that
I could not easily wrap its calls into Ruby for OS X. A method
exists for Linux, but I need BDS support and didn't want to
take the time to do this myself.

Now the question is to use Ruby C or just plain ole C. For
things that I can really abstract, I prefer C (sort of like
the OO style used in the examples in the Pickaxe book) and
then use Swig to wrap my libraries. That way, I can use the
library independent of Ruby if I (or anyone else) ever need to.

--
Jim Freeze
"It's not Camelot, but it's not Cleveland, either."
    -- Kevin White, mayor of Boston

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it; and a weed grows, even though we do
not love it. --Dogen

===============================================================================

Is it possible to somehow isolate the inner-loop functionality? As long
as you can stay away from raw iterations of every data point -- calling only
row or column operations, for example -- you have a pretty good chance of
being fast enough.

not really. i was trying out a combo of mmap an narray:

   mmap = Mmap.new 'huge', 'rw', Mmap::MAP_SHARED
   na = NArray.to_na mmap.to_s, NArray::BYTE
   positions = (na.eq val).where

(how cool is it that this works!)

but this blows the top right off memory/swap. eg. if i could to this

   (na.eq val).where do |pos|
     ...
   end

eg. iff #where took a block i could do it this way. any sort of collection
will blow up since i'm looking for potentially 2 ** 30 positions and streaming
them on stdout (to another program) and each of those positions can occupy
(guessing) 30 bytes or so (how big is '1234567' in ruby?)... in otherwords i
really do have to handle them one at the time.

i can hear you thinking... why doesn't he mmap a peice at a time and use some
sort of buffering?

see

   124624 – mmap use causes kernel panic

for the reason why...

Don't forget you can call any function in any shared library with ruby/dl,
including memcpy and so fourth. Make some trivial extension classes which
simply hold raw chunks of data in a C array, or just pack strings. You
can then pass this data to C lib functions or ruby extensions.

ah, ruby/dl - i've never used this. know of any good examples? this __is__
interesting.

-a

···

On Wed, 9 Jun 2004, Jeff Mitchell wrote:
--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it; and a weed grows, even though we do
not love it. --Dogen

===============================================================================

Conan wrote:

Can you just use a faster computer? If your working for a company its a
good excuse to bug your boss for better hardware, and if your working at
a university you might be able to convince them to let you use their
high-performance computer.(every university has at least 1 supercomputer
right? :slight_smile: )

The nice thing about code is that your algorithms complexity will be the
same no matter what language it's written in, so scaling hardware is
often a nice easy solution. (Unless your already running it on top of
the line system).

I think the problem here is that if they could get it to process 100 pictures a second, that would be better. They want to squeeze as much performance per CPU as possible, since any wait time is more than ideal. So if you get heavier iron, then they are going to want to use it to process that many more pictures, and you're left with the same problem you started with.

Charles Comstock wrote:

these are the options i've been mulling over. lately, however, i'm starting
to favour this option:

  * just code it in c using ruby's builtin libs. gives you hashes, eval'ing
    code, lists, GC, etc. i'm not adding additional dependancies and
    guarunteed (if my c stays pretty posix) that my code will run where ruby
    will. don't need autotools, no stl., etc. etc.

are there any other options i'm missing?

what do you do when it needs to be __really__ fast BUT you also have to
develop it __really__ fast and would __prefer__ not to add dependancies.
   
Have you taken a look at rubyinline? It lets you embed compiled c
code inline in the middle of a ruby script. That way if you need to
run a fast loop you can switch to C but the cheap code that takes a
while in C is still doable in ruby. Can't remember the url for it,
should find it if you google. I want to say it's zenspiders work.

Charlie

All of this discussion about C also reminds me that C code can itself be optimized considerably. If you really want to squeeze the last drop out of performance you can sometimes get even a twenty-fold improvement by doing architecture-specific optimizations. There are also plenty of platform-agnostic optimizations that can be done. A good book on this is /Computer Systems: A Programmer's Perspective/, Randal E. Bryant and David O'Hallaron.

Carl

···

On Tue, 8 Jun 2004, Ara.T.Howard wrote:

Last "Recent News" is september 2003. I wonder if that means that the
thing is not very active anymore ?

Yours,

JeanHuguesRobert

···

At 14:38 12/06/2004 +0900, you wrote:

On Tue, 08 Jun 2004 08:47:52 -0600, Ara.T.Howard wrote:

what do you do when ruby just won't quite cut the mustard?

....

when you want to write something faster than ruby, but want basic (IMHO)
tools at your disposal like hashes, good string handling, exceptions, etc.
what is the way to go?

It may not be a practical solution now, because it will take you some time
to learn, but you might want to look into using OCaml in the future. OCaml
is a powerful functional language that can produce native binary
executables on a number of platforms. OCaml execution speed is extremely
impressive. In just about every scenario, the speed of an OCaml program
exceeds that of a comparable c++ program, and there some situations where
Ocaml execution speed may even exceed that of a comperable c program.

Ocaml is also a good choice, because it is object oriented, and it
provides a lot of really useful/powerful extras that vastly simplify code.
Hashes are available via the Hashtbl module, and type-homogenous lists are
a native part of the language. OCaml has a map function, and handles lists
with incredible grace.

-- SegPhault

-------------------------------------------------------------------------
Web: @jhr is virteal, virtually real
Phone: +33 (0) 4 92 27 74 17

That may mean he has access to supercomputers,

yes.

but it also means that if he doesn't, getting something faster may be
incredibly painful.

yes. emphasis on the 'incredibly'.

:wink:

Btw, Ara, I don't have any answers to your question, but I think the fact
you're asking it is great. It's really amazing to show some real
world Ruby applications, esp. when it involves serious number crunching.

Ben

definitely serious number crunching. my latest project was ruby(mine)/idl(not
mine) fire detection - which is being used by the india government, check out

   http://dmsp.ngdc.noaa.gov/images/poster_world.jpg

red is ruby/idl found fires!

primary data source is

   http://dmsp.ngdc.noaa.gov/html/sensors/doc_ols.html

-a

···

On Wed, 9 Jun 2004, Ben Giddings wrote:
--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it; and a weed grows, even though we do
not love it. --Dogen

===============================================================================

this is about where i'm at... glib is really nice, but the ruby api is
sufficient for most purposes...

-a

···

On Wed, 9 Jun 2004, Aredridel wrote:

On Wed, 2004-06-09 at 00:55 +0900, Gavin Sinclair wrote:

On Wednesday, June 9, 2004, 12:53:42 AM, Ara.T.Howard wrote:

   * do it using a nice library for c. glib is good - lot's of bells.
     extra dependancy though...

If you've got a serious problem to solve, the dependency added by
glib surely pales as a concern. Heck, I don't know what it is, but if
*you* think it will help, use it!

Glib is the base library for GTK+ -- it's an object and signal toolkit
in C. Very nice API, for C anyway. It has a lot of ruby-ish features.
It's nice to use.

However, I'd use Ruby for the same: Same functionality, plus the
language to code (or even just prototype) additional parts in.

Ari

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it; and a weed grows, even though we do
not love it. --Dogen

===============================================================================

Ben Giddings wrote:

Btw, Ara, I don't have any answers to your question, but I think the fact you're asking it is great. It's really amazing to show some real world Ruby applications, esp. when it involves serious number crunching.

Timely for me, too, as I've been pondering writing a (potentially) commercial app in Ruby, and speed may be an issue. My argument to skeptics is that any slow stuff can be replaced with C code; that when you code in Ruby you're essentially scripting a C app (full source available, too!), and extending that C code is relatively easy.

James