Ruby-specific performance heuristics?

I've been doing some stuff with CSV recently, having data in one
flat file that I'm trying to normalize into tables. Thus, I'm
trying to ignore things I've seen before so I don't create 2 entries
for them, and that sort of thing.

Jon Bentley points out that "Data Structures Programs" -- i.e. how the
data is "shaped" determines how the program will be designed and
perform. Often the best solution is not immediately obvious, as he
discusses. So, for example, one can test for "seen before" by

   array = []

   unless array.include? item
     array << item
   end

which is much slower for clumped data than

   unless array.include? item
     array.unshift item
   end

or one could use a hash, or maybe a set.

So, my question is [a memeber of the set of "How long is a piece of
string?" type questions]:

   Are there any heuristics for performance of *Ruby* data structures
   available, which guide one to designing code to *often* have good
   performance?

The answer will be very much "it depends, you neet to test it, it's
a function of data size, ...", but I suspect thee
implementation, i.e. Ruby puts useful bounds on the problem.

My searching has not turned up anything, though there are good
performance hints in the PickAxe. I'm wondering if the info
exists or if it would be worth trying to create it, or
if it is just too broad a problem to be worth it. However, since we
Rubyists often speak of the ability to put code together quickly,
having hints around on how to do so and yield efficient code seems
of some worth.

         Hugh

Hugh Sasse wrote:

I've been doing some stuff with CSV recently, having data in one
flat file that I'm trying to normalize into tables. Thus, I'm
trying to ignore things I've seen before so I don't create 2 entries
for them, and that sort of thing.

Jon Bentley points out that "Data Structures Programs" -- i.e. how the
data is "shaped" determines how the program will be designed and
perform. Often the best solution is not immediately obvious, as he
discusses. So, for example, one can test for "seen before" by

   array =

   unless array.include? item
     array << item
   end

which is much slower for clumped data than

   unless array.include? item
     array.unshift item
   end

or one could use a hash, or maybe a set.

Yes, definitely.

So, my question is [a memeber of the set of "How long is a piece of
string?" type questions]:

   Are there any heuristics for performance of *Ruby* data structures
   available, which guide one to designing code to *often* have good
   performance?

I think the major differences performance wise are in the algorithms and
data structures as such and not the language: a hash is faster for lookups
in Ruby than an array - as in any other language that has it.

The only thing that comes to mind spontaneously is that it's a good idea
to freeze a string that you are going to use as a hash key because a
String key will be duped on insertion if it is not frozen to prevent
errors caused by aliasing.

The answer will be very much "it depends, you neet to test it, it's
a function of data size, ...", but I suspect thee
implementation, i.e. Ruby puts useful bounds on the problem.

Yes, definitely.

My searching has not turned up anything, though there are good
performance hints in the PickAxe. I'm wondering if the info
exists or if it would be worth trying to create it, or
if it is just too broad a problem to be worth it. However, since we
Rubyists often speak of the ability to put code together quickly,
having hints around on how to do so and yield efficient code seems
of some worth.

Another general rule of thumb that is true with Java and also Ruby is that
object creation is comparatively expensive. I.e. "foo" + "bar" is more
expensive than "foo" << "bar" for example. There are only some exceptions
to this rule (Fixnums for example).

Kind regards

    robert

here's some things i think of:

   - use class variables of some sort : never store in an object what should be
     stored in a class

   - write lazy attribute getters : if your parser loads 15,000 fields from a
     header but normally only 4 or 5 are use in code write the parser to parse
     the field on demand and cache the result

   - don't write explicit IO : use mmap and let the kernel worry about it

   - use processes not threads : ipc is so easy with ruby (using local drb
     objects) that it's ridiculous not to use multiple processes so
     multi-processor machines can actually take best advantage of your code.
     sometimes threads are better for a problem - but rarely - especially when
     you count debugging time :wink:

   - don't check things : assume an arg is an int - the stack trace is plenty
     good to tell you it wasn't and the check slows things down

   - use rbtree : it's amazing :wink:

   - use sqlite : it's amazing :wink:

   - ruby inline : it's amazing :wink:

   - use iterator methods for filesystem operations : prefer

       find('.'){|f| ... }

     over

       Dir['*/**'].each{|f| ... }

     the seconds kills you when a dir has 100,000 files in it.

   - compose things out of array, hashes, and strings whenever possible. these
     objects are pretty optimized in ruby and serialize great as yaml meaning
     you never need to write hand parsers.

   - eliminate variable creation where possible : prefer

       n = nil
       obj.each{|elem| n = elem ** 2 and p n}

     over

       obj.each{|elem| n = elem ** 2 and p n}

   - anchor every single regular expression : performance degrades
     exponentially otherwise

   - compile regular expressions only once unless they require variable
     substitution

   - cache anything possible, for instance methods dynamically generated from
     method_missing : permanmently add the method to the object instead of
     generating it everytime

now, i'm not saying these are 'right' - just that they are what i think of
when trying to make ruby fast.

cheers.

-a

···

On Fri, 2 Sep 2005, Hugh Sasse wrote:

I've been doing some stuff with CSV recently, having data in one flat file
that I'm trying to normalize into tables. Thus, I'm trying to ignore things
I've seen before so I don't create 2 entries for them, and that sort of
thing.

Jon Bentley points out that "Data Structures Programs" -- i.e. how the data
is "shaped" determines how the program will be designed and perform. Often
the best solution is not immediately obvious, as he discusses. So, for
example, one can test for "seen before" by

array =

unless array.include? item array << item end

which is much slower for clumped data than

unless array.include? item array.unshift item end

or one could use a hash, or maybe a set.

So, my question is [a memeber of the set of "How long is a piece of string?"
type questions]:

Are there any heuristics for performance of *Ruby* data structures
available, which guide one to designing code to *often* have good
performance?

The answer will be very much "it depends, you neet to test it, it's a
function of data size, ...", but I suspect thee implementation, i.e. Ruby
puts useful bounds on the problem.

My searching has not turned up anything, though there are good performance
hints in the PickAxe. I'm wondering if the info exists or if it would be
worth trying to create it, or if it is just too broad a problem to be worth
it. However, since we Rubyists often speak of the ability to put code
together quickly, having hints around on how to do so and yield efficient
code seems of some worth.

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Ara.T.Howard wrote:

here's some things i think of:

<lots of good stuff deleted>

   - anchor every single regular expression : performance degrades
     exponentially otherwise

That's not possible - especially with String#scan. Sometimes you want all
occurrences of something.

   - compile regular expressions only once unless they require
     variable substitution

I guess you're talking about flag /o or storage of a regexp object
somewhere which is really only needed if the regexp contains interpolation
that you want to do only once. Otherwise a regexp *is* compiled only once
and inline regexps are roughly as performant as a stored regexp and might
be easier to read (not always).

   - cache anything possible, for instance methods dynamically
     generated from method_missing : permanmently add the method to
     the object instead of generating it everytime

now, i'm not saying these are 'right' - just that they are what i
think of when trying to make ruby fast.

In fact I agree with most of what you wrote. :slight_smile:

Kind regards

    robert

Ara.T.Howard wrote:

  - anchor every single regular expression : performance degrades
    exponentially otherwise

That's an overstatement. True, use anchors whenever possible, as they
are great hints to the engine and they certainly limit the number of
possible matches very often, but it's not like every regular expression
without anchors takes O(N^M),
        nikolai

···

--
Nikolai Weibull: now available free of charge at http://bitwi.se/\!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

Wow, lots of good stuff already. I'm collecting these on the web
(hope that's OK, attributed, slightly edited for brevity, no email
addresses)... The page, which needs the formatting tweaking some more, is

http://www.eng.cse.dmu.ac.uk/~hgs/ruby/performance/

as yet not linked into my ruby page, but Google will probably find
it soon anyway...

Things I'd like to comment on....

- don't write explicit IO : use mmap and let the kernel worry about it

I'm not familiar enough with mmap to understand this one :slight_smile:

- compose things out of array, hashes, and strings whenever possible. these
   objects are pretty optimized in ruby and serialize great as yaml meaning
   you never need to write hand parsers.

Do you mean, as opposed to creating new classes and composing with
those?

- cache anything possible, for instance methods dynamically generated from
   method_missing : permanmently add the method to the object instead of
   generating it everytime

I've tried to do this and found performance has degraded. I'm not
sure why, but think this post on Redhanded

http://redhanded.hobix.com/inspect/theFullyUpturnedBin.html

(unfortunately mauled by poker spammers, <seethe/>), referencing

http://whytheluckystiff.net/articles/theFullyUpturnedBin.html

has something to do with it, and because of "Grab things in 8MB
chunks" in particular. One case that stood out was caching results
of floating point calculations involving roots: caching was slower.

now, i'm not saying these are 'right' - just that they are what i think of
when trying to make ruby fast.

cheers.

-a

         Thank you, All,
         Hugh

···

On Fri, 2 Sep 2005, Ara.T.Howard wrote:

here's some things i think of:

<lots of good stuff deleted>

   - anchor every single regular expression : performance degrades
     exponentially otherwise

That's not possible - especially with String#scan. Sometimes you want all
occurrences of something.

true. however there nearly always __some__ anchor one can use:

   harp:~ > irb
   irb(main):001:0> "foo bar".scan %r/ \b\w+ /x
   => ["foo", "bar"]

a better thing would be to say "use an anchor unless you know exactly why you
shouldn't."

   - compile regular expressions only once unless they require
     variable substitution

I guess you're talking about flag /o or storage of a regexp object somewhere
which is really only needed if the regexp contains interpolation that you
want to do only once. Otherwise a regexp *is* compiled only once and inline
regexps are roughly as performant as a stored regexp and might be easier to
read (not always).

you are quite right - guy pointed this out to me once before - for some reason
my brain hasn't retained it though :wink:

cheers.

-a

···

On Sat, 3 Sep 2005, Robert Klemme wrote:
--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

i don't think it is an overstatement. i'm not saying every regex w/o anchor
__is__ O(N^M) - but that they easily __degrade__ to that worst case more
easily w/o them. a single misplaced '.*' can do this. one time i was
examining a processing machine that uses regexes to match keys of items that
arrive over the network and are inserted into a giant in memory circular queue
(http://my.unidata.ucar.edu/content/software/ldm/archive/index.html\) because
it's cpu usage had shot up from around 5 to 50% for no apparent reason. the
keys were not large (10-100 bytes) and the regexes ldm is configured with tend
to be pretty simple, things like

   /F10.*OLS/
   /F12.*OLS/

and these are matched against strings like

   F10200008100242.OLS

but there are __many__ of these strings. in any case it took me about two
days to convice our data manager to let me re-write the regexes - they just
couldn't believe that this could kill a 3.3ghz machine, especially when the
code was written in c and the patterns and strings were so small/simple.
needless to say a few small tweaks to the expressions, anchors and minimal
modifiers mostly, and the machine was back to normal. it was hard for them to
accept even then - i had to draw an exponential graph and say "see, this is
where we were (close to zero on x-axis) and this is where we are (tinee-tiny
bit to the right, but __way__ up on the y-axis), explain big-O notation to
them, explain that regexs are, in fact, a mini computer program describing a
state machine with looping constructs and i still don't think they got it
because what they had before worked and the incoming volume hadn't changed
"that much." so, for someone like yourself that advice is a bit strong, but
for alot of people it's better the assert it since it generally ends making
them gain a better understanding of the input and write a better expression
anyhow, like

   %r/ ^ \s* foo /x vs %r/ foo /x

and it (using anchors/minimal modifiers) is seldom the wrong thing to do.

in summary that advice is probably mis-placed on this list - but i'll still
hand it out in general.

cheers.

-a

···

On Sat, 3 Sep 2005, Nikolai Weibull wrote:

Ara.T.Howard wrote:

  - anchor every single regular expression : performance degrades
    exponentially otherwise

That's an overstatement. True, use anchors whenever possible, as they
are great hints to the engine and they certainly limit the number of
possible matches very often, but it's not like every regular expression
without anchors takes O(N^M),
       nikolai

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

I seem to be finding that using Struct::Thingy.new is slower than
Thingy.new where Thingy is a class with attr_accessor :this,
:that,...

         Hugh

Wow, lots of good stuff already. I'm collecting these on the web
(hope that's OK, attributed, slightly edited for brevity, no email
addresses)... The page, which needs the formatting tweaking some more, is

http://www.eng.cse.dmu.ac.uk/~hgs/ruby/performance/

cool.

- don't write explicit IO : use mmap and let the kernel worry about it

I'm not familiar enough with mmap to understand this one :slight_smile:

guy's abstraction is powerful, you just use it like a string:

     harp:~ > cat a.rb
     require 'mmap'
     require 'tempfile'

···

On Sat, 3 Sep 2005, Hugh Sasse wrote:
     #
     # generate a test file
     #
     t = Tempfile::new Process::pid.to_s
     t.puts 'foobar'
     t.close

     #
     # modify it in place - letting the kernel do the IO
     #
     m = Mmap::new tmp.path, 'rw', Mmap::MAP_SHARED
     m.gsub! %r/foobar/, 'barfoo'
     m.msync
     m.munmap

     #
     # show it
     #
     p IO::read(t.path)

     harp:~ > ruby a.rb
     "barfoo\n"

it's extremely powerful when you are changing, say, a scanline of pixels in the
middle of a 40MB image but nothing else - the kernel will only read/write the
blocks you need.

- compose things out of array, hashes, and strings whenever possible. these
   objects are pretty optimized in ruby and serialize great as yaml meaning
   you never need to write hand parsers.

Do you mean, as opposed to creating new classes and composing with those?

sometimes :wink: i've just been suprised by the performance of ruby's built-ins
lately. for instance, rbtree used to by tons faster for big maps, but hashes
don't do too badly now. built-in strings share storage sometimes, etc.

- cache anything possible, for instance methods dynamically generated from
   method_missing : permanmently add the method to the object instead of
   generating it everytime

I've tried to do this and found performance has degraded. I'm not
sure why, but think this post on Redhanded

http://redhanded.hobix.com/inspect/theFullyUpturnedBin.html

(unfortunately mauled by poker spammers, <seethe/>), referencing

http://whytheluckystiff.net/articles/theFullyUpturnedBin.html

has something to do with it, and because of "Grab things in 8MB
chunks" in particular. One case that stood out was caching results
of floating point calculations involving roots: caching was slower.

sure. it's just guideline - i've run up against similar things too. can't
avoid testing in the end :wink:

-a
--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Ara.T.Howard wrote:

here's some things i think of:

<lots of good stuff deleted>

   - anchor every single regular expression : performance degrades
     exponentially otherwise

That's not possible - especially with String#scan. Sometimes you
want all occurrences of something.

true. however there nearly always __some__ anchor one can use:

Ah, yes, I overlooked that one.

   harp:~ > irb
   irb(main):001:0> "foo bar".scan %r/ \b\w+ /x
   => ["foo", "bar"]

But does this really make such a big difference? Or does this depend on
input sequence and rx? My quick hack test didn't show significant
differences.

17:29:48 [ruby]: ruby bm-rx-anchor.rb
Rehearsal -------------------------------------------------------
no anchor 4.469000 0.000000 4.469000 ( 4.465000)
1 anchor 4.515000 0.000000 4.515000 ( 4.525000)
2 anchors 4.500000 0.000000 4.500000 ( 4.520000)
--------------------------------------------- total: 13.484000sec

                          user system total real
no anchor 4.500000 0.000000 4.500000 ( 4.509000)
1 anchor 4.484000 0.000000 4.484000 ( 4.494000)
2 anchors 4.531000 0.000000 4.531000 ( 4.531000)

a better thing would be to say "use an anchor unless you know exactly
why you shouldn't."

Sounds good.

   - compile regular expressions only once unless they require
     variable substitution

I guess you're talking about flag /o or storage of a regexp object
somewhere which is really only needed if the regexp contains
interpolation that you want to do only once. Otherwise a regexp
*is* compiled only once and inline regexps are roughly as performant
as a stored regexp and might be easier to read (not always).

you are quite right - guy pointed this out to me once before - for
some reason my brain hasn't retained it though :wink:

Well, repetition helps. :-))

Kind regards

    robert

bm-rx-anchor.rb (315 Bytes)

···

On Sat, 3 Sep 2005, Robert Klemme wrote:

Regarding performance, just using any anchor won't necessarily
help performance. The important ones are \A (beginning of
string) and \G (where the last match stopped). You should be
able to make just about any regexp not using these use one of
these and match the same thing. For example, this:

/<non-anchored regexp>/m

is equivalent to:

/\A.*?<non-anchored regexp>/m

I used "m" (multi-line) so that "." matched anything. One
should realize that using a non-anchored (\A or \G) regexp has
the above equivalence (.*?) when matching and the associated
performance penalty. Usually, you can simplify the .*? in your
application.

Regarding your scan case above, the \G anchored way to do it
would be:

"foo bar".scan(/\G\s*\w+/)

Of course this returns spaces in the result. You could group
the \w+, but then you get another Array level which degrade
performance.

I think \A or \G anchoring is also better because you get
better syntax checking of your input - not ignoring what comes
before the thing you want.

It would be nice if there was some option in many of the regexp
matching methods to implicitly anchor your regexp. About the
only ones that do this is what is in StringScanner.

···

--- "Ara.T.Howard" <Ara.T.Howard@noaa.gov> wrote:

On Sat, 3 Sep 2005, Robert Klemme wrote:

>> here's some things i think of:
>
> <lots of good stuff deleted>
>
>> - anchor every single regular expression : performance
degrades
>> exponentially otherwise
>
> That's not possible - especially with String#scan.
Sometimes you want all
> occurrences of something.

true. however there nearly always __some__ anchor one can
use:

   harp:~ > irb
   irb(main):001:0> "foo bar".scan %r/ \b\w+ /x
   => ["foo", "bar"]

a better thing would be to say "use an anchor unless you know
exactly why you
shouldn't."

____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs

Ara.T.Howard wrote:

> Ara.T.Howard wrote:

> > - anchor every single regular expression : performance degrades
> > exponentially otherwise

> That's an overstatement. True, use anchors whenever possible, as
> they are great hints to the engine and they certainly limit the
> number of possible matches very often, but it's not like every
> regular expression without anchors takes O(N^M),

i don't think it is an overstatement. i'm not saying every regex w/o
anchor __is__ O(N^M) - but that they easily __degrade__ to that worst
case more easily w/o them. a single misplaced '.*' can do this. one
time i was examining a processing machine that uses regexes to match
keys of items that arrive over the network and are inserted into a
giant in memory circular queue
(http://my.unidata.ucar.edu/content/software/ldm/archive/index.html\)
because it's cpu usage had shot up from around 5 to 50% for no
apparent reason. the keys were not large (10-100 bytes) and the
regexes ldm is configured with tend to be pretty simple, things like

  /F10.*OLS/ /F12.*OLS/

and these are matched against strings like

  F10200008100242.OLS

Well, what you said was that "performance degrades exponentially
otherwise".

I don't have your whole dataset, but it looks to me that you could be
using a wild-card matcher instead. Another good solution would be to
not use Ruby's built-in regular expression matcher, as it's implemented
using backtracking, which, you are correct, has a tendency to go haywire
far too often. That's the price you pay for having crap like
backreferencing and look-around I'm afraid.

(Sidenote: I'm releasing a regular expression matcher for Ruby in a near
future that is quite nice.)

Anyway, your advice is sound; the more information you give a
backtracking NFA implementation, the better,
        nikolai

···

On Sat, 3 Sep 2005, Nikolai Weibull wrote:

--
Nikolai Weibull: now available free of charge at http://bitwi.se/\!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

Wow, lots of good stuff already. I'm collecting these on the web
(hope that's OK, attributed, slightly edited for brevity, no email
addresses)... The page, which needs the formatting tweaking some more, is

http://www.eng.cse.dmu.ac.uk/~hgs/ruby/performance/

cool.

Thank you. I've still not tidied it up yet.

- don't write explicit IO : use mmap and let the kernel worry about it

I'm not familiar enough with mmap to understand this one :slight_smile:

guy's abstraction is powerful, you just use it like a string:

   harp:~ > cat a.rb
   require 'mmap'
   require 'tempfile'

         [...]

   #
   # modify it in place - letting the kernel do the IO
   #
   m = Mmap::new tmp.path, 'rw', Mmap::MAP_SHARED
   m.gsub! %r/foobar/, 'barfoo'
   m.msync
   m.munmap

Nice.
        [...]

   harp:~ > ruby a.rb
   "barfoo\n"

it's extremely powerful when you are changing, say, a scanline of pixels in the
middle of a 40MB image but nothing else - the kernel will only read/write the
blocks you need.

That's a great help. I'll have to look into that, not that I do
much image processing these days.

- compose things out of array, hashes, and strings whenever possible. these
   objects are pretty optimized in ruby and serialize great as yaml meaning
   you never need to write hand parsers.

Do you mean, as opposed to creating new classes and composing with those?

sometimes :wink: i've just been suprised by the performance of ruby's built-ins
lately. for instance, rbtree used to by tons faster for big maps, but hashes
don't do too badly now. built-in strings share storage sometimes, etc.

Yes, I'd like to write some test cases for this exploration at some
point, so as future ruby versions appear we can do regression tests
on what is claimed to be quickest.

- cache anything possible, for instance methods dynamically generated from
   method_missing : permanmently add the method to the object instead of
   generating it everytime

I've tried to do this and found performance has degraded. I'm not
sure why, but think this post on Redhanded

http://redhanded.hobix.com/inspect/theFullyUpturnedBin.html

(unfortunately mauled by poker spammers, <seethe/>), referencing

http://whytheluckystiff.net/articles/theFullyUpturnedBin.html

has something to do with it, and because of "Grab things in 8MB
chunks" in particular. One case that stood out was caching results
of floating point calculations involving roots: caching was slower.

sure. it's just guideline - i've run up against similar things too. can't
avoid testing in the end :wink:

Agreed. I'm just wondering if the heuristic should be to cache
things of the order of megabytes, because having about 8 of them
will "trip the switch"? I've not read the code and don't know
enought about GC to say much that's meaningful.

-a

         Thank you,
         Hugh

···

On Sat, 3 Sep 2005, Ara.T.Howard wrote:

On Sat, 3 Sep 2005, Hugh Sasse wrote:

can i assume then, that yours is DFA?

cheers.

-a

···

On Sat, 3 Sep 2005, Nikolai Weibull wrote:

(Sidenote: I'm releasing a regular expression matcher for Ruby in a near
future that is quite nice.)

Anyway, your advice is sound; the more information you give a
backtracking NFA implementation, the better,

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Ara.T.Howard wrote:

···

On Sat, 3 Sep 2005, Nikolai Weibull wrote:

> (Sidenote: I'm releasing a regular expression matcher for Ruby in a
> near future that is quite nice.)

> Anyway, your advice is sound; the more information you give a
> backtracking NFA implementation, the better,

can i assume then, that yours is DFA?

No, it's an NFA executed in parallel,
        nikolai

--
Nikolai Weibull: now available free of charge at http://bitwi.se/\!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}