Ruby, memory and speed

I have a script that aggregates data from multiple file, store it all in a hash, and then emit a summary on standard input. The input files (text files) are fairly big, like 4 of about 50Mb and 4 of about 350Mb. The hash will grow to about 500 000 keys. The memory footprint of the ruby process as reported by top is above 2 Gigs.

When the script start, it processes the files at a speed of 10K/s or so. Not lightening fast, but will get the job done. As time goes on, the speed drops down to 100 bytes/s or less, while still taking 100% CPU time. Unbearable. The machine it is running on is pretty good: 4xAMD Opteron 64bit, 32G memory, local scsi raided drive.

Does the performance of Ruby collapse past a certain memory usage? Like the GC kicks in all the time.

Any clue on how to speed this up? Any help appreciated.

Guillaume.

The code is as followed:

delta and snps are IOs. reads is a hash. max is an integer (4 in my case).
It expects a line starting with a '>' on delta. Then it reads some information on delta (and discard the rest) and some more information on snps (if present). All this is then recorded in the reads hash file.
Each entry entry in the hash are arrays with the 4 best match found so far.

def delta_reorder(delta, snps, reads, max = nil)
   l = delta.gets or return
   snps_a = nil
   loop do
     l =~ /^>(\S+)\s+(\S+)/ or break
     contig_name, read_name = $1, $2
     read = (reads[read_name] ||= [])
     loop do
       l = delta.gets or break
       l[0] == ?> and break
       cs, ce, rs, re, er = l.scan(/\d+/)
       er_i = er.to_i
       cs && ce && rs && re && er or break
       l = delta.gets while l && l != "0\n"
       if snps
         snps_a = []
         er_i.times { l << snps.gets or break; snps_a << l.split[-1] }
       end
       score = (re.to_i - rs.to_i).abs - 6 * er_i
       if max
# i = read.bsearch_upper_boundary { |x| score <=> x[1] }
# read.insert(i, [contig_name, score, cs, ce, rs, re, er, snps_a])
# read.slice!(max..-1) if read.size > max
         if read.size >= max
           min = read.min { |x, y| x[1] <=> y[1] }
           if score > min[1]
             min.replace([contig_name, score, cs, ce, rs, re, er, snps_a])
           end
         else
           read << [contig_name, score, cs, ce, rs, re, er, snps_a]
         end
       else
         if !read[0] || score > read[0][1]
           read[0] = [contig_name, score, cs, ce, rs, re, er, snps_a]
         end
       end
     end
   end
end

Example of data (comments after # are mine, not present in file):

# read_name (hash key) is gnl|ti|379331986
>gi>56411835|ref|NC_004353.2| gnl|ti|379331986 1281640 769
246697 246940 722 479 22 22 0 # Keep this info. Collect 22 lines from snps IO
0 # Skip
440272 440723 156 617 41 41 0 # Keep this info. Collect 41 lines from snps IO
147 # Skip 'til 0
-22
-206
-1
0
441263 441492 384 152 17 17 0 # Keep. Collect lines from snps.
-44 # Skip 'til 0
-1
37
0
>gi>56411835|ref|NC_004353.2| gnl|ti|379331989 1281640 745 # and so forth...
453805 453934 130 1 8 8 0
0

I think you got it.

Regards, Morton

···

On Aug 17, 2006, at 9:54 AM, Guillaume Marcais wrote:

Does the performance of Ruby collapse past a certain memory usage? Like the GC kicks in all the time.

Small optimization, which will help only if delta_reorder is called ofen:

     read = (reads[read_name.freeze] ||= )

Background: a Hash will dup a non frozen string to avoid nasty effects if the original changes.

<snip/>

To make people's lives who want to play with this easier you could provide a complete test set (original script + data files).

I don't fully understand your processing but maybe there's an option to improve this algorithm wise.

Kind regards

  robert

···

On 17.08.2006 15:54, Guillaume Marcais wrote:

I have a script that aggregates data from multiple file, store it all in a hash, and then emit a summary on standard input. The input files (text files) are fairly big, like 4 of about 50Mb and 4 of about 350Mb. The hash will grow to about 500 000 keys. The memory footprint of the ruby process as reported by top is above 2 Gigs.

When the script start, it processes the files at a speed of 10K/s or so. Not lightening fast, but will get the job done. As time goes on, the speed drops down to 100 bytes/s or less, while still taking 100% CPU time. Unbearable. The machine it is running on is pretty good: 4xAMD Opteron 64bit, 32G memory, local scsi raided drive.

Does the performance of Ruby collapse past a certain memory usage? Like the GC kicks in all the time.

Any clue on how to speed this up? Any help appreciated.

Guillaume.

The code is as followed:

delta and snps are IOs. reads is a hash. max is an integer (4 in my case).
It expects a line starting with a '>' on delta. Then it reads some information on delta (and discard the rest) and some more information on snps (if present). All this is then recorded in the reads hash file.
Each entry entry in the hash are arrays with the 4 best match found so far.

def delta_reorder(delta, snps, reads, max = nil)
  l = delta.gets or return
  snps_a = nil
  loop do
    l =~ /^>(\S+)\s+(\S+)/ or break
    contig_name, read_name = $1, $2

This is a perfect example of what I've noticed many times: Ruby's
performance is perfectly fast and acceptable until the working set gets a
certain (not terribly large) size, then it falls off a cliff. GC perhaps has
something to do with it, but I suspect that's only a small part of the
problem.

Before I spend a lot of time understanding your algorithm: is there a
specific reason why you need to keep the entire set in memory? You say
you're generating summary output at the end of the run, but can you
accumulate your summary in an object of fixed size?

Or does your summary depend on some kind of transformation on the entire set
(like a numeric sort)? If so, there are things you can do to improve the
algorithm so you're not keeping a large working set in memory.

···

On 8/17/06, Guillaume Marcais <guslist@free.fr> wrote:

I have a script that aggregates data from multiple file, store it all
in a hash, and then emit a summary on standard input. The input files
(text files) are fairly big, like 4 of about 50Mb and 4 of about 350Mb.
The hash will grow to about 500 000 keys. The memory footprint of the
ruby process as reported by top is above 2 Gigs.

[...]

When the script start, it processes the files at a speed of 10K/s or
so. Not lightening fast, but will get the job done. As time goes on,
the speed drops down to 100 bytes/s or less, while still taking 100%
CPU time. Unbearable. The machine it is running on is pretty good:
4xAMD Opteron 64bit, 32G memory, local scsi raided drive.

Does the performance of Ruby collapse past a certain memory usage? Like
the GC kicks in all the time.

Any clue on how to speed this up? Any help appreciated.

[...]

Have you tried to increment GC_MALLOC_LIMIT in gc.c?

···

On Thu, Aug 17, 2006 at 10:54:04PM +0900, Guillaume Marcais wrote:

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

This is a perfect example of what I've noticed many times: Ruby's
performance is perfectly fast and acceptable until the working set gets a
certain (not terribly large) size, then it falls off a cliff. GC perhaps has
something to do with it, but I suspect that's only a small part of the
problem.

Can anyone speculate as to what other subsystems or facets of Ruby's
architecture, besides GC, might possibly cause this?

Regards,

Bill

···

From: "Francis Cianfrocca" <garbagecat10@gmail.com>

I have a script that aggregates data from multiple file, store it all in a hash, and then emit a summary on standard input. The input files (text files) are fairly big, like 4 of about 50Mb and 4 of about 350Mb. The hash will grow to about 500 000 keys. The memory footprint of the ruby process as reported by top is above 2 Gigs.
When the script start, it processes the files at a speed of 10K/s or so. Not lightening fast, but will get the job done. As time goes on, the speed drops down to 100 bytes/s or less, while still taking 100% CPU time. Unbearable. The machine it is running on is pretty good: 4xAMD Opteron 64bit, 32G memory, local scsi raided drive.
Does the performance of Ruby collapse past a certain memory usage? Like the GC kicks in all the time.
Any clue on how to speed this up? Any help appreciated.
Guillaume.
The code is as followed:
delta and snps are IOs. reads is a hash. max is an integer (4 in my case).
It expects a line starting with a '>' on delta. Then it reads some information on delta (and discard the rest) and some more information on snps (if present). All this is then recorded in the reads hash file.
Each entry entry in the hash are arrays with the 4 best match found so far.
def delta_reorder(delta, snps, reads, max = nil)
  l = delta.gets or return
  snps_a = nil
  loop do
    l =~ /^>(\S+)\s+(\S+)/ or break
    contig_name, read_name = $1, $2

Small optimization, which will help only if delta_reorder is called ofen:

    read = (reads[read_name.freeze] ||= )

Background: a Hash will dup a non frozen string to avoid nasty effects if the original changes.

<snip/>

To make people's lives who want to play with this easier you could provide a complete test set (original script + data files).

Will do, when I get to my office.

Guillaume.

···

Le 17 août 06, à 10:45, Robert Klemme a écrit :

On 17.08.2006 15:54, Guillaume Marcais wrote:

I don't fully understand your processing but maybe there's an option to improve this algorithm wise.

Kind regards

  robert

Hi,

···

In message "Re: Ruby, memory and speed" on Fri, 18 Aug 2006 00:16:33 +0900, "Francis Cianfrocca" <garbagecat10@gmail.com> writes:

This is a perfect example of what I've noticed many times: Ruby's
performance is perfectly fast and acceptable until the working set gets a
certain (not terribly large) size, then it falls off a cliff. GC perhaps has
something to do with it, but I suspect that's only a small part of the
problem.

This may be caused by the problem addressed in 1.9:

   Mon Jul 10 19:22:19 2006 Tanaka Akira <akr@fsij.org>

  * gc.c (gc_sweep): expand heap earlier.
    reported by MORITA Naoyuki. [ruby-dev:28960]

              matz.

I have a script that aggregates data from multiple file, store it all
in a hash, and then emit a summary on standard input. The input files
(text files) are fairly big, like 4 of about 50Mb and 4 of about 350Mb.
The hash will grow to about 500 000 keys. The memory footprint of the
ruby process as reported by top is above 2 Gigs.

This is a perfect example of what I've noticed many times: Ruby's
performance is perfectly fast and acceptable until the working set gets a
certain (not terribly large) size, then it falls off a cliff. GC perhaps has
something to do with it, but I suspect that's only a small part of the
problem.

Is there any tuning of GCC so it kicks in less frequently when the memory consumption is large? Or could it be that the Hash algorithm chokes when the number of keys gets large (like > 100_000)?

I guess I could re-implement in C or C++, but it saddens me because it is a quick script for (probably) a one time task and doing string processing is so much easier in Ruby. I'll try in Perl first.

Before I spend a lot of time understanding your algorithm: is there a
specific reason why you need to keep the entire set in memory? You say
you're generating summary output at the end of the run, but can you
accumulate your summary in an object of fixed size?

No, I can't (at least I don't see how). I have n reads split into ns files and m contigs split into ms files. Then all ns read files are match against all ms contig files, which leads to ns * ms match (or delta) files. And each reads may have one or many match with any of the contigs.

I want to extract for each read the 4 "best" matches. (the score is defined as the length of the match minus 6 times the number of errors in this match (we are talking about DNA alignment, so a match can have a small number of mismatches, due to sequencing errors, mutations, etc.)). The snps here are just added information, not crucial, but probably adds to the problem because it increases the size of the data set. It is the exact position of all the single mismatches inside of a match. The number of line (one mismatch per line) is given by the sub-header in the delta file.

So for every set of reads, I parse the ms delta files corresponding and extract for the best 4 match for each read. The routing showed before parses only one file. The same hash is used for all ms files.

Or does your summary depend on some kind of transformation on the entire set
(like a numeric sort)? If so, there are things you can do to improve the
algorithm so you're not keeping a large working set in memory.

Maybe. I haven't seen it yet. And I went first with the dumb algorithm as it is an ad-hock script, not part of some great piece of software.

Thanks,
Guillaume.

···

Le 17 août 06, à 11:16, Francis Cianfrocca a écrit :

On 8/17/06, Guillaume Marcais <guslist@free.fr> wrote:

Mauricio Fernandez wrote:

[...]
> When the script start, it processes the files at a speed of 10K/s or
> so. Not lightening fast, but will get the job done. As time goes on,
> the speed drops down to 100 bytes/s or less, while still taking 100%
> CPU time. Unbearable. The machine it is running on is pretty good:
> 4xAMD Opteron 64bit, 32G memory, local scsi raided drive.
>
> Does the performance of Ruby collapse past a certain memory usage? Like
> the GC kicks in all the time.
>
> Any clue on how to speed this up? Any help appreciated.
[...]

Have you tried to increment GC_MALLOC_LIMIT in gc.c?

No. Any hint on what value I can raised it to? Any upper limit that I
should not cross?

Guillaume.

···

On Thu, Aug 17, 2006 at 10:54:04PM +0900, Guillaume Marcais wrote:

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

Mauricio Fernandez wrote:

[...]
> When the script start, it processes the files at a speed of 10K/s or
> so. Not lightening fast, but will get the job done. As time goes on,
> the speed drops down to 100 bytes/s or less, while still taking 100%
> CPU time. Unbearable. The machine it is running on is pretty good:
> 4xAMD Opteron 64bit, 32G memory, local scsi raided drive.
>
> Does the performance of Ruby collapse past a certain memory usage? Like
> the GC kicks in all the time.
>
> Any clue on how to speed this up? Any help appreciated.
[...]

Have you tried to increment GC_MALLOC_LIMIT in gc.c?

No. Any hint on what value I can raised it to? Any upper limit that I
should not cross?

Guillaume.

···

On Thu, Aug 17, 2006 at 10:54:04PM +0900, Guillaume Marcais wrote:

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

Stole that for
   http://rubygarden.org:3000/Ruby/page/show/RubyOptimization

Suggestions for the Original Poster...

* Browse that Wiki page, it may have something for you. (Alternatively,
   once you solve your problem add the solution to that page!)

* If you are on Linux, use "vmstat 5" eg.
vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
  r b swpd free buff cache si so bi bo in cs us sy id wa
  0 0 127628 14308 4548 198096 1 1 28 7 27 20 8 1 86 5
  0 0 127628 14308 4548 198096 0 0 0 0 421 835 0 0 100 0

Watch the "si" and "so". (Swap In Swap Out) If you are swapping 2 or
more swaps every 5 seconds, then you don't have ruby GC problems, you
have memory problems. ie. Tweaking GC won't help. You have to store less
in ram full stop. Remember to set any dangling references that you won't
use again to nil, especially from class variables and globals.

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : john.carter@tait.co.nz
New Zealand

Carter's Clarification of Murphy's Law.

"Things only ever go right so that they may go more spectacularly wrong later."

From this principle, all of life and physics may be deduced.

···

On Thu, 17 Aug 2006, Robert Klemme wrote:

Small optimization, which will help only if delta_reorder is called ofen:

   read = (reads[read_name.freeze] ||= )

Background: a Hash will dup a non frozen string to avoid nasty effects if the original changes.

Total, rank speculation here, but I don't suspect GC because I
consistently see this kind of problem even when a lot of objects are
being created and not many are becoming garbage. My wild
unsubstantiated guess is that Ruby creates data structures in memory
that are not designed to maximize processor- and L2- cache
utilization. Perhaps data objects overflow cache lines, or they are
heavily pointer-driven (requiring more memory-bus cycles than is
optimal), or they exhibit poor locality of reference.

···

On 8/17/06, Bill Kelly <billk@cts.com> wrote:

From: "Francis Cianfrocca" <garbagecat10@gmail.com>
>
> This is a perfect example of what I've noticed many times: Ruby's
> performance is perfectly fast and acceptable until the working set gets a
> certain (not terribly large) size, then it falls off a cliff. GC perhaps has
> something to do with it, but I suspect that's only a small part of the
> problem.

Can anyone speculate as to what other subsystems or facets of Ruby's
architecture, besides GC, might possibly cause this?

Any plans to back port it to ruby_1_8 branch?

···

On 8/17/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Hi,

In message "Re: Ruby, memory and speed" > on Fri, 18 Aug 2006 00:16:33 +0900, "Francis Cianfrocca" <garbagecat10@gmail.com> writes:

>This is a perfect example of what I've noticed many times: Ruby's
>performance is perfectly fast and acceptable until the working set gets a
>certain (not terribly large) size, then it falls off a cliff. GC perhaps has
>something to do with it, but I suspect that's only a small part of the
>problem.

This may be caused by the problem addressed in 1.9:

   Mon Jul 10 19:22:19 2006 Tanaka Akira <akr@fsij.org>

        * gc.c (gc_sweep): expand heap earlier.
          reported by MORITA Naoyuki. [ruby-dev:28960]

                                                        matz.

--
Kent
---

If I remember a past thread correctly, the Ruby GC is set to keep memory usage below 8 MB by default. This is specified is determined by a #define in the interpreter source code, and I don't think it's really an adaptive algorithm. If GC is the performance problem, you could try changing that constant and rebuilding the interpreter.

As for the Hash problem, profiling could probably tell you that. Of course, profiling with such a huge dataset when you already know the code gets anomalously slow might take really, really long.

David Vallner

···

On Thu, 17 Aug 2006 19:49:03 +0200, Guillaume Marcais <guslist@free.fr> wrote:

Is there any tuning of GCC so it kicks in less frequently when the memory consumption is large? Or could it be that the Hash algorithm chokes when the number of keys gets large (like > 100_000)?

Bill Kelly wrote:

From: "Francis Cianfrocca" <garbagecat10@gmail.com>

This is a perfect example of what I've noticed many times: Ruby's
performance is perfectly fast and acceptable until the working set gets a
certain (not terribly large) size, then it falls off a cliff. GC
perhaps has
something to do with it, but I suspect that's only a small part of the
problem.

Can anyone speculate as to what other subsystems or facets of Ruby's
architecture, besides GC, might possibly cause this?

Regards,

Bill

I've posted a script for building a "gprof-enabled" Ruby on GNU/Linux
systems at

http://rubyforge.org/cgi-bin/viewvc.cgi/MatrixBenchmark/?root=cougar

It's a pretty simple process to dig into at least the CPU portion of the
equation -- where is the Ruby interpreter spending cycles interpreting
your script? Some more sophisticated tools are required to profile
memory usage, but it's more or less impossible for a normally-designed
program to abuse memory without spending CPU cycles doing it. :slight_smile:

Another simple thing you can do is run "top" while your Ruby script is
executing. On Windows, the Task Manager will do the same thing in the
"Process" tab. Sort in descending order on CPU and if your script is at
the "top", it's spending CPU time. Sort in descending order on memory
size and if your script is at the top without a lot of CPU usage, it's a
memory management bottleneck.

Small optimization, which will help only if delta_reorder is called ofen:

   read = (reads[read_name.freeze] ||= )

Background: a Hash will dup a non frozen string to avoid nasty effects if the original changes.

Stole that for
  http://rubygarden.org:3000/Ruby/page/show/RubyOptimization

Suggestions for the Original Poster...

* Browse that Wiki page, it may have something for you. (Alternatively,
  once you solve your problem add the solution to that page!)

Thanks for the pointer.

I also used Mmap#scan. It is pretty elegant compare to the usual:

io.each { |l|
   l.chomp!
   next unless l =˜ /blah/
   ...
}

I would think it is faster too (no formal testing done).

* If you are on Linux, use "vmstat 5" eg.
vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 127628 14308 4548 198096 1 1 28 7 27 20 8 1 86 5
0 0 127628 14308 4548 198096 0 0 0 0 421 835 0 0 100 0

Watch the "si" and "so". (Swap In Swap Out) If you are swapping 2 or
more swaps every 5 seconds, then you don't have ruby GC problems, you
have memory problems. ie. Tweaking GC won't help. You have to store less
in ram full stop. Remember to set any dangling references that you won't
use again to nil, especially from class variables and globals.

Checked that. But no, the machine has plenty of memory and si/so was always 0.

I finally went back to doing stream parsing. Instead of aggregating the information from many file in one big hash, I read and write one record at a time in a format suitable for sort (the UNIX command). Then pipe it to sort. Finally, I merge all relevant files together with 'sort -m'. This lead to the result in a few hours only.

Thank you all for your suggestions,
Guillaume.

···

Le 18 août 06, à 00:07, John Carter a écrit :

On Thu, 17 Aug 2006, Robert Klemme wrote:

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : john.carter@tait.co.nz
New Zealand

Carter's Clarification of Murphy's Law.

"Things only ever go right so that they may go more spectacularly wrong later."

From this principle, all of life and physics may be deduced.

You have so much RAM you could probably afford to set it to a few hundred
MBs... If malloc_limit grows quickly enough it won't really make a
difference, though. It'd be nice to know if it actually does in your case,
for future reference.

···

On Fri, Aug 18, 2006 at 09:50:08AM +0900, guillaume.marcais@gmail.com wrote:

Mauricio Fernandez wrote:
> On Thu, Aug 17, 2006 at 10:54:04PM +0900, Guillaume Marcais wrote:
> > Does the performance of Ruby collapse past a certain memory usage? Like
> > the GC kicks in all the time.
> >
> > Any clue on how to speed this up? Any help appreciated.
> [...]
>
> Have you tried to increment GC_MALLOC_LIMIT in gc.c?

No. Any hint on what value I can raised it to? Any upper limit that I
should not cross?

--
Mauricio Fernandez - http://eigenclass.org - singular Ruby

Hi,

···

In message "Re: Ruby, memory and speed" on Fri, 18 Aug 2006 02:52:33 +0900, "Kent Sibilev" <ksruby@gmail.com> writes:

This may be caused by the problem addressed in 1.9:

   Mon Jul 10 19:22:19 2006 Tanaka Akira <akr@fsij.org>

        * gc.c (gc_sweep): expand heap earlier.
          reported by MORITA Naoyuki. [ruby-dev:28960]

Any plans to back port it to ruby_1_8 branch?

Not for 1.8.5. Maybe later.

              matz.

David Vallner wrote:

> Is there any tuning of GCC so it kicks in less frequently when the
> memory consumption is large? Or could it be that the Hash algorithm
> chokes when the number of keys gets large (like > 100_000)?

If I remember a past thread correctly, the Ruby GC is set to keep memory
usage below 8 MB by default. This is specified is determined by a #define
in the interpreter source code, and I don't think it's really an adaptive
algorithm. If GC is the performance problem, you could try changing that
constant and rebuilding the interpreter.

As for the Hash problem, profiling could probably tell you that. Of
course, profiling with such a huge dataset when you already know the code
gets anomalously slow might take really, really long.

Yeah, that's my problem. I did some profiling on some smaller subset,
and that's why I removed the binary search (see posted code). A binary
search on a sorted of only 4 elements doesn't buy you anything, it was
even slower. But I never tried with a dataset that would push the
memory consumption, sot I got no insight on that.

Guillaume.

···

On Thu, 17 Aug 2006 19:49:03 +0200, Guillaume Marcais <guslist@free.fr> > wrote:

David Vallner