Very strange GC behaviour

(Geert Fannes) #1

I implemented a sparse matrix class for loading a calling network (vertices are mobile phone users and edges are calls. The rows are held by a Hash and each row is an AVLTree, containing the information to where some customer called and how much (total duration). These AVLTrees are balanced trees that allow fast lookup and insertion, I used the C package "C AVL Tree generic package" from Walt Karas for this and converted it to a ruby class.

Anyway, as you can imagine, these calling networks are quite large (5 gig in memory, 10 million customers, 50 million calls). Unfortunately, ruby's GC hinders fast loading of these graphs: each time GC shoots in action, he has to check the ever increasing AVLTree objects for deletion and none can be deleted. Therefore, my plan was to disable GC and run it manually every 2 million lines. When doing this, ruby appears to consume way more memory and crashes before the calling graph is loaded from file (Errno::ENOMEM). When I leave the default GC untouched, I am capable to read the graph, but much slower ofcourse due to frequent GC. I cannot say GC.start is not working, since after each 2 million lines, the program keeps the same memory footprint for some time, before it starts increasing again. My guess is that it does not really free the memory, but keeps it in some pool and reuses it. Here is a sketch of the code:

class SparseMatrix
  attr :rows

  def SparseMatrix.load(fileName)
    matrix=SparseMatrix.new

    GC.disable

    File.open(fileName) do |fi|
      nrRead=0
      fi.each do |line|
        #do GC every 2 million lines
        if nrRead%2000000==0
          GC.enable
          GC.start
          GC.disable
        end

        #insert the info contained in this line into the matrix
        source,destination,duration=extractInfoFromLine(line)
        matrix[source]=AVLTree.new if !matrix[source]
        matrix[source].insert(destination,duration)

        nrRead+=1
      end
    end

    GC.enable

    matrix
  end
end

matrix=SparseMatrix.load(fromSomeFile)

Greetings,
Geert.

···

-----Original Message-----
From: Brian Schröder [mailto:ruby.brian@gmail.com]
Sent: Tuesday, August 23, 2005 4:36 PM
To: ruby-talk ML
Subject: Re: Very strange GC behaviour

On 23/08/05, Geert Fannes <Geert.Fannes@ikanconsulting.com> wrote:

Can these registers be cleared without doing much damage before doing
the garbage collection, thus providing a less conservative GC.start
function (perhaps with a parameter true or false for easy selection)? I
found a function rb_gc_start in gc.c. I'm guessing this can best be done
before the rb_gc call. Or are things more complicated as they seem...

Why do you need this. Maybe you are tackling the problem from the
wrong angle if you need to rely so much on the implementation of the
GC. If you describe your problem maybe someone here can find a less
dependant solution.

regards,

Brian

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

(Stefan Kaes) #2

Geert Fannes wrote:

I implemented a sparse matrix class for loading a calling network (vertices are mobile phone users and edges are calls. The rows are held by a Hash and each row is an AVLTree, containing the information to where some customer called and how much (total duration). These AVLTrees are balanced trees that allow fast lookup and insertion, I used the C package "C AVL Tree generic package" from Walt Karas for this and converted it to a ruby class.

Anyway, as you can imagine, these calling networks are quite large (5 gig in memory, 10 million customers, 50 million calls). Unfortunately, ruby's GC hinders fast loading of these graphs: each time GC shoots in action, he has to check the ever increasing AVLTree objects for deletion and none can be deleted. Therefore, my plan was to disable GC and run it manually every 2 million lines. When doing this, ruby appears to consume way more memory and crashes before the calling graph is loaded from file (Errno::ENOMEM). When I leave the default GC untouched, I am capable to read the graph, but much slower ofcourse due to frequent GC. I cannot say GC.start is not working, since after each 2 million lines, the program keeps the same memory footprint for some time, before it starts increasing again. My guess is that it does not really free the memory, but keeps it in some pool and reuses it.

You can try my GC patch, which allows one to specify the initial heap size and several other parameters. You can download it from http://rubyforge.org/projects/railsbench.

Regards,

Stefan Kaes