BIG memory problem

Hi all,

I'm having a really odd memory problem with a small ruby program I've
written. It basically takes in lines from input files (which represent
router flows), deduplicates them (based on elements of the line) and
outputs the unique flows to file. The input file often contains over
300,000 lines of which about 25-30% are duplicates. The trouble I'm
having is that the program (which is intended to be long running) does
not seem to release any memory back to the system and in fact just
increases in memory footprint from iteration to iteration. It should
use about 150 MB by my estimates but sails through this and yesterday
slowed to a halt at about 1.6GB (due to the GC by my guess). This
doesn't make any sense to me as at times I am deleting data structures
that occupy at least 50MB of memory.

The codebase is slightly to big too big to pastie but it is available
here http://svn.tobyclemson.co.uk/public/trunk/flow_deduplicator .
There are actually only 2 classes of importance and 1 script but I
don't know if pastie can handle that.

Any help would be greatly appreciated as the alternative (pressures
from above) is to rewrite in Python (which involves me learning
Python)

Thanks in advance,
Toby

I _think_ I have found a problem. In the main loop (in bin/dedupe),
you use a single Timestamp instance, which is destructively
modified by calling advance.

Now this single Timestamp instance is used as a key for _all_
calls to checksum_buffer.add(). As a result, the @buffers hash
will always have only one entry and this single entry will hold _all_
flow.checksum/flow.timestamp pairs ever. Since the retention treshhold
is 1, this single @buffers entry that hold _all_ data will never be
deleted.

The solution should be to make Timestamp#advance nondestructive
and change the line

  timestamp.advance

in the main loop to

  timestamp = timestamp.advance

Stefan

···

2008/8/8 tobyclemson@gmail.com <tobyclemson@gmail.com>:

Hi all,

I'm having a really odd memory problem with a small ruby program I've
written. It basically takes in lines from input files (which represent
router flows), deduplicates them (based on elements of the line) and
outputs the unique flows to file. The input file often contains over
300,000 lines of which about 25-30% are duplicates. The trouble I'm
having is that the program (which is intended to be long running) does
not seem to release any memory back to the system and in fact just
increases in memory footprint from iteration to iteration. It should
use about 150 MB by my estimates but sails through this and yesterday
slowed to a halt at about 1.6GB (due to the GC by my guess). This
doesn't make any sense to me as at times I am deleting data structures
that occupy at least 50MB of memory.

The codebase is slightly to big too big to pastie but it is available
here http://svn.tobyclemson.co.uk/public/trunk/flow_deduplicator .
There are actually only 2 classes of importance and 1 script but I
don't know if pastie can handle that.

Any help would be greatly appreciated as the alternative (pressures
from above) is to rewrite in Python (which involves me learning
Python)

Sorry I don't quite understand the problem - I can see that it
probably is one but I think it's a matter of terminology. What do you
mean when you say destructively modified? I am modifying the value of
the timestamp in place? So that any reference to that timestamp will
be modified too? Should I be doing a duplication on the string that is
used to key the buffer in the buffers hash? I didn't think that the
actual object was passed in when an argument is supplied, I thought a
copy of it was passed in..

How would I make Timestamp#advance nondestructive?
If it is easier than pasting here I can give you commmit priveleges on
that repository?

Thanks very much for your help,
Toby

···

On Aug 8, 1:31 pm, "Stefan Lang" <perfectly.normal.hac...@gmail.com> wrote:

2008/8/8 tobyclem...@gmail.com <tobyclem...@gmail.com>:

> Hi all,

> I'm having a really odd memory problem with a small ruby program I've
> written. It basically takes in lines from input files (which represent
> router flows), deduplicates them (based on elements of the line) and
> outputs the unique flows to file. The input file often contains over
> 300,000 lines of which about 25-30% are duplicates. The trouble I'm
> having is that the program (which is intended to be long running) does
> not seem to release any memory back to the system and in fact just
> increases in memory footprint from iteration to iteration. It should
> use about 150 MB by my estimates but sails through this and yesterday
> slowed to a halt at about 1.6GB (due to the GC by my guess). This
> doesn't make any sense to me as at times I am deleting data structures
> that occupy at least 50MB of memory.

> The codebase is slightly to big too big to pastie but it is available
> herehttp://svn.tobyclemson.co.uk/public/trunk/flow_deduplicator.
> There are actually only 2 classes of importance and 1 script but I
> don't know if pastie can handle that.

> Any help would be greatly appreciated as the alternative (pressures
> from above) is to rewrite in Python (which involves me learning
> Python)

I _think_ I have found a problem. In the main loop (in bin/dedupe),
you use a single Timestamp instance, which is destructively
modified by calling advance.

Now this single Timestamp instance is used as a key for _all_
calls to checksum_buffer.add(). As a result, the @buffers hash
will always have only one entry and this single entry will hold _all_
flow.checksum/flow.timestamp pairs ever. Since the retention treshhold
is 1, this single @buffers entry that hold _all_ data will never be
deleted.

The solution should be to make Timestamp#advance nondestructive
and change the line

timestamp.advance

in the main loop to

timestamp = timestamp.advance

Stefan

Sorry I don't quite understand the problem - I can see that it
probably is one but I think it's a matter of terminology. What do you
mean when you say destructively modified? I am modifying the value of
the timestamp in place? So that any reference to that timestamp will
be modified too? Should I be doing a duplication on the string that is
used to key the buffer in the buffers hash? I didn't think that the
actual object was passed in when an argument is supplied, I thought a
copy of it was passed in..

How would I make Timestamp#advance nondestructive?
If it is easier than pasting here I can give you commmit priveleges on
that repository?

Arguments are passed by reference. Not a reference to the variable,
but a reference to the object. That's how most OO languages work.

Regarding your program: Add an accessor for the :time to the
Timestamp class, then change the advance definition
to this:

    def advance
      ts = self.dup
      ts.time += 60
      ts
    end

Instead of modifying the instance, we create a new one with the
desired change.

Now in the main loop in dedupe change this line:

    timestamp.advance

to

    timestamp = timestamp.advance

This way ChecksumBuffer#add will actually get a different
timestamp object on each call.

Since you also use Enumerable#min on an array of Timestamp
objects, you need to add Timestamp#<=>:

    def <=>(other)
        self.time <=> other.time
    end

That should do it.

Thanks very much for your help,

You're welcome!

Stefan

···

2008/8/8 tobyclemson@gmail.com <tobyclemson@gmail.com>:

Ok I've gone and had a little play and yes the memory problem was
completely my fault. I was passing in the timestamp to use as the key
for the buffer rather than the current value of the timestamp. By
changing the line checksum_buffer.add(flow, timestamp) to
checksum_buffer.add(flow, timestamp.current) the problems are solved!
It's just a shame it took me nearly a day of debugging and attempting
to learn Python and help from you guys to work that out!

Stefan, Edward, Thanks again for your help. I would never have noticed
that bug without your help Stefan,
Thanks,
Toby

···

On Aug 8, 4:01 pm, "tobyclem...@gmail.com" <tobyclem...@gmail.com> wrote:

Sorry I don't quite understand the problem - I can see that it
probably is one but I think it's a matter of terminology. What do you
mean when you say destructively modified? I am modifying the value of
the timestamp in place? So that any reference to that timestamp will
be modified too? Should I be doing a duplication on the string that is
used to key the buffer in the buffers hash? I didn't think that the
actual object was passed in when an argument is supplied, I thought a
copy of it was passed in..

How would I make Timestamp#advance nondestructive?
If it is easier than pasting here I can give you commmit priveleges on
that repository?

Thanks very much for your help,
Toby

On Aug 8, 1:31 pm, "Stefan Lang" <perfectly.normal.hac...@gmail.com> > wrote:

> 2008/8/8 tobyclem...@gmail.com <tobyclem...@gmail.com>:

> > Hi all,

> > I'm having a really odd memory problem with a small ruby program I've
> > written. It basically takes in lines from input files (which represent
> > router flows), deduplicates them (based on elements of the line) and
> > outputs the unique flows to file. The input file often contains over
> > 300,000 lines of which about 25-30% are duplicates. The trouble I'm
> > having is that the program (which is intended to be long running) does
> > not seem to release any memory back to the system and in fact just
> > increases in memory footprint from iteration to iteration. It should
> > use about 150 MB by my estimates but sails through this and yesterday
> > slowed to a halt at about 1.6GB (due to the GC by my guess). This
> > doesn't make any sense to me as at times I am deleting data structures
> > that occupy at least 50MB of memory.

> > The codebase is slightly to big too big to pastie but it is available
> > herehttp://svn.tobyclemson.co.uk/public/trunk/flow_deduplicator.
> > There are actually only 2 classes of importance and 1 script but I
> > don't know if pastie can handle that.

> > Any help would be greatly appreciated as the alternative (pressures
> > from above) is to rewrite in Python (which involves me learning
> > Python)

> I _think_ I have found a problem. In the main loop (in bin/dedupe),
> you use a single Timestamp instance, which is destructively
> modified by calling advance.

> Now this single Timestamp instance is used as a key for _all_
> calls to checksum_buffer.add(). As a result, the @buffers hash
> will always have only one entry and this single entry will hold _all_
> flow.checksum/flow.timestamp pairs ever. Since the retention treshhold
> is 1, this single @buffers entry that hold _all_ data will never be
> deleted.

> The solution should be to make Timestamp#advance nondestructive
> and change the line

> timestamp.advance

> in the main loop to

> timestamp = timestamp.advance

> Stefan