Marshal efficiency

Folks,

As an intermediate step in a small software system that performs a large
amount of data gathering, I am using Marshal to store processed results,
so that other programs can use this information.

At first, this was an excellent solution: couldn’t be easier, and it
demonstrated good performance. However, as I am collecting more and more
data, the period of time required to serialise the data to disk is
balooning - it seems quadratically. The current size of my data is: an
array with 10000 moderately-sized objects. The size on disk is only
around 900K, and it took at least 10 minutes to write it.

Part of the problem is my approach:

  • read the marshalled data
  • gather some more data into the array
  • write the marshalled data

Each iteration produces a few thousand more data elements, but obviously
has to write all of the data back to disk.

Can anyone suggest a better approach to storing the data? I am open to
all suggestions, but I am hoping for a very easy solution. At work, where
the use of Ruby is not admired, I don’t want to expand the software
dependencies if I can avoid it. I don’t have much time, either, so I need
a simple solution.

This is using ruby 1.6.5 (I know I’m bad…) on Cygwin.

Thanks,
Gavin

Folks,

I wrote:

As an intermediate step in a small software system that performs a large
amount of data gathering, I am using Marshal to store processed results,
so that other programs can use this information.

At first, this was an excellent solution: couldn’t be easier, and it
demonstrated good performance. However, as I am collecting more and
more data, the period of time required to serialise the data to disk is
balooning - it seems quadratically. […]

Soon after I sent this, I decided to search the mailing list for existing
information (yes, I often get things backwards…). It turns out that
Marshal is known to be much faster in 1.7/1.8 than in 1.6.

The mailing list does not, however, lead me to a precompiled 1.8 for
Cygwin. So for future posterity, here it is, courtesy of Google:

http://mirrors.sunsite.dk/ruby/binaries/cygwin/1.8

Gavin

Gavin Sinclair wrote:

Folks,

As an intermediate step in a small software system that performs a large
amount of data gathering, I am using Marshal to store processed results,
so that other programs can use this information.

At first, this was an excellent solution: couldn’t be easier, and it
demonstrated good performance. However, as I am collecting more and more
data, the period of time required to serialise the data to disk is
balooning - it seems quadratically. The current size of my data is: an
array with 10000 moderately-sized objects. The size on disk is only
around 900K, and it took at least 10 minutes to write it.

Part of the problem is my approach:

  • read the marshalled data
  • gather some more data into the array
  • write the marshalled data

Each iteration produces a few thousand more data elements, but obviously
has to write all of the data back to disk.

Can anyone suggest a better approach to storing the data? I am open to
all suggestions, but I am hoping for a very easy solution. At work, where
the use of Ruby is not admired, I don’t want to expand the software
dependencies if I can avoid it. I don’t have much time, either, so I need
a simple solution.

The simplest solutions I can think of (so you may have already tossed
'em aside) are:

  1. One file per iteration. Concatenate files later.

  2. One large file, but each iteration appends a marshalled array.
    Concatenate the arrays later.

This works because you can dump objects to a file in sequence, and load
them back in sequence:

irb(main):005:0> x = [0,1,2,3]
[0, 1, 2, 3]
irb(main):006:0> f.write(Marshal.dump(x[0]))
4
irb(main):007:0> f.write(Marshal.dump(x[1]))
4
irb(main):008:0> f.write(Marshal.dump(x[2]))
4
irb(main):009:0> f.write(Marshal.dump(x[3]))
4
irb(main):010:0> Marshal.dump(x[3]).size
4
irb(main):011:0> Marshal.dump(x).size
12
irb(main):012:0> f.rewind
0
irb(main):013:0> Marshal.load(f)
0
irb(main):014:0> Marshal.load(f)
1
irb(main):015:0> Marshal.load(f)
2
irb(main):016:0> Marshal.load(f)
3

Joel VanderWerf vjoel@PATH.Berkeley.EDU wrote in message news:3F17759B.2030100@path.berkeley.edu

Gavin Sinclair wrote:

Folks,

As an intermediate step in a small software system that performs a large

a simple solution.

Try using ruby 1.8 and tell us the results!

Joel VanderWerf vjoel@PATH.Berkeley.EDU wrote in message
news:3F17759B.2030100@path.berkeley.edu

Gavin Sinclair wrote:

Folks,

As an intermediate step in a small software system that performs a
large

a simple solution.

Try using ruby 1.8 and tell us the results!

Sorry, didn’t I mention already? Ruby 1.8 works fantastically! Marshal
is so fast I don’t need a different strategy.

Gavin