As an intermediate step in a small software system that performs a large
amount of data gathering, I am using Marshal to store processed results,
so that other programs can use this information.
At first, this was an excellent solution: couldn’t be easier, and it
demonstrated good performance. However, as I am collecting more and more
data, the period of time required to serialise the data to disk is
balooning - it seems quadratically. The current size of my data is: an
array with 10000 moderately-sized objects. The size on disk is only
around 900K, and it took at least 10 minutes to write it.
Part of the problem is my approach:
read the marshalled data
gather some more data into the array
write the marshalled data
Each iteration produces a few thousand more data elements, but obviously
has to write all of the data back to disk.
Can anyone suggest a better approach to storing the data? I am open to
all suggestions, but I am hoping for a very easy solution. At work, where
the use of Ruby is not admired, I don’t want to expand the software
dependencies if I can avoid it. I don’t have much time, either, so I need
a simple solution.
This is using ruby 1.6.5 (I know I’m bad…) on Cygwin.
As an intermediate step in a small software system that performs a large
amount of data gathering, I am using Marshal to store processed results,
so that other programs can use this information.
At first, this was an excellent solution: couldn’t be easier, and it
demonstrated good performance. However, as I am collecting more and
more data, the period of time required to serialise the data to disk is
balooning - it seems quadratically. […]
Soon after I sent this, I decided to search the mailing list for existing
information (yes, I often get things backwards…). It turns out that
Marshal is known to be much faster in 1.7/1.8 than in 1.6.
The mailing list does not, however, lead me to a precompiled 1.8 for
Cygwin. So for future posterity, here it is, courtesy of Google:
As an intermediate step in a small software system that performs a large
amount of data gathering, I am using Marshal to store processed results,
so that other programs can use this information.
At first, this was an excellent solution: couldn’t be easier, and it
demonstrated good performance. However, as I am collecting more and more
data, the period of time required to serialise the data to disk is
balooning - it seems quadratically. The current size of my data is: an
array with 10000 moderately-sized objects. The size on disk is only
around 900K, and it took at least 10 minutes to write it.
Part of the problem is my approach:
read the marshalled data
gather some more data into the array
write the marshalled data
Each iteration produces a few thousand more data elements, but obviously
has to write all of the data back to disk.
Can anyone suggest a better approach to storing the data? I am open to
all suggestions, but I am hoping for a very easy solution. At work, where
the use of Ruby is not admired, I don’t want to expand the software
dependencies if I can avoid it. I don’t have much time, either, so I need
a simple solution.
The simplest solutions I can think of (so you may have already tossed
'em aside) are:
One file per iteration. Concatenate files later.
One large file, but each iteration appends a marshalled array.
Concatenate the arrays later.
This works because you can dump objects to a file in sequence, and load
them back in sequence: