[Q] Fast loading of BIG data structures

I have a (really) big data structure, which looks like:

Features = [
[“feature”, [“parent1”, “parent2”], {:C1 => [0.0, (7 more)], :C2 => …}],
…a LOT more (like 2500) lines…
]

I was initially using (shudder) XML for storage, but as the structure
developed, we literally had to choose between dumping XML or dumping
Ruby. I thought about other generic methods but ended up settling on
writing out the Ruby representation of the structure and loading it
with require. What I’m currently using is exactly like the above but
with the constant wrapped in a module.

My question is: Is there an even faster way to load a big structure
than this?

My first thought was to use Marshal, but I was surprised to find that
Marshal.load takes about twice as long as require:

$ ruby -v
ruby 1.8.0 (2003-08-04) [sparc-solaris2.8]
$ time ruby -e "require ‘network’"
real 0m3.431s
user 0m3.080s
sys 0m0.200s
$ time ruby -e "File.open(‘network.dump’) {|f| Marshal.load(f)}"
real 0m7.321s
user 0m6.880s
sys 0m0.170s

Steve

Try PStore.

http://www.rubycentral.com/book/lib_standard.html

It stores the data in binary format.

Tell me how it works out. I would be interested to know.

Cheers,
Daniel.

···

On Thu, Dec 11, 2003 at 10:27:02AM +0900, Steven Lumos wrote:

I have a (really) big data structure, which looks like:

Features = [
[“feature”, [“parent1”, “parent2”], {:C1 => [0.0, (7 more)], :C2 => …}],
…a LOT more (like 2500) lines…
]

I was initially using (shudder) XML for storage, but as the structure
developed, we literally had to choose between dumping XML or dumping
Ruby. I thought about other generic methods but ended up settling on
writing out the Ruby representation of the structure and loading it
with require. What I’m currently using is exactly like the above but
with the constant wrapped in a module.

My question is: Is there an even faster way to load a big structure
than this?

My first thought was to use Marshal, but I was surprised to find that
Marshal.load takes about twice as long as require:

$ ruby -v
ruby 1.8.0 (2003-08-04) [sparc-solaris2.8]
$ time ruby -e “require ‘network’”
real 0m3.431s
user 0m3.080s
sys 0m0.200s
$ time ruby -e “File.open(‘network.dump’) {|f| Marshal.load(f)}”
real 0m7.321s
user 0m6.880s
sys 0m0.170s

Steve


Daniel Carrera | “Software is like sex. It’s better when it’s free”.
PhD student. |
Math Dept. UMD | – Linus Torvalds

Curious. What kind of machine are you running this on? These times seem a bit
slow in general.

I’m also wondering how Yaml would compare. Are you running ruby 1.8+? If so
try using to_yaml and dumping the result to a file, and then reload. Simple
example to help if you’re not familiar with Yaml:

save in yaml format

require ‘yaml’
require ‘network’
File.open(‘network.yaml’,‘w’){|f| f << Features.to_yaml}

You might want to look at the yaml file at this point, it is rather readable.
Then try

load yaml file

require ‘yaml’
Features = Yaml::load(File.open(‘network.yaml’))

And see what kind of times you get.

T.

···

On Thursday 11 December 2003 02:27 am, Steven Lumos wrote:

My question is: Is there an even faster way to load a big structure
than this?

My first thought was to use Marshal, but I was surprised to find that
Marshal.load takes about twice as long as require:

$ ruby -v
ruby 1.8.0 (2003-08-04) [sparc-solaris2.8]
$ time ruby -e “require ‘network’”
real 0m3.431s
user 0m3.080s
sys 0m0.200s
$ time ruby -e “File.open(‘network.dump’) {|f| Marshal.load(f)}”
real 0m7.321s
user 0m6.880s
sys 0m0.170s

Hi,

[…]

My question is: Is there an even faster way to load a big structure
than this?

My first thought was to use Marshal, but I was surprised to find that
Marshal.load takes about twice as long as require:

$ ruby -v
ruby 1.8.0 (2003-08-04) [sparc-solaris2.8]
$ time ruby -e “require ‘network’”
real 0m3.431s
user 0m3.080s
sys 0m0.200s
$ time ruby -e “File.open(‘network.dump’) {|f| Marshal.load(f)}”
real 0m7.321s
user 0m6.880s
sys 0m0.170s

On the off-chance it’s something about the way Marshal is doing
file IO (I’ve had it be slow under Windows), you might try for
comparison:

bin = File.open(‘network.dump’, “rb”) {|f| f.sysread(f.stat.size) }
dat = Marshal.load(bin)

…just in case it might be faster… (On Windows, with Ruby 1.6.8,
this made a considerable difference…)

HTH,

Bill

···

From: “Steven Lumos” slumos@yahoo.com

I suppose load should be little faster than require :slight_smile:

BTW, you may like to take a look at [ruby-talk: 83802],
_why did some experiment on loadin stuff faster using yaml, over xml,
with some tricks

More: are you working with REXML? have you tried using libxml
bindings?
What about the ruby version/platform? 1.6 on windows was really slow,
1.8 is much faster.

···

il Thu, 11 Dec 2003 01:23:56 GMT, Steven Lumos slumos@yahoo.com ha scritto::

I have a (really) big data structure, which looks like:

Features = [
[“feature”, [“parent1”, “parent2”], {:C1 => [0.0, (7 more)], :C2 => …}],
…a LOT more (like 2500) lines…
]

I was initially using (shudder) XML for storage, but as the structure
developed, we literally had to choose between dumping XML or dumping
Ruby. I thought about other generic methods but ended up settling on
writing out the Ruby representation of the structure and loading it
with require. What I’m currently using is exactly like the above but
with the constant wrapped in a module.

My question is: Is there an even faster way to load a big structure
than this?

“Steven Lumos” slumos@yahoo.com schrieb im Newsbeitrag
news:86wu94f8pk.fsf@bitty.lumos.us

I have a (really) big data structure, which looks like:

Features = [
[“feature”, [“parent1”, “parent2”], {:C1 => [0.0, (7 more)], :C2 =>
…}],
…a LOT more (like 2500) lines…
]

I was initially using (shudder) XML for storage, but as the structure
developed, we literally had to choose between dumping XML or dumping
Ruby. I thought about other generic methods but ended up settling on
writing out the Ruby representation of the structure and loading it
with require. What I’m currently using is exactly like the above but
with the constant wrapped in a module.

My question is: Is there an even faster way to load a big structure
than this?

My first thought was to use Marshal, but I was surprised to find that
Marshal.load takes about twice as long as require:

$ ruby -v
ruby 1.8.0 (2003-08-04) [sparc-solaris2.8]
$ time ruby -e “require ‘network’”
real 0m3.431s
user 0m3.080s
sys 0m0.200s
$ time ruby -e “File.open(‘network.dump’) {|f| Marshal.load(f)}”
real 0m7.321s
user 0m6.880s
sys 0m0.170s

Strange. Normally I would have suggested a combination of Marshal and
load: The dump is used if it is newer than the Ruby file, otherwise the
ruby file is loaded and dumped. This should yield quite fast loading
speed while maintaining simple editibility. (Is that an English word?
:-))

However, you might want to reconsider your data structure. Maybe there is
a more efficient way of handling this. You could use path names as
feature keys into a single Hash for example:

Features = {
“feature.parent1” => true,
“feature.parent2” => true,

}

Of course this is just a guess since I don’t know the data at hand.

Kind regards

robert

Date: Thu, 11 Dec 2003 01:23:56 GMT
From: Steven Lumos slumos@yahoo.com
Newsgroups: comp.lang.ruby
Subject: [Q] Fast loading of BIG data structures

I have a (really) big data structure, which looks like:

Features = [
[“feature”, [“parent1”, “parent2”], {:C1 => [0.0, (7 more)], :C2 => …}],
…a LOT more (like 2500) lines…
]

I was initially using (shudder) XML for storage, but as the structure
developed, we literally had to choose between dumping XML or dumping
Ruby. I thought about other generic methods but ended up settling on
writing out the Ruby representation of the structure and loading it
with require. What I’m currently using is exactly like the above but
with the constant wrapped in a module.

that’s a cool idea: a code generation database

My question is: Is there an even faster way to load a big structure than
this?

if you could simplify your structure a little it might be good to put into a
bdb. you may not see a huge performance gain for one process, but bdb uses
memory pools and so you should see big gain if > 1 process is accessing the
data in a read-only way.

-a

···

On Thu, 11 Dec 2003, Steven Lumos wrote:

ATTN: please update your address books with address below!

===============================================================================

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
STP :: Solar-Terrestrial Physics Data | NCEI
NGDC :: http://www.ngdc.noaa.gov/
NESDIS :: http://www.nesdis.noaa.gov/
NOAA :: http://www.noaa.gov/
US DOC :: http://www.commerce.gov/

The difference between art and science is that science is what we
understand well enough to explain to a computer.
Art is everything else.
– Donald Knuth, “Discover”

/bin/sh -c ‘for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done’
===============================================================================

Hi!

  • Steven Lumos; 2003-12-11, 01:50 UTC:

I have a (really) big data structure, which looks like:

Features = [
[“feature”, [“parent1”, “parent2”], {:C1 => [0.0, (7 more)], :C2 => …}],
…a LOT more (like 2500) lines…
]

My question is: Is there an even faster way to load a big structure
than this?

Special purpose C extension. Not that I recommend it in general
(leave alone: like it) but sometimes hand-optimized C code is the
best solution at hand.

Josef ‘Jupp’ SCHUGT

···


http://oss.erdfunkstelle.de/ruby/ - German comp.lang.ruby-FAQ
http://rubyforge.org/users/jupp/ - Ruby projects at Rubyforge

Daniel Carrera wrote:

Try PStore.

http://www.rubycentral.com/book/lib_standard.html

It stores the data in binary format.

… using Marshal, which the OP found less efficient :frowning:

“T. Onoma” transami@runbox.com writes:

My question is: Is there an even faster way to load a big structure
than this?

My first thought was to use Marshal, but I was surprised to find that
Marshal.load takes about twice as long as require:

$ ruby -v
ruby 1.8.0 (2003-08-04) [sparc-solaris2.8]
$ time ruby -e “require ‘network’”
real 0m3.431s
user 0m3.080s
sys 0m0.200s
$ time ruby -e “File.open(‘network.dump’) {|f| Marshal.load(f)}”
real 0m7.321s
user 0m6.880s
sys 0m0.170s

Curious. What kind of machine are you running this on? These times seem a bit
slow in general.

That was on a Blade 2000, but the timing for the require case is
basically the same on an Athlon 1600 running Windows 2000.

I’m also wondering how Yaml would compare. Are you running ruby 1.8+? If so
try using to_yaml and dumping the result to a file, and then reload. Simple
example to help if you’re not familiar with Yaml:

I love Yaml, but I didn’t try it because it’s already documented as
being slower than Marshal.

Steve

···

On Thursday 11 December 2003 02:27 am, Steven Lumos wrote: