short version: What’s your prefered way to handle large data sets used by
Ruby scripts?
rbtree (red-black) tree.
is has an interface like a hash but is always sorted. the sort method is
determined by the keys ‘<=>’ method. it has also allows lower_bound and
upper_bound searches. it marshals to disk quite fast. log(n) for insert,
delete, and search.
long version: Some time ago I had a problem where Rubys text munching
capabilities helped very much. Unfortunately, the data set was quite large,
an array of ~40000 entries, whereas every entry is itself an array of ~8
data (String, Int,… whatever) points. All in all the final solution ran
in ~2min time and iterated several times over the whole data set → a brute
force approach.
i use rbtree to store database ‘tables’ where
pk → tuple
of 80 elements and more. it is extremely fast.
What surprised me, a first “more efficient” implementation which took
“notes” in temp structures while munching, but did require less full
iterations, was much slower.
function call overhead is probably killing you. simply scanning memory is
fast.
Now I got to do such a thing again, but the data set will be in the upper
6-digit count.
like i said - i’ve done this.
Brute force again? Is the GC the significant speed problem with temp data?
Is the GC suited for xxxMb data at all? Is “mixing in” an external database
maybe faster?
berkley db might help - since is uses memeory pools. guy’s impl. is very
complete.
Is there a reference for non source divers how Rubys basic datastructures
(Array, Hash, String,…) perform with basic ops - insert, delete, append,
search, … Can be tested one for one, but if I know some things run at
order n^2 I won’t even try?
i think you need to test for you type of data. for instance, i had some data
for which is was faster to do custom marshaling rather than rely on rbtree’s
own. this was because the data was delimited text and doing a ‘join on
delimiter marhal as string - load from string, split on delimiter’ was faster
than going over the entire rbtree
Any experiences to tell? 
my system ‘caches’ the result of a database query in an rbtree stored within a
pstore in the webroot. queries are between 10,000 and 100,000 lines. the
only speed hit is sending large pages back to the client - no probs with data
speed.
here is the results for an rbtree of around 65000 entries:
~/eg/ruby/rbtree > ruby large.rb
insert:
min: 0.00000298
max: 0.02559197
avg: 0.00001013
search:
min: 0.00000298
max: 0.04867399
avg: 0.00001075
marshal:
min: 0.82507992
max: 0.82507992
avg: 0.82507992
load:
min: 1.04620802
max: 1.04620802
avg: 1.04620802
delete:
min: 0.00000703
max: 0.15283799
avg: 0.00000857
here is the code:
~/eg/ruby/rbtree > cat large.rb
require ‘rbtree’
n = 1 << 16
rb = RBTree.new
insert n entries
insert = Stat.new :insert
n.times do |k|
tuple = [k, Time.now.to_f, rand]
insert.time { rb[k] = tuple }
end
p insert
search for all n entries
search = Stat.new :search
n.times do |k|
search.time { tuple = rb[k] }
end
p search
dump all n entries
marshal = Stat.new :marshal
dumped = marshal.time { Marshal.dump rb }
p marshal
load them back up
mload = Stat.new :load
rb = mload.time { Marshal.load dumped }
p mload
delete all n entries
delete = Stat.new :delete
n.times do |k|
delete.time { rb.delete n }
end
p delete
BEGIN {
class Stat
def initialize name
@name = name
@min, @max, @avg = nil,nil,nil
end
def update! t
@min = t if @min.nil? or t < @min
@max = t if @max.nil? or t > @max
@avg = (@avg.nil? ? t : (@avg + t)/2)
end
def time
a = Time.now.to_f
r = yield
b = Time.now.to_f
update!(b - a)
r
end
def inspect
format “%s:\n\tmin: %8.8f\n\tmax: %8.8f\n\tavg: %8.8f\n”,
@name, @min, @max, @avg
end
end
}
cheers.
-a
···
On 8 Dec 2003, Martin Pirker wrote:
ATTN: please update your address books with address below!
===============================================================================
EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
STP :: Solar-Terrestrial Physics Data | NCEI
NGDC :: http://www.ngdc.noaa.gov/
NESDIS :: http://www.nesdis.noaa.gov/
NOAA :: http://www.noaa.gov/
US DOC :: http://www.commerce.gov/
The difference between art and science is that science is what we
understand well enough to explain to a computer.
Art is everything else.
– Donald Knuth, “Discover”
/bin/sh -c ‘for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done’
===============================================================================