Large disc-based "hash"?

Hi,

I need a simple way to have a potentially very large hash mapping Strings to Ruby objects. It needs to scale to on the order of 1 million entries with up to 1kb of data per entry (even though it might end up with maybe 2-300 bytes per entry for starters). Access time is not critical but low-to-medium mem usage is. Deletion of (and insert of new) keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and a "front end" object which hides the details (of marshal/load of the objects and mapping of keys to the marshaled data etc) but maybe there is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random strings grows non-linearly in the file size and thus do not scale. Could sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a "full" db.

If it simplifies you can assume the keys and/or values are of fixed sizes since that is very likely but I'm not sure it will really be possible to keep that invariant so more flexbile solutions might be needed. Solution ideas which support caching is nice but not required (mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose this is of interest.

Thanks,

Robert

Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
accessing it.

Bdb link:

http://moulon.inra.fr/ruby/bdbxml.html

Ruby library:

http://moulon.inra.fr/ruby/bdb.html

-rich

···

On 10/5/04 5:12 PM, "Robert Feldt" <feldt@ce.chalmers.se> wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random
strings grows non-linearly in the file size and thus do not scale. Could
sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a
"full" db.

If it simplifies you can assume the keys and/or values are of fixed
sizes since that is very likely but I'm not sure it will really be
possible to keep that invariant so more flexbile solutions might be
needed. Solution ideas which support caching is nice but not required
(mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose
this is of interest.

Thanks,

Robert

if 'fairly frequent' is once per week look at CDB too - it's blindingly fast
and updates (atomic replacements actually) are suprisingly fast. reads are
absolutely insanely fast.

bdb is also great.

sqlite with adequate indexing may also work - but probably not.

-a

···

On Wed, 6 Oct 2004, Robert Feldt wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

Following the road less traveled a short way to see where it
leads... You can fairly easily make something that looks something like
a hash but has different internal structure:

class Hash_of_hashes
    def initialize(pc=Hash,cc=Hash)
        @parent_class,@child_class = uc,lc
        @storage = @parent_class.new
        @fanout = 100
        end
    def sub_level(k)
        @storage[k.hash.modulo(@fanout)] ||= @child_class.new
        end
    def (k)
        sub_level(k)[k]
        end
    def =(k,v)
        sub_level(k)[k] = v
        end
    end

     It also seems like it would be fairly easy to make something that
looked sort of like a hash but used a directory as its storage.

class Hash_on_disk
    def initialize(path='.')
        @path=path
        File.mkdir(@path) unles File.directory? @path
        @fanout = 100
        end
    def sub_key(k)
        k.hash.modulo(@fanout)
        end
    def (k)
        File.open(@path+"/"+sub_key(k),'r') { |f| #read v from f }
        end
    def =(k,v)
        File.open(@path+"/"+sub_key(k),'w') { |f| #write v to f }
        end
    end

     These are just sketches (you'd want to be able to delete keys/value
pairs, for example), but they are pretty darned simple. Thus it seems
like you should (in about 50 lines of ruby or so) be able to implement
something that looks like a hash but stores the data in a directory tree
that "terminates" in files based on some criterion. You might, for
example, want to store each value in its own file, or you might want to
store the bottom-most hash in a file (or you may want to decide
dynamically). (The reason for going with the tree structure is to avoid
killing your OS.)
     The only gotcha's I can see are 1) intra-value references and 2)
hash-key degeneracy. Both should be reasonably easy to deal with if you
are on the lookout for them.

    -- Markus

···

On Tue, 2004-10-05 at 14:12, Robert Feldt wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
accessing it.

Bdb link:

Opps... http://www.sleepycat.com/

···

On 10/5/04 5:22 PM, "Richard Kilmer" <rich@infoether.com> wrote:

http://moulon.inra.fr/ruby/bdbxml.html

Ruby library:

http://moulon.inra.fr/ruby/bdb.html

-rich

On 10/5/04 5:12 PM, "Robert Feldt" <feldt@ce.chalmers.se> wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random
strings grows non-linearly in the file size and thus do not scale. Could
sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a
"full" db.

If it simplifies you can assume the keys and/or values are of fixed
sizes since that is very likely but I'm not sure it will really be
possible to keep that invariant so more flexbile solutions might be
needed. Solution ideas which support caching is nice but not required
(mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose
this is of interest.

Thanks,

Robert

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

if 'fairly frequent' is once per week look at CDB too - it's blindingly fast
and updates (atomic replacements actually) are suprisingly fast. reads are
absolutely insanely fast.

fairly frequent is more like "1-5% of keys per hour" but thanks for the pointer, very interesting. Seems it is unix only but I need windows also so maybe not.

bdb is also great.

Yes, seems it can handle this load. I tried with a BTree and the db-file-size-to-added-key+value-bytes ratio seems to stay at 2.5 at least up to 1e5 entries. Running a test with 1e6 added entries now; insertion time does not seem to scale linearly but we'll see about file size.

Thanks to both Rich and you for the pointers.

Downside of going with something like bdb is ease of install/support on multiple platforms (ie windows since everything works on linux... :)) but I guess bdb should be well supported also on Windows so maybe that's what it'll be.

sqlite with adequate indexing may also work - but probably not.

Ok, thanks for that info.

/R

···

Ara.T.Howard@noaa.gov wrote:

On Wed, 6 Oct 2004, Robert Feldt wrote:

OT: Can someone explain to me the difference between "Berkeley DB" and
("DBM", "GDBM", "SDBM")?

Gavin

···

On Wednesday, October 6, 2004, 7:22:18 AM, Richard wrote:

Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
accessing it.

Bdb link:

http://moulon.inra.fr/ruby/bdbxml.html

Ruby library:

http://moulon.inra.fr/ruby/bdb.html

-rich

On 10/5/04 5:12 PM, "Robert Feldt" <feldt@ce.chalmers.se> wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random
strings grows non-linearly in the file size and thus do not scale. Could
sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a
"full" db.

If it simplifies you can assume the keys and/or values are of fixed
sizes since that is very likely but I'm not sure it will really be
possible to keep that invariant so more flexbile solutions might be
needed. Solution ideas which support caching is nice but not required
(mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose
this is of interest.

Thanks,

Robert

check out fsdb on the raa - it might actually be suitable for OP's post too.

-a

···

On Wed, 6 Oct 2004, Markus wrote:

    Following the road less traveled a short way to see where it
leads... You can fairly easily make something that looks something like
a hash but has different internal structure:

class Hash_of_hashes
   def initialize(pc=Hash,cc=Hash)
       @parent_class,@child_class = uc,lc
       @storage = @parent_class.new
       @fanout = 100
       end
   def sub_level(k)
       @storage[k.hash.modulo(@fanout)] ||= @child_class.new
       end
   def (k)
       sub_level(k)[k]
       end
   def =(k,v)
       sub_level(k)[k] = v
       end
   end

    It also seems like it would be fairly easy to make something that
looked sort of like a hash but used a directory as its storage.

class Hash_on_disk
   def initialize(path='.')
       @path=path
       File.mkdir(@path) unles File.directory? @path
       @fanout = 100
       end
   def sub_key(k)
       k.hash.modulo(@fanout)
       end
   def (k)
       File.open(@path+"/"+sub_key(k),'r') { |f| #read v from f }
       end
   def =(k,v)
       File.open(@path+"/"+sub_key(k),'w') { |f| #write v to f }
       end
   end

    These are just sketches (you'd want to be able to delete keys/value
pairs, for example), but they are pretty darned simple. Thus it seems
like you should (in about 50 lines of ruby or so) be able to implement
something that looks like a hash but stores the data in a directory tree
that "terminates" in files based on some criterion. You might, for
example, want to store each value in its own file, or you might want to
store the bottom-most hash in a file (or you may want to decide
dynamically). (The reason for going with the tree structure is to avoid
killing your OS.)
    The only gotcha's I can see are 1) intra-value references and 2)
hash-key degeneracy. Both should be reasonably easy to deal with if you
are on the lookout for them.

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

If you guys are talking about the same CDB format Bernstein created,
then I don't know that it would be a good match for this kind of task.

CDB has one major weakness: you can make updates in memory quickly, but
persisting them to disk requires time proportional to the total dataset
size. Note that's dataset size, not number of keys -- CDB requires you
to copy the entire old DB to a new file to persist changes.

It *is* rediculously fast for lookups of mostly constant data (hence
the initial 'C'), but probably isn't a good match for any sort of
regularly-updated dataset.

I'd second the recommendation for BDB -- it's reasonably fast, very
scalable, and gets hammered on by all kinds of applications day in and
out.

Berkeley DB is an embedded database system developed and supported by
Sleepycat Software (http://www.sleepycat.com/). It is designed for
applications which need transactional data storage, but don't need (or
want) the additional structure that relational databases provide.

It allows simple numeric or string-valued keys to be associated with a
chunk of data, much like a persistent version of Ruby's native hash
tables. However, unlike a hash, only a small amount of the data has to
be in RAM at once -- Berkeley DB storages can be scaled into the
many-GB range easily. Like most relational databases, it also supports
atomic transactions and data recovery after crashes, as well as
concurrent (more than one process or thread) access to a database.

DBM, GDBM, and SDBM are all database formats (and libraries to access
them) which, like Berkeley DB, allow simple key/value mappings to be
saved to disk, and updated incrementally. However, each has various
limitations as to data size per key, total database size, or
performance on large datasets which Berkeley DB does not. None of them
natively offer transaction support, or handle concurrent access to the
same database gracefully.

The advantage of the *DBM libraries is their simplicity; if your
application uses less than 100,000 total keys, and doesn't need to
store much data in each one, they may offer plenty of functionality,
without requiring you to link in a large, complex library.

More specifically to Ruby, it is worth noting that the *DBM libraries
all have bindings included in the standard Ruby 1.8 distribution, while
Berkeley DB requires you to install BDB, then compile and install the
bindings. This is fine for some applications, but for quick
"download-and-go" tools, it may be more dependencies than you want to
introduce.

Robert Feldt <feldt@ce.chalmers.se> writes:

[...bdb...]

Yes, seems it can handle this load. I tried with a BTree and
the db-file-size-to-added-key+value-bytes ratio seems to stay
at 2.5 at least up to 1e5 entries. Running a test with 1e6
added entries now; insertion time does not seem to scale
linearly but we'll see about file size.

My experience with bdb has been very good. Tests with >500
million entries scaled linearly with regards to file-size and
essentially linearly with regards to insert time. I don't have
too much direct experience with remove time, but update is good,
especially if the entries are not expanding.

The key for excellent performance with bdb is selecting fast
persistent storage for the log files.

~r

Robert Feldt wrote:

Downside of going with something like bdb is ease of install/support on multiple platforms (ie windows since everything works on linux... :)) but I guess bdb should be well supported also on Windows so maybe that's what it'll be.

If I'm not mistaken, bdb is also not free if you want to redistribute it with a non-open-source application. (I get the impression the licensing fees are pretty steep--maybe on the order of tens of thousands of USD?)

DBM, GDBM, and SDBM are all database formats (and libraries to access
them) which, like Berkeley DB, allow simple key/value mappings to be
saved to disk, and updated incrementally. However, each has various
limitations as to data size per key, total database size, or
performance on large datasets which Berkeley DB does not. None of them

dbm, sdbm and ndbm have rather low limits for the data size IIRC.
gdbm had no such restrictions.

natively offer transaction support, or handle concurrent access to the
same database gracefully.

gdbm (at least) allows concurrent reader access to the db; however
when you have a writer process no other simultaneous access is possible.
In practice, gdbm is often adequate if you don't need transactions and
your db is not too large. I've heard it is often faster than bdb due to
the lack of transactions.

···

On Wed, Oct 06, 2004 at 11:19:49AM +0900, rcoder@gmail.com wrote:

--
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com