Large disc-based "hash"?

Chalmers · 5 October 2004 21:12

Hi,

I need a simple way to have a potentially very large hash mapping Strings to Ruby objects. It needs to scale to on the order of 1 million entries with up to 1kb of data per entry (even though it might end up with maybe 2-300 bytes per entry for starters). Access time is not critical but low-to-medium mem usage is. Deletion of (and insert of new) keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and a "front end" object which hides the details (of marshal/load of the objects and mapping of keys to the marshaled data etc) but maybe there is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random strings grows non-linearly in the file size and thus do not scale. Could sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a "full" db.

If it simplifies you can assume the keys and/or values are of fixed sizes since that is very likely but I'm not sure it will really be possible to keep that invariant so more flexbile solutions might be needed. Solution ideas which support caching is nice but not required (mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose this is of interest.

Thanks,

Robert

Rich_Kilmer · 5 October 2004 21:22

Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
accessing it.

Bdb link:

http://moulon.inra.fr/ruby/bdbxml.html

Ruby library:

http://moulon.inra.fr/ruby/bdb.html

-rich

···

On 10/5/04 5:12 PM, "Robert Feldt" <feldt@ce.chalmers.se> wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random
strings grows non-linearly in the file size and thus do not scale. Could
sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a
"full" db.

If it simplifies you can assume the keys and/or values are of fixed
sizes since that is very likely but I'm not sure it will really be
possible to keep that invariant so more flexbile solutions might be
needed. Solution ideas which support caching is nice but not required
(mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose
this is of interest.

Thanks,

Robert

Ara.T.Howard6 · 5 October 2004 22:14

if 'fairly frequent' is once per week look at CDB too - it's blindingly fast
and updates (atomic replacements actually) are suprisingly fast. reads are
absolutely insanely fast.

bdb is also great.

sqlite with adequate indexing may also work - but probably not.

-a

···

On Wed, 6 Oct 2004, Robert Feldt wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

Markus · 6 October 2004 04:00

Following the road less traveled a short way to see where it
leads... You can fairly easily make something that looks something like
a hash but has different internal structure:

class Hash_of_hashes
    def initialize(pc=Hash,cc=Hash)
        @parent_class,@child_class = uc,lc
        @storage = @parent_class.new
        @fanout = 100
        end
    def sub_level(k)
        @storage[k.hash.modulo(@fanout)] ||= @child_class.new
        end
    def (k)
        sub_level(k)[k]
        end
    def =(k,v)
        sub_level(k)[k] = v
        end
    end

It also seems like it would be fairly easy to make something that
looked sort of like a hash but used a directory as its storage.

class Hash_on_disk
    def initialize(path='.')
        @path=path
        File.mkdir(@path) unles File.directory? @path
        @fanout = 100
        end
    def sub_key(k)
        k.hash.modulo(@fanout)
        end
    def (k)
        File.open(@path+"/"+sub_key(k),'r') { |f| #read v from f }
        end
    def =(k,v)
        File.open(@path+"/"+sub_key(k),'w') { |f| #write v to f }
        end
    end

These are just sketches (you'd want to be able to delete keys/value
pairs, for example), but they are pretty darned simple. Thus it seems
like you should (in about 50 lines of ruby or so) be able to implement
something that looks like a hash but stores the data in a directory tree
that "terminates" in files based on some criterion. You might, for
example, want to store each value in its own file, or you might want to
store the bottom-most hash in a file (or you may want to decide
dynamically). (The reason for going with the tree structure is to avoid
killing your OS.)
The only gotcha's I can see are 1) intra-value references and 2)
hash-key degeneracy. Both should be reasonably easy to deal with if you
are on the lookout for them.

-- Markus

···

On Tue, 2004-10-05 at 14:12, Robert Feldt wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

Rich_Kilmer · 5 October 2004 21:24

Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
accessing it.

Bdb link:

Opps... http://www.sleepycat.com/

···

On 10/5/04 5:22 PM, "Richard Kilmer" <rich@infoether.com> wrote:

http://moulon.inra.fr/ruby/bdbxml.html

Ruby library:

http://moulon.inra.fr/ruby/bdb.html

-rich

On 10/5/04 5:12 PM, "Robert Feldt" <feldt@ce.chalmers.se> wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random
strings grows non-linearly in the file size and thus do not scale. Could
sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a
"full" db.

If it simplifies you can assume the keys and/or values are of fixed
sizes since that is very likely but I'm not sure it will really be
possible to keep that invariant so more flexbile solutions might be
needed. Solution ideas which support caching is nice but not required
(mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose
this is of interest.

Thanks,

Robert

Chalmers · 5 October 2004 22:43

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

if 'fairly frequent' is once per week look at CDB too - it's blindingly fast
and updates (atomic replacements actually) are suprisingly fast. reads are
absolutely insanely fast.

fairly frequent is more like "1-5% of keys per hour" but thanks for the pointer, very interesting. Seems it is unix only but I need windows also so maybe not.

bdb is also great.

Yes, seems it can handle this load. I tried with a BTree and the db-file-size-to-added-key+value-bytes ratio seems to stay at 2.5 at least up to 1e5 entries. Running a test with 1e6 added entries now; insertion time does not seem to scale linearly but we'll see about file size.

Thanks to both Rich and you for the pointers.

Downside of going with something like bdb is ease of install/support on multiple platforms (ie windows since everything works on linux... :)) but I guess bdb should be well supported also on Windows so maybe that's what it'll be.

sqlite with adequate indexing may also work - but probably not.

Ok, thanks for that info.

/R

···

Ara.T.Howard@noaa.gov wrote:

On Wed, 6 Oct 2004, Robert Feldt wrote:

Gavin_Sinclair · 6 October 2004 01:59

OT: Can someone explain to me the difference between "Berkeley DB" and
("DBM", "GDBM", "SDBM")?

Gavin

···

On Wednesday, October 6, 2004, 7:22:18 AM, Richard wrote:

Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
accessing it.

Bdb link:

http://moulon.inra.fr/ruby/bdbxml.html

Ruby library:

http://moulon.inra.fr/ruby/bdb.html

-rich

On 10/5/04 5:12 PM, "Robert Feldt" <feldt@ce.chalmers.se> wrote:

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random
strings grows non-linearly in the file size and thus do not scale. Could
sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a
"full" db.

If it simplifies you can assume the keys and/or values are of fixed
sizes since that is very likely but I'm not sure it will really be
possible to keep that invariant so more flexbile solutions might be
needed. Solution ideas which support caching is nice but not required
(mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose
this is of interest.

Thanks,

Robert

Ara.T.Howard6 · 6 October 2004 04:24

check out fsdb on the raa - it might actually be suitable for OP's post too.

-a

···

On Wed, 6 Oct 2004, Markus wrote:

    Following the road less traveled a short way to see where it
leads... You can fairly easily make something that looks something like
a hash but has different internal structure:

class Hash_of_hashes
   def initialize(pc=Hash,cc=Hash)
       @parent_class,@child_class = uc,lc
       @storage = @parent_class.new
       @fanout = 100
       end
   def sub_level(k)
       @storage[k.hash.modulo(@fanout)] ||= @child_class.new
       end
   def (k)
       sub_level(k)[k]
       end
   def =(k,v)
       sub_level(k)[k] = v
       end
   end

    It also seems like it would be fairly easy to make something that
looked sort of like a hash but used a directory as its storage.

class Hash_on_disk
   def initialize(path='.')
       @path=path
       File.mkdir(@path) unles File.directory? @path
       @fanout = 100
       end
   def sub_key(k)
       k.hash.modulo(@fanout)
       end
   def (k)
       File.open(@path+"/"+sub_key(k),'r') { |f| #read v from f }
       end
   def =(k,v)
       File.open(@path+"/"+sub_key(k),'w') { |f| #write v to f }
       end
   end

    These are just sketches (you'd want to be able to delete keys/value
pairs, for example), but they are pretty darned simple. Thus it seems
like you should (in about 50 lines of ruby or so) be able to implement
something that looks like a hash but stores the data in a directory tree
that "terminates" in files based on some criterion. You might, for
example, want to store each value in its own file, or you might want to
store the bottom-most hash in a file (or you may want to decide
dynamically). (The reason for going with the tree structure is to avoid
killing your OS.)
    The only gotcha's I can see are 1) intra-value references and 2)
hash-key degeneracy. Both should be reasonably easy to deal with if you
are on the lookout for them.

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

Lennon_Day-Reynolds1 · 6 October 2004 00:04

If you guys are talking about the same CDB format Bernstein created,
then I don't know that it would be a good match for this kind of task.

CDB has one major weakness: you can make updates in memory quickly, but
persisting them to disk requires time proportional to the total dataset
size. Note that's dataset size, not number of keys -- CDB requires you
to copy the entire old DB to a new file to persist changes.

It *is* rediculously fast for lookups of mostly constant data (hence
the initial 'C'), but probably isn't a good match for any sort of
regularly-updated dataset.

I'd second the recommendation for BDB -- it's reasonably fast, very
scalable, and gets hammered on by all kinds of applications day in and
out.

Lennon_Day-Reynolds1 · 6 October 2004 02:19

Berkeley DB is an embedded database system developed and supported by
Sleepycat Software (http://www.sleepycat.com/). It is designed for
applications which need transactional data storage, but don't need (or
want) the additional structure that relational databases provide.

It allows simple numeric or string-valued keys to be associated with a
chunk of data, much like a persistent version of Ruby's native hash
tables. However, unlike a hash, only a small amount of the data has to
be in RAM at once -- Berkeley DB storages can be scaled into the
many-GB range easily. Like most relational databases, it also supports
atomic transactions and data recovery after crashes, as well as
concurrent (more than one process or thread) access to a database.

DBM, GDBM, and SDBM are all database formats (and libraries to access
them) which, like Berkeley DB, allow simple key/value mappings to be
saved to disk, and updated incrementally. However, each has various
limitations as to data size per key, total database size, or
performance on large datasets which Berkeley DB does not. None of them
natively offer transaction support, or handle concurrent access to the
same database gracefully.

The advantage of the *DBM libraries is their simplicity; if your
application uses less than 100,000 total keys, and doesn't need to
store much data in each one, they may offer plenty of functionality,
without requiring you to link in a large, complex library.

More specifically to Ruby, it is worth noting that the *DBM libraries
all have bindings included in the standard Ruby 1.8 distribution, while
Berkeley DB requires you to install BDB, then compile and install the
bindings. This is fine for some applications, but for quick
"download-and-go" tools, it may be more dependencies than you want to
introduce.

Majere · 6 October 2004 02:29

Robert Feldt <feldt@ce.chalmers.se> writes:

[...bdb...]

Yes, seems it can handle this load. I tried with a BTree and
the db-file-size-to-added-key+value-bytes ratio seems to stay
at 2.5 at least up to 1e5 entries. Running a test with 1e6
added entries now; insertion time does not seem to scale
linearly but we'll see about file size.

My experience with bdb has been very good. Tests with >500
million entries scaled linearly with regards to file-size and
essentially linearly with regards to insert time. I don't have
too much direct experience with remove time, but update is good,
especially if the entries are not expanding.

The key for excellent performance with bdb is selecting fast
persistent storage for the log files.

~r

Joe_Cheng · 6 October 2004 05:19

Robert Feldt wrote:

Downside of going with something like bdb is ease of install/support on multiple platforms (ie windows since everything works on linux... :)) but I guess bdb should be well supported also on Windows so maybe that's what it'll be.

If I'm not mistaken, bdb is also not free if you want to redistribute it with a non-open-source application. (I get the impression the licensing fees are pretty steep--maybe on the order of tens of thousands of USD?)

Mauricio_Fernndez · 6 October 2004 08:50

DBM, GDBM, and SDBM are all database formats (and libraries to access
them) which, like Berkeley DB, allow simple key/value mappings to be
saved to disk, and updated incrementally. However, each has various
limitations as to data size per key, total database size, or
performance on large datasets which Berkeley DB does not. None of them

dbm, sdbm and ndbm have rather low limits for the data size IIRC.
gdbm had no such restrictions.

natively offer transaction support, or handle concurrent access to the
same database gracefully.

gdbm (at least) allows concurrent reader access to the db; however
when you have a writer process no other simultaneous access is possible.
In practice, gdbm is often adequate if you don't need transactions and
your db is not too large. I've heard it is often faster than bdb due to
the lack of transactions.

···

On Wed, Oct 06, 2004 at 11:19:49AM +0900, rcoder@gmail.com wrote:

--
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Topic		Replies	Views
Memory-mapped persistent hash? ruby-talk	13	111	18 May 2007
Persistent Hash ruby-talk	7	81	11 May 2008
Pre-allocate large amount of memory? ruby-talk	14	129	16 September 2009
Beyond YAML? (scaling) ruby-talk	27	98	10 May 2007
Looking for a Fast Persistent Store ruby-talk	50	228	15 August 2006

Large disc-based "hash"?

--

--

Related topics