Ruby sdbm library reliable?

From: Jamey Cribbs [mailto:cribbsj@oakwood.org]
Sent: Tuesday, May 16, 2006 8:37 AM
To: ruby-talk ML
Subject: ruby sdbm library reliable?

Does anyone have any recent experience with using Ruby's sdbm
library?

The dbm/gdbm/sdbm libraries are toys, imho. I wouldn't trust them to
store more than a few thousand records, let alone a million. I've seen
them fall down horribly on Windows. Just ask Hal Fulton. As far as I
can tell, the dbm/gdbm/sdbm wrappers were included because thats what
Perl does.

I would be looking at storing up to a million records, with
each record
being up to 200 bytes, and each key being up to 7 bytes.

I'm very curious Jamey - why aren't you using KirbyBase?

Anyway, if you want something more robust yet still simple, I recommend
sqlite.

Regards,

Dan

This communication is the property of Qwest and may contain confidential or
privileged information. Unauthorized use of this communication is strictly
prohibited and may be unlawful. If you have received this communication
in error, please immediately notify the sender by reply e-mail and destroy
all copies of the communication and any attachments.

···

-----Original Message-----

Berger, Daniel wrote:

The dbm/gdbm/sdbm libraries are toys, imho. I wouldn't trust them to
store more than a few thousand records, let alone a million. I've seen
them fall down horribly on Windows. Just ask Hal Fulton. As far as I
can tell, the dbm/gdbm/sdbm wrappers were included because thats what
Perl does.

Thanks, Dan. That's the kind of info I was looking for before I spent a lot of time testing.

I'm very curious Jamey - why aren't you using KirbyBase?
  
Well, here's the deal. :slight_smile:

I'm looking at trying to improve KirbyBase's performance, to make it scale better. One way I am doing this is I have started playing around with using Skiplists for KirbyBase indexes, rather than arrays. For a lot of queries, this has greatly speeded up the part of the query that grabs all matching keys.

But, I'm still running into the bottleneck of taking those keys from the query and using them to pull the actual records in from the flat text file where KirbyBase stores the actual data.

So, I've been looking at ways to get around this. I would basically look at replacing the back-end flat file format of KirbyBase with something faster. The user wouldn't see any difference in how they interacted with KirbyBase. One of the possibilities was to store the data records in some kind of dbm-like file (i.e. gdbm, qdbm, etc.) and sdbm looked attractive because it is included with Ruby. The downside to this is that the data is no longer kept in easily-editable plain-text files.

Another option would be to keep the entire table in memory as a hash and use some kind of journal file to handle updates/deletes. This might be an issue on memory-constrained systems, but I'm thinking this is less and less of an issue on modern pcs.

I'm just kind of brainstorming right now. KirbyBase performs pretty well, but I would like to see it scale better.

Jamey

Confidentiality Notice: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. If you are not the intended recipient(s), you are hereby notified that any dissemination, unauthorized review, use, disclosure or distribution of this email and any materials contained in any attachments is prohibited. If you receive this message in error, or are not the intended recipient(s), please immediately notify the sender by email and destroy all copies of the original message, including attachments.

They also consistently fail tests on Gentoo GNU/Linux for me (and a few others).

Michal

This message is my property! Do not read it, and immediately destroy it!!!

···

On 5/16/06, Berger, Daniel <Daniel.Berger@qwest.com> wrote:

> -----Original Message-----
> From: Jamey Cribbs [mailto:cribbsj@oakwood.org]
> Sent: Tuesday, May 16, 2006 8:37 AM
> To: ruby-talk ML
> Subject: ruby sdbm library reliable?
>
> Does anyone have any recent experience with using Ruby's sdbm
> library?

The dbm/gdbm/sdbm libraries are toys, imho. I wouldn't trust them to
store more than a few thousand records, let alone a million. I've seen
them fall down horribly on Windows. Just ask Hal Fulton. As far as I
can tell, the dbm/gdbm/sdbm wrappers were included because thats what
Perl does.

I've been interested for a long time in doing a fast system for datasets up
to a few million rows with journalled updates. (I wrote the journal-arena
already, I use it for fast access to log entries.) If you like, I'd be happy
to help out also.

···

On 5/16/06, Jamey Cribbs <cribbsj@oakwood.org> wrote:

Berger, Daniel wrote:
> The dbm/gdbm/sdbm libraries are toys, imho. I wouldn't trust them to
> store more than a few thousand records, let alone a million. I've seen
> them fall down horribly on Windows. Just ask Hal Fulton. As far as I
> can tell, the dbm/gdbm/sdbm wrappers were included because thats what
> Perl does.
>
Thanks, Dan. That's the kind of info I was looking for before I spent a
lot of time testing.
> I'm very curious Jamey - why aren't you using KirbyBase?
>

Well, here's the deal. :slight_smile:

I'm looking at trying to improve KirbyBase's performance, to make it
scale better. One way I am doing this is I have started playing around
with using Skiplists for KirbyBase indexes, rather than arrays. For a
lot of queries, this has greatly speeded up the part of the query that
grabs all matching keys.

But, I'm still running into the bottleneck of taking those keys from the
query and using them to pull the actual records in from the flat text
file where KirbyBase stores the actual data.

So, I've been looking at ways to get around this. I would basically
look at replacing the back-end flat file format of KirbyBase with
something faster. The user wouldn't see any difference in how they
interacted with KirbyBase. One of the possibilities was to store the
data records in some kind of dbm-like file (i.e. gdbm, qdbm, etc.) and
sdbm looked attractive because it is included with Ruby. The downside
to this is that the data is no longer kept in easily-editable plain-text
files.

Another option would be to keep the entire table in memory as a hash and
use some kind of journal file to handle updates/deletes. This might be
an issue on memory-constrained systems, but I'm thinking this is less
and less of an issue on modern pcs.

I'm just kind of brainstorming right now. KirbyBase performs pretty
well, but I would like to see it scale better.

Jamey

Confidentiality Notice: This email message, including any attachments, is
for the sole use of the intended recipient(s) and may contain confidential
and/or privileged information. If you are not the intended recipient(s), you
are hereby notified that any dissemination, unauthorized review, use,
disclosure or distribution of this email and any materials contained in any
attachments is prohibited. If you receive this message in error, or are not
the intended recipient(s), please immediately notify the sender by email and
destroy all copies of the original message, including attachments.

Francis Cianfrocca wrote:

I've been interested for a long time in doing a fast system for datasets up
to a few million rows with journalled updates. (I wrote the journal-arena
already, I use it for fast access to log entries.) If you like, I'd be happy
to help out also.

Sure thing!

I've been doing some thinking about Logan's idea of offering a choice of multiple backends. That sounds intriguing. I could see that idea, coupled with something like qdbm, to provide different options for the user.

The default backend could remain the plain-text files that KirbyBase currently uses. But, if the user has qdbm (as an example) installed, KirbyBase could optionally use that as the backend resulting in a noticeable (I would hope) performance increase.

By the way, does anyone have experience using qdbm? It looks like the library is well maintained and there is plenty of documentation. I'm curious if it is reliable and does it seem to scale.

As far as the indexing goes, I've recently become infatuated with Skiplists. In some informal testing, my very crude skiplist implementation seemed to be a nice balance between a hash and an array for general lookups. I've also thought about trying out a B-Tree implementation for the indexing, with the added possibility of storing/accessing the indexes from disk as opposed to loading them completely into memory before use. I think Logan mentioned that he had some thoughts in that direction.

I would definitely be interested in seeing your work with journaling. I was thinking of possibly using journaling in conjunction with having the entire table in memory. Updates and deletes would be made directly to the table in memory as well as written to a journal file, so that, when it came time to write the table back out to disk, I would only have to update the table's disk file with the contents of the journal file as opposed to having to write out the memory-based table. Like I said, this is just brainstorming. It may not even make sense to try to keep the entire table in memory.

Jamey

Confidentiality Notice: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. If you are not the intended recipient(s), you are hereby notified that any dissemination, unauthorized review, use, disclosure or distribution of this email and any materials contained in any attachments is prohibited. If you receive this message in error, or are not the intended recipient(s), please immediately notify the sender by email and destroy all copies of the original message, including attachments.

So much depends on the application. And it's important to decide whether or
not we're willing to write fairly specialized implementations for particular
application classes. I've done a fair bit of that and I'm not afraid to do
it, but it's axiomatically an inefficient process. But if you go the
opposite direction, toward creating a general-purpose lightweight data
manager, you might as well just use sqllite. If we put our heads together,
we ought to be able to find a middle ground: what, for example, would be a
generic architecture for a lightweight database? What are the essential
pieces and the optional pieces, and precisely what interfaces do they
expose? Done right, that would enable optimizations for particular apps by
swapping out components.

The two big problems with journaling data-storage are: collecting the
garbage, and rebuilding the indices when the run starts (because there's no
graceful place to put them in the journal). One of the apps I often deal
with is indexed searching on logfiles. A journalling arena implemented in
virtual memory works well here because garbage collection is automatic. But
fast index generation would be a huge plus. These apps generally index whole
words, Lucene-style.

Another thing I do a lot is LDAP directories augmented with presence data.
Here, journalling is a tougher fit because most of the data is not garbage,
but some of it becomes garbage at very high rates. The indexing is also a
serious problem because LDAP permits partial-word searches.

Skiplists look really interesting, thanks for pointing them out.

···

On 5/17/06, Jamey Cribbs <cribbsj@oakwood.org> wrote:

Francis Cianfrocca wrote:
> I've been interested for a long time in doing a fast system for
> datasets up
> to a few million rows with journalled updates. (I wrote the
journal-arena
> already, I use it for fast access to log entries.) If you like, I'd be
> happy
> to help out also.
Sure thing!

I've been doing some thinking about Logan's idea of offering a choice of
multiple backends. That sounds intriguing. I could see that idea,
coupled with something like qdbm, to provide different options for the
user.

The default backend could remain the plain-text files that KirbyBase
currently uses. But, if the user has qdbm (as an example) installed,
KirbyBase could optionally use that as the backend resulting in a
noticeable (I would hope) performance increase.

By the way, does anyone have experience using qdbm? It looks like the
library is well maintained and there is plenty of documentation. I'm
curious if it is reliable and does it seem to scale.

As far as the indexing goes, I've recently become infatuated with
Skiplists. In some informal testing, my very crude skiplist
implementation seemed to be a nice balance between a hash and an array
for general lookups. I've also thought about trying out a B-Tree
implementation for the indexing, with the added possibility of
storing/accessing the indexes from disk as opposed to loading them
completely into memory before use. I think Logan mentioned that he had
some thoughts in that direction.

I would definitely be interested in seeing your work with journaling. I
was thinking of possibly using journaling in conjunction with having the
entire table in memory. Updates and deletes would be made directly to
the table in memory as well as written to a journal file, so that, when
it came time to write the table back out to disk, I would only have to
update the table's disk file with the contents of the journal file as
opposed to having to write out the memory-based table. Like I said,
this is just brainstorming. It may not even make sense to try to keep
the entire table in memory.

Jamey

Confidentiality Notice: This email message, including any attachments, is
for the sole use of the intended recipient(s) and may contain confidential
and/or privileged information. If you are not the intended recipient(s), you
are hereby notified that any dissemination, unauthorized review, use,
disclosure or distribution of this email and any materials contained in any
attachments is prohibited. If you receive this message in error, or are not
the intended recipient(s), please immediately notify the sender by email and
destroy all copies of the original message, including attachments.

I wrote something a while back that might be helpful (or not). I wanted a
lightweight mechanism to persist data to disk via a hashlike interface. I
didn't want to use pstore because one pays a pretty high cost for each
transaction once the data store gets any significant amount of data in it,
and one has to go through extra effort to make pstore multiprocess safe. I
also wanted it to have flat RAM usage, regardless of the number of items in
the data store.

What I ended up with is something that can be used as an LRU cache or simply a
hash. It is completely disk based and is multithread/multiprocess safe. It
still needs some love, but it works.

It's slower than pstore (when starting a transaction, performaing an action,
closing a transaction) for small numbers of items in the data store, and much
faster than pstore once the data store has a few thousand items.

I'm planning on packaging it up as a seperate micropackage in the IOWA area on
Rubyforge. It's probably no faster than Kirbybase is now, but if it might be
useful in some capacity, I'll bump that up on my priority list.

Thanks

Kirk Haines

···

On Wednesday 17 May 2006 7:48 am, Jamey Cribbs wrote:

I've been doing some thinking about Logan's idea of offering a choice of
multiple backends. That sounds intriguing. I could see that idea,
coupled with something like qdbm, to provide different options for the
user.

Kirk Haines wrote:

···

On Wednesday 17 May 2006 7:48 am, Jamey Cribbs wrote:

I've been doing some thinking about Logan's idea of offering a choice of
multiple backends. That sounds intriguing. I could see that idea,
coupled with something like qdbm, to provide different options for the
user.

I wrote something a while back that might be helpful (or not). I wanted a
lightweight mechanism to persist data to disk via a hashlike interface. I
didn't want to use pstore because one pays a pretty high cost for each
transaction once the data store gets any significant amount of data in it,
and one has to go through extra effort to make pstore multiprocess safe. I
also wanted it to have flat RAM usage, regardless of the number of items in
the data store.

What I ended up with is something that can be used as an LRU cache or simply a
hash. It is completely disk based and is multithread/multiprocess safe. It
still needs some love, but it works.

It's slower than pstore (when starting a transaction, performaing an action,
closing a transaction) for small numbers of items in the data store, and much
faster than pstore once the data store has a few thousand items.

I'm planning on packaging it up as a seperate micropackage in the IOWA area on
Rubyforge. It's probably no faster than Kirbybase is now, but if it might be
useful in some capacity, I'll bump that up on my priority list.

Sounds interesting. I wrote FSDB with some of the same ideas in mind,
but probably we used different approaches. Looking forward to seeing it
when it comes out.

--
      vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Kirk Haines wrote:

I wrote something a while back that might be helpful (or not). I wanted a lightweight mechanism to persist data to disk via a hashlike interface. I didn't want to use pstore because one pays a pretty high cost for each transaction once the data store gets any significant amount of data in it, and one has to go through extra effort to make pstore multiprocess safe. I also wanted it to have flat RAM usage, regardless of the number of items in the data store.

What I ended up with is something that can be used as an LRU cache or simply a hash. It is completely disk based and is multithread/multiprocess safe. It still needs some love, but it works.

It's slower than pstore (when starting a transaction, performaing an action, closing a transaction) for small numbers of items in the data store, and much faster than pstore once the data store has a few thousand items.

I'm planning on packaging it up as a seperate micropackage in the IOWA area on Rubyforge. It's probably no faster than Kirbybase is now, but if it might be useful in some capacity, I'll bump that up on my priority list.
  

This sounds really interesting. Thanks, Kirk!

Jamey

Confidentiality Notice: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. If you are not the intended recipient(s), you are hereby notified that any dissemination, unauthorized review, use, disclosure or distribution of this email and any materials contained in any attachments is prohibited. If you receive this message in error, or are not the intended recipient(s), please immediately notify the sender by email and destroy all copies of the original message, including attachments.