Looking for a Fast Persistent Store

How fast is fast enough, and what kind of transaction support do you need?

I have a class that is part of the suite of caches I provide in IOWA that I have recently been working on.

It's called a DiskCache because it implements an LRU cache that operates exclusively through persistent storage, but the expiration of elements can be turned off, turning it into a simple persistent storage mechanism. Because it operates exclusively through persistent storage, it requires no in-memory indexes or other data structures. This was important to me because if one were using PStore with many small transactions on any data set that was not tiny (less than a thousandish elements), PStore's index reading/writing overhead became troublesome, and I didn't want to incur the RAM hit of an in memory index for a very large data set, either.

It's relatively fast (it is much faster than PStore for non-trivial data sets since PStore requires reading an index into memory for each transaction), platform independent (well, I have tested it on Linux and Win XP) and it is multiprocess/threadsafe.

It supports transactions as well as nested transactions.

i.e.

@cache.transaction do
   @cache[:a] = 1
   @cache.transaction do
     @cache[:b] = 2
     @cache.transaction do
       @cache[:c] = 3
       rollback
     end
     commit
   end
   rollback
end

At the end of those transactions, @cache[:b] == 2, and the two other items have been rolled back.

I'm currently expanding the test cases for it and cleaning up some bugs, and will go ahead and release it as a standalone package this weekend. Maybe it will help you.

Kirk Haines

···

On Fri, 11 Aug 2006, Bob Hutchison wrote:

And he passes the proverbial buck! :slight_smile:

Austin contributed all the code in Net::LDAP for doing the Rubyforge
and Gmail dances, and it works very well, but if you have trouble with
it, it's probably because I munged it some after Austin checked it in.
So any problems will be my fault, not his.

There are some dependencies, like minitar.

···

On 8/14/06, Austin Ziegler <halostatue@gmail.com> wrote:

Actually, the Net::LDAP Rakefile is the best thing to look at right now. :wink:

-austin
--

Austin Ziegler wrote:

···

On 8/14/06, Francis Cianfrocca <garbagecat10@gmail.com> wrote:

On 8/14/06, Joel VanderWerf <vjoel@path.berkeley.edu> wrote:
> I apologize for the lack of fsdb stuff on the RubyForge page. I've never
> found a comfortable way to automate gem uploads, so I still use my old
> scripts for building this page:
Austin Ziegler has automated the "Rubyforge dance"- look at
PDF::Writer or post to him on this list.

Actually, the Net::LDAP Rakefile is the best thing to look at right now. :wink:

I will look at both before I start dancing. Thanks to both of you :slight_smile:

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Kirk, how did you do this? Are you storing immutable objects and
turning older versions into garbage somehow? If you have no indexing,
then how do you enforce a consistent view across processes without
incurring disk i/o (at least having to read the inode)? Or are you
managing an arena that is backed by shared memory and you convert the
key values deterministically into offsets?

Sorry for all the questions, but I've just had to fight with this
problem too many times and it would be great if you've come up with a
better mousetrap.

I'm currently expanding the test cases for it and cleaning up some bugs, and will go ahead and release it as a standalone package this weekend. Maybe it will help you.

Oh, this sounds very cool! Even if it doesn't help with this particular situation, I have a pretty good idea where it will come in handy.

Looking forward to it.

Cheers,
Bob

···

On Aug 11, 2006, at 2:03 PM, khaines@enigo.com wrote:

Kirk Haines

----
Bob Hutchison -- blogs at <http://www.recursive.ca/hutch/&gt;
Recursive Design Inc. -- <http://www.recursive.ca/&gt;
Raconteur -- <http://www.raconteur.info/&gt;
xampl for Ruby -- <http://rubyforge.org/projects/xampl/&gt;

Kirk Haines wrote:

How fast is fast enough, and what kind of transaction support do you need?

<snip>

I have a class that is part of the suite of caches I provide in IOWA that I have recently been working on.

<snip>

I'm currently expanding the test cases for it and cleaning up some bugs, and will go ahead and release it as a standalone package this weekend. Maybe it will help you.

Francis Cianfrocca wrote:

Kirk, how did you do this? Are you storing immutable objects and
turning older versions into garbage somehow? If you have no indexing,
then how do you enforce a consistent view across processes without
incurring disk i/o (at least having to read the inode)? Or are you
managing an arena that is backed by shared memory and you convert the
key values deterministically into offsets?

Sorry for all the questions, but I've just had to fight with this
problem too many times and it would be great if you've come up with a
better mousetrap.

I'm interested in this, too...please make an announcement on the list when you release it? :slight_smile:

-Justin

I don't think it is necessarily a better mousetrap. In this case, I'm just willing to make the filesystem do some of the work for me instead of keeping a separate index in memory.

I'm using a modified version of Ara Howard's lockfile.rb to handle locking between processes. It works over NFS, and I have modified it so that it also works on Windows by automatically falling back to a Windows safe locking method.

I create a hash (Digest::SHA512 based) out of the key, and use that key as the filename for the data. There is also a second file written that contains some metadata (the structure on disk is actually a linked list to more easily support some of the LRU aspects and element expiration). There are a couple of metadata files that also maintain some information on the data store as a whole.

The code automatically breaks the file storage into a tree of directories to help avoid any filesystem problems with having too many files in a single directory. The size of this tree is tuneable so that a data store that may have a million entries may have more nodes (directories) than one that is only expected to have 10000.

So, retrieving an element from the data store is reduced to hashing the key, finding the file, and reading it. It's disk i/o, but less than PStore generates most of the time.

Since my need was for something that could do this while also being treated as an LRU cache, there is some extra work mixed into there to maintain the meta data and expire elements from the cache (it can also expire them based on age), and that's really where most of the performance hit comes from.

I'm working on creating a second version that strips out all of the LRU cache code to provide a simple, fast data persistence implementation without any of that extra overhead, and will be better able to report on just how much of a performance hit that overhead brings with it soon.

Thanks,

Kirk Haines

···

On Sat, 12 Aug 2006, Francis Cianfrocca wrote:

Kirk, how did you do this? Are you storing immutable objects and
turning older versions into garbage somehow? If you have no indexing,
then how do you enforce a consistent view across processes without
incurring disk i/o (at least having to read the inode)? Or are you
managing an arena that is backed by shared memory and you convert the
key values deterministically into offsets?

Sorry for all the questions, but I've just had to fight with this
problem too many times and it would be great if you've come up with a
better mousetrap.

With all the times I've reinvented this wheel, I've never tried
storing millions of data elements each in its own file. Do you have
performance metrics you can share? Did you tune it for a particular
filesystem?

···

On 8/11/06, khaines@enigo.com <khaines@enigo.com> wrote:

On Sat, 12 Aug 2006, Francis Cianfrocca wrote:

> Kirk, how did you do this? Are you storing immutable objects and
> turning older versions into garbage somehow? If you have no indexing,
> then how do you enforce a consistent view across processes without
> incurring disk i/o (at least having to read the inode)? Or are you
> managing an arena that is backed by shared memory and you convert the
> key values deterministically into offsets?
>
> Sorry for all the questions, but I've just had to fight with this
> problem too many times and it would be great if you've come up with a
> better mousetrap.

I don't think it is necessarily a better mousetrap. In this case, I'm
just willing to make the filesystem do some of the work for me instead of
keeping a separate index in memory.

I'm using a modified version of Ara Howard's lockfile.rb to handle locking
between processes. It works over NFS, and I have modified it so that it
also works on Windows by automatically falling back to a Windows safe
locking method.

I create a hash (Digest::SHA512 based) out of the key, and use that key as
the filename for the data. There is also a second file written that
contains some metadata (the structure on disk is actually a linked list to
more easily support some of the LRU aspects and element expiration).
There are a couple of metadata files that also maintain some information
on the data store as a whole.

The code automatically breaks the file storage into a tree of directories
to help avoid any filesystem problems with having too many files in a
single directory. The size of this tree is tuneable so that a data store
that may have a million entries may have more nodes (directories) than one
that is only expected to have 10000.

So, retrieving an element from the data store is reduced to hashing the
key, finding the file, and reading it. It's disk i/o, but less than
PStore generates most of the time.

Since my need was for something that could do this while also being
treated as an LRU cache, there is some extra work mixed into there to
maintain the meta data and expire elements from the cache (it can also
expire them based on age), and that's really where most of the performance
hit comes from.

I'm working on creating a second version that strips out all of the LRU
cache code to provide a simple, fast data persistence implementation
without any of that extra overhead, and will be better able to report on
just how much of a performance hit that overhead brings with it soon.

Thanks,

Kirk Haines

khaines@enigo.com writes:

I'm working on creating a second version that strips out all of the
LRU cache code to provide a simple, fast data persistence
implementation without any of that extra overhead, and will be better
able to report on just how much of a performance hit that overhead
brings with it soon.

(How) do you handle object references?

···

Thanks,

Kirk Haines

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Not really. I have not stored the results of any tests to date. The largest test, however, created a million item cache, and I had no problems.

The main consideration if one were planning on storing large numbers of elements is making sure that the filesystem is built with large numbers of small files taken into consideration.

However, a lot of files can fit into a surprisingly non-messy directory structure.

Consider an SHA512 hash for a key that has the first few characters of:

dd232b224979c0e

If I am running a cache with a bucket depth of 2, and a bucket width of 2, the file for that cache is going to go into the directory

dd/23/

Given an even distribution of hashes, with a million records, one should have 256 directories at the top level, 256 at the second level, and 15 or 16 files under each of those.

That's pretty easy to handle efficiently.

Kirk Haines

···

On Sat, 12 Aug 2006, Francis Cianfrocca wrote:

With all the times I've reinvented this wheel, I've never tried
storing millions of data elements each in its own file. Do you have
performance metrics you can share? Did you tune it for a particular
filesystem?

It uses Marshal.

Kirk Haines

···

On Sun, 13 Aug 2006, Christian Neukirchen wrote:

(How) do you handle object references?

I was so interested in this idea that I cobbled up a quick test on a
workstation machine with a medium-bandwidth IO channel. 100,000 1K files
brought the shared-memory filesystem to its knees almost immediately. On an
oxide-based journalling filesystem, it seemed to do 10 to 20 thousand files
a second pretty consistently. That's a decent but not huge fraction of the
available channel bandwidth. Haven't tried a million files. I'm still
interested though.

···

On 8/11/06, khaines@enigo.com <khaines@enigo.com> wrote:

On Sat, 12 Aug 2006, Francis Cianfrocca wrote:

> With all the times I've reinvented this wheel, I've never tried
> storing millions of data elements each in its own file. Do you have
> performance metrics you can share? Did you tune it for a particular
> filesystem?

Not really. I have not stored the results of any tests to date. The
largest test, however, created a million item cache, and I had no
problems.

The main consideration if one were planning on storing large numbers of
elements is making sure that the filesystem is built with large numbers of
small files taken into consideration.

However, a lot of files can fit into a surprisingly non-messy directory
structure.

Consider an SHA512 hash for a key that has the first few characters of:

dd232b224979c0e

If I am running a cache with a bucket depth of 2, and a bucket width of 2,
the file for that cache is going to go into the directory

dd/23/

Given an even distribution of hashes, with a million records, one should
have 256 directories at the top level, 256 at the second level, and 15 or
16 files under each of those.

That's pretty easy to handle efficiently.

Kirk Haines

In the Java thing we wrote we had 100 of thousands of files in a directory, nothing shared though. No problems until you did a 'ls' by mistake :slight_smile: Backups are also a problem. External fragmentation can be a problem. I know we experimented with the kind of thing that I think Kirk was suggesting. By building a hierarchy you can keep the numbers of files/directories down, and this avoids the backup (and 'ls') problems.

Cheers,
Bob

···

On Aug 11, 2006, at 7:00 PM, Francis Cianfrocca wrote:

I was so interested in this idea that I cobbled up a quick test on a
workstation machine with a medium-bandwidth IO channel. 100,000 1K files
brought the shared-memory filesystem to its knees almost immediately. On an
oxide-based journalling filesystem, it seemed to do 10 to 20 thousand files
a second pretty consistently. That's a decent but not huge fraction of the
available channel bandwidth. Haven't tried a million files. I'm still
interested though.

----
Bob Hutchison -- blogs at <http://www.recursive.ca/hutch/&gt;
Recursive Design Inc. -- <http://www.recursive.ca/&gt;
Raconteur -- <http://www.raconteur.info/&gt;
xampl for Ruby -- <http://rubyforge.org/projects/xampl/&gt;

I was so interested in this idea that I cobbled up a quick test on a
workstation machine with a medium-bandwidth IO channel. 100,000 1K files
brought the shared-memory filesystem to its knees almost immediately. On an
oxide-based journalling filesystem, it seemed to do 10 to 20 thousand files
a second pretty consistently. That's a decent but not huge fraction of the
available channel bandwidth. Haven't tried a million files. I'm still
interested though.

Hi,

What is being measured? Access time for files that already exist?
Creation of new files? Scanning the directory structure for a list
of existing files?

At a prior gig, we used to split a couple hundred thousand encyclopedia articles up as 12/34/56.xxx sort of format. It worked
adequately for our needs--our batch-oriented processing was expected
to run overnight anyway--but my impression was that as long as the
filename was known, accessing file 12/34/56.xxx seemed quick,
whereas directory scans to enumerate the existing filenames were
pretty slow.

(This was on sunOs/solaris systems circa 2000... i don't know if
the filesystem was journalled or not.)

Regards,

Bill

···

From: "Francis Cianfrocca" <garbagecat10@gmail.com>

Clearly a directory hierarchy is the only way to make this work, and Kirk's
idea of partitioning the space by segmenting the hash seems as reasonable as
any. So far I'm very surprised to find that performance may be reasonable.

But what do you mean by "external fragmentation"? If you're thinking of NFS,
I'd rather cut myself than attempt something like this on NFS, even though
Kirk says it's supported.

···

On 8/11/06, Bob Hutchison <hutch@recursive.ca> wrote:

External fragmentation can be
a problem.

What about writing a thin C extension for interfacing with Perst?
Going back and forth from C to Java may be tricky, though... But,
then, what about any of the non-Java or -Ruby options? Just write a
quick C extension/wrapper for Ruby, or use SWIG to do it for you.

M.T.

I wanted to put a rough lower bound on the performance of the
approach, just to decide if it's worth pursuing at all. (Kirk
obviously thinks it is, but I don't know if the class of applications
that interests him is anything like the class that interests me.) So I
measured creation of 1k files and re-writing of 1k files. (The prior
case involves creating an inode and data pages, the second involves
touching an inode and creating data pages.) I didn't test reads of the
files (which would involve touching an inode) because I didn't feel
like optimizing the test (by remounting my filesystem with the
inode-touch for last-access turned off). The test box was a
medium-powered Linux workstation with a 2.6 kernel, a single SATA
drive, and ext3 filesystems.

I'd expect under normal conditions to get maybe 15 megabytes per
second of disk-write bandwidth from this system, although with tuning
I could probably get a lot more. But again, I was going for a smell
test here. This whole approach is attractive because it's easy to
code, so I'd want to use it for light-duty applications on non-tuned
hardware. For an application with more stringent requirements, I'd
make a different tradeoff in development time and probably use a
different approach.

Anyway, I first tried it on /dev/shm. That worked really nice for
10,000 files, took about 0.9 seconds consistently to create new files
and 0.6 seconds to re-write them. The same test with 100,000 files
totally de-stabilized the machine. I didn't want to reboot it so I
waited. Fifteen minutes later it was back. But what a strange journey
that must have been. Obviously this approach doesn't make a lot of
sense on a shm device anyway, but I had to know.

With a disk-based filesystem, things were a lot better. For 10,000
files, about 1.6 seconds to create them and 1.2 to rewrite them. Those
numbers were consistent across many trials. Similar results at 30,000
and 60,000 files, just scale upward. At 100,000 files things got
screwy. The create-time got variable, ranging from 5 seconds to almost
15 seconds from run to run. During all the runs, the machine didn't
appear to destabilize and remained responsive. Obviously processor
loads were very low. I didn't make a thorough study of page faults and
swapping activity though. But notice that the implied throughput is an
interestingly high fraction of my notional channel bandwidth of 15
megabytes/sec. And the journalling FS means that I don't even think
about the movement of the R/W head inside the disk drive anymore. (Of
course that may matter a great deal on Windows, but if I'm using
Windows then *everything* about the project costs vastly more anyway,
so who cares?)

So I'm claiming without further study (and without trying to explain
the results) that the lower bound on performance is in the region of
5000 writes a second. That's just at the fuzzy edge of being worth
doing. I usually think in terms of a "budget" for any operation that
must be repeated for a continuously-running server application. In
general, I want to be able to do a bare minimum of 1000 "useful
things" per second on a sustained basis, on untuned hardware. ("Useful
things" might be dynamic web-pages generated, or guaranteed-delivery
messages processed, etc.) So this approach uses nearly 20% of my
budget. It's a big number. (Just to show how I apply this kind of
analysis: I never worry about adding a SHA-1 hash calculation to any
critical code path, because I know I can do 250,000 of those per
second without breaking a sweat.)

···

On 8/12/06, Bill Kelly <billk@cts.com> wrote:

Hi,

What is being measured? Access time for files that already exist?
Creation of new files? Scanning the directory structure for a list
of existing files?

At a prior gig, we used to split a couple hundred thousand
encyclopedia articles up as 12/34/56.xxx sort of format. It worked
adequately for our needs--our batch-oriented processing was expected
to run overnight anyway--but my impression was that as long as the
filename was known, accessing file 12/34/56.xxx seemed quick,
whereas directory scans to enumerate the existing filenames were
pretty slow.

External fragmentation can be
a problem.

Clearly a directory hierarchy is the only way to make this work, and Kirk's
idea of partitioning the space by segmenting the hash seems as reasonable as
any. So far I'm very surprised to find that performance may be reasonable.

So was I but it actually is quite good (I'll publish some unscientific benchmarks in the next couple of days).

But what do you mean by "external fragmentation"? If you're thinking of NFS,
I'd rather cut myself than attempt something like this on NFS, even though
Kirk says it's supported.

Sorry, jargon from 3 decades ago. External fragmentation, in this case is the disk space wasted due to small files being put into a larger segment (e.g. 1k file in a 4k disk segment is a 3k loss of space (external fragmentation)). Things like Perst, qdbm, Purple, SQLite-as-persistent-store will avoid almost all external fragmentation, at the cost of internal fragmentation. In the case of the file system, there isn't any space wasted *within* the 1k file, but the persistent stores all have this problem (and you'll see them talking about free space management, best-fit, first-fit, garbage collection, and so on).

Cheers,
Bob

···

On Aug 11, 2006, at 8:12 PM, Francis Cianfrocca wrote:

On 8/11/06, Bob Hutchison <hutch@recursive.ca> wrote:

----
Bob Hutchison -- blogs at <http://www.recursive.ca/hutch/&gt;
Recursive Design Inc. -- <http://www.recursive.ca/&gt;
Raconteur -- <http://www.raconteur.info/&gt;
xampl for Ruby -- <http://rubyforge.org/projects/xampl/&gt;

For that matter, you could just try using JRuby, which would let you call
Java code directly. www.jruby.org

···

On 8/12/06, Matt Todd <chiology@gmail.com> wrote:

What about writing a thin C extension for interfacing with Perst?
Going back and forth from C to Java may be tricky, though... But,
then, what about any of the non-Java or -Ruby options? Just write a
quick C extension/wrapper for Ruby, or use SWIG to do it for you.

M.T.

--
Contribute to RubySpec! @ Welcome to headius.com
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

That's a good point. I don't think I want to involve Java in this if I can help it. The fellow who wrote Perst, Konstantin Knizhnik <http://www.garret.ru/~knizhnik/databases.html&gt; has also written GOODS, FastDB, and GigaBASE. If I was going to write a C extension I'd go with one of those. Konstantin has also written something called DyBASE which is specifically for Dynamic languages, like Ruby and Python, and comes with bindings for Ruby 1.6.x. I've asked Konstantin about the state of DyBASE and am trying to work out if that is worth updating to Ruby 1.8.4

Cheers,
Bob

···

On Aug 12, 2006, at 10:40 PM, Matt Todd wrote:

What about writing a thin C extension for interfacing with Perst?
Going back and forth from C to Java may be tricky, though... But,
then, what about any of the non-Java or -Ruby options? Just write a
quick C extension/wrapper for Ruby, or use SWIG to do it for you.

M.T.

----
Bob Hutchison -- blogs at <http://www.recursive.ca/hutch/&gt;
Recursive Design Inc. -- <http://www.recursive.ca/&gt;
Raconteur -- <http://www.raconteur.info/&gt;
xampl for Ruby -- <http://rubyforge.org/projects/xampl/&gt;