Looking for a Fast Persistent Store

You are rebuilding many of the features of git.
http://git.or.cz/index.html

Git is the SCM designed by Linus to replaces Bitkeeper. C source code
is freely available.

The git file store hashes files into an array of 255 directories and
keeps them initially as small files. Later on you can run a pass which
transparently converts them into single file packs. A pack may have
many thousand files that can be accessed via their hash name. Packs
recover the space lost to block fragmentation. Git directory objects
are used to map hash names to human readable strings.

When packs are created there is code that searches for similarities
between files in the pack. If similar files are found they are stored
as deltas from the original file. All objects stored in the system,
files or packs, are zlib compressed.

It wouldn't be a lot of trouble to take the indexing code out of
cLucene and adjust it to use the git file store.

Transactions are achieved by first adding all of your files to the
file store, then a single 'commit' makes them become visible. If you
never do the commit there are tools for pruning dangling objects out
of the db. But the dangling objects aren't visible so it's just a
cleanup step.

Git is an archival system, objects in the database can not be changed,
they can only be superseded by a later version. Cryptographic hashes
will immediately detect if something in the db has been altered.

I have multi gigabyte git files stores containing millions of objects.
Performance is pretty constant until your working set won't fit into
memory. The killer feature of these file stores is support for
distributed replication with multiple people adding objects to their
distributed copies.

···

On 8/13/06, Francis Cianfrocca <garbagecat10@gmail.com> wrote:

On 8/12/06, Bill Kelly <billk@cts.com> wrote:
> Hi,
>
> What is being measured? Access time for files that already exist?
> Creation of new files? Scanning the directory structure for a list
> of existing files?
>
> At a prior gig, we used to split a couple hundred thousand
> encyclopedia articles up as 12/34/56.xxx sort of format. It worked
> adequately for our needs--our batch-oriented processing was expected
> to run overnight anyway--but my impression was that as long as the
> filename was known, accessing file 12/34/56.xxx seemed quick,
> whereas directory scans to enumerate the existing filenames were
> pretty slow.
>

I wanted to put a rough lower bound on the performance of the
approach, just to decide if it's worth pursuing at all. (Kirk
obviously thinks it is, but I don't know if the class of applications
that interests him is anything like the class that interests me.) So I
measured creation of 1k files and re-writing of 1k files. (The prior
case involves creating an inode and data pages, the second involves
touching an inode and creating data pages.) I didn't test reads of the
files (which would involve touching an inode) because I didn't feel
like optimizing the test (by remounting my filesystem with the
inode-touch for last-access turned off). The test box was a
medium-powered Linux workstation with a 2.6 kernel, a single SATA
drive, and ext3 filesystems.

--
Jon Smirl
jonsmirl@gmail.com

That's a pretty constant tradeoff, speed for space, and it shows up in many
different ways. I'm convinced that's why there's no single answer to this
problem- every application will need different optimizations. However, this
is somewhat less true than it once was, for two reasons: journaling
filesystems, and the fact that disk storage is still getting cheaper every
year at a rapid rate- it's the only element in the computing chain that
still is. So my inclination (subject to change depending on the particular
case) is to waste space if it will save time (development time and/or
runtime).

···

On 8/12/06, Bob Hutchison <hutch@recursive.ca> wrote:

Sorry, jargon from 3 decades ago. External fragmentation, in this
case is the disk space wasted due to small files being put into a
larger segment (e.g. 1k file in a 4k disk segment is a 3k loss of
space (external fragmentation)). Things like Perst, qdbm, Purple,
SQLite-as-persistent-store will avoid almost all external
fragmentation, at the cost of internal fragmentation. In the case of
the file system, there isn't any space wasted *within* the 1k file,
but the persistent stores all have this problem (and you'll see them
talking about free space management, best-fit, first-fit, garbage
collection, and so on).

Cheers,
Bob

Write-once/read-many is the optimization profile for archival storage,
but not for a persistent session cache. You're proving yet again that
there are almost as many valid and useful approaches to this problem
as there are applications.

···

On 8/13/06, Jon Smirl <jonsmirl@gmail.com> wrote:

You are rebuilding many of the features of git.
http://git.or.cz/index.html

Bob Hutchison <hutch@recursive.ca> writes:

What about writing a thin C extension for interfacing with Perst?
Going back and forth from C to Java may be tricky, though... But,
then, what about any of the non-Java or -Ruby options? Just write a
quick C extension/wrapper for Ruby, or use SWIG to do it for you.

M.T.

That's a good point. I don't think I want to involve Java in this if
I can help it. The fellow who wrote Perst, Konstantin Knizhnik
<http://www.garret.ru/~knizhnik/databases.html&gt; has also written
GOODS, FastDB, and GigaBASE. If I was going to write a C extension
I'd go with one of those. Konstantin has also written something
called DyBASE which is specifically for Dynamic languages, like Ruby
and Python, and comes with bindings for Ruby 1.6.x. I've asked
Konstantin about the state of DyBASE and am trying to work out if
that is worth updating to Ruby 1.8.4

I've just discovered RScheme's PStore,
http://www.rscheme.org/rs/a/2005/persistence/

  "a system that allows objects in persistent storage (i.e., on disk)
  to be mapped into memory for direct manipulation by an application
  program. The approach is based on the concept of pointer swizzling
  at page-fault time as described in Paul Wilson's Pointer Swizzling
  at Page-Fault Time."

Now, having that in Ruby would rock very much, but I've no idea on how
to implement it. Any couraged Ruby hacker? :slight_smile:

···

On Aug 12, 2006, at 10:40 PM, Matt Todd wrote:

Cheers,
Bob

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Hi,

That's a pretty constant tradeoff, speed for space, and it shows up in many
different ways. I'm convinced that's why there's no single answer to this
problem- every application will need different optimizations. However, this
is somewhat less true than it once was, for two reasons: journaling
filesystems, and the fact that disk storage is still getting cheaper every
year at a rapid rate- it's the only element in the computing chain that
still is. So my inclination (subject to change depending on the particular
case) is to waste space if it will save time (development time and/or
runtime).

I absolutely agree which is why I've implemented the filesystem route first. There is *no* performance problem with it. The problem is that I've got to write a bunch of files out and if for some reason even one file failed to write I'd have a consistency problem. There are easy ways to mitigate these risks but they can take an excruciatingly long time to run. Moving to a transactional store (like Perst in Java, or SQLite in ruby) gives me the guarantees of consistency that I'm looking for, but SQLite in ruby is 3-5 times slower than direct to the file system on OS/X with journaling on (unless there's something wrong in my testing... silly benchmarks forthcoming). In Java, Perst is *much* faster than the filesystem.

Now, I've thought of implementing the same kind of techniques that some persistent stores use to assure consistency but directly in the filesystem. The only quick-enough techniques that I can think of would require a recovery after failure, but that shouldn't be too big a problem (especially the frequency of failure will be low judging

···

On Aug 12, 2006, at 8:27 AM, Francis Cianfrocca wrote:
from my current experience where I've not had any failures yet).

----
Bob Hutchison -- blogs at <http://www.recursive.ca/hutch/&gt;
Recursive Design Inc. -- <http://www.recursive.ca/&gt;
Raconteur -- <http://www.raconteur.info/&gt;
xampl for Ruby -- <http://rubyforge.org/projects/xampl/&gt;

A persistent session cache was actually my original motivation when I started working on the code.

IOWA's default way of managing sessions is via a cache. For a busy server, this obviously imposes some memory constraints on just how many sessions or for just how long one can keep sessions around.

So I created a bilevel cache that uses an in-memory LRU cache for the L1 cache, and when an element is expired from the L1 cache, it is inserted into the L2 cache. If the L2 cache is a persistent store, then that session can be retrieved again when required.

By making that persistent store itself an LRU cache, I can still let it self manage its ultimate size or the ultimate age of elements within it, I maintain reasonably fast access to elements that have gone out to the persistent store, I can potentially do things like have the server flush it's L1 cache to L2 and shut down. When it's started again all of the sessions which were available when it was shut down are still available. Etc....

Kirk Haines

···

On Mon, 14 Aug 2006, Francis Cianfrocca wrote:

On 8/13/06, Jon Smirl <jonsmirl@gmail.com> wrote:

You are rebuilding many of the features of git.
http://git.or.cz/index.html

Write-once/read-many is the optimization profile for archival storage,
but not for a persistent session cache. You're proving yet again that

Active sessions would stay as files and generate a new key each time
they are changed. Once a day you would sweep them into a pack. You
will need to make some small modifications for the git code to do this
but all of the hard code is already written.

With git's compression you will be able to store hundreds of millions
of session snapshots in a couple of gigs of disk. It would provide
interesting data for later analysis.

···

On 8/13/06, Francis Cianfrocca <garbagecat10@gmail.com> wrote:

On 8/13/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> You are rebuilding many of the features of git.
> http://git.or.cz/index.html

Write-once/read-many is the optimization profile for archival storage,
but not for a persistent session cache. You're proving yet again that
there are almost as many valid and useful approaches to this problem
as there are applications.

--
Jon Smirl
jonsmirl@gmail.com

The fastest thing I ever wrote was a transactional persistent store
for a guaranteed-devlivery messaging system. It used a decidedly
space-wasting design, and profiled approximately 100 times faster than
the filesystem on all supported platforms (Windows, Linux, Solaris) at
the time it was written (about eight years ago), assuming enough I/O
and memory bandwidth was available to make the comparison meaningful.
Most of the speed came by taking advantage of the tunings in the
kernel's VMM.

But I can't imagine writing such a thing in Java. What does Perst do
to be so fast?

···

On 8/13/06, Bob Hutchison <hutch@recursive.ca> wrote:
> I absolutely agree which is why I've implemented the filesystem route

first. There is *no* performance problem with it.

Ok, did some reading up on Perst- no real magic there. They simply put
together a lot of the good techniques you would expect to find in a
virtual arena manager, and they made some guesses about the handful of
tough questions you have to answer. (Allocation-grain size, page size,
etc.) I used same kind of bitmap-strategy for the persistent
GD-message system I mentioned upthread, but Perst makes a different
choice: they sequentially search the bitmap pages for a big-enough
free space when storing an object. Even though they use the obvious
optimization (storing a max-available-size on each page), I remember
being phobic about that approach and did something much more elaborate
(but reliably O(1)). But given your comments about Perst's speed, they
have obviously made the right choices overall!

I think something like Perst could be done in Ruby, possibly augmented
with a tiny C extension to access virtual-memory features.

···

On 8/13/06, Bob Hutchison <hutch@recursive.ca> wrote:

something wrong in my testing... silly benchmarks forthcoming). In
Java, Perst is *much* faster than the filesystem.

Once again, you're solving a rather different problem (or perhaps
you're redefining the problem to fit your solution): permanently
storing ephemeral session caches for later analysis is interesting but
different from just having a session cache that survives process
crashes. Yes, the "hard work" is done, but if you didn't need to do
the hard work in the first place, it's wasteful. Again, this whole
subthread is about Kirk's filesystem approach, which is attractive for
some applications (not including yours) because it's dead easy and it
may be "fast enough."

···

On 8/13/06, Jon Smirl <jonsmirl@gmail.com> wrote:

Active sessions would stay as files and generate a new key each time
they are changed. Once a day you would sweep them into a pack. You
will need to make some small modifications for the git code to do this
but all of the hard code is already written.

With git's compression you will be able to store hundreds of millions
of session snapshots in a couple of gigs of disk. It would provide
interesting data for later analysis.

Just FYI, I still haven't finished packing things up into a separate release, but I do have some comparative benchmarks.

One a very modest server (AMD Athlon 2600 with generic IDE drives, extfs3, on a Linux 2.4 kernel) that does have a small load on it, with the DiskCache, which implements overhead to maintain a linked list of files on disk in order to support LRU semantics, it averages around 500-600 writes per second, 700-800 updates per second, 850-950 reads per second, and 1000 deletes per second.

Removing all of the LRU overhead results in a dramatic speedup. Writes experience the least speedup, going to around 2000 per second, give or take a couple hundred depending on other activity. Updates move into the 3800-4200 range. Reads move into the 6000-6200 range, and deleted move to around 2000/second.

Given the hardware, I found those numbers pleasingly high. And it clearly demonstrates that unless one really needs the LRU capabilities in order to limit the total number of entries to some arbitrary value, there is much to be gained by using the basic DiskStore, and just having an external job run periodically to flush old elements from the disk.

Kirk Haines