[ANN] Ruby/Odeum 0.2 (More Idiomatic, Better Source Layout)

Hello Everyone,

Just another announcement for the Ruby/Odeum project:

http://www.zedshaw.com/projects/ruby_odeum/

Ruby/Odeum is an extension that wraps Mikio Hirabayashiÿs QDBM Odeum
library for fast reverse indexing of documents. It is in the same
category as Lucene, although slightly lower level (no query language or
document specific lexers). It supports indexing documents by their
words, breaking the text into words, normalizing the words, searching,
and attaching meta-data to each document. It is very fast and the
Ruby/Odeum extension is nice and small with only two classes needed.

This release features:

1. Much improved project layout with a Rake and setup.rb build.
2. Full RubyDoc documentation for every function.
3. A more complete example script odeum_mgr which uses Ruby/Odeum to
index some documents and let you search it.
4. Some new functions to make the library even easier and help reduce
the amount of Ruby to C interfunction calls.
5. Bug fixes related to memory management. There seems to be an issue
with the garbage collector not collecting Document objects inside an
each block, so there's a new Document.close function to do it manually.

Ruby/Odeum is licensed under the same license as QDBM (LGPL).

Feel free to download and send me your comments. I'm already in the
process of removing all C99 code in order to get it to compile on old
2.95 GCC (that way you people using old versions of *BSD can avoid
upgrading for one more year :slight_smile:

Zed A. Shaw

* Zed A. Shaw wrote:

Hello Everyone,

Just another announcement for the Ruby/Odeum project:

http://www.zedshaw.com/projects/ruby_odeum/

As a long-time user of QDBM with Ruby, I'm very happy about that.

My problem: in the project that I would want to use Odeum for, a News
Archive, the texts are in a database, not in files. Would it be easy to
use the library with that? I don't get a clear idea where to start. The
only way I could think of is to make my own server application that taps
into the database and serves the texts, but I would be happy if there is
a solution with less overhead.

···

--
Oliver C.
45n31, 73w34
Temperature: 10.1°C (22 April 2005 12:00 PM EDT)

Hi Oliver,

Several people have asked the same question, so I'm going to include a
HOWTO doc in the next release explaining some common usage. The best
thing is to review the bin/odeum_mgr for code that shows a common
indexing/searching set of functions. The odeum_mgr file is pretty
small, and hopefully you can understand it.

In your case I'd say that you have a few options depending on how you
store/use your data:

1. Write a stand-alone process that periodically goes through the
database, compares modified times of articles, and updates the odeum
index on disk. It should also periodically cull (remove) articles which
don't exist anymore.

2. Add to your "store/edit/delete article" procedure a small side
operation that also indexes all of the text in the article with
ruby/odeum after it is put in the database. This makes the search
available immediately.

3. A kind of hybrid where you have a thread in your program that
listends to a queue. When an article is stored/updated/deleted you put
a little message on the queue saying "article 333444 deleted". The
thread then just reads stuff off the queue and does the index updating
based on what it's told.

The advantage of #1 is that it's easier to control when the index is
being updated and you don't need to worry as much about read/write
locking since Odeum will do it for you (in theory). This also has a
nice separation since your search feature only opens the database in
read mode, and the indexer is the only thing opening it write mode.

The disadvantage #1 is that your articles aren't immediately available
for searching.

Option #2 fixes the immediacy problem, but you'll get into some delays
if you have more than one attempt to update the index at the same time.
Odeum does a good job of read/write locking the index using OS level
thread locking (if it's supported), so your biggest risk is your program
crashing in the middle of the index update (which hoses the index
usually). Basically, #2 will drive you insane trying to manage the
writers fighting over the index.

Option #3 is kind of in-between the other two: it's a little easier to
implement and control than #2, not quite as easy as #1, but has nearly
immediate results of #2.

Now, searching is easy. Once you have the articles indexed and put into
the odeum storage, you simply need to open it and do a search. The
results are a series of Document objects with URI's for names, meta-data
attached, and words you can use to summarize the document. Pretty much
everything you need to find the article in the database and show the
user a summary. In the search results, just show the summary from
odeum, and wait to show them the full article until they click on the
link. That cuts down on database traffic since all of the relevant
words are stored right in the odeum index.

Feel free to contact me offline if you want more advice. I'm going to
be making some changes to QDBM Odeum for Mikio, and also including some
more features into Ruby/Odeum in the next release.

Zed

···

On Sat, 2005-04-23 at 06:54 +0900, Oliver Cromm wrote:

* Zed A. Shaw wrote:

> Hello Everyone,
>
> Just another announcement for the Ruby/Odeum project:
>
> http://www.zedshaw.com/projects/ruby_odeum/

As a long-time user of QDBM with Ruby, I'm very happy about that.

My problem: in the project that I would want to use Odeum for, a News
Archive, the texts are in a database, not in files. Would it be easy to
use the library with that? I don't get a clear idea where to start. The
only way I could think of is to make my own server application that taps
into the database and serves the texts, but I would be happy if there is
a solution with less overhead.