Q: what database would you suggest?

Hi,
I am doing some scientific work in Ruby. So far the data is stored
into a massive file (just Marshalled or YAMLed) and then once the run
is over it is loaded back into memory for some data analysis.

This (not very elegant method) has worked so far but as I will need to
raise the scale to several GBs the time lost in loading and slicing
the data in memory is getting painful. Hence I thought I could take
advantage of the many DB that exist for Ruby. Alas... they are so
many, and I've got no database experience, so I am not sure what would
be best.

The data can be thought of as hierarchical tables: i.e. tables in the
SQL sense with *some* entries corresponding to entire new tables, and
the others either strings or numbers (integers and float). I have only
access to one computer, though with multiple processors, so I don't
have any need for distributing the data. Finally I'd like speed, but
given the database size I can't keep all the info in memory.

The kind of operations I need to do involve getting slices of tables
(all the rows but a subset of columns), often in all the "subtables"
I've got. I also need to add a few rows to each table when the
simulation is running (but not for analysis) however the rate is
rather low so I don't expect this to be much of a constraint.

Could people with more DB experience give a few suggestions? I've read
of MongoDB, CouchDB, MySQL and PostgreSQL... and to be honest I am
pretty lost, so I am sending this in the hope some fellow Rubyist can
help a poor DB newbie. :wink:

Thank you to all in advance.
Diego Virasoro

SQLite[0] might be worth a look, especially if you can split your data
across several instances of SQLite.

The key difference of SQLite to things like MySQL, PostgreSQL, and
others is that SQLite is not a server, in that it doesn't offer remote
access, which you say you don't need, so that is not an issue.

Additionally, you have no overhead of administering (or at least
setting up) a database server and its users, and SQLite is lighter on
your resources. :wink:

If you have data that doesn't lend itself well to storing them in the
rather rigid confines of SQL, take a look at the NoSQL DBs like
CouchDB or MongoDB. All of those (and the SQL engines) have good to
great Ruby bindings, so you can use them with relative ease.

[0] http://sqlite.org/

···

On Sun, Dec 5, 2010 at 7:15 PM, Diego Virasoro <diego.virasoro@gmail.com> wrote:

The data can be thought of as hierarchical tables: i.e. tables in the
SQL sense with *some* entries corresponding to entire new tables, and
the others either strings or numbers (integers and float). I have only
access to one computer, though with multiple processors, so I don't
have any need for distributing the data. Finally I'd like speed, but
given the database size I can't keep all the info in memory.

The kind of operations I need to do involve getting slices of tables
(all the rows but a subset of columns), often in all the "subtables"
I've got. I also need to add a few rows to each table when the
simulation is running (but not for analysis) however the rate is
rather low so I don't expect this to be much of a constraint.

Could people with more DB experience give a few suggestions? I've read
of MongoDB, CouchDB, MySQL and PostgreSQL... and to be honest I am
pretty lost, so I am sending this in the hope some fellow Rubyist can
help a poor DB newbie. :wink:

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.

SQLite sounds like a good start as Phillip Gawlowski says. It's widely
used, simple and straightforward. I doubt you'll have a problem with
size - see this:
http://www.sqlite.org/limits.html

···

--
Posted via http://www.ruby-forum.com/.

my recommendation would be postgresql. If you don't want to deal with postgresql initially, then sqlite3. Migrating from sqlite3 to postgresql should be pretty straightforward if you desire/need to do so later. If you go with an ORM, make sure that it supports the ability for YOU to write sql queries if you want to -- ORM's may not build the most appropriate SQL query for YOUR database schema, this may be the difference between an acceptable/unacceptable query response time.

···

On 12/5/2010 1:15 PM, Diego Virasoro wrote:

Could people with more DB experience give a few suggestions? I've read
of MongoDB, CouchDB, MySQL and PostgreSQL... and to be honest I am
pretty lost, so I am sending this in the hope some fellow Rubyist can
help a poor DB newbie. :wink:

Berkley DB, for sure.

···

--
Posted via http://www.ruby-forum.com/.

oh, i didn't knew that they implemented SQL API.

···

--
Posted via http://www.ruby-forum.com/.

Diego Virasoro <diego.virasoro@gmail.com> writes:

Hi,
I am doing some scientific work in Ruby. So far the data is stored
into a massive file (just Marshalled or YAMLed) and then once the run
is over it is loaded back into memory for some data analysis.

This (not very elegant method) has worked so far but as I will need to
raise the scale to several GBs the time lost in loading and slicing
the data in memory is getting painful. Hence I thought I could take
advantage of the many DB that exist for Ruby. Alas... they are so
many, and I've got no database experience, so I am not sure what would
be best.

The data can be thought of as hierarchical tables: i.e. tables in the
SQL sense with *some* entries corresponding to entire new tables, and
the others either strings or numbers (integers and float). I have only
access to one computer, though with multiple processors, so I don't
have any need for distributing the data. Finally I'd like speed, but
given the database size I can't keep all the info in memory.

The kind of operations I need to do involve getting slices of tables
(all the rows but a subset of columns), often in all the "subtables"
I've got. I also need to add a few rows to each table when the
simulation is running (but not for analysis) however the rate is
rather low so I don't expect this to be much of a constraint.

Could people with more DB experience give a few suggestions? I've read
of MongoDB, CouchDB, MySQL and PostgreSQL... and to be honest I am
pretty lost, so I am sending this in the hope some fellow Rubyist can
help a poor DB newbie. :wink:

Thank you to all in advance.
Diego Virasoro

From what you've described, I recommend you spend some time looking into
MongoDB. I've been using relational databases for decades, and I've
recently developed a couple applications using MongoDB. It's fast,
handles hierarchical data well, has a powerful query language, good
indexing, etc.

One of my projects has over 40 million documents, and MongoDB is
handling it *much* better than postgres or mysql.

It does seem to take extra disk space than postgres (my typical rdbms),
but I accept that penalty for the increase in speed and flexibility in
programming.

···

--
Brian Adkins
http://lojic.com/

I don't have benchmarks handy but for really large volumes I'd turn to PostgreSQL. Especially if the ability is needed to manually manipulate the data and do backup, parallel accesses and the like.

Kind regards

  robert

···

On 05.12.2010 19:52, Mike Stephens wrote:

SQLite sounds like a good start as Phillip Gawlowski says. It's widely
used, simple and straightforward. I doubt you'll have a problem with
size - see this:
Implementation Limits For SQLite

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

SQLite[0] might be worth a look, especially if you can split your data
across several instances of SQLite.

Thanks, I'll start with this then.

Any suggestion on the ruby wrapper? It seems that DataMapper is pretty
good and "hides" some of the complexities of SQL. But I've heard of
Sequel as well, and some closer to the SQL code such as ruby-sqlite.

Which one would you suggest to start?

Thank you again
Diego

Interestingly enough, the latest, greatest Berkeley DB seems to offer "SQLite-compatible SQL API":

http://www.oracle.com/us/corporate/press/063695

···

On Dec 7, 2010, at 12:04 AM, Eugeni Akmuradov wrote:

Berkley DB, for sure.

For my two cents I am fairly database illiterate the only thing I have
read a decent amount about is normalisation of data.

I am using sequel and Mysql, I chose Mysql because it was popular wide
range of documnetation and community support but mostly it has the
MySQL workbench which will allow me to see graphically what i am doing
which being a beginner and only having used Access 2000 & 2007 prior
was a plus.You can turn networking off in MySQL setup to setup a
single developer PC easily.

There is also a very helpful sequel group if your having any issues
http://groups.google.com/group/sequel-talk and again enough docs for a
beginner here http://sequel.rubyforge.org/

I am a database no nothing and sequel and mysql work for me. And
because its fairly easy I am learning more..

···

On Dec 7, 7:39 pm, Eugeni Akmuradov <e.akmura...@gmail.com> wrote:

oh, i didn't knew that they implemented SQL API.

--
Posted viahttp://www.ruby-forum.com/.

Without knowing your situation: I'd use sqlite3-ruby, or DBI (still
allows you to access different DB engines without performing too much
magic behind the scenes). From what you wrote, it doesn't look like an
ORM (like DataMapper, or ActiveRecord) would be a good fit. And an ORM
might make things you want to do (like analysis of your data) more
difficult that it is necessary.

If you are really unsure: Grab what you find interesting, and do a
little prototyping with a reduced, somewhat simplified dataset that
you'll be working with. Once you have something that you are happy
with, stick to that, instead of having to discard half your work
because you start hating the way your chosen tool behaves. :slight_smile:

···

On Sun, Dec 5, 2010 at 10:30 PM, Diego Virasoro <diego.virasoro@gmail.com> wrote:

SQLite[0] might be worth a look, especially if you can split your data
across several instances of SQLite.

Thanks, I'll start with this then.

Any suggestion on the ruby wrapper? It seems that DataMapper is pretty
good and "hides" some of the complexities of SQL. But I've heard of
Sequel as well, and some closer to the SQL code such as ruby-sqlite.

Which one would you suggest to start?

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.

IMHO you should definitely have a look at the latest Rails 3 version
of active record which uses arel GitHub - rails/arel: A Relational Algebra

Active record can be used separately from Rails, it always could but
Rails 3 makes it easier to pick particular parts of Rails then ever.

···

On Sun, Dec 5, 2010 at 4:30 PM, Diego Virasoro <diego.virasoro@gmail.com> wrote:

SQLite[0] might be worth a look, especially if you can split your data
across several instances of SQLite.

Thanks, I'll start with this then.

Any suggestion on the ruby wrapper? It seems that DataMapper is pretty
good and "hides" some of the complexities of SQL. But I've heard of
Sequel as well, and some closer to the SQL code such as ruby-sqlite.

Which one would you suggest to start?

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Github: http://github.com/rubyredrick
Twitter: @RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

Whichever you like -- the important thing is to start.

Personally, I'd suggest DataMapper, mostly because it's likely to make your
application least bound to SQL at all (let alone SQLite) if you decide to
change in the future.

One thing to be aware of with SQLite is that it essentially locks the entire
database for any operations. This may affect you if your code is sufficiently
multithreaded (and you use a Ruby which supports this, like JRuby) or if you
intend to launch multiple worker processes. In either case, you may want to
swap SQLite out for something else eventually.

But especially if you're building on something like DM, you can easily swap
SQLite out for something else in the future.

···

On Sunday, December 05, 2010 03:30:18 pm Diego Virasoro wrote:

> SQLite[0] might be worth a look, especially if you can split your data
> across several instances of SQLite.

Thanks, I'll start with this then.

Any suggestion on the ruby wrapper? It seems that DataMapper is pretty
good and "hides" some of the complexities of SQL. But I've heard of
Sequel as well, and some closer to the SQL code such as ruby-sqlite.

Which one would you suggest to start?

That's not very accurate. SQLite has a very sophisticated system of staged locks and it's definitely possible to have multiple clients reading data at once.

James Edward Gray II

···

On Dec 5, 2010, at 5:48 PM, David Masover wrote:

One thing to be aware of with SQLite is that it essentially locks the entire database for any operations.

I learned that the hard way. I innocently tried to fork a process that
used sqlite and the whole thing fell apart in not so obvious ways.
Many hours were wasted recovering from that one. So yes, safe sqlite
equals single process.

Regards,
Ammar

···

On Mon, Dec 6, 2010 at 1:48 AM, David Masover <ninja@slaphack.com> wrote:

One thing to be aware of with SQLite is that it essentially locks the entire
database for any operations. This may affect you if your code is sufficiently
multithreaded (and you use a Ruby which supports this, like JRuby) or if you
intend to launch multiple worker processes. In either case, you may want to
swap SQLite out for something else eventually.

least bound to SQL at all (let alone SQLite) if you decide to
change in the future.

"Prepare Now For Possible Future Head Transplant"

One thing to be aware of with SQLite is that it essentially locks the entire
database for any operations.

Hmmm?!? Not quiet.

http://www.sqlite.org/lockingv3.html
http://www.sqlite.org/wal.html

···

On Dec 6, 2010, at 12:48 AM, David Masover wrote:

> least bound to SQL at all (let alone SQLite) if you decide to
> change in the future.

"Prepare Now For Possible Future Head Transplant"
http://database-programmer.blogspot.com/2010/11/prepare-now-for-possible-fu
ture-head.html

I don't have time to do a point-by-point rebuttal, so for now:

"So what about that weird title involving head transplants? Obviously a head
transplant is impossible, making it also very unlikely, besides being silly
and non-sensical. It came to mind as a kind of aggregrate of all of the
bizarre and unrealistic ideas about abstracting data designs that I have heard
over the years."

Huh?

Having something that's at least abstract enough to deal with multiple
databases is something I have actually used. That is, as a matter of
convenience, I've actually developed applications using SQLite and deployed
them to MySQL.

And it's not always just SQL.

I've also migrated models from one application to another -- that is, I took a
model which used to be internal to both applications A and B, along with code
that made assumptions about that model, and made it live in application A,
with application B accessing it via REST.

So, if you're keeping track, that's the same model, used for both REST and
SQL. Far from impossible, silly, or unlikely, the fact that we were using
abstractions which allowed for this possibility increased the likelihood that
we'd actually do it.

It's also possible the article misses one of the main points of an ORM. Even
if I were writing a SQL-only application, I'd have to be insane to use raw SQL
to deal with it when I can have 90% of the work done for me by the ORM -- that
is, 90% of what any application does is trivial CRUD, and there's no reason I
should have to rewrite that every time.

> One thing to be aware of with SQLite is that it essentially locks the
> entire database for any operations.

Hmmm?!? Not quiet.

File Locking And Concurrency In SQLite Version 3
Write-Ahead Logging

Yeah, reading the 3.0 summary confirms all writes are an exclusive lock over
the entire database. My bad, that is different than locking for _all_ writes,
but it's still something you want to be aware of during the design phase,
especially if someone follows your advice and doesn't abstract sufficiently
that they can easily migrate away later.

···

On Monday, December 06, 2010 01:48:18 pm Petite Abeille wrote:

On Dec 6, 2010, at 12:48 AM, David Masover wrote: