Should I use a database or a flat file?

I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

Basically, I will have a list of names each belonging to one of 5
categories. Sort of like this:

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name7
-name8
-name9
-etc...

There will be hundreds of names, evenly divided between the categories.
But each name will go in only one category, there is no relation between
categories or anything like that. All the information will be
completely rewritten once a day and then read several times throughout
the day.

My choices for storage are an sqlite database (using ActiveRecord), a
flat text file of my own design, a YAML file, or an XML file.

···

--
Posted via http://www.ruby-forum.com/.

I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

Basically, I will have a list of names each belonging to one of 5
categories. Sort of like this:

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name7
-name8
-name9
-etc...

There will be hundreds of names, evenly divided between the categories.

That's not much. I'd probably use XML - but that also depends on what
generates the data and what needs to be able to read it. You can
efficiently generate it and read it (using a stream parser for
example, but that seems unnecessary for hundreds of names only).

But ultimately it depends on what you want to do with the data. In
some cases a DB might be a better choice. Also, if your volume is
going to increase dramatically etc.

But each name will go in only one category, there is no relation between
categories or anything like that. All the information will be
completely rewritten once a day and then read several times throughout
the day.

My choices for storage are an sqlite database (using ActiveRecord), a
flat text file of my own design, a YAML file, or an XML file.

YAML is another nice alternative because it is human readable. And
you can use Marshal if producer and consumer of the data are Ruby
programs.

Kind regards

robert

···

2008/4/1, James Dinkel <jdinkel@gmail.com>:

--
use.inject do |as, often| as.you_can - without end

James Dinkel wrote:

I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

Basically, I will have a list of names each belonging to one of 5
categories. Sort of like this:

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name7
-name8
-name9
-etc...

There will be hundreds of names, evenly divided between the categories.
But each name will go in only one category, there is no relation between
categories or anything like that. All the information will be
completely rewritten once a day and then read several times throughout
the day.

My choices for storage are an sqlite database (using ActiveRecord), a
flat text file of my own design, a YAML file, or an XML file.

IMHO Databases are best when you have concurrent access to data being modified regularly and want to enforce constraints during concurrent write accesses.

In your case, the data is mostly static and constraints are easily handled outside the storage layer (you overwrite all data with another consistent version in one pass). I'd advise to use the simplest storage method, which probably is a YAML dump of an object holding all this data.

Marshall.dump/load is an option too. It may be faster than YAML if this matters to you (I've not benchmarked it, so you better do it if you need fast read/write). It's not human-readable, so it can be a drawback when debugging.

That was the code/integration complexity side of your problem.

For the performance side of the problem :

If you dump your data in a temporary file and then rename it to overwrite the final destination, you can use a neat hack for long running processes needing fresh data: you can design a little cache that checks the mtime of the backing store (the final destination) on read accesses and reload it when it changes.
mtime checks are cheap and simple to code and if the need arise for really high throughput you can minimize them by coding a TTL logic.

Lionel

James Dinkel wrote:

I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

One option is FSDB[1] (file-system database), with one file per "category", and each file stored as YAML. This scales as well as your file system scales, is always human-readable, and should be fairly efficient. (It's thread and process safe too, not that it matters for your app.)

For example:

   require 'fsdb'
   require 'yaml'

   db = FSDB::Database.new "~/tmp/my_data"
   db.formats = [FSDB::YAML_FORMAT] + db.formats

   3.times do |i|
     db["Cat#{i}.yml"] = %w{
       name1
       name2
       name3
     }
   end

   path = "Cat1.yml"

   puts "Here's the object:"
   puts "=================="
   p db[path]
   puts "=================="
   puts

   puts "Here's the file:"
   puts "=================="
   puts File.read(File.join(db.dir, path))
   puts "=================="
   puts

and this is the output:

Here's the object:

···

==================
["name1", "name2", "name3"]

Here's the file:

---
- name1
- name2
- name3

The dir structure looks like this:

[~/tmp] ls my_data
Cat0.yml Cat1.yml Cat2.yml

[1] http://redshift.sourceforge.net/fsdb

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

But ultimately it depends on what you want to do with the data.

yeah, it's kinda hard to describe without just posting my entire script,
which I doubt people will want to read.

The data will be accessed by one ruby script, running on one computer.
The data will be read in, then the file closed and done for a couple
hours. So no concurrent access, no relations, no keeping the connection
open for extended periods of time, which is why I thought a database
would probably be overkill and just add overhead.

But I didn't know if maybe reading a file into memory would take more
effort than reading entries from a database. Also, I was a little off
on the numbers, I meant to say that there are hundreds of names per
category, so total names could be over a thousand. That size will
likely never ever change beyond +/- 100 at the most.

Thanks for the info. I'm really a newb at this, so any thoughts on
storing data using any of these methods is helpful.

James.

···

--
Posted via http://www.ruby-forum.com/\.

I was thinking that maybe the OP could use something like KirbyBase.
I've used it before, and it allows the code to stay very portable because
kirbybase is just ruby code.

You can locate it here:
http://rubyforge.org/projects/kirbybase

/Shawn

···

On Tue, Apr 1, 2008 at 11:11 AM, Joel VanderWerf <vjoel@path.berkeley.edu> wrote:

James Dinkel wrote:
> I need to store some information with my ruby program and I am not sure
> on what would be the best method. I'm mostly concerned about what would
> be the most efficient use of cpu resources.

One option is FSDB[1] (file-system database), with one file per
"category", and each file stored as YAML. This scales as well as your
file system scales, is always human-readable, and should be fairly
efficient. (It's thread and process safe too, not that it matters for
your app.)

For example:

  require 'fsdb'
  require 'yaml'

  db = FSDB::Database.new "~/tmp/my_data"
  db.formats = [FSDB::YAML_FORMAT] + db.formats

  3.times do |i|
    db["Cat#{i}.yml"] = %w{
      name1
      name2
      name3
    }
  end

  path = "Cat1.yml"

  puts "Here's the object:"
  puts "=================="
  p db[path]
  puts "=================="
  puts

  puts "Here's the file:"
  puts "=================="
  puts File.read(File.join(db.dir, path))
  puts "=================="
  puts

and this is the output:

Here's the object:

["name1", "name2", "name3"]

Here's the file:

---
- name1
- name2
- name3

The dir structure looks like this:

[~/tmp] ls my_data
Cat0.yml Cat1.yml Cat2.yml

[1] FSDB

--
      vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Seems like the type of problem yaml thats perfect for yaml

···

On Tue, Apr 1, 2008 at 11:32 AM, James Dinkel <jdinkel@gmail.com> wrote:

> But ultimately it depends on what you want to do with the data.

yeah, it's kinda hard to describe without just posting my entire script,
which I doubt people will want to read.

The data will be accessed by one ruby script, running on one computer.
The data will be read in, then the file closed and done for a couple
hours. So no concurrent access, no relations, no keeping the connection
open for extended periods of time, which is why I thought a database
would probably be overkill and just add overhead.

But I didn't know if maybe reading a file into memory would take more
effort than reading entries from a database. Also, I was a little off
on the numbers, I meant to say that there are hundreds of names per
category, so total names could be over a thousand. That size will
likely never ever change beyond +/- 100 at the most.

Thanks for the info. I'm really a newb at this, so any thoughts on
storing data using any of these methods is helpful.

James.
--
Posted via http://www.ruby-forum.com/\.

> But ultimately it depends on what you want to do with the data.

yeah, it's kinda hard to describe without just posting my entire script,
which I doubt people will want to read.

I found that plain English works best for anything that is longer than
a few lines. :slight_smile:

The data will be accessed by one ruby script, running on one computer.
The data will be read in, then the file closed and done for a couple
hours. So no concurrent access, no relations, no keeping the connection
open for extended periods of time, which is why I thought a database
would probably be overkill and just add overhead.

Yep.

But I didn't know if maybe reading a file into memory would take more
effort than reading entries from a database. Also, I was a little off
on the numbers, I meant to say that there are hundreds of names per
category,

You *did* say that.

so total names could be over a thousand. That size will
likely never ever change beyond +/- 100 at the most.

1000 is a really meek number. I did a quick test and also for
illustration (attached).

17:51:23 /c/Temp
$ ./yam.rb
0.010 create
0.261 write
0.025 load
17:52:20 /c/Temp

Times in seconds

Thanks for the info. I'm really a newb at this, so any thoughts on
storing data using any of these methods is helpful.

You're welcome.

Kind regards

robert

yam.rb (658 Bytes)

···

2008/4/1, James Dinkel <jdinkel@gmail.com>:

--
use.inject do |as, often| as.you_can - without end

I'm going to slightly disagree with Lionel -- and also Robert -- on
this one. First of all, a database is not necessarily just for
concurrency. It's for data integrity and allows the ability to build
reports on that data that you can trust because of the strict nature
of the underlying data store (I'm talking about RDBMS, but I've kept
my eyes open about OO databases as well; stay away from Pick,
though!!).

Here's the problem with relational databases, though (RDBMSs): it's
hard to model a hierarchy (which you can pull off somewhat clumsily
with XML).

If you are not going to do serious queries and inserts on the db, and
your data isn't complex, then a flat file approach might work. It
works, after all, for software builds. I strongly recommend against
it in higher languages, though, even for small apps. And, no, I am
not a database vendor.

I always tell people they should learn SQL, but nowadays I'm getting a
cold shoulder, especially with OO people :slight_smile:

The other important thing that I've noticed about data and storage is:
what do you want to do with it and how often? Store it, query it (and
how), add to it, move it around, archive it, etc. These are important
factors to consider.

Todd

···

On Tue, Apr 1, 2008 at 10:32 AM, James Dinkel <jdinkel@gmail.com> wrote:

> But ultimately it depends on what you want to do with the data.

yeah, it's kinda hard to describe without just posting my entire script,
which I doubt people will want to read.

The data will be accessed by one ruby script, running on one computer.
The data will be read in, then the file closed and done for a couple
hours. So no concurrent access, no relations, no keeping the connection
open for extended periods of time, which is why I thought a database
would probably be overkill and just add overhead.

But I didn't know if maybe reading a file into memory would take more
effort than reading entries from a database. Also, I was a little off
on the numbers, I meant to say that there are hundreds of names per
category, so total names could be over a thousand. That size will
likely never ever change beyond +/- 100 at the most.

Thanks for the info. I'm really a newb at this, so any thoughts on
storing data using any of these methods is helpful.

James.

Wow, this has been a very good discussion. Feel free to keep
discussing, but, being as I'm the OP, I just thought I would let you
know that I think I will go with YAML for this case.

···

--
Posted via http://www.ruby-forum.com/.

Just a quick change of the script to make the volume more realistic.

robert

yam.rb (653 Bytes)

Todd Benson wrote:

I'm going to slightly disagree with Lionel -- and also Robert -- on
this one. First of all, a database is not necessarily just for
concurrency. It's for data integrity

Yes I agree (as explained below concurrency is what I consider the main problem to solve to enforce data integrity). That said if you write your data in one pass as the OP, you don't need data integrity in the storage layer... rename is atomic : you either renamed the temp file to its final position before a crash or not.

The problem are partial updates where you need to maintain consistancy. And on the top of my head the only problems with partial updates are :
- concurrent accesses (most common, counting both concurrent read and write accesses),
- crashes (fortunately less common and can even be adressed by backups in many cases).

These are why I disagree with people wanting to push all the consistency logic into the applicaltion layer on database-backed applications with concurrent access (like often advocated for Rails). It's simply not doable without recoding the whole concurrent access manager and log-based/MVCC/... crash resistance of the database in the application layer (good luck with that).

Lionel.

Don't forget: you could put the data into a hash, and marshall it to
disc. Not a DB, but better than a flat file!

Maybe we are talking about different things. By data integrity, I
mean you can be certain not just that the data was entered correctly,
but also that it coincides with the relationships present. In a
modified version of the OP's model, for example...

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name1
-name2
-name3
etc...

Note the same category names, but in different categories.

Now, surely, you can say, "Well, the application logic will take care
of that ambiguity." But I say we should continue to separate
application logic from data logic.

I'm no CS guy, so I don't know the correct terms for this, but I do
see the potential pratfalls.

There certainly is a time and place for this, but I've found it's
usefulness generally not that beneficial.

Todd

···

On Tue, Apr 1, 2008 at 11:55 AM, Lionel Bouton <lionel-subscription@bouton.name> wrote:

Todd Benson wrote:

> I'm going to slightly disagree with Lionel -- and also Robert -- on
> this one. First of all, a database is not necessarily just for
> concurrency. It's for data integrity

Yes I agree (as explained below concurrency is what I consider the main
problem to solve to enforce data integrity). That said if you write your
data in one pass as the OP, you don't need data integrity in the storage
layer... rename is atomic : you either renamed the temp file to its
final position before a crash or not.

The problem are partial updates where you need to maintain consistancy.
And on the top of my head the only problems with partial updates are :
- concurrent accesses (most common, counting both concurrent read and
write accesses),
- crashes (fortunately less common and can even be adressed by backups
in many cases).

These are why I disagree with people wanting to push all the consistency
logic into the applicaltion layer on database-backed applications with
concurrent access (like often advocated for Rails). It's simply not
doable without recoding the whole concurrent access manager and
log-based/MVCC/... crash resistance of the database in the application
layer (good luck with that).

Lionel.

Oh wait, Lionel already suggested that.

Sorry Lionel; missed the OP's "But each name will go in only one
category". I do still think it wouldn't be that bad to use a DB.

Todd

···

On Tue, Apr 1, 2008 at 12:13 PM, Todd Benson <caduceass@gmail.com> wrote:

Maybe we are talking about different things. By data integrity, I
mean you can be certain not just that the data was entered correctly,
but also that it coincides with the relationships present. In a
modified version of the OP's model, for example...

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name1
-name2
-name3
etc...

Note the same category names, but in different categories.

Now, surely, you can say, "Well, the application logic will take care
of that ambiguity." But I say we should continue to separate
application logic from data logic.

I'm no CS guy, so I don't know the correct terms for this, but I do
see the potential pratfalls.

There certainly is a time and place for this, but I've found it's
usefulness generally not that beneficial.

Todd

> Todd Benson wrote:
>
> > I'm going to slightly disagree with Lionel -- and also Robert -- on
> > this one. First of all, a database is not necessarily just for
> > concurrency. It's for data integrity
>
> Yes I agree (as explained below concurrency is what I consider the main
> problem to solve to enforce data integrity). That said if you write your
> data in one pass as the OP, you don't need data integrity in the storage
> layer... rename is atomic : you either renamed the temp file to its
> final position before a crash or not.

Exactly. With regard to all that we've learned about the issue at
hand a DB seems overkill here. KISS

> The problem are partial updates where you need to maintain consistancy.
> And on the top of my head the only problems with partial updates are :
> - concurrent accesses (most common, counting both concurrent read and
> write accesses),
> - crashes (fortunately less common and can even be adressed by backups
> in many cases).
>
> These are why I disagree with people wanting to push all the consistency
> logic into the applicaltion layer on database-backed applications with
> concurrent access (like often advocated for Rails). It's simply not
> doable without recoding the whole concurrent access manager and
> log-based/MVCC/... crash resistance of the database in the application
> layer (good luck with that).

Totally agree - but this is another story.

Maybe we are talking about different things. By data integrity, I
mean you can be certain not just that the data was entered correctly,
but also that it coincides with the relationships present. In a
modified version of the OP's model, for example...

Now, surely, you can say, "Well, the application logic will take care
of that ambiguity." But I say we should continue to separate
application logic from data logic.

But the consistency needs to be /somewhere/ and if no database is
needed then enforcing it in app logic is certainly ok.

I'm no CS guy, so I don't know the correct terms for this, but I do
see the potential pratfalls.

There certainly is a time and place for this, but I've found it's
usefulness generally not that beneficial.

What is "this" in this paragraph?

Generally I do not think we're far away - if at all. Given the scale
of the problem and the apparent lack of future extension with regard
to size, complexity and concurrency a simple solution suffices IMHO.
Of course it's good to know the options - that's why we discuss here.

Kind regards

robert

A modified version of the script since the other posting did not seem
to make it into usenet. This one has consistency check as originally
required:

#!/bin/env ruby

require 'set'
require 'yaml'

class CatNames
  def self.load(file_name)
    File.open(file_name) {|io| YAML.load(io)}
  end

  def save(file_name)
    File.open(file_name, "w") {|io| YAML.dump(self, io)}
  end

  def initialize
    @cat = {}
    @all = {}
  end

  def add(cat, name)
    raise "Consistency Error" if @all[name]
    s = (@cat[cat] ||= Set.new)
    s << name
    @all[name] = s
  end

  def remove(cat, name)
    c = @cat[cat] and c.delete name
    @all.delete name
  end

  def clear
    @cat.clear
    @all.clear
  end

  def size
    @cat.inject(0) {|sum,(name,set)| sum + set.size}
  end
end

t = Time.now

d = CatNames.new

1000.times do |i|
  d.add("cat#{i % 10}", "name#{i}")
end

puts d.size

tt = Time.now
printf "%6.3f %s\n", tt-t, "create"
t = tt

d.save "test.yaml"

tt = Time.now
printf "%6.3f %s\n", tt-t, "write"
t = tt

d2 = CatNames.load "test.yaml"

tt = Time.now
printf "%6.3f %s\n", tt-t, "load"
t = tt

begin
  d2.add "foo", "name0"
rescue Exception => e
  puts e
end

···

2008/4/1, Todd Benson <caduceass@gmail.com>:

On Tue, Apr 1, 2008 at 11:55 AM, Lionel Bouton > <lionel-subscription@bouton.name> wrote:

--
use.inject do |as, often| as.you_can - without end

I admit, I tend to like using a sledgehammer to turn a machine screw,
but in that respect, I'm usually thinking of scalability and data
integrity.

When I said "there's a time and place for this", "this" was referring
to the various forms of flat file storage.

With this particular situation, I would probably go with YAML, and
migrate to a database if need be (which shouldn't be that hard,
depending on how deeply nested the data is).

Todd

···

On Tue, Apr 1, 2008 at 2:35 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

Exactly. With regard to all that we've learned about the issue at
hand a DB seems overkill here. KISS

Todd Benson wrote:

···

On Tue, Apr 1, 2008 at 2:35 PM, Robert Klemme > <shortcutter@googlemail.com> wrote:

Exactly. With regard to all that we've learned about the issue at
hand a DB seems overkill here. KISS

I admit, I tend to like using a sledgehammer to turn a machine screw,
but in that respect, I'm usually thinking of scalability and data
integrity.

Why use a sledge hammer, when you can use the surgeon's knife SQLite?

That's the RDBM I'd use, if I would be using a SQL DB in this situation.

It doesn't always have to be Postgre or Oracle. :stuck_out_tongue:

- -- Phillip Gawlowski

Well, "sledgehammer" was for humor. A better analogy of my approach
would be this darn overly large swiss army knife that doesn't always
fit comfortably in my pocket, but I wear it any way just in case.

My only problem with SQLite is lack of foreign key constraints.

cheers,
Todd

···

On Tue, Apr 1, 2008 at 4:58 PM, Phillip Gawlowski <cmdjackryan@googlemail.com> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd Benson wrote:
> On Tue, Apr 1, 2008 at 2:35 PM, Robert Klemme > > <shortcutter@googlemail.com> wrote:
>
>> Exactly. With regard to all that we've learned about the issue at
>> hand a DB seems overkill here. KISS
>
> I admit, I tend to like using a sledgehammer to turn a machine screw,
> but in that respect, I'm usually thinking of scalability and data
> integrity.

Why use a sledge hammer, when you can use the surgeon's knife SQLite?