Importing flat-file data in to a database with Ruby

All,

I'm up against a performance issue with some code I have - it's on
GitHub at
https://github.com/poggs/tsdbexplorer/blob/master/lib/tsdbexplorer/cif.rb if anyone wants to take a look.

The code takes a fixed 80-column wide input file, reads the first two
bytes of each line and hands off further processing to an initialize
function in a Class. An object is returned and pushed on to an Array.
When ~1000 of these records have been processed, they are INSERTed in to
a database using activerecord-import.

Without calling the import method on an ActiveRecord object, I can
process a 450Mb source file in 6-8 minutes. When writing to the
database, this easily takes in excess of an hour.

Can anyone help me work out where I'm going wrong? Both MySQL and
PostgreSQL show this 'slowness', so I'm looking at code first.

Kind regards,

Peter

···

--
Peter Hicks <peter.hicks@poggs.co.uk>

What's your benchmarking of another ORM (Sequel, DataMapper) vs.
'handmade' transactional SQL saying?

···

On Tue, Aug 30, 2011 at 9:17 PM, Peter Hicks <peter.hicks@poggs.co.uk> wrote:

Without calling the import method on an ActiveRecord object, I can
process a 450Mb source file in 6-8 minutes. When writing to the
database, this easily takes in excess of an hour.

--
Phillip Gawlowski

phgaw.posterous.com | twitter.com/phgaw | gplus.to/phgaw

A method of solution is perfect if we can forsee from the start,
and even prove, that following that method we shall attain our aim.
-- Leibniz

Generally OR mappers are all but ideal for mass data imports. I don't
know AR's import interface and how it's implemented internally but
usually RDBMS allow for batch commands. With these you can execute
multiple SQL statements with a single network roundtrip between client
and RDBMS. When looking at your code and
Home · zdennis/activerecord-import Wiki · GitHub it seems you
might be using the interface in a sub optimal way by importing groups
of objects per class. I haven't looked through the code extensively
but it may be that you are not processing batches of ~1000 but rather
significantly more batches of smaller number of objects making your
batch size less effective.

Not all RDBMS have proper batch interfaces (or an implementation in
Ruby that uses it) and it seems the code in
#import_without_validations_or_callbacks in

honors this ("if not supports_import?") by issuing individual INSERT
statements. Which means that your batch size becomes 1. *shudder*

Usually best results are obtained by using a RDBMS's batch loader
(e.g. SQL*Loder with Oracle) which have some nifty tricks to speed up
insertion. It may even be more efficient to preformat insert data for
the specific loader and either write it to a file or feed it to stdin
of the loader via a pipe in memory.

Kind regards

robert

···

On Tue, Aug 30, 2011 at 9:17 PM, Peter Hicks <peter.hicks@poggs.co.uk> wrote:

I'm up against a performance issue with some code I have - it's on
GitHub at
tsdbexplorer/lib/tsdbexplorer/cif.rb at master · poggs/tsdbexplorer · GitHub if anyone wants to take a look.

The code takes a fixed 80-column wide input file, reads the first two
bytes of each line and hands off further processing to an initialize
function in a Class. An object is returned and pushed on to an Array.
When ~1000 of these records have been processed, they are INSERTed in to
a database using activerecord-import.

Without calling the import method on an ActiveRecord object, I can
process a 450Mb source file in 6-8 minutes. When writing to the
database, this easily takes in excess of an hour.

Can anyone help me work out where I'm going wrong? Both MySQL and
PostgreSQL show this 'slowness', so I'm looking at code first.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/