Splitting a CSV file into 40,000 line chunks

Also, there is a hidden assumption in your position -- that libraries, ipso facto, represent robust methods.

For the newbies, however, it might matter. They might think library contents differ from ordinary code.

I sure hope they think that! I know I do.

There's no faster way to find bugs than to bundle up some code and turn it loose on the world. That leads to more robust code. This is the reason open source development works so well.

If one of us patches a library, everyone benefits. It's like having a few hundred extra programmers on your staff.

Yes, I realize I'm over generalizing there. There will always be poorly supported or weak libraries, but someone just forks or replaces those eventually.

On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds)...

Really? How do they handle data with newlines in it?

Linefeeds are escaped as though in a normal quoted string. This is how I
have always dealt with embedded linefeeds, which is why I was ignorant of
the specification's language on this (an explanation, not an excuse).

So a linefeed is \n and then we need to escape the \ so that is \\, I assume. Interesting.

I would argue that is not CSV, but it's certainly debatable. My reasoning is that you either need to post process the CSV parsed data to restore it or use a custom parser that understands CSV plus your escaping rules.

Which "CSV databases" are you referring to here?

MySQL, the database I am most familiar with, uses this method for import or
export of comma- or tab-separated plain-text data. Within MySQL's own
database protocol, linefeeds really are linefeeds, but an imported or
exported plain-text table has them escaped within fields.

Wild. I use MySQL everyday. Guess I've never dumped a CSV of linefeed containing data with it though. (I generally walk the database myself with a Ruby script and dump with FasterCSV.)

It just takes longer if all the database
handling (not just record parsing) must use the same state machine that
field parsing must use.

I don't understand this comment. MySQL does not use CSV internally, like most databases.

It's very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.

Have you heard of the FasterCSV library? :wink: It's pretty zippy.

But parsing will necessarily be slow, character by character, the entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few extra
states.

You don't really have to parse CSV character by character. FasterCSV does most of its parsing with a single highly optimized (to avoid backtracking) regular expression and a few tricks.

Basically you can read line by line and divide into fields. If you have an unclosed field at the end of the line, you hit an embedded linefeed. You then just pull and append the next line and continue eating fields.

The standard CSV library does not do this and that is one of two big reasons it is so slow.

James Edward Gray II

···

On Nov 30, 2006, at 2:45 PM, Paul Lutus wrote:

Paul Lutus wrote:

Nice, informative post. There are a lot of issues here, primarily the fact
that the database under discussion is too big to hold in memory, and it is
also too big to fit into Excel in one chunk, which appears to be its
destination.

Most people have begin to drift toward suggesting a database approach,
rather than anything that involves direct manipulation of the database in
Ruby. Because of the size of the database and because sorting the records
is one goal, I have to agree.
  

I haven't "begun to drift" -- I'll flat out say, "Use a %^$&%^$( database!"

···

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.

James Edward Gray II wrote:

/ ...

So a linefeed is \n and then we need to escape the \ so that is \\, I
assume. Interesting.

I would argue that is not CSV, but it's certainly debatable.

Not really debatable IMHO. Embedded linefeeds are optional, and the
two-character "\n" sequence are allowed characters. It's an
application-specific option that happens to be fairly common.

Consider what would happen if someone embedded a bunch of HTML break tags in
CSV fields. Does that mean it isn't CSV? Same idea. It's just a common
convention to symbolize a linefeed.

My
reasoning is that you either need to post process the CSV parsed data
to restore it or use a custom parser that understands CSV plus your
escaping rules.

Or you can leave it alone and let the calling application deal with it. In a
case like this, it is often two cooperating applications, both of which
know the conventions in use. The CSV parser doesn't have to address this
convention directly.

/ ...

Wild. I use MySQL everyday. Guess I've never dumped a CSV of
linefeed containing data with it though. (I generally walk the
database myself with a Ruby script and dump with FasterCSV.)

If you have a database with embedded linefeeds, and you dump it to the
console:

echo "Select * from database.table;" | mysql (authorization)

You will see the linefeeds escaped in the output (the fields are
tab-separated, not comma-separated, but the result is the same).

Oh, that reminds me. Tabs are also often represented symbolically in the
same general way, to avoid a conflict with tab field delimiters, when they
are used.

It just takes longer if all the database
handling (not just record parsing) must use the same state machine
that
field parsing must use.

I don't understand this comment. MySQL does not use CSV internally,
like most databases.

Yes, I know. Here I am referring to the task of parsing CSV tables, it's no
longer about MySQL.

It's very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.

Have you heard of the FasterCSV library? :wink: It's pretty zippy.

Yes, and whatever speed it achieves, it is proportionally slower to the
degree that it follows the requirement to allow linefeeds inside fields.

But parsing will necessarily be slow, character by character, the
entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few
extra
states.

You don't really have to parse CSV character by character. FasterCSV
does most of its parsing with a single highly optimized (to avoid
backtracking) regular expression and a few tricks.

Then the regex engine parses character by character, rather than the
explicit Ruby code. Same proportions as explained above, but probably much
faster overall (I assume Ruby's regex engine is written in native code).

Basically you can read line by line and divide into fields.

Actually you must do both at once. You can't simply parse record delimiters
without paying attention to the field delimiters and field contents. I know
you know this, I am just adding it for other readers.

If you
have an unclosed field at the end of the line, you hit an embedded
linefeed. You then just pull and append the next line and continue
eating fields.

It must be some regex scheme to simultaneously deal with embedded field
delimiters as well as the various quote issues and, finally, embedded
record delimiters.

The standard CSV library does not do this and that is one of two big
reasons it is so slow.

I'm glad to see there is healthy competition for quality code, even in
libraries.

···

--
Paul Lutus
http://www.arachnoid.com