Also, there is a hidden assumption in your position -- that libraries, ipso facto, represent robust methods.
For the newbies, however, it might matter. They might think library contents differ from ordinary code.
I sure hope they think that! I know I do.
There's no faster way to find bugs than to bundle up some code and turn it loose on the world. That leads to more robust code. This is the reason open source development works so well.
If one of us patches a library, everyone benefits. It's like having a few hundred extra programmers on your staff.
Yes, I realize I'm over generalizing there. There will always be poorly supported or weak libraries, but someone just forks or replaces those eventually.
On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds)...Really? How do they handle data with newlines in it?
Linefeeds are escaped as though in a normal quoted string. This is how I
have always dealt with embedded linefeeds, which is why I was ignorant of
the specification's language on this (an explanation, not an excuse).
So a linefeed is \n and then we need to escape the \ so that is \\, I assume. Interesting.
I would argue that is not CSV, but it's certainly debatable. My reasoning is that you either need to post process the CSV parsed data to restore it or use a custom parser that understands CSV plus your escaping rules.
Which "CSV databases" are you referring to here?
MySQL, the database I am most familiar with, uses this method for import or
export of comma- or tab-separated plain-text data. Within MySQL's own
database protocol, linefeeds really are linefeeds, but an imported or
exported plain-text table has them escaped within fields.
Wild. I use MySQL everyday. Guess I've never dumped a CSV of linefeed containing data with it though. (I generally walk the database myself with a Ruby script and dump with FasterCSV.)
It just takes longer if all the database
handling (not just record parsing) must use the same state machine that
field parsing must use.
I don't understand this comment. MySQL does not use CSV internally, like most databases.
It's very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.
Have you heard of the FasterCSV library?
It's pretty zippy.
But parsing will necessarily be slow, character by character, the entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few extra
states.
You don't really have to parse CSV character by character. FasterCSV does most of its parsing with a single highly optimized (to avoid backtracking) regular expression and a few tricks.
Basically you can read line by line and divide into fields. If you have an unclosed field at the end of the line, you hit an embedded linefeed. You then just pull and append the next line and continue eating fields.
The standard CSV library does not do this and that is one of two big reasons it is so slow.
James Edward Gray II
···
On Nov 30, 2006, at 2:45 PM, Paul Lutus wrote: