Why the CSV standard library is broken, broken, broken (and how to fix it)


(Gerald Bauer) #1

Hello,

   I’ve written up an article titled “Why the CSV standard library is
broken, broken, broken (and how to fix it)” [1].

   Why not let’s make ruby the best it can be for data wrangling -
starting with the humble comma-separated values (CSV) format - the
world’s most popular data interchange format in text :-).

Try it yourself. Example:

require 'csv'
require 'pp'

begin
  CSV.parse( %{1, "2"})
rescue CSV::MalformedCSVError => ex
  pp ex
end
# => #<CSV::MalformedCSVError: Illegal quoting in line 1.>

begin
CSV.parse( %{"3" , 4})
rescue CSV::MalformedCSVError => ex
  pp ex
end
# => #<CSV::MalformedCSVError: Unclosed quoted field on line 1.>

pp CSV.parse( %{"","",} )

# => ["", "", nil, nil]

  Is that what you expected? Questions and comments welcome. Cheers. Prost.

[1] https://github.com/csv11/docs/blob/master/why-the-csv-stdlib-is-broken.md


(Austin Ziegler) #2

I disagree that it’s broken. It’s implementing the RFC and gives you
tools that allow you to be less strict. “Fixing” CSV in this case
would perhaps be providing some default “human-made” CSVs. However,
what you are suggesting as possible defaults would force me to fork
the CSV library and use a fairly strict version of it because I am
dealing with machine-generated CSVs that conform to the RFC as much as
we can get it.

-a

···

On Wed, Aug 22, 2018 at 12:22 PM Gerald Bauer <gerald.bauer@gmail.com> wrote:

Hello,

   I’ve written up an article titled “Why the CSV standard library is
broken, broken, broken (and how to fix it)” [1].

   Why not let’s make ruby the best it can be for data wrangling -
starting with the humble comma-separated values (CSV) format - the
world’s most popular data interchange format in text :-).

Try it yourself. Example:

require 'csv'
require 'pp'

begin
  CSV.parse( %{1, "2"})
rescue CSV::MalformedCSVError => ex
  pp ex
end
# => #<CSV::MalformedCSVError: Illegal quoting in line 1.>

begin
CSV.parse( %{"3" , 4})
rescue CSV::MalformedCSVError => ex
  pp ex
end
# => #<CSV::MalformedCSVError: Unclosed quoted field on line 1.>

pp CSV.parse( %{"","",} )

# => ["", "", nil, nil]

  Is that what you expected? Questions and comments welcome. Cheers. Prost.

[1] https://github.com/csv11/docs/blob/master/why-the-csv-stdlib-is-broken.md

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

--
Austin Ziegler • halostatue@gmail.com • austin@halostatue.ca
http://www.halostatue.ca/http://twitter.com/halostatue


(Eric Wong) #3

Did you try reporting to upstream to see about fixing it?
I'm no CSV expert, but this type of article seems click-baity.

···

Gerald Bauer <gerald.bauer@gmail.com> wrote:

   I’ve written up an article titled “Why the CSV standard library is
broken, broken, broken (and how to fix it)” [1].


(Gerald Bauer) #4

Hello,

I disagree that it’s broken. It’s implementing the RFC and gives you

tools that allow you to be less strict.

No, it doesn't. The heart of the matter and the joke is that if you
want to parse comma-separated values (csv) lines it is more
complicated than using line.split(",") and you need a purpose-built
parser for the (edge) cases and (special) escape rules, and, thus,
you're advised to use a csv library.

    After using the csv std library I'm getting all these parse errors
so I look at the source code and read-up what's going on and -
surprise, surprise - the joke is on me:

   parts = parse.split(@col_sep_split_separator, -1) [1]

   By definition it is impossible and unfixable unless you use your
own purpose built parser - sorry, there's no "ingenious" hack for a
supposed "faster" library and the excuse about parsing only very,
very, very strict RFC is getting old. What do all the other csv
libraries in the world do (see python, java, go, javascript, etc.)

    Anyways, here's how a parser looks like (it's not magic but
definitely more work - e.g. instead of 10-20 lines you will have 100
or 200 or more):

def parse_field( io, sep: ',' )
  value = ""
  skip_spaces( io ) ## strip leading spaces
  if (c=io.peek; c=="," || c==LF || c==CR || io.eof?) ## empty field
    ## return value; do nothing
  elsif io.peek == DOUBLE_QUOTE
    puts "start double_quote field - peek >#{io.peek}< (#{io.peek.ord})"
    io.getc ## eat-up double_quote
    loop do
       while (c=io.peek; !(c==DOUBLE_QUOTE || io.eof?))
         value << io.getc ## eat-up everything unit quote (")
       end
      break if io.eof?
      io.getc ## eat-up double_quote
      if io.peek == DOUBLE_QUOTE ## doubled up quote?
         value << io.getc ## add doube quote and continue!!!!
      else
      ....

   and so on and so forth [2]. See the difference?

    Anyways, thanks for the suggestion about reporting upstream.
  Unfortunately, the csv library is an orphan abadoned by its original
author as a 1000 line single-file code bomb and would need some love
and care.

    There are so many other major flaws e.g. why not just return a
hash if the csv file has a header.
   To conclude, the csv library might have been once
"state-of-the-art" ten years ago - now in 2020 it's unfortunately a
dead horse and cannot handle the (rich) diversity / dialects of csv
formats.

     Anyways, not really complaining, see the csv11 org [3] - really
trying hard to make ruby better for data wrangling with csv (one
commit at a time with alternative csv libraries / gems). What have you
done lately? Who do you think is going to fix it?

   Cheers. Prost.

[1] https://github.com/ruby/csv/blob/master/lib/csv.rb#L1248
[2] https://github.com/csv11/csvreader/blob/master/lib/csvreader/parser.rb
[3] https://github.com/csv11

···

El mié., 22 ago. 2018 a las 19:56, Eric Wong (<e@80x24.org>) escribió:

Gerald Bauer <gerald.bauer@gmail.com> wrote:
> I’ve written up an article titled “Why the CSV standard library is
> broken, broken, broken (and how to fix it)” [1].

Did you try reporting to upstream to see about fixing it?
I'm no CSV expert, but this type of article seems click-baity.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Matthew Kerwin) #5

Hello,

> I disagree that it’s broken. It’s implementing the RFC and gives you
tools that allow you to be less strict.

[snipping long response]

   By definition it is impossible and unfixable unless you use your
own purpose built parser - sorry, there's no "ingenious" hack for a
supposed "faster" library and the excuse about parsing only very,
very, very strict RFC is getting old.

Have you read https://tools.ietf.org/html/rfc4180 ? The quoting rules
are in there.

    Anyways, thanks for the suggestion about reporting upstream.
  Unfortunately, the csv library is an orphan abadoned by its original
author as a 1000 line single-file code bomb and would need some love
and care.

Uh... it was last updated in July..?

Lots of libraries are being pulled out of stdlib and moved to separate
gems; maybe you just don't have the gemified version installed? Mine's
current -- 3.0.0 -- and does this:

···

On Thu, 23 Aug 2018 at 16:34, Gerald Bauer <gerald.bauer@gmail.com> wrote:

~~~
irb(main):001:0> require 'csv'
=> true
irb(main):002:0> CSV.parse %Q[foo,"b\nr","b""z"]
=> [["foo", "b\nr", "b\"z"]]
~~~

It did the same with 1.0.0, too.

There are also other CSV gems, if you look for them:
https://rubygems.org/gems/smarter_csv

Cheers
--
  Matthew Kerwin
  https://matthew.kerwin.net.au/
On Thu, 23 Aug 2018 at 16:34, Gerald Bauer <gerald.bauer@gmail.com> wrote:

Hello,

> I disagree that it’s broken. It’s implementing the RFC and gives you
tools that allow you to be less strict.

No, it doesn't. The heart of the matter and the joke is that if you
want to parse comma-separated values (csv) lines it is more
complicated than using line.split(",") and you need a purpose-built
parser for the (edge) cases and (special) escape rules, and, thus,
you're advised to use a csv library.

    After using the csv std library I'm getting all these parse errors
so I look at the source code and read-up what's going on and -
surprise, surprise - the joke is on me:

   parts = parse.split(@col_sep_split_separator, -1) [1]

   By definition it is impossible and unfixable unless you use your
own purpose built parser - sorry, there's no "ingenious" hack for a
supposed "faster" library and the excuse about parsing only very,
very, very strict RFC is getting old. What do all the other csv
libraries in the world do (see python, java, go, javascript, etc.)

    Anyways, here's how a parser looks like (it's not magic but
definitely more work - e.g. instead of 10-20 lines you will have 100
or 200 or more):

def parse_field( io, sep: ',' )
  value = ""
  skip_spaces( io ) ## strip leading spaces
  if (c=io.peek; c=="," || c==LF || c==CR || io.eof?) ## empty field
    ## return value; do nothing
  elsif io.peek == DOUBLE_QUOTE
    puts "start double_quote field - peek >#{io.peek}< (#{io.peek.ord})"
    io.getc ## eat-up double_quote
    loop do
       while (c=io.peek; !(c==DOUBLE_QUOTE || io.eof?))
         value << io.getc ## eat-up everything unit quote (")
       end
      break if io.eof?
      io.getc ## eat-up double_quote
      if io.peek == DOUBLE_QUOTE ## doubled up quote?
         value << io.getc ## add doube quote and continue!!!!
      else
      ....

   and so on and so forth [2]. See the difference?

    Anyways, thanks for the suggestion about reporting upstream.
  Unfortunately, the csv library is an orphan abadoned by its original
author as a 1000 line single-file code bomb and would need some love
and care.

    There are so many other major flaws e.g. why not just return a
hash if the csv file has a header.
   To conclude, the csv library might have been once
"state-of-the-art" ten years ago - now in 2020 it's unfortunately a
dead horse and cannot handle the (rich) diversity / dialects of csv
formats.

     Anyways, not really complaining, see the csv11 org [3] - really
trying hard to make ruby better for data wrangling with csv (one
commit at a time with alternative csv libraries / gems). What have you
done lately? Who do you think is going to fix it?

   Cheers. Prost.

[1] https://github.com/ruby/csv/blob/master/lib/csv.rb#L1248
[2] https://github.com/csv11/csvreader/blob/master/lib/csvreader/parser.rb
[3] https://github.com/csv11

El mié., 22 ago. 2018 a las 19:56, Eric Wong (<e@80x24.org>) escribió:
>
> Gerald Bauer <gerald.bauer@gmail.com> wrote:
> > I’ve written up an article titled “Why the CSV standard library is
> > broken, broken, broken (and how to fix it)” [1].
>
> Did you try reporting to upstream to see about fixing it?
> I'm no CSV expert, but this type of article seems click-baity.
>
> Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
> <http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

--
  Matthew Kerwin
  https://matthew.kerwin.net.au/


(Gerald Bauer) #6

Hello,

Have you read https://tools.ietf.org/html/rfc4180 ?
The quoting rules are in there.

  Have you read it? :slight_smile: Let's start at the beginning (together):

This memo provides information for the Internet community. It does
not specify an Internet standard of any kind.

   Welcome to the wonderful world of csv formats. Why not accept the
world (and humans) how it is (how they are)? Are you asking for
perfection in your csv format? Good luck in convincing the world to
follow your nit-picky rules :slight_smile:

   Thanks for highlighting other csv library alternatives. Yes, moving
more libraries to "regular" gems is a fantastic move and hopefully
helps in getting more updates / fixes.

   Cheers. Prost.


(Andy Jones) #7

   I’ve written up an article titled “Why the CSV standard library is
broken, broken, broken (and how to fix it)” [1].

I agree with what is apparently your premise: that the CSV "standard" is weakly adhered to in the wild, and that we need a tool to read all the CSVs, not just the ones that follow the RFC.

But.

This "broken, broken, broken" language just makes it seem as if you want to start a flame war, not fix a problem. Personally, I'm not going to take anyone seriously if they behave like that. Sorry.

Click here to view Company Information and Confidentiality Notice.<http://www.jameshall.co.uk/index.php/small-print/email-disclaimer>

Please note that we have updated our privacy policy in line with new data protection regulations. Please refer to our website to view the ways in which we handle your data.


(Who Are You) #8

Just wanted to mention briefly - I used the CSV stdlib in a recent project.
It was somewhat unflexible when it came to simple CRUD, compared to SQL statements.
Had no problems reading or appending data, but updating and deleting specific records looked less failsafe.
I ended up creating a set of append only CSV files for long-term storage, which entries are duplicated to from a set of temporary CSV files which accumulate during client-server interactions.
After saving to long-term CSV files, the temporary CSV files are then overwritten by a set of CSV header-only 'skel' files.
Definitely a hash would've been preferred instead of the object it returns when headers are enabled, after you have selected a row.
Had originally considered converting a different project to CSV, but decided against it after above.

···

On Thu, Aug 23, 2018 at 08:32:29AM +0200, Gerald Bauer wrote:

    There are so many other major flaws e.g. why not just return a
hash if the csv file has a header.
   To conclude, the csv library might have been once
"state-of-the-art" ten years ago - now in 2020 it's unfortunately a
dead horse and cannot handle the (rich) diversity / dialects of csv
formats.


(Peter Hickman) #9

Is this sarcasm? I can't tell anymore :frowning:

···

On Thu, 23 Aug 2018 at 10:50, <who.are.you@posteo.no> wrote:

On Thu, Aug 23, 2018 at 08:32:29AM +0200, Gerald Bauer wrote:
>
> There are so many other major flaws e.g. why not just return a
> hash if the csv file has a header.
> To conclude, the csv library might have been once
> "state-of-the-art" ten years ago - now in 2020 it's unfortunately a
> dead horse and cannot handle the (rich) diversity / dialects of csv
> formats.
>

Just wanted to mention briefly - I used the CSV stdlib in a recent project.
It was somewhat unflexible when it came to simple CRUD, compared to SQL
statements.
Had no problems reading or appending data, but updating and deleting
specific records looked less failsafe.
I ended up creating a set of append only CSV files for long-term storage,
which entries are duplicated to from a set of temporary CSV files which
accumulate during client-server interactions.
After saving to long-term CSV files, the temporary CSV files are then
overwritten by a set of CSV header-only 'skel' files.
Definitely a hash would've been preferred instead of the object it returns
when headers are enabled, after you have selected a row.
Had originally considered converting a different project to CSV, but
decided against it after above.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Matthew Kerwin) #10

Hello,

> Have you read https://tools.ietf.org/html/rfc4180 ?
> The quoting rules are in there.

  Have you read it? :slight_smile: Let's start at the beginning (together):

> This memo provides information for the Internet community. It does
> not specify an Internet standard of any kind.

Exactly, it's not prescribing a standard protocol or format, it's
documenting what is known to already exist. That's what that sort of
informational RFC does.

I don't know what tools you're using that generate stuff with
whitespace outside quotes and the like, I've certainly never come
across it. Even when Postelism was a thing, the rule was "be [...]
conservative in what you send". If you swim upstream, you aren't
really entitled to complain if the current pushes against you.

   Welcome to the wonderful world of csv formats. Why not accept the
world (and humans) how it is (how they are)? Are you asking for
perfection in your csv format? Good luck in convincing the world to
follow your nit-picky rules :slight_smile:

Your patronising tone doesn't do anyone any favours.

We have a stable RFC that *documents existing, well-known behaviour*.
It was written *13 years ago*, meaning the format was stable and
well-understood even before that time. Somehow you are generating
documents (hand-crafting?) that don't conform to decades of precedent.
I'm not surprised the tools don't accept them, formal standard or no.

Incidentally, "fail fast" is generally considered better than "garbage
in, garbage out" -- i.e. if you can detect that something is malformed
it's better to raise it than bully your way through. In the absence
of a prescriptive spec, it makes sense for libraries and tools to fail
hard on malformed data. By specifying a "graceful recovery" mechanism
for these cases, you've effectively created a new protocol/format; one
which doesn't interoperate particularly well with the existing
ecosystem, at least in one direction. The least you could do is
document it, and give it a new name.

···

On Thu, 23 Aug 2018 at 17:49, Gerald Bauer <gerald.bauer@gmail.com> wrote:

   Thanks for highlighting other csv library alternatives. Yes, moving
more libraries to "regular" gems is a fantastic move and hopefully
helps in getting more updates / fixes.

   Cheers. Prost.

--
  Matthew Kerwin
  https://matthew.kerwin.net.au/


(Gerald Bauer) #11

Hello,
   FYI: I've also posted a link to "Why the ruby CSV standard library
is broken, broken, broken (and how to fix it)" on the reddit ruby. If
interested find more comments:

   https://www.reddit.com/r/ruby/comments/993lkm/csvreader_read_tabular_data_in_the_commaseparated/

   Anyways, I'm offline and leave you with a quote from reddit ruby:
   The main point that got lost is that the "parser" in the csv
standard library is broken, broken, broken. It's just a line.split
kludge - it's not working by definition for different formats because
you will need a purpose built parser. See python :slight_smile: as an example
(parser is in c)
https://github.com/python/cpython/blob/master/Modules/_csv.c#L587.
Anyways, I don't really care - use what works for you - was just
trying to point out the obvious (if you look at the csv code itself in
the ruby stdlib - it's an unfixable hack unless it gets replaced with
a proper ye good olde' parser - learn from python :slight_smile: if you want to
make it faster, use c :slight_smile: not some hack-y kludges).

   All the best. Enjoy the summer. Cheers. Prost.

PS: Funny aside - the Python CSV standard library sets the strict
flag/option to false by default. See
https://docs.python.org/3/library/csv.html#csv.Dialect.strict


(Kirk Haines) #12

We have a stable RFC that *documents existing, well-known behaviour*.

It was written *13 years ago*, meaning the format was stable and
well-understood even before that time. Somehow you are generating
documents (hand-crafting?) that don't conform to decades of precedent.
I'm not surprised the tools don't accept them, formal standard or no.

While not speaking to any of the specifics of Gerald's article, in my
experience it his highly probable that he's getting CSV files from external
sources which are unparseable with the Ruby CSV library.

I have a long history of CSV files from financial institutions over the
last 15+ years, and they sometimes historically did strange and sometimes
inconsistent things with quotes, whitespace, or header lines, requiring me
to sometimes write preprocessing code to transform these CSV files before I
could do anything with them via the Ruby CSV library. The format of these
files was completely outside of my control.

Kirk Haines

···

On Thu, Aug 23, 2018 at 5:25 AM Matthew Kerwin <matthew@kerwin.net.au> wrote:


(Gerald Bauer) #13

Hello,

  Little errata for the record - the correct reddit link is:

  If interested find more comments:

  All the best. Enjoy the summer. Cheers. Prost.


(Andy Jones) #14

I have a long history of CSV files from financial institutions over the last 15+ years, and they sometimes historically did strange and sometimes inconsistent things with quotes, whitespace, or header lines, requiring me to sometimes write preprocessing code to transform these CSV files before I could do anything with them via the Ruby CSV library. The format of these files was completely outside of my control.

This is absolutely my experience, also. Most people – even inside IT – think of CSV as nothing more precise than “a bunch of values, separated by (probably) commas, one record to a line”. And they are not exactly wrong, are they?

Click here to view Company Information and Confidentiality Notice.<http://www.jameshall.co.uk/index.php/small-print/email-disclaimer>

Please note that we have updated our privacy policy in line with new data protection regulations. Please refer to our website to view the ways in which we handle your data.