Why the CSV standard library is broken (and how to fix it), Part IV or Numerics a.k.a. Auto-Magic Type Inference for Strings and Numbers?


(Gerald Bauer) #1

Hello,

  I've written a new (and fourth) episode on why the CSV standard library is
broken, broken, broken (and how to fix it).

  Let's have a look at numerics a.k.a. auto-magic type inference for
strings and numbers [1].

  Here's the challenge for the standard csv library.
  Let's read data.csv:

    1,2,3
    "4","5","6"

Using these popular two rules (bonus for NaNs - not a number).

Rule 1: Use "un-quoted" values for float numbers e.g. 1,2,3 or 1.0,
2.0, 3.0 etc.

Rule 2: Use quoted values for "non-numeric" strings e.g. "4", "5", "6"
or "Hello, World!" etc.

In the new csv reader it works like this :-):

   records = Csv.numeric.read( 'data.csv' )
   pp records
   # => [[1.0, 2.0, 3.0],
   # ["4", "5", "6"]]

And with your own not a number constants / configuration:

   records = Csv.numeric.parse( '1,2,NAN,#NA', nan: ['NAN', '#NA'] )
   pp records
   # => [[1.0, 2.0, NaN, NaN]]

Let's quote an old quote from this mailing list:

I disagree that it's broken.
It's implementing the [strict] RFC [CSV format] and gives you the tools that allow you to be less strict.

Anyone? Show us how you handle the reading of the numerics
variant and Not a Number (NaN) with the standard csv library?

Questions and comments welcome. Cheers. Prost.

PS: If you want to see other (more) CSV formats / dialects pre-configured
and supported "out-of-the-box" in the new csv reader, please tell.

[1] https://github.com/csv11/docs/blob/master/csv-numerics.md