Hello,
I've written a new (and fourth) episode on why the CSV standard library is
broken, broken, broken (and how to fix it).
Let's have a look at numerics a.k.a. auto-magic type inference for
strings and numbers [1].
Here's the challenge for the standard csv library.
Let's read data.csv:
1,2,3
"4","5","6"
Using these popular two rules (bonus for NaNs - not a number).
Rule 1: Use "un-quoted" values for float numbers e.g. 1,2,3 or 1.0,
2.0, 3.0 etc.
Rule 2: Use quoted values for "non-numeric" strings e.g. "4", "5", "6"
or "Hello, World!" etc.
In the new csv reader it works like this :-):
records = Csv.numeric.read( 'data.csv' )
pp records
# => [[1.0, 2.0, 3.0],
# ["4", "5", "6"]]
And with your own not a number constants / configuration:
records = Csv.numeric.parse( '1,2,NAN,#NA', nan: ['NAN', '#NA'] )
pp records
# => [[1.0, 2.0, NaN, NaN]]
Let's quote an old quote from this mailing list:
I disagree that it's broken.
It's implementing the [strict] RFC [CSV format] and gives you the tools that allow you to be less strict.
Anyone? Show us how you handle the reading of the numerics
variant and Not a Number (NaN) with the standard csv library?
Questions and comments welcome. Cheers. Prost.
PS: If you want to see other (more) CSV formats / dialects pre-configured
and supported "out-of-the-box" in the new csv reader, please tell.
[1] https://github.com/csv11/docs/blob/master/csv-numerics.md