FasterCSV RCR?

I'm considering submitting my first RCR to add FasterCSV to the Standard Library.

It's a pretty mature library now, has a CSV compatibility mode, is very feature rich (including many CSV lacks), and is wicked fast in comparison. I see it recommended regularly and get lots of positive feedback.

What do others think? Worth adding?

James Edward Gray II

I'm considering submitting my first RCR to add FasterCSV to the
Standard Library.

Sweet.

It's a pretty mature library now, has a CSV compatibility mode, is
very feature rich (including many CSV lacks), and is wicked fast in
comparison. I see it recommended regularly and get lots of positive
feedback.

What do others think? Worth adding?

Yes, absolutely. I brought it in house here, and we used it pretty widely
until someone made an issue out of the fact that it's not in the standard
library.

Since we run through fairly large CSVs multiple times a day, I enjoy the
speed FasterCVS gives us and I really don't want to have to go back.

···

On 5/26/06, James Edward Gray II <james@grayproductions.net> wrote:

James Edward Gray II

--
thanks,
-pate
-------------------------

James Edward Gray II wrote:

I'm considering submitting my first RCR to add FasterCSV to the Standard Library.

It should *replace* the current CSV library. :slight_smile:

Regards,

Dan

I bugged you about doing this off list, which may be why you posted
this, but just so people know, I use FasterCSV a lot in my work (and
in Ruport) and it has been very pleasant to work with! :slight_smile:

···

On 5/26/06, James Edward Gray II <james@grayproductions.net> wrote:

I'm considering submitting my first RCR to add FasterCSV to the
Standard Library.

I assume we have to keep CSV for backwards compatibility. We still have ftools, even though fileutils is preferred. runit too.

James Edward Gray II

···

On May 26, 2006, at 6:46 PM, Daniel Berger wrote:

James Edward Gray II wrote:

I'm considering submitting my first RCR to add FasterCSV to the Standard Library.

It should *replace* the current CSV library. :slight_smile:

pat eyler wrote:

···

On 5/26/06, James Edward Gray II <james@grayproductions.net> wrote:

It's a pretty mature library now, has a CSV compatibility mode, is
very feature rich (including many CSV lacks), and is wicked fast in
comparison. I see it recommended regularly and get lots of positive
feedback.

What do others think? Worth adding?

I'd suggest changing the name to CSV. And possibly defaulting to
compat-mode or perhaps issuing a warning if it's detected that
the user is trying to use the Old Library.

Hal

FasterCSV looses much of it's speed in compatibility mode. I think we want to encourage people to use the new interface, especially since I think it's superior. :wink:

James Edward Gray II

···

On May 27, 2006, at 12:31 AM, Hal Fulton wrote:

pat eyler wrote:

On 5/26/06, James Edward Gray II <james@grayproductions.net> wrote:

It's a pretty mature library now, has a CSV compatibility mode, is
very feature rich (including many CSV lacks), and is wicked fast in
comparison. I see it recommended regularly and get lots of positive
feedback.

What do others think? Worth adding?

I'd suggest changing the name to CSV. And possibly defaulting to
compat-mode or perhaps issuing a warning if it's detected that
the user is trying to use the Old Library.

Hi,

···

In message "Re: FasterCSV RCR?" on Sat, 27 May 2006 14:31:50 +0900, Hal Fulton <hal9000@hypermetrics.com> writes:

I'd suggest changing the name to CSV. And possibly defaulting to
compat-mode or perhaps issuing a warning if it's detected that
the user is trying to use the Old Library.

I agree. I don't want to have two independent CSV readers in the
distribution. It's OK that compatible mode is slow, or gives
obsoletion warning. But we have to discuss about when it should
happen - during 1.8.x or for 1.9.

              matz.

Hi,

>I'd suggest changing the name to CSV. And possibly defaulting to
>compat-mode or perhaps issuing a warning if it's detected that
>the user is trying to use the Old Library.

I agree. I don't want to have two independent CSV readers in the
distribution. It's OK that compatible mode is slow, or gives
obsoletion warning.

Alright, let me take another crack at the compatibility mode then. I can probably speed in up since I know it's about to gain importance.

But we have to discuss about when it should
happen - during 1.8.x or for 1.9.

I trust your judgement on what is best.

I guess I should warn you that the compatibility mode is not a 100% CSV replacement. It works for the majority of applications using just the CSV.* methods, but I don't even try to support all the reader and writer object. I've never seen code that uses those, but it could exist.

James Edward Gray II

···

On May 27, 2006, at 11:20 AM, Yukihiro Matsumoto wrote:

In message "Re: FasterCSV RCR?" > on Sat, 27 May 2006 14:31:50 +0900, Hal Fulton > <hal9000@hypermetrics.com> writes:

Alright, I've thought a lot about this and there is really one big issue here: CSV and FasterCSV are not 100% compatible. If it was just the method arguments, we could get pretty close to perfect, but CSV does some odd things like confuse open() with foreach() that I chose to avoid in FasterCSV. Because of that, I can't always be sure what to do when user code calls a given method.

That leaves two options, in my opinion:

1. CSV's compatibility mode handles most of the issues very well and I'm pretty sure I can remove most of the speed penalty. If we go with that, we have a pretty workable solution right now with one big gotcha: you can require a file named csv.rb and use CSV just fine, but the good stuff will actually be hiding under FasterCSV (in the same file). I have to keep them separate, because of the compatibility issues mentioned above. This, to me, is the only sane way to go if we want to target the 1.8.x branch. It would still break some software, if they use the unusual features of CSV, but I suspect this is quite rare.
2. We could drop compatibility and rename FasterCSV to CSV. This way people get all the good stuff where they expect it. However, this would break a lot of CSV software (most of it, in fact), so it only seems reasonable when targeting 1.9.x and up.

My thought is that the second option seems preferable. If we train people to use FasterCSV, then we just have to switch them again down the road if we want to revert to CSV. We don't really gain many big advantages for the switch either (speed, if I can eliminate the penalty, but not header parsing or the other good FasterCSV features). That doesn't sound like it's worth breaking software over.

In summary, I recommending targeting 1.9.x with no compatibility mode and renaming FasterCSV to CSV. Am I making sense here?

James Edward Gray II

···

On May 27, 2006, at 11:20 AM, Yukihiro Matsumoto wrote:

Hi,

In message "Re: FasterCSV RCR?" > on Sat, 27 May 2006 14:31:50 +0900, Hal Fulton > <hal9000@hypermetrics.com> writes:

>I'd suggest changing the name to CSV. And possibly defaulting to
>compat-mode or perhaps issuing a warning if it's detected that
>the user is trying to use the Old Library.

I agree. I don't want to have two independent CSV readers in the
distribution. It's OK that compatible mode is slow, or gives
obsoletion warning. But we have to discuss about when it should
happen - during 1.8.x or for 1.9.

Hi all,

Long time no post.

James Edward Gray II wrote:

I agree. I don't want to have two independent CSV readers in the
distribution. It's OK that compatible mode is slow, or gives
obsoletion warning.

Alright, let me take another crack at the compatibility mode then. I
can probably speed in up since I know it's about to gain importance.

Please do not waste your time any more. (sorry for writing this. I know
you are taking much time to support users for using CSV in Ruby).
Cracks are from difference of our CSV standpoints so it must not be
100% compatible. Just replace csv.rb with faster_csv.rb.

Replacement (in my opiniion):
  On 1.9: replace csv.rb with faster_csv.rb.

  On 1.8: Never mind. replace csv.rb with faster_csv.rb, with no
          compatible mode.

As a bundled library (in my opiniion):

  One thing I don't like faster_csv.rb is String#parse_csv and
  Array#to_csv. Please do not bring pollution to standard classes.

  Kernel.CSV should be discussed well before introducing it. Needed?
  (We already have Kernel.URI though...)

Regards,
// NaHi

i know matz is against it, but i really think we should have both. we have

   ftools and fileutils

   date and date2

   getoptlong, getopts, parsearg, and optparse

   monitor, mutex, and sync

   runit and test/unit

and so on.

i have quite a bit of code that does things like

   CSV::Row

   CSV::Cell

etc. it'd be very upset if i had to re-write all of it. it does little to
help people love ruby when scripts that worked stop working after an upgrade.
that said, i'm 100% for having faster csv in the dist. i just don't see
what's wrong with a few extra pure ruby files in there - they are very tiny.

cheers.

-a

···

On Wed, 31 May 2006, James Edward Gray II wrote:

1. CSV's compatibility mode handles most of the issues very well and I'm pretty sure I can remove most of the speed penalty. If we go with that, we have a pretty workable solution right now with one big gotcha: you can require a file named csv.rb and use CSV just fine, but the good stuff will actually be hiding under FasterCSV (in the same file). I have to keep them separate, because of the compatibility issues mentioned above. This, to me, is the only sane way to go if we want to target the 1.8.x branch. It would still break some software, if they use the unusual features of CSV, but I suspect this is quite rare.
2. We could drop compatibility and rename FasterCSV to CSV. This way people get all the good stuff where they expect it. However, this would break a lot of CSV software (most of it, in fact), so it only seems reasonable when targeting 1.9.x and up.

My thought is that the second option seems preferable. If we train people to use FasterCSV, then we just have to switch them again down the road if we want to revert to CSV. We don't really gain many big advantages for the switch either (speed, if I can eliminate the penalty, but not header parsing or the other good FasterCSV features). That doesn't sound like it's worth breaking software over.

In summary, I recommending targeting 1.9.x with no compatibility mode and renaming FasterCSV to CSV. Am I making sense here?

James Edward Gray II

--
be kind whenever possible... it is always possible.
- h.h. the 14th dali lama

I have created an RCR for this option:

http://www.rcrchive.net/rcr/show/338

Those in favor (or against) may wish to vote.

James Edward Gray II

···

On May 30, 2006, at 7:13 PM, James Edward Gray II wrote:

2. We could drop compatibility and rename FasterCSV to CSV. This way people get all the good stuff where they expect it. However, this would break a lot of CSV software (most of it, in fact), so it only seems reasonable when targeting 1.9.x and up.

Hi,

I started thinking that just csv.rb should be faster.

James Edward Gray II wrote:

method arguments, we could get pretty close to perfect, but CSV does
some odd things like confuse open() with foreach() that I chose to avoid
in FasterCSV. Because of that, I can't always be sure what to do when

Can you please explain what are "odd"? FasterCSV.build_csv_interface
seems to be a simple delegator.

2. We could drop compatibility and rename FasterCSV to CSV. This way
people get all the good stuff where they expect it. However, this would

Can you please explain what are "good"? I'll introduce those features
into csv.rb. Do those features depend on faster_csv.rb specific behavior?

Regards,
// NaHi

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

Long time no post.

James Edward Gray II wrote:

I agree. I don't want to have two independent CSV readers in the
distribution. It's OK that compatible mode is slow, or gives
obsoletion warning.

Alright, let me take another crack at the compatibility mode then. I
can probably speed in up since I know it's about to gain importance.

Please do not waste your time any more. (sorry for writing this. I know
you are taking much time to support users for using CSV in Ruby).
Cracks are from difference of our CSV standpoints so it must not be
100% compatible.

Do we have different standpoints? I hope not too different. We're just using different parsing techniques, right?

Other than to_csv() and parse_csv(), are there things you don't like about FasterCSV? I'm open to suggestions.

Just replace csv.rb with faster_csv.rb.

I just don't want to break a lot of software. :frowning:

As a bundled library (in my opiniion):

  One thing I don't like faster_csv.rb is String#parse_csv and
  Array#to_csv. Please do not bring pollution to standard classes.

  Kernel.CSV should be discussed well before introducing it. Needed?
  (We already have Kernel.URI though...)

Maybe I'm alone in this thinking, but I'm not bothered by conversion methods like this. It's also fairly common (to_set(), to_yaml(), etc.).

James Edward Gray II

···

On May 27, 2006, at 8:25 PM, NAKAMURA, Hiroshi wrote:

Hi,

i know matz is against it, but i really think we should have both. we have

  ftools and fileutils

  date and date2

  getoptlong, getopts, parsearg, and optparse

  monitor, mutex, and sync

  runit and test/unit

and so on.

They are the mistakes that I try to avoid making again.

  ftools and fileutils
  getoptlong, getopts, parsearg, and optparse

They are unfortunate mistakes I (we) made.

  date and date2

date2 = date + extra libraries.

  monitor, mutex, and sync

They are (somewhat) different.

  runit and test/unit

runit is a compatibility library based on test/unit.

              matz.

···

In message "Re: FasterCSV RCR?" on Wed, 31 May 2006 09:34:32 +0900, ara.t.howard@noaa.gov writes:

James Edward Gray II wrote:

method arguments, we could get pretty close to perfect, but CSV does
some odd things like confuse open() with foreach() that I chose to avoid
in FasterCSV. Because of that, I can't always be sure what to do when

Can you please explain what are "odd"?

My biggest complaint with CSV is that open() behaves "oddly" and thus defeats all my normal expectations:

>> File.open("example.csv", "w") do |csv|
?> csv.puts "1,2,3"
>> csv.puts "a,b,c"
>> end
=> nil
>> require "csv"
=> true
>> # typical Ruby style reading...
?> File.open("example.csv") do |file|
?> file.each { |row| p row }
>> end
"1,2,3\n"
"a,b,c\n"
=> #<File:example.csv (closed)>
>> # or...
?> File.foreach("example.csv") do |row|
?> p row
>> end
"1,2,3\n"
"a,b,c\n"
=> nil
>> # CSV's "odd" open() method...
>> CSV.open("example.csv", "r") do |row| # "r" required
?> p row # we get rows, not the file object
>> end
["1", "2", "3"]
["a", "b", "c"]
=> nil

Of course, if you open in a writing mode, you do get a file like object. It's inconsistent.

I'm confused about why CSV does this, since it offers the foreach() method, which normally fills this role.

Other CSV oddities (my opinion):

* I always have to think, "Now do I want the *_line() method or the *_row() method here..."
* Most methods take a field separator and a row separator, but foreach() and readlines() only take the row separator.
* I have to set a field separator when I really just want to set a row separator.
* A method called "generate_line()" doesn't involve a line ending.

2. We could drop compatibility and rename FasterCSV to CSV. This way
people get all the good stuff where they expect it. However, this would

Can you please explain what are "good"? I'll introduce those features
into csv.rb.

Here's a selection of some features from my CHANGELOG that I am not aware of in CSV:

* Added built-in and custom data converters. Built-in handle numbers and dates.
* Added auto-discovery for <tt>:row_sep</tt> (now the default).
* Added FasterCSV::filter() for easy Unix-like CSV filters.
* Added support for accessing fields by headers.
   * Headers can have their own converters.
   * Headers can be skipped or returned as needed.
   * FasterCSV::Row allows index or header access while retaining order and
     allowing for duplicate headers.
* <tt>:headers</tt> can now be set to an Array of headers to use.
* <tt>:headers</tt> can now be set to an external CSV String of headers to use.
* Provided support for the serialization of custom Ruby objects using CSV.
* Added FasterCSV::instance and FasterCSV()/FCSV() shortcuts for easy output.

James Edward Gray II

···

On Jun 4, 2006, at 5:30 AM, NAKAMURA, Hiroshi wrote:

Hi James,

James Edward Gray II wrote:

Please do not waste your time any more. (sorry for writing this. I know
you are taking much time to support users for using CSV in Ruby).
Cracks are from difference of our CSV standpoints so it must not be
100% compatible.

Do we have different standpoints? I hope not too different. We're just
using different parsing techniques, right?

As you wrote in your document, followings are from standpoint I think.
* streaming
* record terminator handling

I don't think faster_csv is wrong. I just wrote csv.rb from (a little)
different viewpoint 6 years ago.

Other than to_csv() and parse_csv(), are there things you don't like
about FasterCSV? I'm open to suggestions.

No. That's all for now. (Sorry, I've not yet look into new CSV features)

Just replace csv.rb with faster_csv.rb.

I just don't want to break a lot of software. :frowning:

I understand that it's a compensation of speed.

As a bundled library (in my opiniion):

  One thing I don't like faster_csv.rb is String#parse_csv and
  Array#to_csv. Please do not bring pollution to standard classes.

  Kernel.CSV should be discussed well before introducing it. Needed?
  (We already have Kernel.URI though...)

Maybe I'm alone in this thinking, but I'm not bothered by conversion
methods like this. It's also fairly common (to_set(), to_yaml(), etc.).

I don't like those, too. We should wait selector namespace. (IMO)

Regards,
// NaHi

Hi,

James Edward Gray II wrote:

James Edward Gray II wrote:

method arguments, we could get pretty close to perfect, but CSV does
some odd things like confuse open() with foreach() that I chose to avoid
in FasterCSV. Because of that, I can't always be sure what to do when

Can you please explain what are "odd"?

My biggest complaint with CSV is that open() behaves "oddly" and thus
defeats all my normal expectations:

File.open("example.csv", "w") do |csv|

?> csv.puts "1,2,3"

  csv.puts "a,b,c"
end

=> nil

require "csv"

=> true

# typical Ruby style reading...

?> File.open("example.csv") do |file|
?> file.each { |row| p row }

end

"1,2,3\n"
"a,b,c\n"
=> #<File:example.csv (closed)>

# or...

?> File.foreach("example.csv") do |row|
?> p row

end

"1,2,3\n"
"a,b,c\n"
=> nil

# CSV's "odd" open() method...
CSV.open("example.csv", "r") do |row| # "r" required

?> p row # we get rows, not the file object

end

["1", "2", "3"]
["a", "b", "c"]
=> nil

Of course, if you open in a writing mode, you do get a file like
object. It's inconsistent.

I can understand your frustration about this point. When I wrote csv.rb
at first, I thought all csv users would do the following when I define
reader style.

  CSV.open("filename.csv", "r") do |reader|
    reader.each do |row|
      ...do something...
    end
  end

Why don't we just write like this;

  CSV.open("filename.csv", "r") do |row|
    ...do something...
  end

I know you are considering that IO-ish methods are important. But I
don't think CSV object should handle IO methods like fcntl, fileno,
seek, tell, tty?, and so on. Would you please tell me typical and
pragmatic examples of reader style, except 'each'?

I'm confused about why CSV does this, since it offers the foreach()
method, which normally fills this role.

foreach and readlines are added recently from IO. Now I think it was a
bad choice though...

Other CSV oddities (my opinion):

Thanks!

* I always have to think, "Now do I want the *_line() method or the
*_row() method here..."

Users don't need to use *_line and *_row methods I think. When do you
use generate_line?

* Most methods take a field separator and a row separator, but
foreach() and readlines() only take the row separator.

See IO.foreach and IO.readlines. But as I wrote above, CSV should not
have these methods...

* I have to set a field separator when I really just want to set a row
separator.

csv.rb in svn repository supports pseudo-keyword-like-method-argument
style. I'll merge it ruby's csv repository before the next release.
http://dev.ctor.org/csv/browser/trunk/lib/csv.rb

# I defined keywords :fs and :rs but it should be :col_sep and :row_sep
# in conformity with faster_csv.

* A method called "generate_line()" doesn't involve a line ending.

Do not use it. :slight_smile: At least users rarely use it I think.

I hope that
  csv.rb's open + read + block does not work as you expected
is the only and the big frustrated point of csv.rb (...if csv.rb is
enough faster :slight_smile:

2. We could drop compatibility and rename FasterCSV to CSV. This way
people get all the good stuff where they expect it. However, this would

Can you please explain what are "good"? I'll introduce those features
into csv.rb.

Here's a selection of some features from my CHANGELOG that I am not
aware of in CSV:

Thanks. I'll look into this. I hope those features are pluggable into
csv.rb and other modules like DBI, spreadsheet related things, HTML
table formatters, etc. I think some of these features are table
specific, not CSV.

Regards,
// NaHi

Only if you go through that interface. The FasterCSV interface is still quite quick.

Let me rethink it a little. It was optimized for developer productivity when I built it. I might be able to do better looking at it from the idea of easy transitioning for the users.

Of course, I handle open() quite differently, so we're going to have problems merging both models of that method in the CSV class. Hmm...

James Edward Gray II

···

On May 27, 2006, at 9:09 PM, NAKAMURA, Hiroshi wrote:

Just replace csv.rb with faster_csv.rb.

I just don't want to break a lot of software. :frowning:

I understand that it's a compensation of speed.