Fastest CSV parsing?

This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
  def csv
    if include? '"'
      ary =
        "#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
      raise "Bad csv record:\n#{self}" if $' != ""
      ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
    else
      ary = chomp.split( /,/, -1)
      ## "".csv ought to be [""], not [], just as
      ## ",".csv is ["",""].
      if [] == ary
        [""]
      else
        ary
      end
    end
  end
end

You are pretty much rewriting FasterCSV here. Why do that when we could just use it instead?

James Edward Gray II

···

On Aug 16, 2007, at 2:35 PM, William James wrote:

This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
  def csv
    if include? '"'
      ary =
        "#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
      raise "Bad csv record:\n#{self}" if $' != ""
      ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
    else
      ary = chomp.split( /,/, -1)
      ## "".csv ought to be [""], not , just as
      ## ",".csv is ["",""].
      if == ary
        [""]
      else
        ary
      end
    end
  end
end

James Edward Gray II wrote:

> This is the best I've come up with so far. It should handle any CSV
> record
> (i.e., fields may contain commas, double quotes, and newlines).
>
> class String
> def csv
> if include? '"'
> ary =
> "#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
> raise "Bad csv record:\n#{self}" if $' != ""
> ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
> else
> ary = chomp.split( /,/, -1)
> ## "".csv ought to be [""], not , just as
> ## ",".csv is ["",""].
> if == ary
> [""]
> else
> ary
> end
> end
> end
> end

You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?

That is a dishonest comment.

What if someone had said to you when you released "FasterCSV":
"You are pretty much rewriting CSV here. Why do that when we
could just use it instead?"

Parsing CSV isn't very difficult.
"FasterCSV" is too slow and far too large. People don't need
to be installing it on their systems when a few lines of code
will do the job.

Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won't be paid any more money.

···

On Aug 16, 2007, at 2:35 PM, William James wrote:

William James wrote:

What if someone had said to you when you released "FasterCSV":
"You are pretty much rewriting CSV here. Why do that when we
could just use it instead?"

Point made. However...

Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won't be paid any more money.

From JEG2's own blog post:

"If your number one concern when working with CSV data in Ruby is raw
speed, you might want to know that FasterCSV is no longer the fastest
option."

Your code may or may not be faster -- you've offered no comparison.
Regardless, I doubt that JEG2 was trying to stifle your efforts; just
suggesting that you may want to avoid reinventing the wheel.

David

···

--
Posted via http://www.ruby-forum.com/\.

William James wrote:

Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won't be paid any more money.
  

Hmmm...and here I was thinking that FasterCSV was free software. Have you identified some way that James is making money from it? Have you identified some way that using FasterCSV is hurtful?

William, there's no need to be so angry. We're all here to help each other.

···

--
RMagick OS X Installer [http://rubyforge.org/projects/rmagick/\]
RMagick Hints & Tips [http://rubyforge.org/forum/forum.php?forum_id=1618\]
RMagick Installation FAQ [http://rmagick.rubyforge.org/install-faq.html\]

James Edward Gray II wrote:

This is the best I've come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
  def csv
    if include? '"'
      ary =
        "#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
      raise "Bad csv record:\n#{self}" if $' != ""
      ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
    else
      ary = chomp.split( /,/, -1)
      ## "".csv ought to be [""], not , just as
      ## ",".csv is ["",""].
      if == ary
        [""]
      else
        ary
      end
    end
  end
end

You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?

That is a dishonest comment.

Not honest? I guess I'm not sure how you meant that.

FasterCSV's parser uses a very similar regular expression. Quoting from the source:

     # prebuild Regexps for faster parsing
     @parsers = {
       :leading_fields =>
         /\A(?:#{Regexp.escape(@col_sep)})+/, # for empty leading fields
       :csv_row =>
         ### The Primary Parser ###
         / \G(?:^|#{Regexp.escape(@col_sep)}) # anchor the match
           (?: "((?>[^"]*)(?>""[^"]*)*)" # find quoted fields
               > # ... or ...
               ([^"#{Regexp.escape(@col_sep)}]*) # unquoted fields
               )/x,
         ### End Primary Parser ###
       :line_end =>
         /#{Regexp.escape(@row_sep)}\z/ # safer than chomp!()
     }

I felt they were similar enough to say you were recreating it. I can live with it if you don't agree though.

What if someone had said to you when you released "FasterCSV":
"You are pretty much rewriting CSV here. Why do that when we
could just use it instead?"

They did. I said it was too slow and I didn't care for the interface, though some do prefer it. Pretty much what you just said to me, so I look forward to using your EvenFasterCSV library on my next project.

Parsing CSV isn't very difficult.

Yeah, it's not too tough.

I'm a little bothered by how your solution makes me slurp the data into a String though. Today I was working with a CSV file with over 35,000 records in it, so I'm not too comfortable with that. You might consider adding a little code to ease that.

Also, I really prefer to work with CSV by headers, instead of column indices. That's easier and more robust, in my opinion. You might want to add some code for that too.

Of course, then we're just getting closer and closer to FasterCSV, so maybe not...

"FasterCSV" is too slow and far too large.

FasterCSV is mostly interface code to make the user experience as nice as possible. There's also a lot of documentation in there. The core parser is still way smaller than the standard library's parser.

James Edward Gray II

···

On Aug 16, 2007, at 7:44 PM, William James wrote:

On Aug 16, 2007, at 2:35 PM, William James wrote:

Hi,

James Edward Gray II wrote:
>
> > This is the best I've come up with so far. It should handle any CSV
> > record
> > (i.e., fields may contain commas, double quotes, and newlines).
>
> You are pretty much rewriting FasterCSV here. Why do that when we
> could just use it instead?

That is a dishonest comment.

Coding is a kind of sports to me. Besides that it is not my
decision what you do with your time.

I often recode things that are already written and
well-proofed. Sometimes my code is better, sometimes I learn
from comparing it. Sometimes it brings just new ideas to me.

So go on and do what you like; I think that's still the main
purpose of open source. Though it might not have been an
actually valuable contribution to this list I had fun
reading your solution.

Bertram

···

Am Freitag, 17. Aug 2007, 09:44:58 +0900 schrieb William James:

> On Aug 16, 2007, at 2:35 PM, William James wrote:

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de

On Behalf Of David Mullet:
# http://blog.grayproductions.net/articles/2007/04/16/no-longer-
# the-fastest-game-in-town

and jeg gave a valuable hint on producing a fast scanner (be it scanning csv or whatever) ---by using the humble and underestimated Stringscanner...

kind regards -botp

Just a pointer to yet another CSV parsing regex: http://snippets.dzone.com/posts/show/4430

Cheers,

b.

···

On 17 Aug., 04:38, Bertram Scharpf <li...@bertram-scharpf.de> wrote:
Hi,

Am Freitag, 17. Aug 2007, 09:44:58 +0900 schrieb William James:

> James Edward Gray II wrote:
> > On Aug 16, 2007, at 2:35 PM, William James wrote:

> > > This is the best I've come up with so far. It should handle any CSV
> > > record
> > > (i.e., fields may contain commas, double quotes, and newlines).

> > You are pretty much rewriting FasterCSV here. Why do that when we
> > could just use it instead?

> That is a dishonest comment.

Coding is a kind of sports to me. Besides that it is not my
decision what you do with your time.

I often recode things that are already written and
well-proofed. Sometimes my code is better, sometimes I learn
from comparing it. Sometimes it brings just new ideas to me.

So go on and do what you like; I think that's still the main
purpose of open source. Though it might not have been an
actually valuable contribution to this list I had fun
reading your solution.

Bertram

--
Bertram Scharpf
Stuttgart, Deutschland/Germanyhttp://www.bertram-scharpf.de