Faster CSV parsing

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# Csv.fs=";"

···

#
# After a record has been read and parsed,
# Csv.string contains the record in raw string format.
#

class Csv

  def Csv.unescape( array )
    array.map{|x| x.gsub( /""/, '"' ) }
  end

  @@fs = ","

  # Set regexp for parse.
  # @@fs is the field-separator, which must be
  # a single character.
  def Csv.make_regexp
    fs = @@fs
    if "^" == fs
      fs = "\\^"
    end

    @@regexp =
    ## Assumes embedded quotes are escaped as "".
      %r{
           \G ## Anchor at end of previous match.
           [ \t]* ## Leading spaces or tabs are discarded.
           (?:
               ## For a field in quotes.
               " ( [^"]* (?: "" [^"]* )* ) " |
               ## For a field not in quotes.
                 ( [^"\n#{fs}]*? )
           )
           [ \t]*
           [#{fs}]
        }mx

    ## When get_rec finds after reading a line that the record isn't
    ## complete, this regexp will be used to decide whether to read
    ## another line or to raise an exception.
    @@reading_regexp =
      %r{
          \A # Anchor at beginning of string.
          (?:
             [ \t]*
             (?:
                 " [^"]* (?: "" [^"]* )* " |
                    [^"\n#{fs}]*?
             )
             [ \t]*
             [#{fs}]
          )*
          [ \t]*
          " [^"]* (?: "" [^"]* )*
          \Z # Anchor at end of string.
        }mx

  end # def make_regexp

  Csv.make_regexp

  def Csv.parse( s )
    ary = (s + @@fs).scan( @@regexp )
    raise "\nBad csv record:\n#{s}\n" if $' != ""
    Csv.unescape( ary.flatten.compact )
  end

  @@string = nil

  def Csv.get_rec( file )
    return nil if file.eof?
    @@string = ""
    begin
      if @@string.size>0
        raise "\nBad record:\n#{@@string}\n" if
          @@string !~ @@reading_regexp
        raise "\nPremature end of csv file." if file.eof?
      end
      @@string += file.gets
    end until @@string.count( '"' ) % 2 == 0
    @@string.chomp!
    Csv.parse( @@string )
  end

  def Csv.string
    @@string
  end

  def Csv.fs=( s )
    raise "\nCsv.fs must be a single character.\n" if s.size != 1
    @@fs = s
    Csv.make_regexp
  end

  def Csv.fs
    @@fs
  end

  def Csv.to_csv( array )
    s = ''
    array.map { |item|
      str = item.to_s
      # Quote the string if it contains the field-separator or
      # a " or a newline or a carriage-return, or if it has leading or
      # trailing whitespace.
      if str.index(@@fs) or /^\s|["\r\n]|\s$/.match(str)
        str = '"' + str.gsub( /"/, '""' ) + '"'
      end
      str
    }.join(@@fs)
  end

end # class Csv

Thanks for sharing this. You claim 'faster' CSV parsing. Faster than what, and by how much? Got any benchmark results to share?

···

On Oct 29, 2005, at 10:07 PM, William James wrote:

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# CSV Comma Separated Value File Format - How To - Creativyst - Explored,Designed,Delivered.(sm)
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# Csv.fs=";"
#
# After a record has been read and parsed,
# Csv.string contains the record in raw string format.

Gavin Kistner wrote:

> ## Read, parse, and create csv records.
>
> # The program conforms to the csv specification at this site:
> # CSV Comma Separated Value File Format - How To - Creativyst - Explored,Designed,Delivered.(sm)
> # The only extra is that you can change the field-separator.
> # For a field-separator other than a comma, for example
> # a semicolon:
> # Csv.fs=";"
> #
> # After a record has been read and parsed,
> # Csv.string contains the record in raw string format.

Thanks for sharing this. You claim 'faster' CSV parsing. Faster than
what,

Than Ruby's standard-lib csv.rb.

and by how much? Got any benchmark results to share?

For my test file of 1,964,211 bytes, it's about 6.4 times as fast.

···

On Oct 29, 2005, at 10:07 PM, William James wrote:

I didn't look too closely at it nor did I test it but your use of class variables seems not necessary here. I'd prefer to transform them into instance variables. Makes your code much more robust...

Kind regards

    robert

···

William James <w_a_x_man@yahoo.com> wrote:

Gavin Kistner wrote:

On Oct 29, 2005, at 10:07 PM, William James wrote:

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# CSV Comma Separated Value File Format - How To - Creativyst - Explored,Designed,Delivered.(sm)
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# Csv.fs=";"
#
# After a record has been read and parsed,
# Csv.string contains the record in raw string format.

Thanks for sharing this. You claim 'faster' CSV parsing. Faster than
what,

Than Ruby's standard-lib csv.rb.

and by how much? Got any benchmark results to share?

For my test file of 1,964,211 bytes, it's about 6.4 times as fast.

William James ha scritto:

Thanks for sharing this. You claim 'faster' CSV parsing. Faster than
what,

Than Ruby's standard-lib csv.rb.

and by how much? Got any benchmark results to share?

For my test file of 1,964,211 bytes, it's about 6.4 times as fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods, variables) at class level instead of instance ?

Robert Klemme wrote:

···

William James <w_a_x_man@yahoo.com> wrote:
> Gavin Kistner wrote:
>> On Oct 29, 2005, at 10:07 PM, William James wrote:
>>> ## Read, parse, and create csv records.
>>>
>>> # The program conforms to the csv specification at this site:
>>> # CSV Comma Separated Value File Format - How To - Creativyst - Explored,Designed,Delivered.(sm)
>>> # The only extra is that you can change the field-separator.
>>> # For a field-separator other than a comma, for example
>>> # a semicolon:
>>> # Csv.fs=";"
>>> #
>>> # After a record has been read and parsed,
>>> # Csv.string contains the record in raw string format.
>>
>> Thanks for sharing this. You claim 'faster' CSV parsing. Faster than
>> what,
>
> Than Ruby's standard-lib csv.rb.
>
>> and by how much? Got any benchmark results to share?
>
> For my test file of 1,964,211 bytes, it's about 6.4 times as fast.

I didn't look too closely at it nor did I test it but your use of class
variables seems not necessary here. I'd prefer to transform them into
instance variables. Makes your code much more robust...

Kind regards

    robert

If anyone wants to make it more robust, he is free to do so.
I have little need for csv parsing, and I don't want to spend
much more time on this.

The people on Ruby Core who are trying to speed up CSV parsing
could use this as a starting point.

I'm sending this message a second time. Seems that my client
has messed up the first. Sorry for the noise.

William James ha scritto:

[...]

> For my test file of 1,964,211 bytes, it's about 6.4 times as
> fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

A more OO (and equally fast) version:

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# CSV Comma Separated Value File Format - How To - Creativyst - Explored,Designed,Delivered.(sm)
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# csv.fs=";" # csv is a FastCsv instance

···

On Sunday 30 October 2005 14:52, gabriele renzi wrote:
#
# After a record has been read and parsed,
# csv.string contains the record in raw string format.
#

class FastCsv

    def self.foreach(filename)
        csv = self.new
        open filename do |file|
            while record = csv.get_rec(file)
                yield record
            end
        end
    end

    def initialize(fs = ",")
        self.fs = fs
        @string = nil
    end

    def fs=(s)
        raise "fs must be a single character." if s.size != 1
        @fs = s.dup
        make_regexp
    end

    def fs
        @fs.dup
    end

    def parse(s)
        ary = (s + @fs).scan(@regexp)
        raise "Bad csv record:\n#{s}\n" if $' != ""
        unescape(ary.flatten.compact)
    end

    def get_rec(file)
        return nil if file.eof?
        @string = ""
        begin
            if @string.size > 0
                raise "Bad record:\n#@string\n" if @string !~
@reading_regexp
                raise "Premature end of csv file." if file.eof?
            end
            @string += file.gets
        end until @string.count('"') % 2 == 0
        @string.chomp!
        parse(@string)
    end

    def string
        @string
    end

    def to_csv(array)
        s = ''
        array.map { |item|
            str = item.to_s
            # Quote the string if it contains the field-separator or
            # a " or a newline or a carriage-return, or if it has
leading or
            # trailing whitespace.
            if str.index(@fs) or /^\s|["\r\n]|\s$/.match(str)
                str = '"' + str.gsub( /"/, '""' ) + '"'
            end
            str
        }.join(@fs)
    end

    private

    def unescape(array)
        array.map { |x| x.gsub(/""/, '"') }
    end

    # Set regexp for parse.
    # @fs is the field-separator, which must be
    # a single character.
    def make_regexp
        fs = @fs
        if "^" == fs
            fs = "\\^"
        end

        @regexp =
            ## Assumes embedded quotes are escaped as "".
            %r{
           \G ## Anchor at end of previous match.
           [ \t]* ## Leading spaces or tabs are discarded.
           (?:
               ## For a field in quotes.
               " ( [^"]* (?: "" [^"]* )* ) " |
               ## For a field not in quotes.
                 ( [^"\n#{fs}]*? )
           )
           [ \t]*
           [#{fs}]
        }mx

        ## When get_rec finds after reading a line that the record
isn't
        ## complete, this regexp will be used to decide whether to
read
        ## another line or to raise an exception.
        @reading_regexp =
            %r{
          \A # Anchor at beginning of string.
          (?:
             [ \t]*
             (?:
                 " [^"]* (?: "" [^"]* )* " |
                    [^"\n#{fs}]*?
             )
             [ \t]*
             [#{fs}]
          )*
          [ \t]*
          " [^"]* (?: "" [^"]* )*
          \Z # Anchor at end of string.
        }mx
    end # make_regexp

end # class FastCsv

if $0 == __FILE__
    FastCsv.foreach("example.csv") { |record| p record }
end

Ara.T.Howard posted a test framework for a bunch of edge cases on Ruby Core, based on the CSV RFC (http://www.ietf.org/rfc/rfc4180.txt\). Here's that framework modified to work with your library:

Neo:~/Desktop$ cat fast_csv_tests.rb
require 'pp'
require 'csv'
require 'fast_csv'

tests = [
   [
     %( a,b ),
     ["a", "b"]
   ],
   [
     %( a,"""b""" ),
     ["a", "\"b\""]
   ],
   [
     %( a,"""b" ),
     ["a", "\"b"]
   ],
   [
     %( a,"b""" ),
     ["a", "b\""]
   ],
   [
     %( a,"
b""" ),
     ["a", "\nb\""]
   ],
   [
     %( a,"""
b" ),
     ["a", "\"\nb"]
   ],
   [
     %( a,"""
b
""" ),
     ["a", "\"\nb\n\""]
   ],
   [
     %( a,"""
b
""",
c ),
     ["a", "\"\nb\n\"", nil]
   ],
   [
     %( a, ),
     ["a", nil, nil, nil]
   ],
   [
     %( , ),
     [nil, nil]
   ],
   [
     %( "","" ),
     ["", ""]
   ],
   [
     %( """" ),
     ["\""]
   ],
   [
     %( """","" ),
     ["\"",""]
   ],
   [
     %( ,"" ),
     [nil,""]
   ],
   [
     %( \r,"\r" ),
     [nil,"\r"]
   ],
   [
     %( "\r\n," ),
     ["\r\n,"]
   ],
   [
     %( "\r\n,", ),
     ["\r\n,", nil]
   ],
]

impls = CSV, FastCsv.new

tests.each_with_index do |test, idx|
   input, expected = test
   csv =
   impls.each do |impl|
     begin
       if impl == CSV
         csv = impl::parse_line input.strip
       else
         csv = impl.parse input.strip
       end
       raise "FAILED" unless csv == expected
     rescue => e
       puts "=" * 42
       puts "#{ impl }[#{ idx }] => #{ e.message } (#{ e.class })"
       puts "=" * 42
       puts "input:\n#{ PP::pp input.strip, '' }"
       puts "csv:\n#{ PP::pp csv, '' }"
       puts "expected:\n#{ PP::pp expected, '' }"
       puts "=" * 42
       puts
     end
   end
end

__END__
Neo:~/Desktop$ ruby fast_csv_tests.rb

···

On Oct 30, 2005, at 8:55 AM, Stefan Lang wrote:

On Sunday 30 October 2005 14:52, gabriele renzi wrote:

William James ha scritto:

[...]

For my test file of 1,964,211 bytes, it's about 6.4 times as
fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

A more OO (and equally fast) version:

==========================================
#<FastCsv:0x33d130>[7] => Bad csv record:
a,"""
b
""",
c
(RuntimeError)

input:
"a,\"\"\"\nb\n\"\"\",\nc"
csv:
["a", "\"\nb\n\"", nil]
expected:
["a", "\"\nb\n\"", nil]

==========================================
#<FastCsv:0x33d130>[8] => FAILED (RuntimeError)

input:
"a,"
csv:
["a", "", "", ""]
expected:
["a", nil, nil, nil]

==========================================
#<FastCsv:0x33d130>[9] => FAILED (RuntimeError)

input:
","
csv:
["", ""]
expected:
[nil, nil]

==========================================
#<FastCsv:0x33d130>[13] => FAILED (RuntimeError)

input:
",\"\""
csv:
["", ""]
expected:
[nil, ""]

==========================================
#<FastCsv:0x33d130>[14] => FAILED (RuntimeError)

input:
",\"\r\""
csv:
["", "\r"]
expected:
[nil, "\r"]

==========================================
#<FastCsv:0x33d130>[16] => FAILED (RuntimeError)

input:
"\"\r\n,\","
csv:
["\r\n,", ""]
expected:
["\r\n,", nil]

James Edward Gray II

My latest offering to Ruby Core has been:

def parse_csv( data )
   io = if data.is_a?(IO) then data else StringIO.new(data) end
   line = ""

   loop do
     line += io.gets
     parse = line.dup
     parse.chomp!

     csv = if parse.sub!(/\A,+/, "") then [nil] * $&.length else Array.new end
     parse.gsub!(/\G(?:^|,)(?:"((?>[^"]*)(?>""[^"]*)*)"|([^",]*))/) do
       csv << if $1.nil?
         if $2 == "" then nil else $2 end
       else
         $1.gsub('""', '"')
       end
       ""
     end

     break csv if parse.empty?
   end
end

Which is passing all the edge cases they have thrown at it so far and is very similar to the speed you achieved.

James Edward Gray II

···

On Oct 30, 2005, at 9:02 AM, William James wrote:

The people on Ruby Core who are trying to speed up CSV parsing
could use this as a starting point.

...

Doesn't look good :frowning:
I just took William James' code and converted the class
variables/methods to instance variables/methods.

Regards,
  Stefan

···

On Sunday 30 October 2005 17:05, James Edward Gray II wrote:

On Oct 30, 2005, at 8:55 AM, Stefan Lang wrote:
> On Sunday 30 October 2005 14:52, gabriele renzi wrote:
>> William James ha scritto:
>
> [...]
>
>>> For my test file of 1,964,211 bytes, it's about 6.4 times as
>>> fast.
>>
>> what things is this missing wrt standard csv.rb?
>>
>> Also, why you did choose to make all of the stuff (methods,
>> variables) at class level instead of instance ?
>
> A more OO (and equally fast) version:

Ara.T.Howard posted a test framework for a bunch of edge cases on
Ruby Core, based on the CSV RFC (Index of /rfc
rfc4180.txt). Here's that framework modified to work with your
library: