Why csv file processing is so slow?

mepython · 28 January 2005 15:35

I want to process csv file. Here is small program in python and ruby:

[root@taamportable GMS]# cat x.py
import csv
reader = csv.reader(file('x.csv'))
header = reader.next()
count = 0
for data in reader:
count += 1
print count

[root@taamportable GMS]# cat x.rb
require 'csv'
reader = CSV.open('x.csv', 'r')
header = reader.shift
count = 0
reader.each {|data|
count += 1
}
p count

···

*******************************************************
Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?
*******************************************************
[root@taamportable GMS]# time python x.py
26907

real 0m0.311s
user 0m0.302s
sys 0m0.009s

[root@taamportable GMS]# time ruby x.rb
26907

real 1m48.296s
user 1m36.853s
sys 0m11.188s

Robert · 28 January 2005 15:50

"mepython" <a@agni.us> schrieb im Newsbeitrag
news:1106926484.356041.45310@c13g2000cwb.googlegroups.com...

I want to process csv file. Here is small program in python and ruby:

[root@taamportable GMS]# cat x.py
import csv
reader = csv.reader(file('x.csv'))
header = reader.next()
count = 0
for data in reader:
count += 1
print count

[root@taamportable GMS]# cat x.rb
require 'csv'
reader = CSV.open('x.csv', 'r')
header = reader.shift
count = 0
reader.each {|data|
count += 1
}
p count

*******************************************************
Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?

First I'd try to figure whether it's IO that's slow or CSV. Did you test
with something like this:

File.open('x.csv') do |reader|
  count = 0
  reader.each {|data| count += 1}
  p count
end

Does it make a huge difference?

Kind regards

robert

···

*******************************************************
[root@taamportable GMS]# time python x.py
26907

real 0m0.311s
user 0m0.302s
sys 0m0.009s

[root@taamportable GMS]# time ruby x.rb
26907

real 1m48.296s
user 1m36.853s
sys 0m11.188s

Andrew_Johnson · 28 January 2005 16:05

[snip]

Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?

Well, the python library csv.py uses the underlying _csv module which
is written in C ... Ruby's standard-lib csv.rb is all Ruby. I don't
know of any csv extensions for Ruby.

regards,
andrew

···

On 28 Jan 2005 07:34:44 -0800, mepython <a@agni.us> wrote:

--
Andrew L. Johnson http://www.siaris.net/
      It's kinda hard trying to remember Perl syntax *and* Occam's
      razor at the same time
          -- Graham Patterson

mepython · 28 January 2005 16:01

It is csv module: reading file seems like half the speed of python. So
real slowness is coming from csv

count = 0
File.open('x.csv') do |reader|
reader.each {|data| count += 1}
end
p count

[root@taamportable GMS]# time ruby x1.rb
26908

real 0m0.077s
user 0m0.060s
sys 0m0.016s

[root@taamportable GMS]# time python x1.py
26908

real 0m0.041s
user 0m0.032s
sys 0m0.010s

mepython · 28 January 2005 16:15

Thanks andrew. I should have look into module before posting.

Robert · 28 January 2005 17:00

"mepython" <a@agni.us> schrieb im Newsbeitrag
news:1106927942.910013.321430@z14g2000cwz.googlegroups.com...

It is csv module: reading file seems like half the speed of python. So
real slowness is coming from csv

count = 0
File.open('x.csv') do |reader|
reader.each {|data| count += 1}
end
p count

[root@taamportable GMS]# time ruby x1.rb
26908

real 0m0.077s
user 0m0.060s
sys 0m0.016s

[root@taamportable GMS]# time python x1.py
26908

real 0m0.041s
user 0m0.032s
sys 0m0.010s

As a simple CSV replacement you could try this:

File.open('x.csv') do |reader|
  reader.each {|line|
    count += 1
    data = line.split(/,/)
  }
end
p count

Depending on your data that might or might not be sufficient. Regexps can
be arbitrarily sophisticated. Here's another one:

data =
line.scan( %r{
  "((?:[^\\"]|\\")*)" |
  '((?:[^\\']|\\')*)' |
  ([^,]+)
}x ){|m| data << m.find {|x|x}}

:-))

robert

W_James · 28 January 2005 22:25

Robert Klemme wrote:

Depending on your data that might or might not be sufficient.

Regexps can

be arbitrarily sophisticated. Here's another one:

data =
line.scan( %r{
  "((?:[^\\"]|\\")*)" |
  '((?:[^\\']|\\')*)' |
  ([^,]+)
}x ){|m| data << m.find {|x|x}}

I borrowed your regexp.

% class String
% def parse_csv
% a = self.scan(
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^,]+ )
% }x ).flatten
% a.delete(nil)
% a
% end
% end
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

With this input

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b

the output is

["a", "b", "foo, bar", "c"]
["foo isn't \\\"bar\\\"", "a", "b"]
["a", "\"just,my,luck\"", "b"]

W_James · 29 January 2005 06:25

William James wrote:

% class String
% def parse_csv
% a = self.scan(
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^,]+ )
% }x ).flatten
% a.delete(nil)
% a
% end
% end

To test the method parse_csv, I created a 1 megabyte file consisting of
4228 copies of

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9

Processing it using parse_csv took about 7 seconds on my computer,
which has a 866MHz pentium processor.

Ruby's standard-lib csv.rb reported an error in the file's format.

So I made a file containing 26907 copies of

111,222,333,444,555,666,777,888,999

Ruby's standard-lib csv.rb took about 35 seconds to process it;
parse_csv, about 5 seconds.

mepython · 29 January 2005 13:45

I got similar result with your parse_csv. This brings another issue in
my mind: This method is also in ruby so why such a huge overhead when
we use csv module vs. this method?

How can we modify so that we can pass field seperator and record
seperator as an argument?

William James wrote:

William James wrote:

> % class String
> % def parse_csv
> % a = self.scan(
> % %r{ "( (?: [^\\"] | \\")* )" |
> % '( (?: [^\\'] | \\')* )' |
> % ( [^,]+ )
> % }x ).flatten
> % a.delete(nil)
> % a
> % end
> % end

To test the method parse_csv, I created a 1 megabyte file consisting

of

···

4228 copies of

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9

Processing it using parse_csv took about 7 seconds on my computer,
which has a 866MHz pentium processor.

Ruby's standard-lib csv.rb reported an error in the file's format.

So I made a file containing 26907 copies of

111,222,333,444,555,666,777,888,999

Ruby's standard-lib csv.rb took about 35 seconds to process it;
parse_csv, about 5 seconds.

W_James · 29 January 2005 19:20

mepython wrote:

How can we modify so that we can pass field seperator and record
seperator as an argument?

This should do it. I found that not rebuilding the regular-expression
every time parse_csv is called made it even faster.

% # Record separator.
% RS = "\n"
%
% # Set regexp for parse_csv.
% # fs is the field-separator
% def fs_is( fs )
% $csv_re = \
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^#{fs}]+ )
% }x
% end
%
% class String
% def parse_csv
% raise "Method fs_is() wasn't called." if $csv_re.nil?
% a = self.scan( $csv_re ).flatten
% a.delete(nil)
% a
% end
% end
%
% fs_is( ',' )
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

W_James · 29 January 2005 20:10

Improved version:

% # Record separator.
% RS = "\n"
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator
% def is_fs
% $csv_re = \
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^#{self}]+ )
% }x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% self.scan( $csv_re ).flatten.compact
% end
% end
%
% ','.is_fs
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

mepython · 29 January 2005 23:45

I found an error in parse_csv if field is empty, it ignores it for
example:
x,y,z
1,2

Second line should return [1,nil,2] instead it returns [1,2].

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

W_James · 30 January 2005 05:30

mepython wrote:

I found an error in parse_csv if field is empty, it ignores it for
example:
x,y,z
1,2

Second line should return [1,nil,2] instead it returns [1,2].

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

1,2 now returns [1, "", 2]

Use arry.to_csv to create csv string from array.

% ## Record separator.
% RS = "\n"
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator, which must be
% # a single character.
% def is_fs
% $csv_fs = self
% if "^" == $csv_fs
% fs = "\\^"
% else
% fs = $csv_fs
% end
% $csv_re = \
% %r! (?:
% "( [^"\\]* (?: \\.[^"\\]* )* )" |
% ( [^#{fs}]* )
% )
% [#{fs}]
% !x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% (self+$csv_fs).scan( $csv_re ).flatten.compact
% end
% end
%
% class Array
% def to_csv
% raise "Method #is_fs wasn't called." if $csv_fs.nil?
% s = ''
% self.each { |x|
% x = '"'+x+'"' if x.index( $csv_fs ) or x.index( '"' )
% s += x + $csv_fs
% }
% s[0 .. -2]
% end
% end
%
%
% ",".is_fs
%
% ## Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% line.chomp!
% puts "-------------------"
% puts line
% ary = line.parse_csv
% p ary
% puts ary.to_csv
% }

Tim_Sutherland1 · 30 January 2005 08:35

In article <1107042295.816570.61420@z14g2000cwz.googlegroups.com>, mepython
wrote:
[...]

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

This assumes the input is an array of lines. (Where each line is an array.)

class Array
    def to_csv
        map { |line|
            line.map { |cell|
                '"' + cell.gsub(/"/, '""') + '"'
            }.join(',')
        }.join("\n")
    end
end

Note that literal quotes " are replaced with "".

Bertram_Scharpf · 30 January 2005 13:57

Hi,

In article <1107042295.816570.61420@z14g2000cwz.googlegroups.com>, mepython
wrote:
[...]
>How hard to do reverse: create csv string from list?
>
>Thanks. I just started Ruby couple of days ago, so I am learning
>instead of implementing, Sorry.

This assumes the input is an array of lines. (Where each line is an array.)

class Array
    def to_csv
        map { |line|
            line.map { |cell|
                '"' + cell.gsub(/"/, '""') + '"'
            }.join(',')
        }.join("\n")
    end
end

How about this (untested):

    class Array
      def to_csv sep = ';'
        quo = '"'
        map { |line|
          line.map { |cell|
            c = cell.to_s
            if c.include? sep or c.include? quo
              quo + c.gsub( quo, quo*2) + quo
            else
              c
            end
          }.join sep
        }.join $/
      end
    end

Bertram

···

Am Sonntag, 30. Jan 2005, 17:35:49 +0900 schrieb Tim Sutherland:

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de

W_James · 31 January 2005 01:25

Now assumes a quotation mark within a field is represented as ""
(previous versions assumed \" ).
Lacks one thing: cannot handle a newline within a field.

% # Record separator.
% RS = "\n"
%
% class Array
% def to_csv
% raise "Method #is_fs wasn't called." if $csv_fs.nil?
% s = ''
% self.map { |item|
% str = item.to_s
% if str.index( $csv_fs ) or /^\s|"|\s$/.match(str)
% str = '"' + str.gsub( /"/, '""' ) + '"'
% end
% str
% }.join($csv_fs)
% end
% def unescape
% self.map!{|x| x.gsub( /""/, '"' ) }
% end
% end
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator, which must be
% # a single character.
% def is_fs
% $csv_fs = self
% if "^" == $csv_fs
% fs = "\\^"
% else
% fs = $csv_fs
% end
% $csv_re = \
% ## Assumes embedded quotes are escaped as "".
% %r! \s*
% (?:
% "( [^"]* (?: "" [^"]* )* )" |
% ( .*? )
% )
% \s*
% [#{fs}]
% !x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% (self+$csv_fs).scan( $csv_re ).flatten.compact.unescape
% end
% end
%
% ",".is_fs
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% line.chomp!
% puts line
% ary = line.parse_csv
% p ary
% puts ary.to_csv
% }

Ralf_Muller · 1 February 2005 08:47

Found a Parser in a german ruby-Book by Röhrl,Schmiedl and Wyss. With a little improvement, it supports unqoted, '-quoted and "-quoted cells in any order:

#!/usr/bin/env ruby
class CSVParser
   include Enumerable

   QUOTED = /('|"){1,1}(.*?)\1{1,1}(,|\r?\n)/m
   UNQUOTED = /()(.*?)(,|\r?\n)/m

   def initialize(string)
      @string = string
   end

   # datafields of a line are provided as an array
   def each
      while @string != ''
         tokens =
         while @string != ''
            case @string[0..0]
               # empty cell
               when ","
                  tokens << nil
                  @string.slice!(0..0)
                  next
               # last cell is empty
               when /\r?\n/
                  tokens << nil
                  @string.slice!(0..$&.size)
                  break
               # complex cell
               when /('|")/
                  pattern = QUOTED
                  dequote = true
               # simple cell
               else
                  pattern = UNQUOTED
                  dequote = false
            end
            # match the content
            md = pattern.match(@string)
            token = md[2]
# token.gsub('""','"') if dequote
            tokens << token
            @string.slice!(0...md[0].size)
            # last cell
            break if md[0][-1..-1] == "\n"
         end
         yield tokens
      end
   end
end

# =============================================================================
# MAIN ------------------------------------------------------------------------
cvs =CSVParser.new($stdin.read)
Start = "'"
End = "'\n"
Sep = "','"
cvs.each{|row|
puts Start + row[0].to_s + Sep + row.join(Sep) + End if row[2].to_i <= 4
00000 and row.last != ''
}

regards
ralf

···

On Sun, 30 Jan 2005 17:35:49 +0900 timsuth@ihug.co.nz (Tim Sutherland) wrote:

In article <1107042295.816570.61420@z14g2000cwz.googlegroups.com>, mepython
wrote:
[...]
>How hard to do reverse: create csv string from list?
>
>Thanks. I just started Ruby couple of days ago, so I am learning
>instead of implementing, Sorry.

This assumes the input is an array of lines. (Where each line is an array.)

class Array
    def to_csv
        map { |line|
            line.map { |cell|
                '"' + cell.gsub(/"/, '""') + '"'
            }.join(',')
        }.join("\n")
    end
end

Note that literal quotes " are replaced with "".

W_James · 1 February 2005 05:40

A small, fast, and (I think) complete csv parser.

Now handles newlines within fields.
A comma is now the default field-separator.

class Array
   def to_csv
     ",".is_fs if $csv_fs.nil?
     s = ''
     self.map { |item|
       str = item.to_s
       # Quote the string if it contains the field-separator or
       # a " or a newline, or if it has leading or trailing

whitespace.

···

       if str.index($csv_fs) or /^\s|"|\n|\s$/.match(str)
         str = '"' + str.gsub( /"/, '""' ) + '"'
       end
       str
     }.join($csv_fs)
   end
   def unescape
     self.map{|x| x.gsub( /""/, '"' ) }
   end
end

class String
   # Set regexp for parse_csv.
   # self is the field-separator, which must be
   # a single character.
   def is_fs
     $csv_fs = self
     if "^" == $csv_fs
       fs = "\\^"
     else
       fs = $csv_fs
     end
     $csv_re = \
       ## Assumes embedded quotes are escaped as "".
       %r{ \s*
            (?:
                "( [^"]* (?: "" [^"]* )* )" |
                 ( .*? )
            )
            \s*
            [#{fs}]
         }mx
   end

   def parse_string
     (self + $csv_fs).scan( $csv_re ).flatten.compact.unescape
   end
end

def get_rec( file )
   ",".is_fs if $csv_re.nil?
   $csv_s = ""
   begin
     if file.eof?
       raise "The csv file is malformed." if $csv_s.size>0
       return nil
     end
     $csv_s += file.gets
   end until $csv_s.count( '"' ) % 2 == 0
   $csv_s.chomp!
   $csv_s.parse_string
end

while rec = get_rec( ARGF )
   puts "----------------"
   puts $csv_s
   p rec
   puts rec.to_csv

end

Ryan_Davis1 · 1 February 2005 06:16

There is test_csv.rb in the ruby tarball. Can you run your new code against it to make sure it is complete? With good profile numbers I doubt it'd be hard to get the slower code replaced.

···

On Jan 31, 2005, at 9:40 PM, William James wrote:

A small, fast, and (I think) complete csv parser.

W_James · 1 February 2005 10:25

Ryan Davis wrote:

> A small, fast, and (I think) complete csv parser.

There is test_csv.rb in the ruby tarball. Can you run your new code
against it to make sure it is complete? With good profile numbers I
doubt it'd be hard to get the slower code replaced.

Wow. test_csv.rb is beyond my comprehension. I don't know how
to use it.

I did lift a very complex test string from it to use in testing
my program. One of the fields in that csv string is defective;
I don't know whether that was intentional or not:

"\r\n"\r\nNaHi,

The " in the field isn't doubled, and the field doesn't end
with a quote.

Incidentally, when my program converts that string to an array
and then back to a csv string, it's not the same as
the original string because ,"", is shortened to , .

I corrected a minor bug in my code by moving
",".is_fs if $csv_fs.nil?
to its proper location.

The program conforms to the csv specification at this site:

and it handles the sample csv records shown there.

All my program can do is read a text file containing csv records,
convert those records (strings) into arrays of strings, and
convert the arrays back into csv strings. I suppose that the
csv library that comes with Ruby may do more than that.

% ## Read, parse, and create csv records.
% ## Has a minor bug fix; discard previous versions.
% ## 2005-02-01.
%
% class Array
% def to_csv
% ",".is_fs if $csv_fs.nil?
% s = ''
% self.map { |item|
% str = item.to_s
% # Quote the string if it contains the field-separator or
% # a " or a newline, or if it has leading or trailing
whitespace.
% if str.index($csv_fs) or /^\s|"|\n|\s$/.match(str)
% str = '"' + str.gsub( /"/, '""' ) + '"'
% end
% str
% }.join($csv_fs)
% end
% def unescape
% self.map{|x| x.gsub( /""/, '"' ) }
% end
% end
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator, which must be
% # a single character.
% def is_fs
% $csv_fs = self
% if "^" == $csv_fs
% fs = "\\^"
% else
% fs = $csv_fs
% end
% $csv_re = \
% ## Assumes embedded quotes are escaped as "".
% %r{ \s*
% (?:
% "( [^"]* (?: "" [^"]* )* )" |
% ( .*? )
% )
% \s*
% [#{fs}]
% }mx
% end
%
% def parse_string
% ",".is_fs if $csv_fs.nil?
% (self + $csv_fs).scan( $csv_re ).flatten.compact.unescape
% end
% end
%
% def get_rec( file )
% $csv_s = ""
% begin
% if file.eof?
% raise "The csv file is malformed." if $csv_s.size>0
% return nil
% end
% $csv_s += file.gets
% end until $csv_s.count( '"' ) % 2 == 0
% $csv_s.chomp!
% $csv_s.parse_string
% end
%
%
% # while rec = get_rec( ARGF )
% # puts "----------------"
% # puts $csv_s
% # p rec
% # puts rec.to_csv
% # end
%
% ## Here is my breakdown of the test string from test-csv.rb.
% # foo,
% # """foo""",
% # "foo,bar",
% # """""",
% # "",
% # ,
% # "\r",
% # "\r\n""\r\nNaHi", <---<< Corrected.
% # """Na""",
% # "Na,Hi",
% # "\r.\n",
% # "\r\n\n",
% # """",
% # "\n",
% # "\r\n"
%
% # Original.
% csvStr = ("foo,!!!foo!!!,!foo,bar!,!!!!!!,!!," +
% "!\r!,!\r\n!\r\nNaHi,!!!Na!!!,!Na,Hi!," +
% "!\r.\n!,!\r\n\n!,!!!!,!\n!,!\r\n!").gsub('!', '"')
%
% # Corrected?
% csvStr = ("foo,!!!foo!!!,!foo,bar!,!!!!!!,!!," +
% "!\r!,!\r\n!!\r\nNaHi!,!!!Na!!!,!Na,Hi!," +
% "!\r.\n!,!\r\n\n!,!!!!,!\n!,!\r\n!").gsub('!', '"')
%
% p csvStr
% arry = csvStr.parse_string
% p arry
% newCsvStr = arry.to_csv
% p newCsvStr
% arry2 = newCsvStr.parse_string
% puts "Arrays match." if arry == arry2

···

On Jan 31, 2005, at 9:40 PM, William James wrote:

Topic		Replies	Views
Faster CSV parsing ruby-talk	10	78	30 October 2005
Fastest CSV parsing? ruby-talk	8	91	20 August 2007
Regexp help: Parsing a CSV file ruby-talk	26	183	27 February 2003
[ANN] FasterCSV 0.1.6 -- With Header Support! ruby-talk	25	291	10 March 2006
FasterCSV RCR? ruby-talk	27	145	6 June 2006

Why csv file processing is so slow?

Related topics