CSV Reader (and Type Inference and Data Conversion) Benchmarks (Faster, Fasterer, Fastest) - And the Winner is... String#split


(Gerald Bauer) #1

Hello,

   I've put together some basic csv reader / parser benchmarks [1].
  The "Raw" Read Benchmark returns all strings - no type inference or
data conversion (*)
and the Numerics Benchmark returns all numbers - simple type inference
or data conversion -
it's all numbers - all the time (except for the header row).

  Here's the result for the numerics benchmark using the weather
station data from
the University of Waterloo, Ontario, Canada:

   n = 100
                      user system total real
std: 20.781000 0.234000 21.015000 ( 21.039186)
split: 1.531000 0.063000 1.594000 ( 1.582496)
split(table): 2.000000 0.015000 2.015000 ( 2.016913)
reader: 63.500000 0.203000 63.703000 ( 63.691851)
reader(table): 37.407000 0.188000 37.595000 ( 37.601160)
reader(numeric): 40.421000 0.141000 40.562000 ( 40.595467)
reader(json): 1.125000 0.062000 1.187000 ( 1.191145)
reader(yaml): 38.485000 15.672000 54.157000 ( 54.229705)

   And the winner is...

Of course - nothing is faster than "plain" string#split (with "simple
csv", that is,
no escape rules and edge cases):

   def read_faster_csv( path, sep: ',' )
     recs = []
     File.open( path, 'r:utf-8' ) do |f|
        f.each_line do |line|
          line = line.chomp( '' )
          values = line.split( sep )
          recs << values
        end
     end
     recs
   end

(*) Note: YAML and JSON - of course - always use YAML and JSON
encoding (and data conversion) rules :-).

  Happy data wrangling with ruby. Cheers. Prost.

[1] https://github.com/csvreader/benchmarks