Performance issues with large files -- ruby vs. python :)

Hi all -

I'm new to ruby after working with python for a while. My work is
performing data mining and doing some web dev for my company. We
recently started looking at rails, and I wanted to see if it's worth
migrating some of my code from python to ruby. So much for the intro.

I've re-written a script that extracts data from a very large csv file
(~8 million rows or so, almost 1Gb in size). It does so by iterating
through the rows and building a tree-like hash (dict in python) in
memory that will later be written to our DB. The structure is something
like this:

{date =>
  { company =>
     { product => [array of relevant info] } } }

This way I only get the data I need, sort it on the fly, and speed up
the process. The idea is to propagate down the keys if some field has
duplicate values -- I hope that makes some sense.. Anyway, I copied my
python code, and basically translated it to ruby.

This is where it got interesting -- after solving all the quirks and
getting it to run, it appeared to be super slow compared to python. I
mean nearly an 30 min in ruby vs under 2 minutes in python code. It's
worth mentioning that I used the psyco module in python, and that I did
my testing on win-xp.

Since I'm a newbie and **really** don't want to spark a python/ruby
talk-back war (I actually like ruby a lot from what I've seen so far), I
was just wondering if there's something I might have missed, like a
psyco equivalent module for ruby or something else to narrow the gap.

I'd appreciate feedback - thanks!

···

--
Posted via http://www.ruby-forum.com/.

It would be nice if you could give us some hints on what libraries you are using and how
you code looks like. If we are talking about such intensive talks, it might be that you are
getting something wrong about memory handling and such.

But, to try a shot: do you use the standard csv-library or do you use fastercsv?

Regards,
Florian Gilcher

Hi,
I do have a very similar problem: large data files and csv-import with
fastercsv which is much slower than an implementation in c++.
I was wondering if you have some interesting insights about that meanwhile
which you would like to share :).
Thanks,
Monika

···

2008/11/27 sa 125 <s_ayalon@hotmail.com>

Hi all -

I'm new to ruby after working with python for a while. My work is
performing data mining and doing some web dev for my company. We
recently started looking at rails, and I wanted to see if it's worth
migrating some of my code from python to ruby. So much for the intro.

I've re-written a script that extracts data from a very large csv file
(~8 million rows or so, almost 1Gb in size). It does so by iterating
through the rows and building a tree-like hash (dict in python) in
memory that will later be written to our DB. The structure is something
like this:

{date =>
{ company =>
    { product => [array of relevant info] } } }

This way I only get the data I need, sort it on the fly, and speed up
the process. The idea is to propagate down the keys if some field has
duplicate values -- I hope that makes some sense.. Anyway, I copied my
python code, and basically translated it to ruby.

This is where it got interesting -- after solving all the quirks and
getting it to run, it appeared to be super slow compared to python. I
mean nearly an 30 min in ruby vs under 2 minutes in python code. It's
worth mentioning that I used the psyco module in python, and that I did
my testing on win-xp.

Since I'm a newbie and **really** don't want to spark a python/ruby
talk-back war (I actually like ruby a lot from what I've seen so far), I
was just wondering if there's something I might have missed, like a
psyco equivalent module for ruby or something else to narrow the gap.

I'd appreciate feedback - thanks!
--
Posted via http://www.ruby-forum.com/\.

I can't really put the code here since it's on the company's intranet. I
use the fastercsv and mysql libraries. Basically I want to grab the
first and last record of every date/company/product combo, and store
it's row info.

The core processing is done through if statements

@main_hash = {}
csv = FasterCSV.open(file_path, "r", :headers => true)

#...code below is in loop: for row in csv...

if not @main_hash.keys.member?(date)
  @main_hash[date] = {}
  @main_hash[date][company] = {}
  @main_hash[date][company][prod] = {}
  @main_hash[date][company][prod] = row_values
else
  if not @main_hash[date].keys.member?(company)
    @main_hash[date][company] = {}
    @main_hash[date][company][prod] = {}
    @main_hash[date][company][prod] = row_values
  else
    if not @main_hash[date][company].keys.member?(prod)
      @main_hash[date][company][prod] = {}
      @main_hash[date][company][prod] = row_values
    end
  end
end

# loop ends

This is basically the part of the code that runs slow. I keep track of
progress in percentage (file position / file size) throughout the loop.
I should mention I extract the row values into loop variables, like
date/company/prod using the csv headers: date = row['Date'], etc.

the row_values variable is an array containing all the relevant
parameters from the row. The @main_hash variable obviously takes up some
memory. There are a couple of if-statements, but not much else. That's
pretty much all I can think about. Thanks!

···

--
Posted via http://www.ruby-forum.com/.

Also, which Ruby version are you using? In my experience, Ruby 1.9.1
is significantly faster than all 1.8 versions and it has also better
memory usage characteristics.

Kind regards

robert

···

2008/11/27 Florian Gilcher <flo@andersground.net>:

It would be nice if you could give us some hints on what libraries you are
using and how
you code looks like. If we are talking about such intensive talks, it might
be that you are
getting something wrong about memory handling and such.

But, to try a shot: do you use the standard csv-library or do you use
fastercsv?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

I was wondering if you have some interesting insights about that
meanwhile which you would like to share :).
Thanks,
Monika

Hi Monika,

it's been a while since I had that problem, and I ended up sticking with
pythong for that particular issue. However, I found the reading the csv
file with a simple file IO

File.open('path/to/file', 'r').each {|row| # do stuff }

is much faster than using FasterCSV. It will not be as convenient as
fastercsv, but if speed is what you're going for it just might do.

I must add that I haven't tried ruby 1.9 yet, which is said to have a
much faster interpreter than 1.8 - though I don't know if it's
compatible with the fastercsv library. If you end up finding out about
it, please let me know how it works out.

Thanks!

···

--
Posted via http://www.ruby-forum.com/\.

Monika Moser wrote:

Hi,
I do have a very similar problem: large data files and csv-import with
fastercsv which is much slower than an implementation in c++.
I was wondering if you have some interesting insights about that meanwhile
which you would like to share :).

You may want to try JRuby; many people who have moved to JRuby have done so explicitly because the perf characteristics of large data sets are very good.

- Charlie

Well, I take it to be pretty obvious that FasterCSV (written in Ruby), isn't going to be as fast as a C/C++ parser. I believe Ruby has a C based parse though, if you want to go that way:

http://rubyforge.org/projects/simplecsv

You might also try FasterCSV's latest code, not yet released but in version control. It has a new parser that can be faster for some things.

James Edward Gray II

···

On May 13, 2009, at 3:49 AM, Monika Moser wrote:

2008/11/27 sa 125 <s_ayalon@hotmail.com>

Hi all -

I'm new to ruby after working with python for a while. My work is
performing data mining and doing some web dev for my company. We
recently started looking at rails, and I wanted to see if it's worth
migrating some of my code from python to ruby. So much for the intro.

I've re-written a script that extracts data from a very large csv file
(~8 million rows or so, almost 1Gb in size). It does so by iterating
through the rows and building a tree-like hash (dict in python) in
memory that will later be written to our DB. The structure is something
like this:

{date =>
{ company =>
   { product => [array of relevant info] } } }

This way I only get the data I need, sort it on the fly, and speed up
the process. The idea is to propagate down the keys if some field has
duplicate values -- I hope that makes some sense.. Anyway, I copied my
python code, and basically translated it to ruby.

This is where it got interesting -- after solving all the quirks and
getting it to run, it appeared to be super slow compared to python. I
mean nearly an 30 min in ruby vs under 2 minutes in python code. It's
worth mentioning that I used the psyco module in python, and that I did
my testing on win-xp.

Since I'm a newbie and **really** don't want to spark a python/ruby
talk-back war (I actually like ruby a lot from what I've seen so far), I
was just wondering if there's something I might have missed, like a
psyco equivalent module for ruby or something else to narrow the gap.

I'd appreciate feedback - thanks!
--
Posted via http://www.ruby-forum.com/\.

Hi,
I do have a very similar problem: large data files and csv-import with
fastercsv which is much slower than an implementation in c++.
I was wondering if you have some interesting insights about that meanwhile
which you would like to share :).

unless main_hash[date]
  main_hash[date] = {}
  main_hash[date][company] = {}
  main_hash[date][company][prod] = row_values
else
  unless main_hash[date][company]
    main_hash[date][company] = {}
    main_hash[date][company][prod] = row_values
  else
    unless main_hash[date][company][prod]
      main_hash[date][company][prod] = row_values
    end
  end
end

main_hash[date] ||= {}
main_hash[date][company] ||= {}
main_hash[date][company][prod] ||= row_values

d = main_hash[date] ||= {}
c = d[company] ||= {}
p = c[prod] ||= row_values

((@main_hash[date] ||= {})[company] ||= {})[prod] ||= row_values

···

On Thu, Nov 27, 2008 at 9:39 AM, sa 125 <s_ayalon@hotmail.com> wrote:

The core processing is done through if statements

#...code below is in loop: for row in csv...

if not @main_hash.keys.member?(date)
@main_hash[date] = {}
@main_hash[date][company] = {}
@main_hash[date][company][prod] = {}
@main_hash[date][company][prod] = row_values
else
if not @main_hash[date].keys.member?(company)
   @main_hash[date][company] = {}
   @main_hash[date][company][prod] = {}
   @main_hash[date][company][prod] = row_values
else
   if not @main_hash[date][company].keys.member?(prod)
     @main_hash[date][company][prod] = {}
     @main_hash[date][company][prod] = row_values
   end
end
end

# loop ends

sa 125 wrote:

if not @main_hash.keys.member?(date)

@main_hash.keys will create an array of all the keys, which will be
expensive when it's large, and member? will do a linear search, which is
also expensive. You can replace with:

  if not @main_hash.has_key?(date)

or more simply

  if not @main_hash[date]

  @main_hash[date] = {}
  @main_hash[date][company] = {}
  @main_hash[date][company][prod] = {}
  @main_hash[date][company][prod] = row_values

The third line does nothing, because it's replaced by the fourth line. I
think all you need is:

    @main_hash[date] = { company => { prod => row_values } }

else
  if not @main_hash[date].keys.member?(company)
    @main_hash[date][company] = {}
    @main_hash[date][company][prod] = {}
    @main_hash[date][company][prod] = row_values
  else
    if not @main_hash[date][company].keys.member?(prod)
      @main_hash[date][company][prod] = {}
      @main_hash[date][company][prod] = row_values
    end
  end
end

You can rewrite this as above too.

But looking at this, I think you can replace *all* this code with just
the following three lines:

  @main_hash[date] ||= {}
  @main_hash[date][company] ||= {}
  @main_hash[date][company][prod] ||= row_values

Note that a ||= b is the same as a = a || b, which will assign b to a
only if a is nil or false.

This is basically the part of the code that runs slow. I keep track of
progress in percentage (file position / file size) throughout the loop.

There is also the ruby profiler, which you can turn on/off where needed,
or just run the whole lot with ruby -rprofile (beware: makes your code
run *much* slower)

Regards,

Brian.

···

--
Posted via http://www.ruby-forum.com/\.

Simple benchmark:
http://gist.github.com/29792

···

On Thu, Nov 27, 2008 at 10:27 AM, <brabuhr@gmail.com> wrote:

On Thu, Nov 27, 2008 at 9:39 AM, sa 125 <s_ayalon@hotmail.com> wrote:

if not @main_hash.keys.member?(date)
@main_hash[date] = {}

main_hash[date] ||= {}

brabuhr@gmail.com wrote:

unless main_hash[date]
  main_hash[date] = {}
  main_hash[date][company] = {}
  main_hash[date][company][prod] = row_values
else
  unless main_hash[date][company]
    main_hash[date][company] = {}
    main_hash[date][company][prod] = row_values
  else
    unless main_hash[date][company][prod]
      main_hash[date][company][prod] = row_values
    end
  end
end

DO NOT use unless...else. It's the most confusing conditional construct ever. Use if and flop the bodies, or just use if !condition.

- Charlie

I have never heard you shout before, *unless* I am mistaken ;).
I always felt that unless is just the human readable way of saying if
not, but accountants do not have any taste (or something like that ;).

Anyway here I would rather apply Ruby's on steroid (citing Dave
Thomas) construct "case".

case
   when not h....
         ....
   when not h....
        ....
       etc.etc
   else
       ....
end

I am not sure if faster code is generated but I would guess so.

All that said I wonder if your code could not benefit of this kind of
initialization

@main_hash = Hash::new{ |h, k| h[k] = Hash::new{ |h, k| h[k] = {} I
let *you* close all those accolades :wink:

HTH
Robert

···

On Thu, Nov 27, 2008 at 4:57 PM, Charles Oliver Nutter <charles.nutter@sun.com> wrote:

DO NOT use unless...else. It's the most confusing conditional construct
ever. Use if and flop the bodies, or just use if !condition.

--
Ne baisse jamais la tête, tu ne verrais plus les étoiles.

Robert Dober :wink:

I don't think the confusion issue was simply unless, but rather
pairing unless with else.

# not confusing
unless a == b
  #foo
end

# confusing
unless a == b
  #foo
else
  #bar
end

# not confusing
if a == b
  #bar
else
  #foo
end

# or
if a != b
  #foo
else
  #bar
end

···

On Thu, Nov 27, 2008 at 11:47 AM, Robert Dober <robert.dober@gmail.com> wrote:

On Thu, Nov 27, 2008 at 4:57 PM, Charles Oliver Nutter > <charles.nutter@sun.com> wrote:

DO NOT use unless...else. It's the most confusing conditional construct
ever. Use if and flop the bodies, or just use if !condition.

I have never heard you shout before, *unless* I am mistaken ;).
I always felt that unless is just the human readable way of saying if
not, but accountants do not have any taste (or something like that ;).

Hmm that indeed is a little more misleading, sorry for not getting
this but the ellipsis put me astray;).

···

On Thu, Nov 27, 2008 at 6:12 PM, <brabuhr@gmail.com> wrote:

On Thu, Nov 27, 2008 at 11:47 AM, Robert Dober <robert.dober@gmail.com> wrote:

On Thu, Nov 27, 2008 at 4:57 PM, Charles Oliver Nutter >> <charles.nutter@sun.com> wrote:

DO NOT use unless...else. It's the most confusing conditional construct
ever. Use if and flop the bodies, or just use if !condition.

I have never heard you shout before, *unless* I am mistaken ;).
I always felt that unless is just the human readable way of saying if
not, but accountants do not have any taste (or something like that ;).

I don't think the confusion issue was simply unless, but rather
pairing unless with else.

Not entirely on topic, but the biggest issue I've had with "unless...else" is that you can't do "unless...elsif...else".

I somewhat frequently find myself adding conditionals for extra corner cases, and in an "unless...else" block, that means re-writing the whole thing as an "if not...else block" anyway.

-Josh

···

On Nov 27, 2008, at 12:29 PM, Robert Dober wrote:

On Thu, Nov 27, 2008 at 6:12 PM, <brabuhr@gmail.com> wrote:

On Thu, Nov 27, 2008 at 11:47 AM, Robert Dober <robert.dober@gmail.com >> > wrote:

On Thu, Nov 27, 2008 at 4:57 PM, Charles Oliver Nutter >>> <charles.nutter@sun.com> wrote:

DO NOT use unless...else. It's the most confusing conditional construct
ever. Use if and flop the bodies, or just use if !condition.

I have never heard you shout before, *unless* I am mistaken ;).
I always felt that unless is just the human readable way of saying if
not, but accountants do not have any taste (or something like that ;).

I don't think the confusion issue was simply unless, but rather
pairing unless with else.

Hmm that indeed is a little more misleading, sorry for not getting
this but the ellipsis put me astray;).