Optimizing a single slow method

Hi!
  I have a Ruby script (a small part of it is included below)
that has many methods, but one method is so incredibly slow that
I would like to write it in C++ (or any other faster language).

  The method is "Contract.parseFile(file)". I've tried the
latest SWIG, but am having a heck of a time trying to call the
two methods: "Contract.open()" and "Contract.addTick()" from within
the SWIG'd extension. (By the way, I also tried "rio(file).csv {|fields|"
but that had the same speed as the code below.)

  Any ideas on how I can rewrite 'Contract.parseFile()' for
speed?

  Thanks!
-- Glenn Lewis

class Contract
   @@contracts = Hash.new
   @@files = Hash.new
   DataDir = "c:/src/Data"
   MonthNames = %w(nil jan feb mar apr may jun jul aug sep oct nov dec)
   def Contract.open(contract)
     @@contracts.has_key?(contract) ? @@contracts[contract] : Contract.new(contract)
   end
   def Contract.all
     @@contract.keys
   end
   def Contract.parseFile(file)
     return unless File.exists?(file)
     return if @@files.has_key?(file)
     @@files[file] = 1
     print "Parsing file #{file}..."
     File.open(file, "rb").each_line {|line|
       line.chomp!
       # puts line
       fields = line.split(/,/)
       next if fields.size < 8
       datestring = fields[1]
       year = datestring[0..1].to_i
       month = datestring[2..3].to_i
       day = datestring[4..5].to_i
       # puts "year=#{year}, month=#{month}, day=#{day}"
       # RUBY BUG?!? Can't use 'date' here... as that messes up findTick's 'date'...
       mydate = Date.new((year < 50 ? year+2000 : year+1900), month, day)
       contract = Contract.open(fields[0])
       tick = Tick.new(mydate, fields[2].to_f, fields[3].to_f,
          fields[4].to_f, fields[5].to_f,
          fields[6].to_i, fields[7].to_i)
       contract.addTick(tick)
     }
     puts "done"
   end
   def initialize(contract)
     @contract = contract
     @@contracts[contract] = self
     @ticks = Hash.new
   end
   def addTick(tick)
     @ticks[tick.date.to_s[0..9]] = tick
   end
end

First of all I'd try to exactly determine *why* it is slow and where it's spending it's time. Since you're doing file IO chances are that a version in a different language is not much faster.

Also, as a hint for a pure Ruby solution: did you try the library functions for date parsing? That might make up a significant part of the time spent here. Also, an alternative is to use a regular expression to match the entire line instead of splitting it. That way you can verify that the current line matches your expectaitions (aka regexp) and have direct access to field values and can verify that.

Also, I noticed that you don' close your file. I always prefer the block form

File.open(file, "rb") do |io|
  io.each_line do |line|
    line.chomp!
    ...
  end
end

Kind regards

    robert

···

Glenn M. Lewis <noSpam@noSpam.com> wrote:

Hi!
I have a Ruby script (a small part of it is included below)
that has many methods, but one method is so incredibly slow that
I would like to write it in C++ (or any other faster language).

The method is "Contract.parseFile(file)". I've tried the
latest SWIG, but am having a heck of a time trying to call the
two methods: "Contract.open()" and "Contract.addTick()" from within
the SWIG'd extension. (By the way, I also tried "rio(file).csv
{|fields|" but that had the same speed as the code below.)

Hi!
    I have a Ruby script (a small part of it is included below)
that has many methods, but one method is so incredibly slow that
I would like to write it in C++ (or any other faster language).

    The method is "Contract.parseFile(file)". I've tried the
latest SWIG, but am having a heck of a time trying to call the
two methods: "Contract.open()" and "Contract.addTick()" from within
the SWIG'd extension. (By the way, I also tried "rio(file).csv {|fields>"
but that had the same speed as the code below.)

    Any ideas on how I can rewrite 'Contract.parseFile()' for
speed?

    Thanks!
-- Glenn Lewis

  def Contract.parseFile(file)
    return unless File.exists?(file)
    return if @@files.has_key?(file)
    @@files[file] = 1
    print "Parsing file #{file}..."
    File.open(file, "rb").each_line {|line|
      line.chomp!
      # puts line
      fields = line.split(/,/)
      next if fields.size < 8
      datestring = fields[1]
      year = datestring[0..1].to_i
      month = datestring[2..3].to_i
      day = datestring[4..5].to_i
      # puts "year=#{year}, month=#{month}, day=#{day}"
      # RUBY BUG?!? Can't use 'date' here... as that messes up findTick's 'date'...
      mydate = Date.new((year < 50 ? year+2000 : year+1900), month, day)

Since Dates are created using Rational, so they can use up a significant amount of processing time. Time is much better provided your dates fit within its bounds.

···

On 08 Sep 2005, at 20:46, Glenn M. Lewis wrote:

      contract = Contract.open(fields[0])
      tick = Tick.new(mydate, fields[2].to_f, fields[3].to_f,
              fields[4].to_f, fields[5].to_f,
              fields[6].to_i, fields[7].to_i)
      contract.addTick(tick)
    }
    puts "done"
  end

--
Eric Hodel - drbrain@segment7.net - http://segment7.net
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04

Glenn M. Lewis wrote:

    I have a Ruby script (a small part of it is included below)
that has many methods, but one method is so incredibly slow that
I would like to write it in C++ (or any other faster language).

[...]

      mydate = Date.new((year < 50 ? year+2000 : year+1900), month, day)

I predict that's your problem right there. The parsing and initialization code of the Date class is incredibly slow. I'd replace it with a less general-purpose implementation.

mathew

···

--
<URL:http://www.pobox.com/~meta/&gt;
          WE HAVE TACOS

Hi!

         [...]

   Any ideas on how I can rewrite 'Contract.parseFile()' for
speed?

   Thanks!
-- Glenn Lewis

def Contract.parseFile(file)
   return unless File.exists?(file)
   return if @@files.has_key?(file)

I'd swap those two: test of a hash will be faster than test of a
filesystem, so may as well bail out quickly. Do repeat keys happen
often?

   @@files[file] = 1

Maybe it only needs to be a Set, not a Hash? Not sure how speeds
compare.

   print "Parsing file #{file}..."
   File.open(file, "rb").each_line {|line|

I think:

     line.chomp!
     # puts line
     fields = line.split(/,/)

might be faster as
         fields = line.chomp.split(/,/)

or if you only chomp the last field afterwards (shorter string to
change)?

     next if fields.size < 8

Maybe line.count(",") first, and bail out quickly? Again, how often
does this happen?

     datestring = fields[1]
     year = datestring[0..1].to_i
     month = datestring[2..3].to_i
     day = datestring[4..5].to_i

reduce array refs:

         year, month, day = datestring[0..5].scan(/../).collect do |s|
           s.to_i
         end

possibly

That's all I can think of just now.
         Hugh

···

On Sat, 10 Sep 2005, Eric Hodel wrote:

On 08 Sep 2005, at 20:46, Glenn M. Lewis wrote:

Thanks a bunch, Hugh and Eric! The combination of your
two suggestions sped it up quite a bit.

I don't agree with Robert, though... I have written many
parsers in C++ (and before that, C) that could soak up
all the data that I'm reading in less than a second whereas
this is taking approximately 9 minutes in Ruby. With the
recommendations of Hugh and Eric, it is now down to about
5 minutes, or almost a factor of 2 speedup.

I would really like an order of magnitude or more, but
I would definitely have to write it in a compiled language.
I've done this before with Ruby and C++ using SWIG, but
this particular one seemed really challenging when having
Ruby call C++ which would then call Ruby...

My last project with Ruby/C++/SWIG had Ruby calling C++
but C++ kept all the data structures internally without
ever calling Ruby, and this was *much* easier... but not
as flexible as I would like for this case.

I may have to rewrite this whole puppy in D if I'm going
to get parsing times under one second. Using C++ and STL
for its map containers is a royal nuisance, but D has
built-in associative arrays. Or maybe I should try Perl
or Python and see how their file parsing speeds compare.

Oh, and to answer Hugh's question, it is extremely rare
that a line would have less than 8 fields... sometimes
the last line of the file has only a ^Z on it.

Thanks again for your help! I appreciate it.
-- Glenn

Hugh Sasse wrote:

···

On Sat, 10 Sep 2005, Eric Hodel wrote:

On 08 Sep 2005, at 20:46, Glenn M. Lewis wrote:

Hi!

        [...]

   Any ideas on how I can rewrite 'Contract.parseFile()' for
speed?

   Thanks!
-- Glenn Lewis

def Contract.parseFile(file)
   return unless File.exists?(file)
   return if @@files.has_key?(file)

I'd swap those two: test of a hash will be faster than test of a
filesystem, so may as well bail out quickly. Do repeat keys happen
often?

   @@files[file] = 1

Maybe it only needs to be a Set, not a Hash? Not sure how speeds
compare.

   print "Parsing file #{file}..."
   File.open(file, "rb").each_line {|line|

I think:

     line.chomp!
     # puts line
     fields = line.split(/,/)

might be faster as
        fields = line.chomp.split(/,/)

or if you only chomp the last field afterwards (shorter string to
change)?

     next if fields.size < 8

Maybe line.count(",") first, and bail out quickly? Again, how often
does this happen?

     datestring = fields[1]
     year = datestring[0..1].to_i
     month = datestring[2..3].to_i
     day = datestring[4..5].to_i

reduce array refs:

        year, month, day = datestring[0..5].scan(/../).collect do |s|
          s.to_i
        end

possibly

That's all I can think of just now.
        Hugh

How many files are you doing?

It'd seem to me that all that STDOUT would slow things down a WHOOOLE
lot if you're going through a bunch of files.

···

On 9/9/05, Glenn M. Lewis <noSpam@nospam.com> wrote:

Thanks a bunch, Hugh and Eric! The combination of your
two suggestions sped it up quite a bit.

I don't agree with Robert, though... I have written many
parsers in C++ (and before that, C) that could soak up
all the data that I'm reading in less than a second whereas
this is taking approximately 9 minutes in Ruby. With the
recommendations of Hugh and Eric, it is now down to about
5 minutes, or almost a factor of 2 speedup.

I would really like an order of magnitude or more, but
I would definitely have to write it in a compiled language.
I've done this before with Ruby and C++ using SWIG, but
this particular one seemed really challenging when having
Ruby call C++ which would then call Ruby...

My last project with Ruby/C++/SWIG had Ruby calling C++
but C++ kept all the data structures internally without
ever calling Ruby, and this was *much* easier... but not
as flexible as I would like for this case.

I may have to rewrite this whole puppy in D if I'm going
to get parsing times under one second. Using C++ and STL
for its map containers is a royal nuisance, but D has
built-in associative arrays. Or maybe I should try Perl
or Python and see how their file parsing speeds compare.

Oh, and to answer Hugh's question, it is extremely rare
that a line would have less than 8 fields... sometimes
the last line of the file has only a ^Z on it.

Thanks again for your help! I appreciate it.
-- Glenn

Hugh Sasse wrote:
> On Sat, 10 Sep 2005, Eric Hodel wrote:
>
>> On 08 Sep 2005, at 20:46, Glenn M. Lewis wrote:
>>
>>> Hi!
>
> [...]
>
>>> Any ideas on how I can rewrite 'Contract.parseFile()' for
>>> speed?
>>>
>>> Thanks!
>>> -- Glenn Lewis
>>>
>>> def Contract.parseFile(file)
>>> return unless File.exists?(file)
>>> return if @@files.has_key?(file)
>
>
> I'd swap those two: test of a hash will be faster than test of a
> filesystem, so may as well bail out quickly. Do repeat keys happen
> often?
>
>>> @@files[file] = 1
>
>
> Maybe it only needs to be a Set, not a Hash? Not sure how speeds
> compare.
>
>>> print "Parsing file #{file}..."
>>> File.open(file, "rb").each_line {|line|
>
>
> I think:
>
>>> line.chomp!
>>> # puts line
>>> fields = line.split(/,/)
>
>
> might be faster as
> fields = line.chomp.split(/,/)
>
> or if you only chomp the last field afterwards (shorter string to
> change)?
>
>>> next if fields.size < 8
>
>
> Maybe line.count(",") first, and bail out quickly? Again, how often
> does this happen?
>
>>> datestring = fields[1]
>>> year = datestring[0..1].to_i
>>> month = datestring[2..3].to_i
>>> day = datestring[4..5].to_i
>
>
> reduce array refs:
>
> year, month, day = datestring[0..5].scan(/../).collect do |s|
> s.to_i
> end
>
> possibly
>
>
> That's all I can think of just now.
> Hugh
>
>

--
-Dan Nugent

can you send a sample data set (contact me offline if you wish) and expected
time to parse and let us have a crack? those times sounds distressing - is
your data HUGE?

cheers.

-a

···

On Sat, 10 Sep 2005, Glenn M. Lewis wrote:

Thanks a bunch, Hugh and Eric! The combination of your
two suggestions sped it up quite a bit.

I don't agree with Robert, though... I have written many
parsers in C++ (and before that, C) that could soak up
all the data that I'm reading in less than a second whereas
this is taking approximately 9 minutes in Ruby. With the
recommendations of Hugh and Eric, it is now down to about
5 minutes, or almost a factor of 2 speedup.

I would really like an order of magnitude or more, but
I would definitely have to write it in a compiled language.
I've done this before with Ruby and C++ using SWIG, but
this particular one seemed really challenging when having
Ruby call C++ which would then call Ruby...

My last project with Ruby/C++/SWIG had Ruby calling C++
but C++ kept all the data structures internally without
ever calling Ruby, and this was *much* easier... but not
as flexible as I would like for this case.

I may have to rewrite this whole puppy in D if I'm going
to get parsing times under one second. Using C++ and STL
for its map containers is a royal nuisance, but D has
built-in associative arrays. Or maybe I should try Perl
or Python and see how their file parsing speeds compare.

Oh, and to answer Hugh's question, it is extremely rare
that a line would have less than 8 fields... sometimes
the last line of the file has only a ^Z on it.

Thanks again for your help! I appreciate it.
-- Glenn

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Great news!

  I discovered that I really only need to parse about 20% of the lines...
I put in one more optimization... specifically this one:
....>>> next unless contract
and now the parsing time goes down to 26 seconds!!! YAHOO!!!
This is *very* tollerable!!! The final version is below...
Thanks so much for everybody's help!!!
(The data files total about 6Megs, by the way.)

-- Glenn

   def Contract.parseFile(file)
     return if @@files.has_key?(file)
     return unless File.exists?(file)
     @@files[file] = 1
     print "Parsing file #{file}..."
     File.open(file, "rb") do |io|
       io.each_line do |line|
  fields = line.chomp.split(/,/)
  next if fields.size < 8
  contract = Contract.open(fields[0])
  next unless contract
  datestring = fields[1]
  # year = datestring[0..1].to_i
  # month = datestring[2..3].to_i
  # day = datestring[4..5].to_i
  year, month, day = datestring[0..5].scan(/../).collect {|s| s.to_i }
  # puts "year=#{year}, month=#{month}, day=#{day}"
  # RUBY BUG?!? Can't use 'date' here... as that messes up findTick's 'date'...
  mydate = Time.local((year < 50 ? year+2000 : year+1900), month, day)
  tick = Tick.new(mydate, fields[2].to_f, fields[3].to_f,
      fields[4].to_f, fields[5].to_f,
      fields[6].to_i, fields[7].to_i)
  contract.addTick(tick)
       end
     end
     puts "done"
   end

Ara.T.Howard wrote:

···

can you send a sample data set (contact me offline if you wish) and expected
time to parse and let us have a crack? those times sounds distressing - is
your data HUGE?

cheers.

-a

Great news!

I discovered that I really only need to parse about 20% of the
lines...

So you changed Contract.open()?

I put in one more optimization... specifically this one:
...>>> next unless contract
and now the parsing time goes down to 26 seconds!!! YAHOO!!!
This is *very* tollerable!!! The final version is below...
Thanks so much for everybody's help!!!
(The data files total about 6Megs, by the way.)

I don't know the condition for having Contract.open() return nil (or false) but if it's a certain pattern of field[0] that might be caught by the regexp approach I suggested.

Btw, which part of my posting did you object to? Did you try DateParse at all?

Kind regards

    robert

···

Glenn M. Lewis <noSpam@noSpam.com> wrote:

-- Glenn

  def Contract.parseFile(file)
    return if @@files.has_key?(file)
    return unless File.exists?(file)
    @@files[file] = 1
    print "Parsing file #{file}..."
    File.open(file, "rb") do |io|
      io.each_line do |line|
fields = line.chomp.split(/,/)
next if fields.size < 8
contract = Contract.open(fields[0])
next unless contract
datestring = fields[1]
# year = datestring[0..1].to_i
# month = datestring[2..3].to_i
# day = datestring[4..5].to_i
year, month, day = datestring[0..5].scan(/../).collect {|s| s.to_i }
# puts "year=#{year}, month=#{month}, day=#{day}"
# RUBY BUG?!? Can't use 'date' here... as that messes up findTick's
'date'... mydate = Time.local((year < 50 ? year+2000 : year+1900),
month, day) tick = Tick.new(mydate, fields[2].to_f, fields[3].to_f,
fields[4].to_f, fields[5].to_f,
fields[6].to_i, fields[7].to_i)
contract.addTick(tick)
      end
    end
    puts "done"
  end

Ara.T.Howard wrote:

can you send a sample data set (contact me offline if you wish) and
expected
time to parse and let us have a crack? those times sounds
distressing - is your data HUGE?

cheers.

-a

and now the parsing time goes down to 26 seconds!!! YAHOO!!!

Good.

def Contract.parseFile(file)
   return if @@files.has_key?(file)
   return unless File.exists?(file)
   @@files[file] = 1
   print "Parsing file #{file}..."

pre-create fields, contract, etc so they don't get called
into being each loop

      fields, contract, mydate, tick = nil, nil, nil, nil
      year, month, day = nil, nil, nil

   File.open(file, "rb") do |io|
     io.each_line do |line|

Coould do that in one I think
      File.open(file, 'rb').each_line do |line|
because you don't use io again.

  fields = line.chomp.split(/,/)
  next if fields.size < 8
  contract = Contract.open(fields[0])

I'm a bit concerned that I don't see a matching close for this, or a
block passed in instead.

  next unless contract
  datestring = fields[1]
  # year = datestring[0..1].to_i
  # month = datestring[2..3].to_i
  # day = datestring[4..5].to_i
  year, month, day = datestring[0..5].scan(/../).collect {|s| s.to_i }
  # puts "year=#{year}, month=#{month}, day=#{day}"
  # RUBY BUG?!? Can't use 'date' here... as that messes up findTick's 'date'...
  mydate = Time.local((year < 50 ? year+2000 : year+1900), month, day)
  tick = Tick.new(mydate, fields[2].to_f, fields[3].to_f,
      fields[4].to_f, fields[5].to_f,
      fields[6].to_i, fields[7].to_i)
  contract.addTick(tick)
     end
   end
   puts "done"
end

         Hugh

···

On Sat, 10 Sep 2005, Glenn M. Lewis wrote:

Hi Robert!

  Yes... I found out that if I create a hash of contract names
that I've already decided to ignore, then I can immediately test
field[0] to see if I can totally ignore the line, like you had suggested.

  Oh, sorry... the only thing I was referring to was not being able
to speed up the File I/O and parsing in another language... and I only
said that because I had just finished another Ruby project where the original
file parsing of about 80Megs of data would take over 10 minutes, and when
I rewrote the File I/O and parsing in C++, I could do all of it in less than
a second.

  But that is neither here nor there because the other optimizations
recommended brought down the load time to 26 seconds, which is now acceptable
and no longer worth the pain to create a C++ module that will communicate
with Ruby in both directions.

  No, I never tried the DateParse... as Eric recommended that I no
longer use Date, but switch over to Time... he was right... I got a significant
speedup with switching to the Time class.

  Thanks for your help!
-- Glenn

Robert Klemme wrote:

···

So you changed Contract.open()?

I don't know the condition for having Contract.open() return nil (or false) but if it's a certain pattern of field[0] that might be caught by the regexp approach I suggested.

Btw, which part of my posting did you object to? Did you try DateParse at all?

Kind regards

   robert

I used this form frequently in the past, but then someone explained to
me that it never closes the file handle. Because File.open doesn't
see an associated block, it doesn't handle the cleanup for you.
The "block inside block" version guarantees cleanup, even if an
exception is thrown.

Other than that, I think your suggestions are good.

--Wilson.

···

On 9/12/05, Hugh Sasse <hgs@dmu.ac.uk> wrote:

> File.open(file, "rb") do |io|
> io.each_line do |line|

Coould do that in one I think
      File.open(file, 'rb').each_line do |line|
because you don't use io again.

Hi Robert!

Yes... I found out that if I create a hash of contract names
that I've already decided to ignore, then I can immediately test
field[0] to see if I can totally ignore the line, like you had
suggested.
Oh, sorry... the only thing I was referring to was not being able
to speed up the File I/O and parsing in another language...

I was probably not clear enough: I intended to convey that the pure IO part might not really speed with another language (the PL introduced overhead is usually small compared to the actual IO part).

and I only
said that because I had just finished another Ruby project where the
original file parsing of about 80Megs of data would take over 10
minutes, and when I rewrote the File I/O and parsing in C++, I could do all of it in
less than a second.

You're really making me curious... :slight_smile:

But that is neither here nor there because the other optimizations
recommended brought down the load time to 26 seconds, which is now
acceptable and no longer worth the pain to create a C++ module that
will communicate with Ruby in both directions.

Here's another suggestion that might have a small impact only: use the bang version of chomp:

File.open(file, "rb") do |io|
  io.each_line do |line|
    line.chomp!
    fields.split(/,/)
....

This saves you one object creation per line read. If lines are short it might even make a noticeable difference.

No, I never tried the DateParse... as Eric recommended that I no
longer use Date, but switch over to Time... he was right... I got a
significant speedup with switching to the Time class.

Cool!

Thanks for your help!

You're welcome!

Kind regards

    robert

···

Glenn M. Lewis <noSpam@noSpam.com> wrote:

   File.open(file, "rb") do |io|
     io.each_line do |line|

Coould do that in one I think
      File.open(file, 'rb').each_line do |line|
because you don't use io again.

I used this form frequently in the past, but then someone explained to
me that it never closes the file handle. Because File.open doesn't

Good point.

see an associated block, it doesn't handle the cleanup for you.

Yes, that's right. I wonder if it is worth the price in garbage
collection to explicitly close in this case? Probably not given
maintenance costs.

The "block inside block" version guarantees cleanup, even if an
exception is thrown.

Exceptions are another good point!

Other than that, I think your suggestions are good.

Did they help at all?

--Wilson.

         Hugh

···

On Wed, 14 Sep 2005, Wilson Bilkovich wrote:

On 9/12/05, Hugh Sasse <hgs@dmu.ac.uk> wrote:

Yes indeed! Thank you very much, Hugh, and all
others who have helped! The processing time is
now a flat 26 seconds, which sure beats the original
9 minutes!!!

To summarize, the three most significant speedups:
1) Only process 20% of the lines based on a quick check
    of relavency
2) Switch from using 'Date' class to 'Time' class
3) Minimize substring processing

Thanks again to all who have helped!
-- Glenn

Hugh Sasse wrote:

···

Did they help at all?
        Hugh

Hugh Sasse wrote:

see an associated block, it doesn't handle the cleanup for you.

Yes, that's right. I wonder if it is worth the price in garbage
collection to explicitly close in this case? Probably not given
maintenance costs.

The closing isn't that costly. And if you later change the script you
might run into errors when omitting proper closing of file descriptors. I
made it a habit to always use the block form (except where not applicable
for application reasons). That way you're always on the safe side.

Kind regards

    robert

Yes indeed! Thank you very much, Hugh, and all
others who have helped! The processing time is
now a flat 26 seconds, which sure beats the original
9 minutes!!!

I was hoping that pre-defining the vars outside the block knock
another second off :slight_smile:

         Thanks,
         Hugh

···

On Wed, 14 Sep 2005, Glenn M. Lewis wrote:

Hugh Sasse wrote:

see an associated block, it doesn't handle the cleanup for you.

Yes, that's right. I wonder if it is worth the price in garbage
collection to explicitly close in this case? Probably not given
maintenance costs.

The closing isn't that costly.

The closing would allow you to avoid that block with the |io| in it.
That cost, the cost of the object, is what I meant.

And if you later change the script you
might run into errors when omitting proper closing of file descriptors. I

Hence my remark about maintenance costs.

made it a habit to always use the block form (except where not applicable
for application reasons). That way you're always on the safe side.

Agreed

Kind regards

   robert

         Hugh

···

On Wed, 14 Sep 2005, Robert Klemme wrote:

Just FYI for anyone following this thread...
I decided to go ahead and rewrite the file parsing
section of the code in C++ and interface it to Ruby
via SWIG (1.3.25), just because I thought the extra
effort would pay off in the long run with the ability
to perform a lot more experiments if it took less time
to run.

Now, with my current database of 16 files totalling
about 13Megs of information, it takes:

Ruby Only: 29 seconds
Ruby/C++ : <2 seconds

So by doing this, I got approximately a 15x speedup.

Please don't misunderstand... I'm *NOT* complaining!!!
I'm tickled pink that I can use Ruby as a front-end
for all the cool processing that I want to do on
the data after I load it in!!!

I only posted this info for those of you who are
wondering if you can benefit from putting heavy-duty
I/O in a compiled language... yes you can. :slight_smile:

-- Glenn

Glenn M. Lewis wrote:

···

Yes indeed! Thank you very much, Hugh, and all
others who have helped! The processing time is
now a flat 26 seconds, which sure beats the original
9 minutes!!!

To summarize, the three most significant speedups:
1) Only process 20% of the lines based on a quick check
   of relavency
2) Switch from using 'Date' class to 'Time' class
3) Minimize substring processing

Thanks again to all who have helped!
-- Glenn

Hugh Sasse wrote:

Did they help at all?
        Hugh