Hi!
I have a Ruby script (a small part of it is included below)
that has many methods, but one method is so incredibly slow that
I would like to write it in C++ (or any other faster language).
The method is "Contract.parseFile(file)". I've tried the
latest SWIG, but am having a heck of a time trying to call the
two methods: "Contract.open()" and "Contract.addTick()" from within
the SWIG'd extension. (By the way, I also tried "rio(file).csv {|fields|"
but that had the same speed as the code below.)
Any ideas on how I can rewrite 'Contract.parseFile()' for
speed?
Thanks!
-- Glenn Lewis
class Contract
@@contracts = Hash.new
@@files = Hash.new
DataDir = "c:/src/Data"
MonthNames = %w(nil jan feb mar apr may jun jul aug sep oct nov dec)
def Contract.open(contract)
@@contracts.has_key?(contract) ? @@contracts[contract] : Contract.new(contract)
end
def Contract.all
@@contract.keys
end
def Contract.parseFile(file)
return unless File.exists?(file)
return if @@files.has_key?(file)
@@files[file] = 1
print "Parsing file #{file}..."
File.open(file, "rb").each_line {|line|
line.chomp!
# puts line
fields = line.split(/,/)
next if fields.size < 8
datestring = fields[1]
year = datestring[0..1].to_i
month = datestring[2..3].to_i
day = datestring[4..5].to_i
# puts "year=#{year}, month=#{month}, day=#{day}"
# RUBY BUG?!? Can't use 'date' here... as that messes up findTick's 'date'...
mydate = Date.new((year < 50 ? year+2000 : year+1900), month, day)
contract = Contract.open(fields[0])
tick = Tick.new(mydate, fields[2].to_f, fields[3].to_f,
fields[4].to_f, fields[5].to_f,
fields[6].to_i, fields[7].to_i)
contract.addTick(tick)
}
puts "done"
end
def initialize(contract) @contract = contract
@@contracts[contract] = self @ticks = Hash.new
end
def addTick(tick) @ticks[tick.date.to_s[0..9]] = tick
end
end
First of all I'd try to exactly determine *why* it is slow and where it's spending it's time. Since you're doing file IO chances are that a version in a different language is not much faster.
Also, as a hint for a pure Ruby solution: did you try the library functions for date parsing? That might make up a significant part of the time spent here. Also, an alternative is to use a regular expression to match the entire line instead of splitting it. That way you can verify that the current line matches your expectaitions (aka regexp) and have direct access to field values and can verify that.
Also, I noticed that you don' close your file. I always prefer the block form
File.open(file, "rb") do |io|
io.each_line do |line|
line.chomp!
...
end
end
Kind regards
robert
···
Glenn M. Lewis <noSpam@noSpam.com> wrote:
Hi!
I have a Ruby script (a small part of it is included below)
that has many methods, but one method is so incredibly slow that
I would like to write it in C++ (or any other faster language).
The method is "Contract.parseFile(file)". I've tried the
latest SWIG, but am having a heck of a time trying to call the
two methods: "Contract.open()" and "Contract.addTick()" from within
the SWIG'd extension. (By the way, I also tried "rio(file).csv
{|fields|" but that had the same speed as the code below.)
Hi!
I have a Ruby script (a small part of it is included below)
that has many methods, but one method is so incredibly slow that
I would like to write it in C++ (or any other faster language).
The method is "Contract.parseFile(file)". I've tried the
latest SWIG, but am having a heck of a time trying to call the
two methods: "Contract.open()" and "Contract.addTick()" from within
the SWIG'd extension. (By the way, I also tried "rio(file).csv {|fields>"
but that had the same speed as the code below.)
Any ideas on how I can rewrite 'Contract.parseFile()' for
speed?
Thanks!
-- Glenn Lewis
def Contract.parseFile(file)
return unless File.exists?(file)
return if @@files.has_key?(file)
@@files[file] = 1
print "Parsing file #{file}..."
File.open(file, "rb").each_line {|line|
line.chomp!
# puts line
fields = line.split(/,/)
next if fields.size < 8
datestring = fields[1]
year = datestring[0..1].to_i
month = datestring[2..3].to_i
day = datestring[4..5].to_i
# puts "year=#{year}, month=#{month}, day=#{day}"
# RUBY BUG?!? Can't use 'date' here... as that messes up findTick's 'date'...
mydate = Date.new((year < 50 ? year+2000 : year+1900), month, day)
Since Dates are created using Rational, so they can use up a significant amount of processing time. Time is much better provided your dates fit within its bounds.
I have a Ruby script (a small part of it is included below)
that has many methods, but one method is so incredibly slow that
I would like to write it in C++ (or any other faster language).
I predict that's your problem right there. The parsing and initialization code of the Date class is incredibly slow. I'd replace it with a less general-purpose implementation.
Thanks a bunch, Hugh and Eric! The combination of your
two suggestions sped it up quite a bit.
I don't agree with Robert, though... I have written many
parsers in C++ (and before that, C) that could soak up
all the data that I'm reading in less than a second whereas
this is taking approximately 9 minutes in Ruby. With the
recommendations of Hugh and Eric, it is now down to about
5 minutes, or almost a factor of 2 speedup.
I would really like an order of magnitude or more, but
I would definitely have to write it in a compiled language.
I've done this before with Ruby and C++ using SWIG, but
this particular one seemed really challenging when having
Ruby call C++ which would then call Ruby...
My last project with Ruby/C++/SWIG had Ruby calling C++
but C++ kept all the data structures internally without
ever calling Ruby, and this was *much* easier... but not
as flexible as I would like for this case.
I may have to rewrite this whole puppy in D if I'm going
to get parsing times under one second. Using C++ and STL
for its map containers is a royal nuisance, but D has
built-in associative arrays. Or maybe I should try Perl
or Python and see how their file parsing speeds compare.
Oh, and to answer Hugh's question, it is extremely rare
that a line would have less than 8 fields... sometimes
the last line of the file has only a ^Z on it.
Thanks again for your help! I appreciate it.
-- Glenn
Hugh Sasse wrote:
···
On Sat, 10 Sep 2005, Eric Hodel wrote:
On 08 Sep 2005, at 20:46, Glenn M. Lewis wrote:
Hi!
[...]
Any ideas on how I can rewrite 'Contract.parseFile()' for
speed?
Thanks!
-- Glenn Lewis
def Contract.parseFile(file)
return unless File.exists?(file)
return if @@files.has_key?(file)
I'd swap those two: test of a hash will be faster than test of a
filesystem, so may as well bail out quickly. Do repeat keys happen
often?
@@files[file] = 1
Maybe it only needs to be a Set, not a Hash? Not sure how speeds
compare.
It'd seem to me that all that STDOUT would slow things down a WHOOOLE
lot if you're going through a bunch of files.
···
On 9/9/05, Glenn M. Lewis <noSpam@nospam.com> wrote:
Thanks a bunch, Hugh and Eric! The combination of your
two suggestions sped it up quite a bit.
I don't agree with Robert, though... I have written many
parsers in C++ (and before that, C) that could soak up
all the data that I'm reading in less than a second whereas
this is taking approximately 9 minutes in Ruby. With the
recommendations of Hugh and Eric, it is now down to about
5 minutes, or almost a factor of 2 speedup.
I would really like an order of magnitude or more, but
I would definitely have to write it in a compiled language.
I've done this before with Ruby and C++ using SWIG, but
this particular one seemed really challenging when having
Ruby call C++ which would then call Ruby...
My last project with Ruby/C++/SWIG had Ruby calling C++
but C++ kept all the data structures internally without
ever calling Ruby, and this was *much* easier... but not
as flexible as I would like for this case.
I may have to rewrite this whole puppy in D if I'm going
to get parsing times under one second. Using C++ and STL
for its map containers is a royal nuisance, but D has
built-in associative arrays. Or maybe I should try Perl
or Python and see how their file parsing speeds compare.
Oh, and to answer Hugh's question, it is extremely rare
that a line would have less than 8 fields... sometimes
the last line of the file has only a ^Z on it.
Thanks again for your help! I appreciate it.
-- Glenn
Hugh Sasse wrote:
> On Sat, 10 Sep 2005, Eric Hodel wrote:
>
>> On 08 Sep 2005, at 20:46, Glenn M. Lewis wrote:
>>
>>> Hi!
>
> [...]
>
>>> Any ideas on how I can rewrite 'Contract.parseFile()' for
>>> speed?
>>>
>>> Thanks!
>>> -- Glenn Lewis
>>>
>>> def Contract.parseFile(file)
>>> return unless File.exists?(file)
>>> return if @@files.has_key?(file)
>
>
> I'd swap those two: test of a hash will be faster than test of a
> filesystem, so may as well bail out quickly. Do repeat keys happen
> often?
>
>>> @@files[file] = 1
>
>
> Maybe it only needs to be a Set, not a Hash? Not sure how speeds
> compare.
>
>>> print "Parsing file #{file}..."
>>> File.open(file, "rb").each_line {|line|
>
>
> I think:
>
>>> line.chomp!
>>> # puts line
>>> fields = line.split(/,/)
>
>
> might be faster as
> fields = line.chomp.split(/,/)
>
> or if you only chomp the last field afterwards (shorter string to
> change)?
>
>>> next if fields.size < 8
>
>
> Maybe line.count(",") first, and bail out quickly? Again, how often
> does this happen?
>
>>> datestring = fields[1]
>>> year = datestring[0..1].to_i
>>> month = datestring[2..3].to_i
>>> day = datestring[4..5].to_i
>
>
> reduce array refs:
>
> year, month, day = datestring[0..5].scan(/../).collect do |s|
> s.to_i
> end
>
> possibly
>
>
> That's all I can think of just now.
> Hugh
>
>
can you send a sample data set (contact me offline if you wish) and expected
time to parse and let us have a crack? those times sounds distressing - is
your data HUGE?
cheers.
-a
···
On Sat, 10 Sep 2005, Glenn M. Lewis wrote:
Thanks a bunch, Hugh and Eric! The combination of your
two suggestions sped it up quite a bit.
I don't agree with Robert, though... I have written many
parsers in C++ (and before that, C) that could soak up
all the data that I'm reading in less than a second whereas
this is taking approximately 9 minutes in Ruby. With the
recommendations of Hugh and Eric, it is now down to about
5 minutes, or almost a factor of 2 speedup.
I would really like an order of magnitude or more, but
I would definitely have to write it in a compiled language.
I've done this before with Ruby and C++ using SWIG, but
this particular one seemed really challenging when having
Ruby call C++ which would then call Ruby...
My last project with Ruby/C++/SWIG had Ruby calling C++
but C++ kept all the data structures internally without
ever calling Ruby, and this was *much* easier... but not
as flexible as I would like for this case.
I may have to rewrite this whole puppy in D if I'm going
to get parsing times under one second. Using C++ and STL
for its map containers is a royal nuisance, but D has
built-in associative arrays. Or maybe I should try Perl
or Python and see how their file parsing speeds compare.
Oh, and to answer Hugh's question, it is extremely rare
that a line would have less than 8 fields... sometimes
the last line of the file has only a ^Z on it.
Thanks again for your help! I appreciate it.
-- Glenn
--
email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna
I discovered that I really only need to parse about 20% of the lines...
I put in one more optimization... specifically this one:
....>>> next unless contract
and now the parsing time goes down to 26 seconds!!! YAHOO!!!
This is *very* tollerable!!! The final version is below...
Thanks so much for everybody's help!!!
(The data files total about 6Megs, by the way.)
-- Glenn
def Contract.parseFile(file)
return if @@files.has_key?(file)
return unless File.exists?(file)
@@files[file] = 1
print "Parsing file #{file}..."
File.open(file, "rb") do |io|
io.each_line do |line|
fields = line.chomp.split(/,/)
next if fields.size < 8
contract = Contract.open(fields[0])
next unless contract
datestring = fields[1]
# year = datestring[0..1].to_i
# month = datestring[2..3].to_i
# day = datestring[4..5].to_i
year, month, day = datestring[0..5].scan(/../).collect {|s| s.to_i }
# puts "year=#{year}, month=#{month}, day=#{day}"
# RUBY BUG?!? Can't use 'date' here... as that messes up findTick's 'date'...
mydate = Time.local((year < 50 ? year+2000 : year+1900), month, day)
tick = Tick.new(mydate, fields[2].to_f, fields[3].to_f,
fields[4].to_f, fields[5].to_f,
fields[6].to_i, fields[7].to_i)
contract.addTick(tick)
end
end
puts "done"
end
Ara.T.Howard wrote:
···
can you send a sample data set (contact me offline if you wish) and expected
time to parse and let us have a crack? those times sounds distressing - is
your data HUGE?
I discovered that I really only need to parse about 20% of the
lines...
So you changed Contract.open()?
I put in one more optimization... specifically this one:
...>>> next unless contract
and now the parsing time goes down to 26 seconds!!! YAHOO!!!
This is *very* tollerable!!! The final version is below...
Thanks so much for everybody's help!!!
(The data files total about 6Megs, by the way.)
I don't know the condition for having Contract.open() return nil (or false) but if it's a certain pattern of field[0] that might be caught by the regexp approach I suggested.
Btw, which part of my posting did you object to? Did you try DateParse at all?
Kind regards
robert
···
Glenn M. Lewis <noSpam@noSpam.com> wrote:
-- Glenn
def Contract.parseFile(file)
return if @@files.has_key?(file)
return unless File.exists?(file)
@@files[file] = 1
print "Parsing file #{file}..."
File.open(file, "rb") do |io|
io.each_line do |line|
fields = line.chomp.split(/,/)
next if fields.size < 8
contract = Contract.open(fields[0])
next unless contract
datestring = fields[1]
# year = datestring[0..1].to_i
# month = datestring[2..3].to_i
# day = datestring[4..5].to_i
year, month, day = datestring[0..5].scan(/../).collect {|s| s.to_i }
# puts "year=#{year}, month=#{month}, day=#{day}"
# RUBY BUG?!? Can't use 'date' here... as that messes up findTick's
'date'... mydate = Time.local((year < 50 ? year+2000 : year+1900),
month, day) tick = Tick.new(mydate, fields[2].to_f, fields[3].to_f,
fields[4].to_f, fields[5].to_f,
fields[6].to_i, fields[7].to_i)
contract.addTick(tick)
end
end
puts "done"
end
Ara.T.Howard wrote:
can you send a sample data set (contact me offline if you wish) and
expected
time to parse and let us have a crack? those times sounds
distressing - is your data HUGE?
Yes... I found out that if I create a hash of contract names
that I've already decided to ignore, then I can immediately test
field[0] to see if I can totally ignore the line, like you had suggested.
Oh, sorry... the only thing I was referring to was not being able
to speed up the File I/O and parsing in another language... and I only
said that because I had just finished another Ruby project where the original
file parsing of about 80Megs of data would take over 10 minutes, and when
I rewrote the File I/O and parsing in C++, I could do all of it in less than
a second.
But that is neither here nor there because the other optimizations
recommended brought down the load time to 26 seconds, which is now acceptable
and no longer worth the pain to create a C++ module that will communicate
with Ruby in both directions.
No, I never tried the DateParse... as Eric recommended that I no
longer use Date, but switch over to Time... he was right... I got a significant
speedup with switching to the Time class.
Thanks for your help!
-- Glenn
Robert Klemme wrote:
···
So you changed Contract.open()?
I don't know the condition for having Contract.open() return nil (or false) but if it's a certain pattern of field[0] that might be caught by the regexp approach I suggested.
Btw, which part of my posting did you object to? Did you try DateParse at all?
I used this form frequently in the past, but then someone explained to
me that it never closes the file handle. Because File.open doesn't
see an associated block, it doesn't handle the cleanup for you.
The "block inside block" version guarantees cleanup, even if an
exception is thrown.
Other than that, I think your suggestions are good.
--Wilson.
···
On 9/12/05, Hugh Sasse <hgs@dmu.ac.uk> wrote:
> File.open(file, "rb") do |io|
> io.each_line do |line|
Coould do that in one I think
File.open(file, 'rb').each_line do |line|
because you don't use io again.
Yes... I found out that if I create a hash of contract names
that I've already decided to ignore, then I can immediately test
field[0] to see if I can totally ignore the line, like you had
suggested.
Oh, sorry... the only thing I was referring to was not being able
to speed up the File I/O and parsing in another language...
I was probably not clear enough: I intended to convey that the pure IO part might not really speed with another language (the PL introduced overhead is usually small compared to the actual IO part).
and I only
said that because I had just finished another Ruby project where the
original file parsing of about 80Megs of data would take over 10
minutes, and when I rewrote the File I/O and parsing in C++, I could do all of it in
less than a second.
You're really making me curious...
But that is neither here nor there because the other optimizations
recommended brought down the load time to 26 seconds, which is now
acceptable and no longer worth the pain to create a C++ module that
will communicate with Ruby in both directions.
Here's another suggestion that might have a small impact only: use the bang version of chomp:
File.open(file, "rb") do |io|
io.each_line do |line|
line.chomp!
fields.split(/,/)
....
This saves you one object creation per line read. If lines are short it might even make a noticeable difference.
No, I never tried the DateParse... as Eric recommended that I no
longer use Date, but switch over to Time... he was right... I got a
significant speedup with switching to the Time class.
Yes indeed! Thank you very much, Hugh, and all
others who have helped! The processing time is
now a flat 26 seconds, which sure beats the original
9 minutes!!!
To summarize, the three most significant speedups:
1) Only process 20% of the lines based on a quick check
of relavency
2) Switch from using 'Date' class to 'Time' class
3) Minimize substring processing
see an associated block, it doesn't handle the cleanup for you.
Yes, that's right. I wonder if it is worth the price in garbage
collection to explicitly close in this case? Probably not given
maintenance costs.
The closing isn't that costly. And if you later change the script you
might run into errors when omitting proper closing of file descriptors. I
made it a habit to always use the block form (except where not applicable
for application reasons). That way you're always on the safe side.
Yes indeed! Thank you very much, Hugh, and all
others who have helped! The processing time is
now a flat 26 seconds, which sure beats the original
9 minutes!!!
I was hoping that pre-defining the vars outside the block knock
another second off
Just FYI for anyone following this thread...
I decided to go ahead and rewrite the file parsing
section of the code in C++ and interface it to Ruby
via SWIG (1.3.25), just because I thought the extra
effort would pay off in the long run with the ability
to perform a lot more experiments if it took less time
to run.
Now, with my current database of 16 files totalling
about 13Megs of information, it takes:
Ruby Only: 29 seconds
Ruby/C++ : <2 seconds
So by doing this, I got approximately a 15x speedup.
Please don't misunderstand... I'm *NOT* complaining!!!
I'm tickled pink that I can use Ruby as a front-end
for all the cool processing that I want to do on
the data after I load it in!!!
I only posted this info for those of you who are
wondering if you can benefit from putting heavy-duty
I/O in a compiled language... yes you can.
-- Glenn
Glenn M. Lewis wrote:
···
Yes indeed! Thank you very much, Hugh, and all
others who have helped! The processing time is
now a flat 26 seconds, which sure beats the original
9 minutes!!!
To summarize, the three most significant speedups:
1) Only process 20% of the lines based on a quick check
of relavency
2) Switch from using 'Date' class to 'Time' class
3) Minimize substring processing