Optimizing a single slow method

Hugh Sasse wrote:

Hugh Sasse wrote:

see an associated block, it doesn't handle the cleanup for you.

Yes, that's right. I wonder if it is worth the price in garbage
collection to explicitly close in this case? Probably not given
maintenance costs.

The closing isn't that costly.

The closing would allow you to avoid that block with the |io| in it.
That cost, the cost of the object, is what I meant.

What object exactly do you mean? The IO instance is there anyway so you
probably are referring to the overhead incurred by the block. I don't
think that this is significant.

14:50:33 [ruby]: ruby block-overhead.rb
1000000
                          user system total real
block 3.563000 0.016000 3.579000 ( 3.569000)
no block 1.859000 0.000000 1.859000 ( 1.867000)

That's 1.704 seconds so 1.7 micro seconds per invocation. Not really much
compared to the overhead of opening the file and all the other work.

Kind regards

    robert

block-overhead.rb (267 Bytes)

···

On Wed, 14 Sep 2005, Robert Klemme wrote:

Hi Glenn,

this sounds like a challenge! I'd love to get my hands on the input and the spec what you want do do with the data to see whether I can find an even faster Ruby implementation. If your data is not private that is. Alternatively you could maybe anonymize it...

Kind regards

    robert

···

Glenn M. Lewis <noSpam@noSpam.com> wrote:

Just FYI for anyone following this thread...
I decided to go ahead and rewrite the file parsing
section of the code in C++ and interface it to Ruby
via SWIG (1.3.25), just because I thought the extra
effort would pay off in the long run with the ability
to perform a lot more experiments if it took less time
to run.

Now, with my current database of 16 files totalling
about 13Megs of information, it takes:

Ruby Only: 29 seconds
Ruby/C++ : <2 seconds

So by doing this, I got approximately a 15x speedup.

Please don't misunderstand... I'm *NOT* complaining!!!
I'm tickled pink that I can use Ruby as a front-end
for all the cool processing that I want to do on
the data after I load it in!!!

I only posted this info for those of you who are
wondering if you can benefit from putting heavy-duty
I/O in a compiled language... yes you can. :slight_smile:

Robert Klemme wrote:

[...]
Hi Glenn,

this sounds like a challenge! I'd love to get my hands on the input and the spec what you want do do with the data to see whether I can find an even faster Ruby implementation. If your data is not private that is. Alternatively you could maybe anonymize it...

Kind regards

   robert

second that!

cheers

Simon

Hi Robert!

  I wish I could point to it or upload it, but it is historical
commodities data from http://www.barchart.com/ and I just tried to access
the data without logging into my subscription, and I can't. If anyone
else does have access to it, though, it is the 'Format-N' version of
the data that I chose to process.

  Each line in the file consists of 8 fields, separated by commas
(essentially a CSV file):
contract,date,open,high,low,close,volume,openInterest
here is an example:
A62005U,050909,0.7726,0.7758,0.7703,0.7737,12366,0

  Having said that though, there must be a free source of historical
information for commodities data... I just haven't found it yet.

-- Glenn

Robert Klemme wrote:

···

Hi Glenn,

this sounds like a challenge! I'd love to get my hands on the input and the spec what you want do do with the data to see whether I can find an even faster Ruby implementation. If your data is not private that is. Alternatively you could maybe anonymize it...

Kind regards

   robert

Simon Kröger wrote:

Robert Klemme wrote:

> [...]
> Hi Glenn,
>
> this sounds like a challenge! I'd love to get my hands on the input and
> the spec what you want do do with the data to see whether I can find an
> even faster Ruby implementation. If your data is not private that is.
> Alternatively you could maybe anonymize it...
>
> Kind regards
>
> robert

second that!

cheers

Simon

I'd like to give it a shot.

Glenn M. Lewis wrote:

Hi Robert!

I wish I could point to it or upload it, but it is historical
commodities data from http://www.barchart.com/ and I just tried to
access the data without logging into my subscription, and I can't.
If anyone else does have access to it, though, it is the 'Format-N'
version of
the data that I chose to process.

Each line in the file consists of 8 fields, separated by commas
(essentially a CSV file):
contract,date,open,high,low,close,volume,openInterest
here is an example:
A62005U,050909,0.7726,0.7758,0.7703,0.7737,12366,0

Ok, at least we can generate a large data set with this kind of
information. You talked about repetitions etc. Are there multiple
entries per contract?

Plus, we need to know what exactly you want to do with the data.

Kind regards

    robert

William James wrote:

Simon Kröger wrote:

Robert Klemme wrote:

[...]
Hi Glenn,

this sounds like a challenge! I'd love to get my hands on the input and
the spec what you want do do with the data to see whether I can find an
even faster Ruby implementation. If your data is not private that is.
Alternatively you could maybe anonymize it...

Kind regards

  robert

second that!

cheers

Simon

I'd like to give it a shot.

Hi,

as i was curious i wrote some little test scripts.
First i create a testfile (100000 rows, 9 values each rows. 5 ints, 4 strings in each row). I use different methods on reading them:

                          user system total real
just read 0.062000 0.031000 0.093000 ( 0.094000)
just readlines 0.219000 0.016000 0.235000 ( 0.234000)
readlines-split 3.468000 0.078000 3.546000 ( 3.562000)
read-scan 10.953000 0.047000 11.000000 ( 11.016000)
read-scan-block 11.485000 0.031000 11.516000 ( 11.563000)
read-split-whole 6.234000 0.047000 6.281000 ( 6.312000)

"just read" and "just readlines" are only for reference.
The file is aprox 12MB and i think 3.5s is a good starting point.

Here is the code:

···

----------------------------------------------------------------------
require 'benchmark'

s=' ' * 21
open('testfile.cvs', 'wb') do |file|
   100000.times do |l|
     line = l.to_s;
     4.times do
       21.times{|i|s[i] = ?A + rand(26)}
       line << ', ' << s << ', ' << rand(10000).to_s
     end
     file.puts(line)
   end
end

a1, a2, a3, a4 = nil
Benchmark.bm 20 do |bm|
   bm.report("just read") do
     a = IO.read('testfile.cvs')
   end

   bm.report("just readlines") do
     IO.readlines('testfile.cvs')
   end

   bm.report("readlines-split") do
     a3 = IO.readlines('testfile.cvs').map!{|l| l.split(', ')}
     a3.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i; b[6] = b[6].to_i; b[8] = b[8].to_i}
   end

   bm.report("read-scan") do
     a1 = IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/)
     a1.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i; b[6] = b[6].to_i; b[8] = b[8].to_i}
   end

   bm.report("read-scan-block") do
     a2 =
IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/) do |b|
       a2 << [b[0].to_i, b[1], b[2].to_i, b[3], b[4].to_i, b[5], b[6].to_i, b[7], b[8].to_i]
     end
   end

   bm.report("read-split-whole") do
     counter = 0;
     a4 = Array.new(100000) {Array.new(9)}
     IO.read('testfile.cvs').split(/\n|, /).each do |f|
       a4[counter / 9][counter % 9] = ((counter % 9) % 2).zero? ? f.to_i : f
       counter += 1
     end
   end
end

puts a1 == a2 && a2 == a3 && a3 == a4
----------------------------------------------------------------------

It would be nice if anyone could rewrite the scanf package in C...

cheers

Simon

Yes, there are three types of files: daily, monthly, and
yearly. In processing the data, I typically take the last 120 days
or so and load in all the 'ticks' (each line I call a 'tick') for
all contracts (the first field is the contract name) that exist
who have a tick in today's daily file. The first optimization is
to ignore any new contracts found in any other files that weren't
found in today's daily file. This alone knocks off a huge amount
of processing time.

  Here's a dump of my data directory:
  Directory of C:\src\barchart\Data

09/07/2005 09:26 AM 79,633 mrn09015.txt
09/07/2005 09:26 AM 79,526 mrn09025.txt
09/07/2005 09:26 AM 79,282 mrn09065.txt
09/08/2005 02:20 PM 79,700 mrn09075.txt
09/08/2005 07:36 PM 80,014 mrn09085.txt
09/09/2005 05:24 PM 80,092 mrn09095.txt
09/12/2005 04:13 PM 78,405 mrn09125.txt
09/14/2005 12:33 AM 80,065 mrn09135.txt
09/14/2005 04:58 PM 80,804 mrn09145.txt
09/15/2005 09:24 PM 80,344 mrn09155.txt
09/16/2005 05:20 PM 80,318 mrn09165.txt
05/01/2003 11:03 AM 1,380,685 mrnapr03.txt
05/01/2004 11:07 AM 1,554,968 mrnapr04.txt
04/30/2005 07:26 PM 1,573,078 mrnapr05.txt
08/30/2003 10:48 AM 1,443,433 mrnaug03.txt
09/01/2004 11:19 AM 1,632,816 mrnaug04.txt
09/01/2005 06:23 PM 1,806,148 mrnaug05.txt
01/01/2004 10:55 AM 1,479,529 mrndec03.txt
01/01/2005 11:31 AM 1,643,217 mrndec04.txt
03/01/2003 06:14 PM 1,285,420 mrnfeb03.txt
02/28/2004 11:01 AM 1,435,698 mrnfeb04.txt
03/01/2005 10:19 AM 1,443,562 mrnfeb05.txt
02/01/2003 06:16 PM 1,405,893 mrnjan03.txt
01/31/2004 10:58 AM 1,466,198 mrnjan04.txt
02/01/2005 06:48 PM 1,475,062 mrnjan05.txt
08/01/2003 10:45 AM 1,533,833 mrnjul03.txt
07/31/2004 11:15 AM 1,577,137 mrnjul04.txt
07/30/2005 06:26 PM 1,599,177 mrnjul05.txt
07/01/2003 10:43 AM 1,464,763 mrnjun03.txt
07/01/2004 11:13 AM 1,586,983 mrnjun04.txt
07/01/2005 09:39 AM 1,645,860 mrnjun05.txt
04/01/2003 03:10 PM 1,405,070 mrnmar03.txt
04/01/2004 11:04 AM 1,719,759 mrnmar04.txt
04/01/2005 07:17 PM 1,650,906 mrnmar05.txt
05/31/2003 08:32 AM 1,409,874 mrnmay03.txt
06/01/2004 11:09 AM 1,492,088 mrnmay04.txt
06/01/2005 10:55 AM 1,575,500 mrnmay05.txt
11/29/2003 10:53 AM 1,328,361 mrnnov03.txt
12/01/2004 11:28 AM 1,570,140 mrnnov04.txt
11/01/2003 03:16 PM 1,576,191 mrnoct03.txt
10/30/2004 11:25 AM 1,558,689 mrnoct04.txt
10/01/2003 10:51 AM 1,479,491 mrnsep03.txt
10/01/2004 11:22 AM 1,599,579 mrnsep04.txt
12/30/2000 11:40 AM 12,909,715 newmrn00.txt
01/01/2002 12:04 PM 15,083,716 newmrn01.txt
01/01/2003 12:33 PM 16,715,817 newmrn02.txt
01/01/1991 09:09 AM 5,815,310 newmrn90.txt
01/01/1992 09:19 AM 6,618,404 newmrn91.txt
01/01/1993 09:31 AM 7,191,765 newmrn92.txt
01/01/1994 09:43 AM 7,731,938 newmrn93.txt
12/31/1994 09:57 AM 8,467,874 newmrn94.txt
12/30/1995 10:11 AM 8,769,054 newmrn95.txt
01/01/1997 10:27 AM 9,489,616 newmrn96.txt
01/01/1998 10:43 AM 10,032,432 newmrn97.txt
01/01/1999 11:00 AM 10,717,629 newmrn98.txt
01/01/2000 11:19 AM 11,632,064 newmrn99.txt
               56 File(s) 180,852,625 bytes

  Once all the data is loaded in that exists in today's
daily file, up to a total of about 120 ticks for each contract,
then the file processing is done, and I can start playing
with the numbers.

  I have a Ruby class called 'Contract' that has a
singleton called 'parseFile(name)' that is written in C++ to parse
the files. I have a C++ routine called 'getTicks(date=nil,days=1)'
that returns an array of ticks in order, starting at 'date' (or
today's date if nil, and returns an array of size 'days' for that
number of ticks.

  Then I've got a bunch more routines to play with the data
that all call 'getTicks'. Here's a small sample of the routines:
   def getOpen(date=nil, days=1)
     getTicks(date, days).collect {|t| t.open }
   end
   def getHigh(date=nil, days=1)
     getTicks(date, days).collect {|t| t.high }
   end
   def getLow(date=nil, days=1)
     getTicks(date, days).collect {|t| t.low }
   end
   def getClose(date=nil, days=1)
     getTicks(date, days).collect {|t| t.close }
   end
   def getChange(date=nil, days=1)
     close = getClose(date, days+1)
     diff = close.dup
     diff.shift
     diff.each_with_index {|val, i|
       diff[i] = val - close[i]
     }
     diff
   end

  And that's about all there is to it. Of course, I
could have written all the data processing stuff in C++ too,
but keeping it in Ruby gives me a lot more flexibility in
the things I can do, and now that the data load times are
very tollerable because ofC++, I can play around and change
things really quickly.

  I hope that helps.
-- Glenn

Robert Klemme wrote:

···

A62005U,050909,0.7726,0.7758,0.7703,0.7737,12366,0

Ok, at least we can generate a large data set with this kind of
information. You talked about repetitions etc. Are there multiple
entries per contract?

Plus, we need to know what exactly you want to do with the data.

Kind regards

    robert

Simon Kröger wrote:

William James wrote:

> Simon Kröger wrote:
>
>>Robert Klemme wrote:
>>
>>
>>>[...]
>>>Hi Glenn,
>>>
>>>this sounds like a challenge! I'd love to get my hands on the input and
>>>the spec what you want do do with the data to see whether I can find an
>>>even faster Ruby implementation. If your data is not private that is.
>>>Alternatively you could maybe anonymize it...
>>>
>>>Kind regards
>>>
>>> robert
>>
>>second that!
>>
>>cheers
>>
>>Simon
>
>
> I'd like to give it a shot.

Hi,

as i was curious i wrote some little test scripts.
First i create a testfile (100000 rows, 9 values each rows. 5 ints, 4
strings in each row). I use different methods on reading them:

                          user system total real
just read 0.062000 0.031000 0.093000 ( 0.094000)
just readlines 0.219000 0.016000 0.235000 ( 0.234000)
readlines-split 3.468000 0.078000 3.546000 ( 3.562000)
read-scan 10.953000 0.047000 11.000000 ( 11.016000)
read-scan-block 11.485000 0.031000 11.516000 ( 11.563000)
read-split-whole 6.234000 0.047000 6.281000 ( 6.312000)

"just read" and "just readlines" are only for reference.
The file is aprox 12MB and i think 3.5s is a good starting point.

Here is the code:

----------------------------------------------------------------------
require 'benchmark'

s=' ' * 21
open('testfile.cvs', 'wb') do |file|
   100000.times do |l|
     line = l.to_s;
     4.times do
       21.times{|i|s[i] = ?A + rand(26)}
       line << ', ' << s << ', ' << rand(10000).to_s
     end
     file.puts(line)
   end
end

a1, a2, a3, a4 = nil
Benchmark.bm 20 do |bm|
   bm.report("just read") do
     a = IO.read('testfile.cvs')
   end

   bm.report("just readlines") do
     IO.readlines('testfile.cvs')
   end

   bm.report("readlines-split") do
     a3 = IO.readlines('testfile.cvs').map!{|l| l.split(', ')}
     a3.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i;
b[6] = b[6].to_i; b[8] = b[8].to_i}
   end

   bm.report("read-scan") do
     a1 =
IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/)
     a1.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i;
b[6] = b[6].to_i; b[8] = b[8].to_i}
   end

   bm.report("read-scan-block") do
     a2 =

IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/)
do |b|
       a2 << [b[0].to_i, b[1], b[2].to_i, b[3], b[4].to_i, b[5],
b[6].to_i, b[7], b[8].to_i]
     end
   end

   bm.report("read-split-whole") do
     counter = 0;
     a4 = Array.new(100000) {Array.new(9)}
     IO.read('testfile.cvs').split(/\n|, /).each do |f|
       a4[counter / 9][counter % 9] = ((counter % 9) % 2).zero? ? f.to_i : f
       counter += 1
     end
   end
end

puts a1 == a2 && a2 == a3 && a3 == a4
----------------------------------------------------------------------

On my computer, it's a tiny bit faster without the .map!.

  bm.report( "readlines-split" ) do
    a3 = IO.readlines('testfile.cvs').map{ |l|
      l.split(', ') }.map{ |b|
        b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i
        b[6] = b[6].to_i; b[8] = b[8].to_i
        b
      }
  end

Glenn,

here's my first shot. It doesn't seem very fast but then again there are still some uncertainties:

- How many contracts are typically in a reference day's set?

- How many contracts are there in total?

- How many percent of the reference contracts are present in an average file?

- How do dates relate to files? (I assumed a file per day plus I used synthetic dates; see the generator script)

For 50 files with 212998930 bytes this took 476.511s total on my machine (2.34s/MB). Maybe you just throw it at your data set and see how it works out.

Kind regards

    robert

generate_ticks.rb (895 Bytes)

parse_contracts.rb (2.13 KB)

Hi Robert!

  VERY IMPRESSIVE!!! After tweaking your two regexp's from
(\w\d+\w) to (\w+\d+\w) (because 'AD2005U' is also a valid contract),
I got this:

    0.047s read ref
    0.062s read file c:/src/barchart/Data/mrn09215.txt
    0.047s read file c:/src/barchart/Data/mrn09205.txt
    0.063s read file c:/src/barchart/Data/mrn09195.txt
    0.078s read file c:/src/barchart/Data/mrn09165.txt
    0.062s read file c:/src/barchart/Data/mrn09155.txt
    0.047s read file c:/src/barchart/Data/mrn09145.txt
    0.047s read file c:/src/barchart/Data/mrn09135.txt
    0.078s read file c:/src/barchart/Data/mrn09125.txt
    0.063s read file c:/src/barchart/Data/mrn09095.txt
    0.172s read file c:/src/barchart/Data/mrn09085.txt
    0.109s read file c:/src/barchart/Data/mrn09075.txt
    0.094s read file c:/src/barchart/Data/mrn09065.txt
    0.047s read file c:/src/barchart/Data/mrn09025.txt
    0.062s read file c:/src/barchart/Data/mrn09015.txt
    1.547s read file c:/src/barchart/Data/mrnaug05.txt
    1.531s read file c:/src/barchart/Data/mrnjul05.txt
    1.141s read file c:/src/barchart/Data/mrnjun05.txt
    1.375s read file c:/src/barchart/Data/mrnmay05.txt
    1.734s read file c:/src/barchart/Data/mrnapr05.txt
    0.907s finished post processing
    9.313s total

1415 total contracts
136164 total ticks (averages out to 96 ticks per contract)

  I ought to point out that the 'ref' file is actually not
processed in this case (meaning that its ticks are not recorded),
but that would probably add on another 0.078s or so.

  Also, I'm expecting more average ticks than that, so
I would have to figure out why it is missing some ticks... but
it is probably just a minor regexp tweak.

  Another minor note is that volume and openInterest were
not recorded, but that is a very minor thing to add on.

  So now the score is:
Glenn's Ruby-Only: ~29 seconds
Robert's Ruby-Only: ~9 seconds
Glenn's Ruby/C++: ~2 seconds

  Great job, Robert! Now, to answer your questions below...

Robert Klemme wrote:

Glenn,

here's my first shot. It doesn't seem very fast but then again there are still some uncertainties:

- How many contracts are typically in a reference day's set?

C:\>wc \src\barchart\data\mrn09225.txt
    1712 1712 79305 \src\barchart\data\mrn09225.txt
About 1700, but only 1415 matched the (\w+\d+\w) regexp, and those
are the only ones I care about.

- How many contracts are there in total?

  I don't know... probably over 5000, depending on how far you
go back because older contract expire and newer contracts start up.
That last letter represents the month of the contract:
F=Jan,G=Feb,H=Mar,J=Apr,K=May,M=Jun,N=Jul,Q=Aug,U=Sep,V=Oct,X=Nov,Z=Dec

- How many percent of the reference contracts are present in an average file?

  As you start out, nearly 100%... then as you go back to earlier
and earlier dates, the reference contracts start to die out, and it may
drop down to around 90-95% or so... but in the example above, I'm only
going back around 100 days.

- How do dates relate to files? (I assumed a file per day plus I used synthetic dates; see the generator script)

  Well, there are three types of files: daily updates, monthly updates,
and yearly updates. So far, I haven't needed to go back to any of the
yearly updates in any of the processing I've done. But suffice to say
that a monthly update file is basically the 'cat' (concatenation) together
of all the daily files for that month, and the yearly is the cat of all
monthly files for that year.

For 50 files with 212998930 bytes this took 476.511s total on my machine (2.34s/MB). Maybe you just throw it at your data set and see how it works out.

Kind regards

   robert

  Nice job! Thanks, Robert!
-- Glenn

Hi Robert!

VERY IMPRESSIVE!!! After tweaking your two regexp's from
(\w\d+\w) to (\w+\d+\w) (because 'AD2005U' is also a valid contract),
I got this:

   0.047s read ref
   0.062s read file c:/src/barchart/Data/mrn09215.txt
   0.047s read file c:/src/barchart/Data/mrn09205.txt
   0.063s read file c:/src/barchart/Data/mrn09195.txt
   0.078s read file c:/src/barchart/Data/mrn09165.txt
   0.062s read file c:/src/barchart/Data/mrn09155.txt
   0.047s read file c:/src/barchart/Data/mrn09145.txt
   0.047s read file c:/src/barchart/Data/mrn09135.txt
   0.078s read file c:/src/barchart/Data/mrn09125.txt
   0.063s read file c:/src/barchart/Data/mrn09095.txt
   0.172s read file c:/src/barchart/Data/mrn09085.txt
   0.109s read file c:/src/barchart/Data/mrn09075.txt
   0.094s read file c:/src/barchart/Data/mrn09065.txt
   0.047s read file c:/src/barchart/Data/mrn09025.txt
   0.062s read file c:/src/barchart/Data/mrn09015.txt
   1.547s read file c:/src/barchart/Data/mrnaug05.txt
   1.531s read file c:/src/barchart/Data/mrnjul05.txt
   1.141s read file c:/src/barchart/Data/mrnjun05.txt
   1.375s read file c:/src/barchart/Data/mrnmay05.txt
   1.734s read file c:/src/barchart/Data/mrnapr05.txt
   0.907s finished post processing
   9.313s total

1415 total contracts
136164 total ticks (averages out to 96 ticks per contract)

I ought to point out that the 'ref' file is actually not
processed in this case (meaning that its ticks are not recorded),
but that would probably add on another 0.078s or so.

Guess so. I thought it was cleaner to do this separately but if you actually need this it's a minor change. You could integrate it into #parse_ticks and remember if it was the first file. Untested but cryptic:

  def parse_ticks_2(io)
    create = @contracts.nil?
    @contracts = {} if create

    io.each_line do |line|
      if %r{^
             (\w\d+\w), # contractid
             (\d{6}), # date
             (\d+(?:\.\d+)?), # open
             (\d+(?:\.\d+)?), # high
             (\d+(?:\.\d+)?), # low
             (\d+(?:\.\d+)?), # close
           }x =~ line
        cid = $1.freeze

        contract = (@contracts[cid] || ( create && (@contracts[cid] = Contract.new cid) ) ) and
          contract.add_tick $2, $3.to_f, $4.to_f, $5.to_f, $6.to_f
      end
    end
  end

Also, I'm expecting more average ticks than that, so
I would have to figure out why it is missing some ticks... but
it is probably just a minor regexp tweak.

Another minor note is that volume and openInterest were
not recorded, but that is a very minor thing to add on.

Certainly. Just add them to the struct and add_tick(). I didn't see them processed in your code so I thought you don't need / want them.

So now the score is:
Glenn's Ruby-Only: ~29 seconds
Robert's Ruby-Only: ~9 seconds
Glenn's Ruby/C++: ~2 seconds

Great job, Robert! Now, to answer your questions below...

Wow!! I didn't expect it to compete so well. Maybe I need to buy a new machine (it's a P4 with 1.8GHz and 1GB mem) or switch off the virus scanner. :slight_smile:

Now it would be interesting to see which difference in the code caused the performance difference. My guess is it's any or several of these:

- I freeze hash keys. This saves a dup.freeze on the keys inserted into hashes (it's an internal implementation speciality of Hash to avoid accidental aliasing effects through key strings changed after the insert)

- I didn't use split thus avoiding unnecessary object creations in case a record is not needed.

- I probably made the regexp more selective and thus more efficient.

<snip/>

- How many percent of the reference contracts are present in an
average file?

As you start out, nearly 100%... then as you go back to earlier
and earlier dates, the reference contracts start to die out, and it
may drop down to around 90-95% or so... but in the example above, I'm
only going back around 100 days.

Ah, yes. I didn't consider this in my generator script.

- How do dates relate to files? (I assumed a file per day plus I used
synthetic dates; see the generator script)

Well, there are three types of files: daily updates, monthly updates,
and yearly updates. So far, I haven't needed to go back to any of the
yearly updates in any of the processing I've done. But suffice to say
that a monthly update file is basically the 'cat' (concatenation)
together of all the daily files for that month, and the yearly is the
cat of all monthly files for that year.

Uh, sounds like yearly logs are going to be huuuge.

Nice job! Thanks, Robert!

You're very welcome!

Kind regards

    robert

···

Glenn M. Lewis <noSpam@noSpam.com> wrote:

This is even more efficient as it pulls the boolean evaluation out of the loop *and* takes advantage of the fact that Hash and Proc both accept #:

  def parse_ticks_3(io)
    if @contracts
      cg = @contracts
    else
      @contracts = {}
      cg = lambda {|c_id| @contracts[c_id] ||= Contract.new c_id}
    end

    io.each_line do |line|
      if %r{^
             (\w\d+\w), # contractid
             (\d{6}), # date
             (\d+(?:\.\d+)?), # open
             (\d+(?:\.\d+)?), # high
             (\d+(?:\.\d+)?), # low
             (\d+(?:\.\d+)?), # close
           }x =~ line
        cid = $1.freeze

        contract = cg[cid] and
          contract.add_tick $2, $3.to_f, $4.to_f, $5.to_f, $6.to_f
      end
    end
  end

Kind regards

    robert

parse_contracts.rb (3.32 KB)

···

Robert Klemme <bob.news@gmx.net> wrote:

Guess so. I thought it was cleaner to do this separately but if you
actually need this it's a minor change. You could integrate it into
#parse_ticks and remember if it was the first file. Untested but
cryptic:
def parse_ticks_2(io)
   create = @contracts.nil?
   @contracts = {} if create

   io.each_line do |line|
     if %r{^
            (\w\d+\w), # contractid
            (\d{6}), # date
            (\d+(?:\.\d+)?), # open
            (\d+(?:\.\d+)?), # high
            (\d+(?:\.\d+)?), # low
            (\d+(?:\.\d+)?), # close
          }x =~ line
       cid = $1.freeze

       contract = (@contracts[cid] || ( create && (@contracts[cid] =
Contract.new cid) ) ) and
         contract.add_tick $2, $3.to_f, $4.to_f, $5.to_f, $6.to_f
     end
   end
end

I forgot missing date conversion in this list. One solution for date calculations then would be to do the calculations for date range with dates and then convert results to strings for comparison. However, I found a conversion method that's quite efficient (see attached) although it slows down the post processing (sorting) a bit. :slight_smile:

There's also a more efficient access to individual values in there...

Btw, I left the interest and volume out to keep time measurements comparable.

Kind regards

    robert

parse_contracts.rb (4.03 KB)

···

Robert Klemme <bob.news@gmx.net> wrote:

Now it would be interesting to see which difference in the code
caused the performance difference. My guess is it's any or several
of these:
- I freeze hash keys. This saves a dup.freeze on the keys inserted
into hashes (it's an internal implementation speciality of Hash to
avoid accidental aliasing effects through key strings changed after
the insert)
- I didn't use split thus avoiding unnecessary object creations in
case a record is not needed.

- I probably made the regexp more selective and thus more efficient.

<snip/>