Yes, there are three types of files: daily, monthly, and
yearly. In processing the data, I typically take the last 120 days
or so and load in all the 'ticks' (each line I call a 'tick') for
all contracts (the first field is the contract name) that exist
who have a tick in today's daily file. The first optimization is
to ignore any new contracts found in any other files that weren't
found in today's daily file. This alone knocks off a huge amount
of processing time.
Here's a dump of my data directory:
Directory of C:\src\barchart\Data
09/07/2005 09:26 AM 79,633 mrn09015.txt
09/07/2005 09:26 AM 79,526 mrn09025.txt
09/07/2005 09:26 AM 79,282 mrn09065.txt
09/08/2005 02:20 PM 79,700 mrn09075.txt
09/08/2005 07:36 PM 80,014 mrn09085.txt
09/09/2005 05:24 PM 80,092 mrn09095.txt
09/12/2005 04:13 PM 78,405 mrn09125.txt
09/14/2005 12:33 AM 80,065 mrn09135.txt
09/14/2005 04:58 PM 80,804 mrn09145.txt
09/15/2005 09:24 PM 80,344 mrn09155.txt
09/16/2005 05:20 PM 80,318 mrn09165.txt
05/01/2003 11:03 AM 1,380,685 mrnapr03.txt
05/01/2004 11:07 AM 1,554,968 mrnapr04.txt
04/30/2005 07:26 PM 1,573,078 mrnapr05.txt
08/30/2003 10:48 AM 1,443,433 mrnaug03.txt
09/01/2004 11:19 AM 1,632,816 mrnaug04.txt
09/01/2005 06:23 PM 1,806,148 mrnaug05.txt
01/01/2004 10:55 AM 1,479,529 mrndec03.txt
01/01/2005 11:31 AM 1,643,217 mrndec04.txt
03/01/2003 06:14 PM 1,285,420 mrnfeb03.txt
02/28/2004 11:01 AM 1,435,698 mrnfeb04.txt
03/01/2005 10:19 AM 1,443,562 mrnfeb05.txt
02/01/2003 06:16 PM 1,405,893 mrnjan03.txt
01/31/2004 10:58 AM 1,466,198 mrnjan04.txt
02/01/2005 06:48 PM 1,475,062 mrnjan05.txt
08/01/2003 10:45 AM 1,533,833 mrnjul03.txt
07/31/2004 11:15 AM 1,577,137 mrnjul04.txt
07/30/2005 06:26 PM 1,599,177 mrnjul05.txt
07/01/2003 10:43 AM 1,464,763 mrnjun03.txt
07/01/2004 11:13 AM 1,586,983 mrnjun04.txt
07/01/2005 09:39 AM 1,645,860 mrnjun05.txt
04/01/2003 03:10 PM 1,405,070 mrnmar03.txt
04/01/2004 11:04 AM 1,719,759 mrnmar04.txt
04/01/2005 07:17 PM 1,650,906 mrnmar05.txt
05/31/2003 08:32 AM 1,409,874 mrnmay03.txt
06/01/2004 11:09 AM 1,492,088 mrnmay04.txt
06/01/2005 10:55 AM 1,575,500 mrnmay05.txt
11/29/2003 10:53 AM 1,328,361 mrnnov03.txt
12/01/2004 11:28 AM 1,570,140 mrnnov04.txt
11/01/2003 03:16 PM 1,576,191 mrnoct03.txt
10/30/2004 11:25 AM 1,558,689 mrnoct04.txt
10/01/2003 10:51 AM 1,479,491 mrnsep03.txt
10/01/2004 11:22 AM 1,599,579 mrnsep04.txt
12/30/2000 11:40 AM 12,909,715 newmrn00.txt
01/01/2002 12:04 PM 15,083,716 newmrn01.txt
01/01/2003 12:33 PM 16,715,817 newmrn02.txt
01/01/1991 09:09 AM 5,815,310 newmrn90.txt
01/01/1992 09:19 AM 6,618,404 newmrn91.txt
01/01/1993 09:31 AM 7,191,765 newmrn92.txt
01/01/1994 09:43 AM 7,731,938 newmrn93.txt
12/31/1994 09:57 AM 8,467,874 newmrn94.txt
12/30/1995 10:11 AM 8,769,054 newmrn95.txt
01/01/1997 10:27 AM 9,489,616 newmrn96.txt
01/01/1998 10:43 AM 10,032,432 newmrn97.txt
01/01/1999 11:00 AM 10,717,629 newmrn98.txt
01/01/2000 11:19 AM 11,632,064 newmrn99.txt
56 File(s) 180,852,625 bytes
Once all the data is loaded in that exists in today's
daily file, up to a total of about 120 ticks for each contract,
then the file processing is done, and I can start playing
with the numbers.
I have a Ruby class called 'Contract' that has a
singleton called 'parseFile(name)' that is written in C++ to parse
the files. I have a C++ routine called 'getTicks(date=nil,days=1)'
that returns an array of ticks in order, starting at 'date' (or
today's date if nil, and returns an array of size 'days' for that
number of ticks.
Then I've got a bunch more routines to play with the data
that all call 'getTicks'. Here's a small sample of the routines:
def getOpen(date=nil, days=1)
getTicks(date, days).collect {|t| t.open }
end
def getHigh(date=nil, days=1)
getTicks(date, days).collect {|t| t.high }
end
def getLow(date=nil, days=1)
getTicks(date, days).collect {|t| t.low }
end
def getClose(date=nil, days=1)
getTicks(date, days).collect {|t| t.close }
end
def getChange(date=nil, days=1)
close = getClose(date, days+1)
diff = close.dup
diff.shift
diff.each_with_index {|val, i|
diff[i] = val - close[i]
}
diff
end
And that's about all there is to it. Of course, I
could have written all the data processing stuff in C++ too,
but keeping it in Ruby gives me a lot more flexibility in
the things I can do, and now that the data load times are
very tollerable because ofC++, I can play around and change
things really quickly.
I hope that helps.
-- Glenn
Robert Klemme wrote:
···
A62005U,050909,0.7726,0.7758,0.7703,0.7737,12366,0
Ok, at least we can generate a large data set with this kind of
information. You talked about repetitions etc. Are there multiple
entries per contract?
Plus, we need to know what exactly you want to do with the data.
Kind regards
robert