Ellie,
Eleanor McHugh wrote:
A few things:
- you left a line in the loop:
File.open( output_filename, 'w' ) do |fout|
which should be deleted
Paste in haste, repent at leisure
I've corrected it to read the
way it appeared in my head when I was looking at it:
http://pastie.org/222765- I originally used:
stats = lines = File.readlines(input_filename, 'r')
but found that reading the whole file (8871 lines) and then
processing the array was inefficient so I got rid of the array- using:
stats << stats06
If you buffer it as a single read and then work through the file in memory it guarantees that you minimise the IO costs of reading. I am
of course assuming that even at 8871 lines your file is much smaller
than your available RAM
When I did the profile, the array processing was the biggest hit - when I got rid of the array, I almost halved the time! Ruby arrays are pretty cool but I think you pay for the convenience . .
and the file writing output of:
File.open(output_filename, "a") do |file| file.sync = false file.puts *stats file.fsync end
looks interesting - why should that be faster?
Doing the file write this way offloads making it efficient to the
Ruby runtime. The file.fsync call will cost you in terms of runtime
performance, but it ensures that the data is flushed to disk before
moving on to the next file which for a large data processing job is
often desirable.
See my other note but it didn't make much difference.
Personally I wouldn't store the results in separate
files but combine them into a single file (possibly even a database),
however I don't know how that would fit with your use case.
There is more post processing using R and for casual inspection it is convenient to be able to find data according to it's file name. It might still be possible to have fewer, larger files - I might ask another question about that (basically I have paste the single column output of this stuff into 32 column arrays). I have tried DBs for storing output form the main simulation program when it was all in C/C++ and it was quite slow so I went back to text files . .
As to the file.puts *stats, there's no guarantee this approach will
be efficient but compared to doing something like:File.open(output_filename, "a") do |file| stats.each { |stat|
file.puts stat } endit feels more natural to the problem domain.
Yes, it is was good to find out about this alternative.
Another alternative would be:
File.open(output_filename, "a") do |file| file.puts stats.join("\n") end
but that's likely to use more memory as first an in-memory string
will be created, then this will be passed to Ruby's IO code. For the
size of file you're working with that's not likely to be a problem.I've a suspicion that your overall algorithm can also be greatly improved.
I'm sure you are right about that!
In particular the fact that you're forming a cubic array and then
manipulating it raises warning bells and suggests you'll have data sparsity issues which could be handled in a different way, but that would require a deeper understanding of your data.
The cubic array was just a direct translation of the C pointer setup I had - basically it is a rectangular grid of sub-populations each with an array of allele lengths.
Thanks again,
Regards,
Phil.
···
On 26 Jun 2008, at 20:47, Philip Rhoades wrote:
--
Philip Rhoades
Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
E-mail: phil@pricom.com.au