I realise I'm doing this a perlish way, but my question is, is it possible
to do this operation in Ruby in a time more comparable to what the Perl
version's getting? (That's about 4 seconds; my Ruby code runs in about 17
seconds over the same data set, which is far smaller than the production
data set.)
Basically, we have CSV files with a date like 31-DEC-03 23:59:59 as the
first field (always in order), and the task is to grab into an array (to
later process further) just the parts of each file that fall after a given
date.
The main slow bit seems to be the string concatenation and comparison
(...+$4+$5+$6 >= start_date).
···
################################################################
#!perl
$start_date = '20040000000000'; # "yyyymmddhhmmss"
$dir = "data";
@months = qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC);
%mm = {};
for ($i = 0; $i < 12; $i++) {
$mm{$months[$i]} = sprintf('%.2d', $i)
}
undef @months;
@a = ();
opendir DIR, $dir;
while ($_ = readdir DIR) {
next if /^\./; # skip dotfiles
open IN, "$dir/$_";
while (<IN>) {
/^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d)/;
$cc = ($3 ge '87' ? '19' : '20');
if ("$cc$3$mm{$2}$1$4$5$6" ge $start_date) {
while (<IN>) {
push @a, $_;
}
}
}
close IN;
}
closedir DIR;
$t = time - $t;
print "Read " . scalar(@a) . " lines in $t seconds$/"; # 4 seconds
$t = time;
open OUT, ">perl.out";
print OUT @a;
$t = time - $t;
print "Wrote in $t seconds$/"; # 3 seconds
################################################################
#!ruby
mm = Hash.new
i = '00'
%w(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC).each do |mmm|
mm[mmm] = i = i.succ
end
start_date = '20040000000000' # "yyyymmddhhmmss"
dir = "data"
date_regex = /^(\d\d)-(\w\w\w)-(\d\d) (\d\d):(\d\d):(\d\d)/
a = []
t = Time.new
reading = false
Dir.open(dir).each do |file|
next if file[0] == ?. # skip dotfiles
reading = false
File.open(dir + '/' + file).each_line do |line|
reading ||= (date_regex =~ line &&
(($3>='87'?'19':'20')+$3+mm[$2]+$1+$4+$5+$6 >= start_date))
a << line if reading
end
end
t = Time.new - t;
puts "Read #{a.size} lines in #{t} seconds"; # 17 seconds
t = Time.new
File.open('ruby.out', 'w') do |f|
f.print a.join
end
t = Time.new - t;
puts "Wrote #{a.size} lines in #{t} seconds"; # 3 seconds