1.9 significantly slower than 1.8 on Mac

Hmm. Simplifying my test script further, I am not sure that Regexp is the problem at all!
With the each_line block, my script take more than TWICE as long in 1.9 vs. 1.8.
But without the each_line block, but keeping the Regexp, it is 10% FASTER.

So unless there is some internal optimisation that occurs when the block is removed, it looks like each_line is the problem, not Regexp???

require 'benchmark'
include Benchmark

logfile="23:59:16 drop 10.14.241.252 >eth2c1 rule: 1015; rule_uid: {6AADF426-0D0C-4C20-A027-06A6DC8C6CE2}; src: 172.25.20.79; dst: 10.14.65.137; proto: tcp; product: VPN-1 & FireWall-1; service: lotus; s_port: 57150;"

bm(12) do |test|
test.report('WITH each_line:') do
   500000.times do
     logfile.each_line do |line|
       line.match /src: (.*?);/
     end
   end
end
test.report('WITHOUT each_line:') do
   500000.times do
       logfile.match /src: (.*?);/
   end
end
end

$ ruby logreport3.rb

                 user system total real
WITH each_line: 1.710000 0.000000 1.710000 ( 1.717034)
WITHOUT each_line: 1.080000 0.000000 1.080000 ( 1.077098)

$ ruby19 logreport3.rb

···

                 user system total real
WITH each_line: 3.680000 0.000000 3.680000 ( 3.680009)
WITHOUT each_line: 0.890000 0.000000 0.890000 ( 0.893182)

So unless there is some internal optimisation that occurs when the
block is removed, it looks like each_line is the problem, not Regexp???

Well, some part of #each_line for 1.8.6

    for (s = p, p += rslen; p < pend; p++) {
        if (rslen == 0 && *p == '\n') {
            if (*++p != '\n') continue;
            while (*p == '\n') p++;
        }
        
easy : increment p and test

the same for 1.9

    while (p < pend) {
        int c = rb_enc_codepoint(p, pend, enc);
        int n = rb_enc_codelen(c, enc);

        if (rslen == 0 && c == newline) {
            while (p < pend && rb_enc_codepoint(p, pend, enc) == newline) {
                p += n;
            }
            p -= n;
        }

a little more complex :

  retrieve the code point
  retrieve its length
  etc,

Guy Decoux

Derek Chesterfield pisze:

Hmm. Simplifying my test script further, I am not sure that Regexp is the problem at all!
With the each_line block, my script take more than TWICE as long in 1.9 vs. 1.8.
But without the each_line block, but keeping the Regexp, it is 10% FASTER.

Oops, It seems you're right, just split the original logfile and use each instead of each_line and it gets a whole lot faster (the rb_str_each_line is encoding aware). Anyways, it doesn't change the fact that Oniguruma might be opted here as well.

lopex