Curious regexp behavior

Derek_Lewis · 16 February 2005 01:16

On a whim, I just decided to try an experiment with regexps, to see how
they perform in two slightly different cases. I wanted to see how using
a single regexp object for many many evaluations performed compared to
using the regexp within the loop.

The scripts I wrote searched through a words file that is 234937 lines
long.

Here's the scripts I wrote, to clarify:
First one:

total = 0
File.open( 'words', 'r' ) { |file|
   file.each_line { |line|
    word = line.chomp
    total +=1 if word =~ /[a-df-h][aeiou]{2}/
  }
}
puts total

Second one:

rexp = /[a-df-h][aeiou]{2}/
total = 0
File.open( 'words', 'r' ) { |file|
   file.each_line { |line|
    word = line.chomp
    total +=1 if word =~ rexp
  }
}
puts total

I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to be
'backwards' like that?

···

--
Derek Lewis

===================================================================
Java Web-Application Developer

      Email : email@lewisd.com
      Cellular : 778.898.5825
      Website : http://www.lewisd.com

"If you've got a 5000-line JSP page that has "all in one" support
for three input forms and four follow-up screens, all controlled
by "if" statements in scriptlets, well ... please don't show it
to me :-). Its almost dinner time, and I don't want to lose my
appetite :-)."
- Craig R. McClanahan

Charles_Mills1 · 16 February 2005 02:39

Derek Lewis wrote:

On a whim, I just decided to try an experiment with regexps, to see

how

they perform in two slightly different cases. I wanted to see how

using

a single regexp object for many many evaluations performed compared

to

using the regexp within the loop.

The scripts I wrote searched through a words file that is 234937

lines

long.

Here's the scripts I wrote, to clarify:
First one:

total = 0
File.open( 'words', 'r' ) { |file|
   file.each_line { |line|
    word = line.chomp
    total +=1 if word =~ /[a-df-h][aeiou]{2}/
  }
}
puts total

Second one:

rexp = /[a-df-h][aeiou]{2}/
total = 0
File.open( 'words', 'r' ) { |file|
   file.each_line { |line|
    word = line.chomp
    total +=1 if word =~ rexp
  }
}
puts total

I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to

be

'backwards' like that?

I'll wager a guess. In the first version Ruby knows that
'/[a-df-h][aeiou]{2}/' is a regexp. In the second one Ruby doesn't
know if 'rexp' is a variable or method, so it has to do 1 maybe 2 look
ups on every interation before it dispatches String#=~.
Also regexp's are immutable so Ruby doesn't allocate a new regexp on
every interation and storing the regexp has no effect in that regard.

-Charlie

Eric_Hodel1 · 16 February 2005 05:29

First one:

total = 0
File.open( 'words', 'r' ) { |file|
   file.each_line { |line|
    word = line.chomp
    total +=1 if word =~ /[a-df-h][aeiou]{2}/

^^^^ inline regexp (part of the AST)

  }
}
puts total

Second one:

rexp = /[a-df-h][aeiou]{2}/
total = 0
File.open( 'words', 'r' ) { |file|
   file.each_line { |line|
    word = line.chomp
    total +=1 if word =~ rexp

^^^^ variable lookup

}
}
puts total

I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to be
'backwards' like that?

Inline regexps are much faster than a variable lookup then using the methods on the Regexp object.

PGP.sig (186 Bytes)

···

On 15 Feb 2005, at 17:16, Derek Lewis wrote:

--
Eric Hodel - drbrain@segment7.net - http://segment7.net
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04

Ryan_Davis1 · 16 February 2005 08:52

Use ParseTree and you can see why!!!

<576> echo "a=/blah/; 's' =~ a" | parse_tree_show -f
(cut for readability)
      [:lasgn, :a, [:lit, /blah/]],
      [:call, [:str, "s"], :=~, [:array, [:lvar, :a]]]]]]]]
<577> echo "'s' =~ /blah/" | parse_tree_show -f
(cut for readability)
      [:match3, [:lit, /blah/], [:str, "s"]]]]]]]

Basically, the inline regex avoids the lvar lookup and the call and shoots straight into a match3 node. The lvar is probably not _that_ expensive, but method dispatch is not terribly cheap.

···

On Feb 15, 2005, at 5:16 PM, Derek Lewis wrote:

I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to be
'backwards' like that?

--
ryand-ruby@zenspider.com - http://blog.zenspider.com/
http://rubyforge.org/projects/ruby2c/
http://rubyforge.org/projects/parsetree/
Seattle.rb | Home

Robert · 16 February 2005 09:14

"Derek Lewis" <lewisd@f00f.net> schrieb im Newsbeitrag
news:20050216012200.GP23232@f00f.net...

On a whim, I just decided to try an experiment with regexps, to see how
they perform in two slightly different cases. I wanted to see how using
a single regexp object for many many evaluations performed compared to
using the regexp within the loop.

The scripts I wrote searched through a words file that is 234937 lines
long.

Here's the scripts I wrote, to clarify:
First one:

total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ /[a-df-h][aeiou]{2}/
}
}
puts total

Second one:

rexp = /[a-df-h][aeiou]{2}/
total = 0
File.open( 'words', 'r' ) { |file|
file.each_line { |line|
word = line.chomp
total +=1 if word =~ rexp
}
}
puts total

I expected the second one to be slightly faster, but was surprised to
see that it was actually slightly slower. I ran each one about 10-15
times, and eyeballed an average. The results from each run after the
first were pretty consistant.

It's just a curiosity, but does anyone know what might cause them to be
'backwards' like that?

Did you try the same with the matching reversed, i.e., "rexp =~ word"
instead of "word =~ rexp"? Did it make a difference?

Kind regards

robert

William_Morgan · 16 February 2005 13:48

Excerpts from Ryan Davis's mail of 16 Feb 2005 (EST):

Use ParseTree and you can see why!!!

<576> echo "a=/blah/; 's' =~ a" | parse_tree_show -f
(cut for readability)
     [:lasgn, :a, [:lit, /blah/]],
     [:call, [:str, "s"], :=~, [:array, [:lvar, :a]]]]]]]]
<577> echo "'s' =~ /blah/" | parse_tree_show -f
(cut for readability)
     [:match3, [:lit, /blah/], [:str, "s"]]]]]]]

Very nice answer.

Like the original poster, I found the behavior counterintuitive. Perhaps
this is because our assumptions come from the C model of the universe,
where more local variables is typically faster, and method dispatch is
not a problem.

I wonder what the merits of collecting equivalences like these to form
some kind of post-hoc parse-tree optimization would be. Probably not
great, but it might be fun.

···

--
William <wmorgan-ruby-talk@masanjin.net>

Derek_Lewis · 16 February 2005 17:35

I did, actually, and it was very slightly faster. Still slower than an
inline regexp, however.

Thanks for the insightful answers, everyone. It quite interesting to
find out how your favorite programming language works inside.

···

On Wed, Feb 16, 2005 at 06:14:52PM +0900, Robert Klemme wrote:

Did you try the same with the matching reversed, i.e., "rexp =~ word"
instead of "word =~ rexp"? Did it make a difference?

Kind regards

robert

--
Derek Lewis

===================================================================
Java Web-Application Developer

      Email : email@lewisd.com
      Cellular : 778.898.5825
      Website : http://www.lewisd.com

"If you've got a 5000-line JSP page that has "all in one" support
for three input forms and four follow-up screens, all controlled
by "if" statements in scriptlets, well ... please don't show it
to me :-). Its almost dinner time, and I don't want to lose my
appetite :-)."
- Craig R. McClanahan

Topic		Replies	Views
Compiling Regexp only once ruby-talk	15	163	8 September 2006
Basic Ruby performance ruby-talk	42	249	15 February 2012
Defining regexp's and variables set by them ruby-talk	10	143	8 August 2005
Speed up suggestions ruby-talk	17	96	23 September 2002
Handling of regexp objects that aren't referenced by variables, arrays, tables or objects ruby-talk	11	153	28 September 2009

Curious regexp behavior

Related topics