Regex speed

i am noticing ruby is much slower than perl when matching patterns with
two or more quantifier (+ or *). consider this string:

$ ruby -0 -ne ‘print “A”*3,"\n"*10000,“B”*10,"\n"*10000’ > str

and matching this string:

$ perl -0 -ne’print 1 if /A.+\n+B/s’ str # 0m1.682s
$ ruby -0 -ne’print 1 if /A.+\n+B/m’ str # 0m12.468s
$ perl -0 -ne’print 1 if /A.+\n.+\nB/s’ str # 0m1.368s
$ ruby -0 -ne’print 1 if /A.+\n.+\nB/m’ str # 0m11.427s

php (pcre) speed is somewhere in between.

of course, with patterns like /A.+\n+.+B/ or those with more quantifiers
both become very slow, but i suspect perl is still faster then ruby
especially when there are fewer “.+” and more “SINGLECHAR+” subpatterns.
i even have one case where perl takes less than a second and ruby
takes more than five minutes (i don’t know for sure, i interrupted the
process so ruby never finished).

note that i’m not implying anything though, because i understand perl’s
regex engine has undergone a long period of tweaking and optimization.

···


dave

ps: using ruby 1.6.7 vs perl 5.8.0 on linux

So, I’m using ruby 1.6.7 and perl 5.6.0 on mac os x and I found
something strange. Perl flies until you add parenthesis. Then they
roughly tie in performance.

<531> time perl -0 -ne 'print 1 if /A.+\n+B/s' str1     # real    

0m01.568s
<532> time perl -0 -ne ‘print 1 if /A(.+)(\n+)B/s’ str # real
0m10.763s
<534> time ruby -0 -ne ‘print 1 if /A.+\n+B/m’ str # real
0m07.856s
<535> time ruby -0 -ne ‘print 1 if /A(.+)(\n+)B/m’ str # real
0m10.532s

The reason why I was doing this was to see that they were indeed
matching on the same results and not performing different amounts of
backtracking and whatnot (see below). They are indeed the same. It
looks like perl has some extra enhancements in place for when there are
no parenthesis.

Many times in regexen like these, you don’t need or want greedy
matches. Non-greedy is much much faster in many cases

<537> time ruby -0 -ne'puts "#{$1.length}, #{$2.length}" if 

/A(.+?)(\n+?)B/m’ str
2, 10000

real    0m0.119s
user    0m0.050s
sys     0m0.020s
<538> time perl -0 -ne  'if (/A(.+?)(\n+?)B/s) { $a = length($1); $b = 

length($2); print “$a, $b\n”}’ str
2, 10000

real    0m0.020s
user    0m0.000s
sys     0m0.010s

Perl still beats us, but at .1 seconds, who cares? :slight_smile:

···

On Sunday, October 6, 2002, at 08:45 PM, David Garamond wrote:

$ perl -0 -ne’print 1 if /A.+\n+B/s’ str # 0m1.682s
$ ruby -0 -ne’print 1 if /A.+\n+B/m’ str # 0m12.468s
$ perl -0 -ne’print 1 if /A.+\n.+\nB/s’ str # 0m1.368s
$ ruby -0 -ne’print 1 if /A.+\n.+\nB/m’ str # 0m11.427s
ps: using ruby 1.6.7 vs perl 5.8.0 on linux

Here’s my take:

RH7.3 box, Perl 5.6.1, Ruby 1.7.3 (2002-09-13)

$ perl -0 -ne’print 1 if /A.+\n+B/s’ str # 0m0.517s
$ ruby -0 -ne’print 1 if /A.+\n+B/m’ str # 0m5.715s
$ perl -0 -ne’print 1 if /A.+\n.+\nB/s’ str # 0m5.854s
$ ruby -0 -ne’print 1 if /A.+\n.+\nB/m’ str # 0m3.574s

Heh. So Ruby 1.7.3 overtakes Perl for complex regexes??!! :wink:

This is a Xeon 1.8GHz/512MB box.

David Garamond wrote:

···

i am noticing ruby is much slower than perl when matching patterns with
two or more quantifier (+ or *). consider this string:

$ ruby -0 -ne ‘print “A”*3,“\n”*10000,“B”*10,“\n”*10000’ > str

and matching this string:

$ perl -0 -ne’print 1 if /A.+\n+B/s’ str # 0m1.682s
$ ruby -0 -ne’print 1 if /A.+\n+B/m’ str # 0m12.468s
$ perl -0 -ne’print 1 if /A.+\n.+\nB/s’ str # 0m1.368s
$ ruby -0 -ne’print 1 if /A.+\n.+\nB/m’ str # 0m11.427s

php (pcre) speed is somewhere in between.

of course, with patterns like /A.+\n+.+B/ or those with more quantifiers
both become very slow, but i suspect perl is still faster then ruby
especially when there are fewer “.+” and more “SINGLECHAR+” subpatterns.
i even have one case where perl takes less than a second and ruby takes
more than five minutes (i don’t know for sure, i interrupted the process
so ruby never finished).

note that i’m not implying anything though, because i understand perl’s
regex engine has undergone a long period of tweaking and optimization.


Wai-Sun “Squidster” Chia waisun.chia@ac-ac.org
Sr. OpenSource Consultant, M.Sc., RHCE
Advanced Computing & Application Center
Universiti Teknologi Malaysia

i am noticing ruby is much slower than perl when matching patterns
with two or more quantifier (+ or *). consider this string:

$ ruby -0 -ne ‘print “A”*3,“\n”*10000,“B”*10,“\n”*10000’ > str

Um, this hangs on my machine. Fine when I execute it from a file
though.

and matching this string:

$ perl -0 -ne’print 1 if /A.+\n+B/s’ str # 0m1.682s

5.005: Real: 0:01.82 CPU: 99.4% (0.007/1.816) Mem: 712
5.6.1: Real: 0:01.85 CPU: 100.0% (0.000/1.850) Mem: 642

$ ruby -0 -ne’print 1 if /A.+\n+B/m’ str # 0m12.468s

Normal: Real: 0:15.35 CPU: 99.6% (0.023/15.272) Mem: 1298
Oni Guruma: Real: 0:15.54 CPU: 99.6% (0.007/15.485) Mem: 1172

$ perl -0 -ne’print 1 if /A.+\n.+\nB/s’ str # 0m1.368s

5.005: Real: 0:12.75 CPU: 99.2% (0.023/12.645) Mem: 642
5.6.1: Real: 0:13.46 CPU: 99.0% (0.023/13.313) Mem: 712

$ ruby -0 -ne’print 1 if /A.+\n.+\nB/m’ str # 0m11.427s

Normal: Real: 0:15.74 CPU: 99.6% (0.039/15.651) Mem: 1298
Oni Guruma: Real: 0:09.83 CPU: 99.3% (0.015/9.760) Mem: 1172

-% perl -v
This is perl, v5.6.1 built for i386-freebsd

-% /usr/bin/perl5.00503 -v
This is perl, version 5.005_03 built for i386-freebsd

-% ruby -v
ruby 1.6.7 (2002-09-12) [i386-freebsd4]

Interesting that perl 5.005 is actually faster than 5.6.1 on these
matches. I’d also imagine you’re using 5.8 given how quick your version
of perl performed the second match.

i even have one case where perl takes less than a second and ruby
takes more than five minutes (i don’t know for sure, i interrupted the
process so ruby never finished).

Yes; regexp engines can encounter cases where they take a very long time
(easily measured in days) to perform a match. Perl has gone to great
lengths to avoid these situations, but they’re still there.

note that i’m not implying anything though, because i understand
perl’s regex engine has undergone a long period of tweaking and
optimization.

And ruby’s has undergone a rewrite :slight_smile:

···


Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

Maj. Bloodnok: Seagoon, you’re a coward!
Seagoon: Only in the holiday season.
Maj. Bloodnok: Ah, another Noel Coward!

Ryan Davis wrote:

Many times in regexen like these, you don’t need or want greedy matches.
Non-greedy is much much faster in many cases

<537> time ruby -0 -ne'puts "#{$1.length}, #{$2.length}" if 

/A(.+?)(\n+?)B/m’ str
2, 10000

thanks for the tip (strange, i had always thought/heard that non-greedy
matches are slow… but hey, what do i know?)

i had also fixed my regex in my previous case by avoiding .+ and using
[^\015\012]+. ruby now processes my data in less than a second too :slight_smile:

···


dave

Thomas Hurst wrote:

$ ruby -0 -ne ‘print “A”*3,“\n”*10000,“B”*10,“\n”*10000’ > str
Um, this hangs on my machine. Fine when I execute it from a file
though.

sorry, the code was obviously miscopy-pasted. it should have been ‘ruby
-e’, not ‘ruby -ne’.

Interesting that perl 5.005 is actually faster than 5.6.1 on these
matches. I’d also imagine you’re using 5.8 given how quick your version
of perl performed the second match.

based on my experience, perl 5.8.0 is generally [pretty significantly]
slower than 5.6 at handling text, say like 10-40% slower. probably
because the unicode handling stuff. unfortunately i don’t have access to
a machine with 5.6.x right now. every single of them has been upgraded
to 5.8.0.

···


dave