Slow ruby regexes

Hello i've been reading this article, wich has a few benchmarks
(inclding a ruby related one) and many theorical explanations of regex
matching algorithms:

http://swtch.com/~rsc/regexp/regexp1.html

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

* Do anyone now if what is said relates to ruby 1.8.6
* If so, is it going to change ?

Thank you!

···

--
Posted via http://www.ruby-forum.com/.

Ruby's regexp implementation is a backtracking matcher similar to Perl's -- while the article's criticism is valid, you should be fine if you write your regular expressions to minimize the amount of backtracking required.

(The author's point is that such care is unnecessary with an NFA-based regexp implementation.)

-mental

···

On Wed, 11 Apr 2007 02:12:09 +0900, Emmanuel <emmanuel@lijasdoblea.com> wrote:

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

For example, a regex was shown in this list lately that tells you if a
number is prime. That needs backtracking.

Aur

···

On 4/10/07, Emmanuel <emmanuel@lijasdoblea.com> wrote:

Hello i've been reading this article, wich has a few benchmarks
(inclding a ruby related one) and many theorical explanations of regex
matching algorithms:

Regular Expression Matching Can Be Simple And Fast

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

* Do anyone now if what is said relates to ruby 1.8.6
* If so, is it going to change ?

Thank you!

--
Posted via http://www.ruby-forum.com/\.

Emmanuel <emmanuel@lijasdoblea.com> writes:

Regular Expression Matching Can Be Simple And Fast

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

I don't know about proposed plans for the regexp engine for ruby, but I would
say not to be overly concerned at this stage.

From reading that article, one thing that I noted was that even the author
acknowledged that the performance of regular expressions is significantly
affected by how the regexp are defined. If you keep in mind the basic concepts
and create your regexp accordingly, you will likely get performance differences
that even outweigh the differences between the two approaches outlined in that
article - or putting it another way, poorly specified RE will perform badly
regardless of the algorithm used. What is important is to do things like
anchoring, using as precise specification as possible and taking advantage of
any knowledge regarding the data you are processing.

I've not done large (ie. gigabytes of data) processing with RE under ruby, but
I have done so with perl and the performance was quite acceptable.

There is no point worrying about optimisation until you know there is a
performance issue. For all you know, using the ruby RE engine for your task may
fall well within your performance requirements.

The other way to look at this is to consider what you would do/use as an
alternative. I've also used a lot of Tcl and according to that article, Tcl
uses the faster algorithm, yet I've never really noticed any great performance
difference between using Perl or Tcl. So, your choice is to continue and see if
there is a problem and then deal with that if/when it occurs or jump now and
start writing your program in Tcl, awk or using grep (maybe even call grep from
within ruby, but I suspect all performance gains would be lost in passing the
data between ruby and the grep process).

I've seen many people make a decision regarding the choice of technology
because they have read somewhere that x is faster than y. Often, I've then seen
something created which is flakey, takes 10x as long to develop or simply
doesn't work when in reality, the aspect they were concerned about wasn't even
relevant to the situation they were working in. Recently, I had an argument
with one of our sys admins who wanted to use the Riser FS rather than Ext3 as
the file system on a new server. His argument was that Riser had better
performance characteristics and the system would perform better. My argument
was that file system performance was not a significant bottleneck for the
server and we would be better off sticking with a file system that had a better
track record, more robust tools and represented a technology more sys admins
were familiar with. I lost the argument, initially. The server was configured
with RiserFS and after a few months, we came in one morning to find massive
disk curruption problems. The server was changed to Ext3 and has not missed a
beat since. More to the point, the performance using Ext3 is still well within
acceptable performance metrics. My point isn't that RiserFS may not be a good
file system - it probably is and possibly is even "better" than Ext3. My point
is that speed is not the only issue to consider.

Something which the article doesn't address is complexity and correctness. The
algorithm used by Perl et. al. may not be fast compared to the alternative, but
it is relatively simple. As important to performance is correctness. An
extremely fast RE is not a better solution if it is only correct 95% of the
time or is so complex, it is difficult to maintain without bugs creeping in
after version updates etc.

Tim

···

--
tcross (at) rapttech dot com dot au

It's unfortunate, though, that an NFA-based approach isn't selectively used for those expressions which don't use features like backreferences (i.e. the majority of regular expressions people write). The performance is so much better it's not funny.

-mental

···

On Wed, 11 Apr 2007 02:59:29 +0900, SonOfLilit <sonoflilit@gmail.com> wrote:

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

There is no point worrying about optimisation until you know there is a
performance issue. For all you know, using the ruby RE engine for your
task may fall well within your performance requirements.

This is very, very true -- don't optimize until you measure. For the sake of the original poster, I want to underscore this point. Chances are the Ruby regexp engine is way more than adequate for his purposes.

The algorithm used by Perl et. al. may not be fast compared to the
alternative, but it is relatively simple. As important to performance is correctness.

Perl's implementation is not very simple, however...

-mental

···

On Thu, 12 Apr 2007 19:55:09 +0900, Tim X <timx@nospam.dev.null> wrote:

> Read wikipedia on Regex. It explains better than I can why one is used
> over the other (more power).

It's unfortunate, though, that an NFA-based approach isn't selectively used for those expressions which don't use features like backreferences (i.e. the majority of regular expressions people write). The performance is so much better it's not funny.

Well the good news is that Ruby 2 Regexs will be much better
performing, I do not know if they will be NFA, though.
The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

But you are right of course that Ruby's current Regex implementation
is not state of the art, useful, very useful nevertheless.

Cheers
Robert

···

On 4/10/07, MenTaLguY <mental@rydia.net> wrote:

On Wed, 11 Apr 2007 02:59:29 +0900, SonOfLilit <sonoflilit@gmail.com> wrote:
-mental

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

MenTaLguY <mental@rydia.net> writes:

There is no point worrying about optimisation until you know there is a
performance issue. For all you know, using the ruby RE engine for your
task may fall well within your performance requirements.

This is very, very true -- don't optimize until you measure. For the sake of the original poster, I want to underscore this point. Chances are the Ruby regexp engine is way more than adequate for his purposes.

The algorithm used by Perl et. al. may not be fast compared to the
alternative, but it is relatively simple. As important to performance is correctness.

Perl's implementation is not very simple, however...

True, but I think thats due to all the additional add-ons. The basic underlying
algorithm is/was fairly straight forward. However, I think even the perl
developers have realised that Perl's RE support was getting out of hand.
Although I've not looked at it, I believe there is significant changes in perl
6 in this area. I seem to remember reading something from Larry, where he said
that Perl's functionality in this area has extended to the point that in
reality, you probably couldn't call it regexps anymore, but as that was the
terminology people were use to, thats what it wold be known as. Perl has gone
past what was traditionally defined as regexps.

Tim

···

On Thu, 12 Apr 2007 19:55:09 +0900, Tim X <timx@nospam.dev.null> wrote:

--
tcross (at) rapttech dot com dot au

The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.

But you are right of course that Ruby's current Regex implementation
is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

-mental

···

On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

True, but I think thats due to all the additional add-ons. The basic
underlying algorithm is/was fairly straight forward.

I was thinking in terms of the optimizations applied to make the backtracking algorithm faster, though that's a factor too...

However, I think even the perl developers have realised that Perl's RE support
was getting out of hand.
Although I've not looked at it, I believe there is significant changes in
perl 6 in this area. I seem to remember reading something from Larry, where he
said that Perl's functionality in this area has extended to the point that in
reality, you probably couldn't call it regexps anymore, but as that was
the terminology people were use to, thats what it wold be known as. Perl has
gone past what was traditionally defined as regexps.

The Perl 6 solution was to promote grammars and rules into first-class abstractions in the language, on a level with classes and methods. So it's basically a general-purpose parsing framework now. I'm not sure if that counts as less out-of-hand or not...

-mental

···

On Fri, 13 Apr 2007 09:30:07 +0900, Tim X <timx@nospam.dev.null> wrote:

Well topic is rather interesting, i'll try to check it soon, and i'll try to
write ruby interface for c library that author of article reference, i was
looking for good case to test including c into ruby.
I'm not promising anything tho :slight_smile:

regexps are great tool for text processing, but serious log analyzers (like
one my company use) are written in c and base on lots of low-level string
manipulation, raw binary access to DB et cecera, not to mention that ruby is
much much slower then c anyway.

Greets
Marcin Raczkowski

···

On Tuesday 10 April 2007 19:14, MenTaLguY wrote:

On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
> The bad news is that the power of regexs will just not let you escape
> with bad design all the time, so knowing a lot about regexes is a
> necessary thing to use them well.

No, the bad news is that the bad design of most "regex" implementations
will not let you escape with using "regexes" as if they were true regular
expressions; knowing a lot about the implementation of the regex engine is
a necessary thing to use them well.

> But you are right of course that Ruby's current Regex implementation
> is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in
other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful
sometimes), but rather that NFAs should be used for evaluating the majority
of "regexes" which don't use non-regular features.

-mental

> The bad news is that the power of regexs will just not let you escape
> with bad design all the time, so knowing a lot about regexes is a
> necessary thing to use them well.

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.

> But you are right of course that Ruby's current Regex implementation
> is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

-mental

···

On 4/10/07, MenTaLguY <mental@rydia.net> wrote:

On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

I think you got NFA and DFA mixed up: DFA's are the ones that do not suffer backtracking issues.

  robert

···

On 10.04.2007 21:14, MenTaLguY wrote:

On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.

But you are right of course that Ruby's current Regex implementation
is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

oops wrong button here :frowning:

>
> > The bad news is that the power of regexs will just not let you escape
> > with bad design all the time, so knowing a lot about regexes is a
> > necessary thing to use them well.
>
> No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.

That might be, I am not qualified to discuss this issue.
What I actually meant is that given the need for backtracking
abilities it will just not be possible to "guess" what the user meant
when he/she uses too general an expression and I am talking about NFA
here.
Well just wanted to clear that point up.

>
> > But you are right of course that Ruby's current Regex implementation
> > is not state of the art, useful, very useful nevertheless.
><snip>
>

Cheers

···

On 4/11/07, Robert Dober <robert.dober@gmail.com> wrote:

On 4/10/07, MenTaLguY <mental@rydia.net> wrote:
> On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

I'm a little confused -- the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn't any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.

For "pure" regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking's worst-case exponential complexity.

-mental

···

On Wed, 11 Apr 2007 16:53:26 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

What I actually meant is that given the need for backtracking
abilities it will just not be possible to "guess" what the user meant
when he/she uses too general an expression and I am talking about NFA
here.

> What I actually meant is that given the need for backtracking
> abilities it will just not be possible to "guess" what the user meant
> when he/she uses too general an expression and I am talking about NFA
> here.

I'm a little confused -- the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn't any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.

For "pure" regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking's worst-case exponential complexity.

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match, let me see the NFA should be something like
X start second
a start term
. second second

how could backtracking avoid when no a is found in the <second> state?
The DFA of course could never backtrack but is there a DFA for this regex?

Hope to learn a little bit more about what I seem to have forgotten :wink:

Cheers
Robert
%r

···

On 4/11/07, MenTaLguY <mental@rydia.net> wrote:

On Wed, 11 Apr 2007 16:53:26 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
-mental

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

<snip>

stupid me make this simply /a*.*a/ sorry

···

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):

start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.

The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

{start,a_star,any_star}: (
   'a' =>{a_star,any_star,final} |
   (any-'a') ->{any_star}
),

{any_star}: (
   'a' =>{any_star,final} |
   (any-'a') ->{any_star}
),

{a_star,any_star,final}: (
   'a' =>{a_star,any_star,final} |
   (any-'a') ->{any_star,final}
),

{any_star,final}: (
   'a' =>{any_star,final} |
   (any-'a') ->{any_star,final}
)

The resulting FA is deterministic, and {a_star,any_star,final} and {any_star,final} are both accepting states.

If we need to capture subexpressions, we can introduce additional counters and labeled transitions with priorities to resolve greedy/non-greedy ambiguities (as is often done in Ragel to resolve ambiguities). Prioritized transitions are also necessary for regexps which are not anchored at the beginning.

-mental

···

On Thu, 12 Apr 2007 03:27:04 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

"Robert Dober" <robert.dober@gmail.com> writes:

Subject: Re: Slow ruby regexes

> What I actually meant is that given the need for backtracking
> abilities it will just not be possible to "guess" what the user meant
> when he/she uses too general an expression and I am talking about NFA
> here.

I'm a little confused -- the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn't any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.

For "pure" regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking's worst-case exponential complexity.

Guys, I think your possibly mixing up backtracking and backreferences. It is
backreferences (referencing a earlier match, often to use as part of the
definition of a latter match) that NFA is not able to do. both approaches use
backtracking. The NFA approach is able to parallelise the choices, reducing the
overheads of backtracking. So, given n different possible paths to follow,
instead of, in the worst case, trying n-1 paths and having to backtrack after
each one, you can try all n paths at once. At least, I htink thats what the
article was saying form a very quick read and from what I can remember from my
computability course and finite automata (oh so many years ago now!).

Tim

···

On 4/11/07, MenTaLguY <mental@rydia.net> wrote:

On Wed, 11 Apr 2007 16:53:26 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

--
tcross (at) rapttech dot com dot au

Robert Dober wrote:

<snip>

stupid me make this simply /a*.*a/ sorry

Actually, to make it non-backtracking, you just need to update all the possible states at the same time. It's quite easy. Just look for Thompson NFA for a closer look at it.
The problem is that it gets problematic with backREFERENCES. That's the main problem that is currently solved with backtracking implementations.

What Mentalguy said is basically that there should be better ways of implementing regex engines with backreferences. Since you don't use backreferences so often, you could have Thompson NFA for those cases where it's fast, and switch to a backtracking engine only when you need it.

Cheers

···

--
Ola Bini (http://ola-bini.blogspot.com) JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)

"Yields falsehood when quined" yields falsehood when quined.