Slow ruby regexes

While this is true in theory there is a number of modern RX features that are extremely difficult or impossible to do with a DFA. So as it stands now for pratical purposes most modern RX engines are NFA or hybrids, sometimes (Perl) with heavy optimizations to prevent certain ugly effects (see "Mastering Regular Expressions" for details).

Kind regards

  robert

···

On 11.04.2007 22:51, MenTaLguY wrote:

On Thu, 12 Apr 2007 03:27:04 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):
  start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.

The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

Thanx for the explanation :), I guess I learnt this algorithm once
upon a time.:(.
Anyway relearning is learning too.
Now there does not seem to be any practical need to do this
transformation though as the Thompson NFA algorithm kinda does the
same thing in runtime, not creating states that are sets of states but
just keeping a set of states.
Thx again for your time and patience, and yes I have to change my
initial statement

there are no bad news :)) [ still ignoring backreferences ]

Cheers
Robert

···

On 4/11/07, MenTaLguY <mental@rydia.net> wrote:

On Thu, 12 Apr 2007 03:27:04 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
> No it must be me who is confused, I thought that backtracking cannot
> be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
> match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):

start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.

> The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

{start,a_star,any_star}: (
   'a' =>{a_star,any_star,final} |
   (any-'a') ->{any_star}
),

{any_star}: (
   'a' =>{any_star,final} |
   (any-'a') ->{any_star}
),

{a_star,any_star,final}: (
   'a' =>{a_star,any_star,final} |
   (any-'a') ->{any_star,final}
),

{any_star,final}: (
   'a' =>{any_star,final} |
   (any-'a') ->{any_star,final}
)

The resulting FA is deterministic, and {a_star,any_star,final} and {any_star,final} are both accepting states.

If we need to capture subexpressions, we can introduce additional counters and labeled transitions with priorities to resolve greedy/non-greedy ambiguities (as is often done in Ragel to resolve ambiguities). Prioritized transitions are also necessary for regexps which are not anchored at the beginning.

-mental

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

"Robert Dober" <robert.dober@gmail.com> writes:

<snip>

Guys, I think your possibly mixing up backtracking and backreferences. It is
backreferences (referencing a earlier match, often to use as part of the
definition of a latter match) that NFA is not able to do.

No Tim I was not mixing it up, I said that I was not interested in the
backreferencing history.
You see I am just forgetting the basic stuff but Ola and Mental kindly
gave me a refresher.

both approaches use
backtracking.

The DFA cannot use backtracking, surely I am missing something *again*!

The NFA approach is able to parallelise the choices, reducing the
overheads of backtracking. So, given n different possible paths to follow,
instead of, in the worst case, trying n-1 paths and having to backtrack after
each one, you can try all n paths at once. At least, I htink thats what the
article was saying form a very quick read and from what I can remember from my
computability course and finite automata (oh so many years ago now!).

For me it is 25 years and for you?
Yeah I understood that in the same way, actually it is quite simple :slight_smile:

Tim
--
tcross (at) rapttech dot com dot au

Cheers
Robert

···

On 4/12/07, Tim X <timx@nospam.dev.null> wrote:

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

I think this (and much of the rest of the discussion) rather misses
the point. The DFA algorithm performs poorly in some circumstances
because it may backtrack to the *same* state *many* times. The NFA
algorithm wins in such cases because it generates a list of all
feasible states before attempting to progress any of them, and it is
therefore easy to optimise by eliminating duplicates from the list.

Here's a concrete example. Suppose we are trying to match a regular
expression containing (a?)(a?)(a?)...(a?). If n is the number of
repetitions of (a?), then in the worst case where the match fails the
DFA explores 2^n possibiilities before failure. But in the course of
doing that it only enters n(n+1)/2 distinct states. This is because
there are eg. 3 ways of matching (a?)(a?)(a?) against "aa" (depending
on which of the three a's you consider to be optional) and the DFA
attempts to complete the match for *each* of those possibilities, even
though the calculation is identical in each case. The only thing that
matters is how many (a?)s you have matched so far, and how many "a"s
you have matched them against, but the DFA isn't smart enough to see
that it is repeating the same calculation 3 times.

The NFA wins because it does *not* search all 2^n combinations of
choices in the regexp, it searches the n(n+1)/2 possible states of the
regexp engine. The three *different* ways of matching (a?)(a?)(a?)
against "aa" all map to the *same* state of the regexp engine, so the
NFA checks this possibility only once.

So, the reason for the NFA's success is *not* that it "tries all
possibilities at once" or that it "avoids backtracking". Rather, it
reformulates the problem so that the search space is much smaller.
Equivalently, it implicitly caches the results of previous attempts so
that it can say "hmm, I've been here before and it didn't work then,
so I won't bother trying again". If you accept the maxim that madness
is trying the same thing in the hope of different results, the DFA
loses because it is insane! :wink:

Regards,

Jeremy Henty

···

On 2007-04-12, Tim X <timx@nospam.dev.null> wrote:

The NFA approach is able to parallelise the choices, reducing the
overheads of backtracking. So, given n different possible paths to
follow, instead of, in the worst case, trying n-1 paths and having
to backtrack after each one, you can try all n paths at once.

Robert Dober wrote:
> <snip>
>
> stupid me make this simply /a*.*a/ sorry
Actually, to make it non-backtracking, you just need to update all the
possible states at the same time. It's quite easy. Just look for
Thompson NFA for a closer look at it.
The problem is that it gets problematic with backREFERENCES. That's the
main problem that is currently solved with backtracking implementations.

Thx Ola, I'll look it up.

What Mentalguy said is basically that there should be better ways of
implementing regex engines with backreferences. Since you don't use
backreferences so often, you could have Thompson NFA for those cases
where it's fast, and switch to a backtracking engine only when you need it.

Yeah I was not concerned with the backreferences, just trying to
understand the basic stuff :wink:

Cheers

--

Robert

···

On 4/11/07, Ola Bini <ola.bini@gmail.com> wrote:
--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

Robert Klemme <shortcutter@googlemail.com> writes:

Subject: Re: Slow ruby regexes

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this
(after collapsing most of the epsilon transitions introduced by Thompson's
Algorithm):
  start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.

The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

While this is true in theory there is a number of modern RX features that are
extremely difficult or impossible to do with a DFA. So as it stands now for
pratical purposes most modern RX engines are NFA or hybrids, sometimes (Perl)
with heavy optimizations to prevent certain ugly effects (see "Mastering
Regular Expressions" for details).

Kind regards

To me, this is possibly a much more important point than just being concerned
over raw performance. While NFA may be *a lot* better performing in the
majority of cases, if we want to add in the exra power of backreferences and
other 'nice' (possibly considered advanced) features, things begin to get a lot
more complicated. Unfortunately, with complexity, we usually find higher levels
of bugs in the final implementation.

I'm not saying that things should be kept super simple just to ensure
correctness at the cost of performance. However, we should consider complexity
and correctness/maintainability of implementation as well as just pure
performance. For me and possibly for many others, I find this particularly
important as performance of regular expressions has never been a critical issue
with any of the work I've had to use them for (to what extent this is just a
reflection of the types of problems I deal with I don't know). In fact,
everyone I've ever helped with sorting out performance problems with their
regexp, it has turned out to be badly defined regexp that were the issue rather
than poor performance of the engine/underlying algorithm.

Maybe the real question we should be asking is "If ruby had a regexp engine
which was 10, 100, 1000, 1000000 times faster, would that allow us to solve
problems which we cannot solve now in any practicle way?

In principal, I think having a regexp engine which possibly adopts the DFA/NFA
algorithim if backreferences are not required and use a slower backtracking
approach when they are is a good approach, but only if we can ensure a stable
and correct implementation.

Tim

Tim

···

On 11.04.2007 22:51, MenTaLguY wrote:

On Thu, 12 Apr 2007 03:27:04 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

--
tcross (at) rapttech dot com dot au

"Robert Dober" <robert.dober@gmail.com> writes:

Subject: Re: Slow ruby regexes

"Robert Dober" <robert.dober@gmail.com> writes:

<snip>

Guys, I think your possibly mixing up backtracking and backreferences. It is
backreferences (referencing a earlier match, often to use as part of the
definition of a latter match) that NFA is not able to do.

No Tim I was not mixing it up, I said that I was not interested in the
backreferencing history.
You see I am just forgetting the basic stuff but Ola and Mental kindly
gave me a refresher.

both approaches use
backtracking.

The DFA cannot use backtracking, surely I am missing something *again*!

Doh!

The NFA approach is able to parallelise the choices, reducing the
overheads of backtracking. So, given n different possible paths to follow,
instead of, in the worst case, trying n-1 paths and having to backtrack after
each one, you can try all n paths at once. At least, I htink thats what the
article was saying form a very quick read and from what I can remember from my
computability course and finite automata (oh so many years ago now!).

For me it is 25 years and for you?

Nearly the same - computability course was 82/83, but I did go back an do som
postrad in early 90s, so my knowledge cache should have been fresher! Guess its

that bloody LRU memory cache algorithm trying to make room again!

Yeah I understood that in the same way, actually it is quite simple :slight_smile:

Its funny, back in the 80..83, the courses I enjoyed the most were
computability, parallel processing, compilers and assembly. Now, when I think
about anything in those areas, I just have vague recollections and no real
facts. Since those days, work has been about databases, dull user interface
programming, 4GLs, web technology etc. Up until 4 months ago, it was an increaseing deluge of
project management methodologies, bullshitp bingo management speak, HR, budgets
and sales. Now I'm getting back to bare boned technology and its scary how much
I've forgotten! Still, at least I'm now looking at Ruby (amongst many other
things!).

Tim

···

On 4/12/07, Tim X <timx@nospam.dev.null> wrote:

--
tcross (at) rapttech dot com dot au

The Thompson algorithm essentially just does the NFA -> DFA conversion lazily.

-mental

···

On Thu, 12 Apr 2007 16:59:47 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

Now there does not seem to be any practical need to do this
transformation though as the Thompson NFA algorithm kinda does the
same thing in runtime, not creating states that are sets of states but
just keeping a set of states.

Every NFA can be rewritten as a DFA. Each state in the resulting DFA
corresponds to a superposition of states in the original NFA

While this is true in theory there is a number of modern RX features
that are extremely difficult or impossible to do with a DFA. So as it
stands now for pratical purposes most modern RX engines are NFA or
hybrids,

It's important not to confuse NFAs with backtracking. For instance, Thompson's algorithm uses NFAs, but it evaluates by rewriting them on-the-fly as DFAs, not by performing a backtracking search. Again, NFAs and DFAs are equivalent, even if the naive algorithm (backtracking) for evaluating an NFA is less efficient than the naive algorithm for evaluating a DFA.

On the other hand, those modern features which really do require backtracking (surprisingly few) also exceed the capabilities of finite automata (i.e. those features correspond to non-regular languages), so at that point the regular expression engines cannot be said to be using FAs at all, but a more general type of automaton.

Again, though we must obviously keep backtracking in reserve for those patterns that need it, it is very wasteful not to use a pure FA approach for those patterns where non-regular features aren't being used.

sometimes (Perl) with heavy optimizations to prevent certain
ugly effects (see "Mastering Regular Expressions" for details).

"Mastering Regular Expressions" is very good for learning the pragmatics of working with contemporary regular expression implementations, but it is not a good source for theory.

-mental

···

On Thu, 12 Apr 2007 16:10:06 +0900, Robert Klemme <shortcutter@googlemail.com> wrote:

I think this (and much of the rest of the discussion) rather misses
the point. The DFA algorithm performs poorly in some circumstances
because it may backtrack to the *same* state *many* times.

DFA evaluation does not backtrack. Assuming epsilon transitions are collapsed when constructing the DFA, for a given input string there will be exactly n (or less, in the case of failure) transitions taken, where n is the number of characters in the input string.

Perhaps you meant that the same NFA state is potentially represented in exponentially many different DFA states? That would indeed be a problem, except...

This is because there are eg. 3 ways of matching (a?)(a?)(a?) against "aa"
(depending on which of the three a's you consider to be optional)

In the case of regular expressions, this ambiguity is resolved by the usual "greedy" versus "non-greedy" distinctions, which can be expressed by assigning weights to transitions in the NFA (thereby pruning the number of states in the resulting DFA). This is pretty much the same way you have to deal with ambiguities when constructing Ragel grammars.

-mental

···

On Fri, 13 Apr 2007 02:55:20 +0900, Jeremy Henty <onepoint@starurchin.org> wrote:

Done :wink:
just wanted to share the link
http://swtch.com/~rsc/regexp/regexp1.html
Robert

···

On 4/11/07, Robert Dober <robert.dober@gmail.com> wrote:

On 4/11/07, Ola Bini <ola.bini@gmail.com> wrote:
>
Thx Ola, I'll look it up.

"Robert Dober" <robert.dober@gmail.com> writes:

<snip>

Its funny, back in the 80..83, the courses I enjoyed the most were
computability, parallel processing, compilers and assembly. Now, when I think
about anything in those areas, I just have vague recollections and no real
facts. Since those days, work has been about databases, dull user interface
programming, 4GLs, web technology etc. Up until 4 months ago, it was an increaseing deluge of
project management methodologies, bullshitp bingo management speak, HR, budgets
and sales. Now I'm getting back to bare boned technology and its scary how much
I've forgotten! Still, at least I'm now looking at Ruby (amongst many other
things!).

Amen, but I think one shall be optimist :wink:

Tim

--
tcross (at) rapttech dot com dot au

Robert

···

On 4/12/07, Tim X <timx@nospam.dev.null> wrote:

This would be true if the Thompson algorithm was caching the states as
it went. If I recall correctly (it's been a while since I read the
article) it doesn't cache the states and Russ Cox's code in the
article definitely doesn't. So basically, the difference between
Thompson's parallel NFA evaluation and the "naive" backtracking NFA
evaluation is breath first search vs depth first search. The reason
breadth first search is so much faster in this problem is that you
avoid repeatedly going down the same search path. Given an unlimited
amount of memory and using dynamic programming techniques you could
presumably make the backtracking NFA evaluation run as fast as the
Thompson's algorithm. In fact I read somewhere that the Perl regex
engine has started to use dynamic programming, obviously without great
success.

So there are going to be some situations where a DFA will be a lot
faster than a Thompson NFA. But, having said that, I think it is
important to note (and I don't think anyone else has mentioned this)
that it is not always better to compile a regex into a DFA, even if it
is possible. In some instances it will take longer to compile the
regex into a DFA then it will take to evaluate the simple NFA
representation. I don't have any hard data to show this but I'm pretty
sure there will be some cases where Perl's and Ruby's regex engines
will easily beat the awk and grep engines.

By the way, I highly recommend people check out Ville Laurikari's
regex engine TRE;

    TRE — The free and portable approximate regex matching library.

While the predictable worst-case performance is nice, the really cool
thing that I like about this project is the approximate matching. This
is something I'd really like to have in Ruby.

···

On 4/13/07, MenTaLguY <mental@rydia.net> wrote:

On Thu, 12 Apr 2007 16:59:47 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
> Now there does not seem to be any practical need to do this
> transformation though as the Thompson NFA algorithm kinda does the
> same thing in runtime, not creating states that are sets of states but
> just keeping a set of states.

The Thompson algorithm essentially just does the NFA -> DFA conversion lazily.

--
Dave Balmain
http://www.davebalmain.com/

Every NFA can be rewritten as a DFA. Each state in the resulting DFA
corresponds to a superposition of states in the original NFA

While this is true in theory there is a number of modern RX features
that are extremely difficult or impossible to do with a DFA. So as it
stands now for pratical purposes most modern RX engines are NFA or
hybrids,

It's important not to confuse NFAs with backtracking. For instance, Thompson's algorithm uses NFAs, but it evaluates by rewriting them on-the-fly as DFAs, not by performing a backtracking search. Again, NFAs and DFAs are equivalent, even if the naive algorithm (backtracking) for evaluating an NFA is less efficient than the naive algorithm for evaluating a DFA.

On the other hand, those modern features which really do require backtracking (surprisingly few) also exceed the capabilities of finite automata (i.e. those features correspond to non-regular languages), so at that point the regular expression engines cannot be said to be using FAs at all, but a more general type of automaton.

Strinctly speaking yes. But AFAIK the basis is usually a NFA.

Btw, because of the nature of the engine there are some more differences between DFA and NFA, for example typically order of expressions in an alternative matters with a NFA engine. I prefer that often, because it allows to control the way the RX engine matches to a certain extent, i.e. you can prioritize alternatives which can be pretty handy at times.

Again, though we must obviously keep backtracking in reserve for those patterns that need it, it is very wasteful not to use a pure FA approach for those patterns where non-regular features aren't being used.

sometimes (Perl) with heavy optimizations to prevent certain
ugly effects (see "Mastering Regular Expressions" for details).

"Mastering Regular Expressions" is very good for learning the pragmatics of working with contemporary regular expression implementations, but it is not a good source for theory.

Since I had the theory at university and at the moment I'm more interested in using them than writing them this is good enough for me. :slight_smile:

Kind regards

  robert

···

On 12.04.2007 18:31, MenTaLguY wrote:

On Thu, 12 Apr 2007 16:10:06 +0900, Robert Klemme <shortcutter@googlemail.com> wrote:

This would be true if the Thompson algorithm was caching the states as
it went. If I recall correctly (it's been a while since I read the
article) it doesn't cache the states and Russ Cox's code in the
article definitely doesn't.

There is a section in his article entitled "Caching the NFA to build a DFA" with code doing exactly that.

In fact I read somewhere that the Perl regex engine has started to use dynamic
programming, obviously without great success.

It's mentioned in his article as well.

In some instances it will take longer to compile the
regex into a DFA then it will take to evaluate the simple NFA
representation.

Namely, when most of the states in the NFA will not be visited for a particular input. That is why the DFA construction is done lazily.

By the way, I highly recommend people check out Ville Laurikari's
regex engine TRE;

    TRE — The free and portable approximate regex matching library.

While the predictable worst-case performance is nice, the really cool
thing that I like about this project is the approximate matching. This
is something I'd really like to have in Ruby.

That's really cool.

-mental

···

On Fri, 13 Apr 2007 02:25:15 +0900, "David Balmain" <dbalmain.ml@gmail.com> wrote:

Nope, you can effectively prioritize alternatives with DFAs, the same
way you can specify greedy/non-greedy matches. NFAs and DFAs _really
are_ equivalent in expressiveness.

-mental

···

On Fri, 2007-04-13 at 16:00 +0900, Robert Klemme wrote:

Btw, because of the nature of the engine there are some more differences
between DFA and NFA, for example typically order of expressions in an
alternative matters with a NFA engine. I prefer that often, because it
allows to control the way the RX engine matches to a certain extent,
i.e. you can prioritize alternatives which can be pretty handy at times.

Sorry, I meant to say that it's when most of the states in the _DFA_ will not be visited for a particular input.

Generally speaking, you can visit all the states in the NFA without necessarily visiting all the states in the DFA.

-mental

···

On Fri, 13 Apr 2007 02:47:34 +0900, MenTaLguY <mental@rydia.net> wrote:

In some instances it will take longer to compile the
regex into a DFA then it will take to evaluate the simple NFA
representation.

Namely, when most of the states in the NFA will not be visited for a
particular input. That is why the DFA construction is done lazily.

> This would be true if the Thompson algorithm was caching the states as
> it went. If I recall correctly (it's been a while since I read the
> article) it doesn't cache the states and Russ Cox's code in the
> article definitely doesn't.

There is a section in his article entitled "Caching the NFA to build a DFA" with code doing exactly that.

> In fact I read somewhere that the Perl regex engine has started to use dynamic
> programming, obviously without great success.

It's mentioned in his article as well.

Right you are. Sorry, I read the article a couple of months back when
it was posted and I thought I remembered it well enough to comment
without reading it again. I guess not. :frowning:

> In some instances it will take longer to compile the
> regex into a DFA then it will take to evaluate the simple NFA
> representation.

Namely, when most of the states in the NFA will not be visited for a particular input. That is why the DFA construction is done lazily.

True, lazy construction of the DFA definitely seems like a very good
solution in most cases. I guess what I was trying to say though is
that the seemingly naive backtracking NFA algorithm will actually be
the fastest solution in some cases as there may be little or no
backtracking and even simply caching the DFA states could be
unnecessary overhead. Therefore, there is no one solution that beats
them all. When matching a simple string like /supercalifraglistic/ for
instance, I believe the simple backtracking algorithm would be faster
than Thompson's algorithm. In fact, there are sub-linear expected
algorithms that exist for matching simple strings like this. I'm
currently reading a good book about this called "Flexible Pattern
Matching in Strings" by Gonzalo Navarro and Mathieu Raffinot. Worth a
read if you can get your hands on it.

···

On 4/13/07, MenTaLguY <mental@rydia.net> wrote:

On Fri, 13 Apr 2007 02:25:15 +0900, "David Balmain" <dbalmain.ml@gmail.com> wrote:

--
Dave Balmain
http://www.davebalmain.com/

How do you do that?

Kind regards

  robert

···

On 13.04.2007 22:02, MenTaLguY wrote:

On Fri, 2007-04-13 at 16:00 +0900, Robert Klemme wrote:

Btw, because of the nature of the engine there are some more differences between DFA and NFA, for example typically order of expressions in an alternative matters with a NFA engine. I prefer that often, because it allows to control the way the RX engine matches to a certain extent, i.e. you can prioritize alternatives which can be pretty handy at times.

Nope, you can effectively prioritize alternatives with DFAs, the same
way you can specify greedy/non-greedy matches. NFAs and DFAs _really
are_ equivalent in expressiveness.

Let's use a real simple example: /^(?:(ab)|(..))$/

Here, we want capturing subexpression 1 to have priority over capturing
subexpression 2. This time, I'll use a notation to annotate
transitions with capture boundaries and priorities.

First, the NFA:

start: ('' [start $1]->group_1 | '' [start $2]->group_2),
group_1: ('a' ->group_1b),
group_1b: ('b' [end $1,priority 1]->final),
group_2: (any ->group_2b),
group_2b: (any [end $2,priority 0]->final)

Now, the equivalent DFA:

{start}: ('' [start $1,start $2]->{group_1,group_2}),
{group_1,group_2}: (
   'a' ->{group_1b,group_2b} |
   (any-'a') ->{group_2b}
),
{group_1b,group_2b}: (
   'b' [end $1]->{final} |
   (any-'b') [end $2]->{final}
)
{group_2b}: (any [end $2]->{final})

See how that works?

More complex examples can be attacked using the same principles, though
it gets painful to work by hand pretty fast.

-mental

···

On Sat, 2007-04-14 at 06:50 +0900, Robert Klemme wrote:

> Nope, you can effectively prioritize alternatives with DFAs, the same
> way you can specify greedy/non-greedy matches. NFAs and DFAs _really
> are_ equivalent in expressiveness.

How do you do that?