Ruby regex engine behavior question

I read this in a journal entry:

"[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along
(it's 'start of current match' rather than 'end of last match'), which
makes relatively useless to write complex parsers with."

Can anyone comment on this? I'm not quite certain what he means. And
is it still the same in 1.8?

Regards,

Dan

"[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along

                                      ^^^^^^^

are you sure of this ?

(it's 'start of current match' rather than 'end of last match'), which
makes relatively useless to write complex parsers with."

Guy Decoux

Hi,

At Tue, 14 Sep 2004 01:04:58 +0900,
Daniel Berger wrote in [ruby-talk:112395]:

"[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along
(it's 'start of current match' rather than 'end of last match'), which
makes relatively useless to write complex parsers with."

I don't understand he means too. Th 'start' and the 'end'
should be same, since global match starts to match from the end
of last match.

···

--
Nobu Nakada

ts <decoux@moulon.inra.fr> wrote in message news:<200409131620.i8DGK1r18648@moulon.inra.fr>...

> "[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along
                                      ^^^^^^^

are you sure of this ?

> (it's 'start of current match' rather than 'end of last match'), which
> makes relatively useless to write complex parsers with."

Guy Decoux

No. That's why I'm asking. I'm merely quoting the entry I saw. Thoughts?

Dan

ts <decoux@moulon.inra.fr> wrote in message news:<200409131620.i8DGK1r18648@moulon.inra.fr>...

> "[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along
                                      ^^^^^^^

are you sure of this ?

> (it's 'start of current match' rather than 'end of last match'), which
> makes relatively useless to write complex parsers with."

Guy Decoux

The OP has further clarified. To quote:

When trying to match abcde with /\Gx?/g, the first match is
successful, because no x is found but the question mark allows zero
characters to be consumed. This match ends after zero characters into
the string — at start-of-string. In order to avoid infinite loops on a
zero-length matches, the engine then retries the match one position
down the string.

In Perl, \G means end-of-last-match, and since end-of-last-match was
at start-of-string, \G can't possibly match at one character into the
string:

    $ perl -le'$_="abcde"; s/\Gx?/!/; print'
    !abcde

In Ruby (both 1.6 and 1.8, I found), \G merely means
start-of-current-match, which, of course, is satisfiable at that
point:

    $ ruby1.6 -e'puts "abcde".gsub(/\Gx?/,"!")'
    !a!b!c!d!e!
    $ ruby1.8 -e'puts "abcde".gsub(/\Gx?/,"!")'
    !a!b!c!d!e!

Perl's \G is a powerful tool to write parsers because the regex engine
is prohibited from skipping characters to find a match — you can work
your way through a string with a multitude of patterns using /c (to
avoid resetting the end-of-last-match on match failure) applied
against the same string in turn, without them sabotaging each other.

End quote.

Thoughts?

Dan

In Perl, \G means end-of-last-match, and since end-of-last-match was
at start-of-string, \G can't possibly match at one character into the
string:

This is one way to say it, another is

  * on a zero length match, perl prohibit the second zero length match

  * on a zero length match, ruby move its internal cursor

Guy Decoux