Ruby regexp backreferences

I’m doing something that required RLE, and the code that I
translated from Perl to do this included the following regexp:

/^(.?)((.)\2{2,127})(.?)$/ois
# Later code using $1 (.?), $2 (…), $3 (.?)

Now, you’d think that this simply translates as:

/^(.?)((.)\2{2,127})(.?)$/m
# Later code using $1 (.?), $2 (…), $3 (.?)

However, it doesn’t work because of the way that Ruby builds the
regexp backreferences. The backreferences built are $1, $2, $3, and
$4 – the (.) in the ((.)\2{2,127}) is treated as $3 – whereas in
Perl, it’s simply consumed and ignored. Obviously, I can’t simply
replace (.) with (?:.), because I need the backreference within the
regexp itself. Thus, the translated version is:

/^(.?)((.)\3{2,127})(.?)$/m

While what’s happening makes sense, I’m wondering if it’s correct –
how deep should backreferences be nested and considered part of the
process?

-austin
– Austin Ziegler, austin@halostatue.ca on 2003.04.29 at 22:59:16

I’m doing something that required RLE, and the code that I
translated from Perl to do this included the following regexp:

/^(.?)((.)\2{2,127})(.?)$/ois

austin - are you sure this works? it’s odd because

/^(.?)((.)\2{2,127})(.?)$/ois
^^^^^^^^^^^^^^
to me this seems to say, “the second match shall be composed of a single char
followed 2 to 127 of the second match.” in otherwords, it would seem to be
recursive.

for instance, this does not work in perl or ruby:

~ > cat foo
‘foo’ =~ /^(.)((.)\2)/;
print $1,“\n”;
print $2,“\n”;
print $3,“\n”;

perl foo

ruby foo
nil
nil
nil

but this does

~ > cat foo
‘foo’ =~ /^(.)((.)\3)/;
print $1,“\n”;
print $2,“\n”;
print $3,“\n”;

perl foo
f
oo
o
ruby foo
f
oo
o

i guess what i am saying is that i don’t see how the original match ever had
valid semantics and that, if it worked, it would seem to imply a broken perl.

in any case - ruby’s behaviour seems correct.

While what’s happening makes sense, I’m wondering if it’s correct – how
deep should backreferences be nested and considered part of the process?

i think the only meaningful way is for each ‘(’ which is not escaped, or
followed by a ‘?:’ to begin the ‘\n’ and ‘$n’ groups - without limiting the
depth.

-a

···

On Wed, 30 Apr 2003, Austin Ziegler wrote:

Ara Howard
NOAA Forecast Systems Laboratory
Information and Technology Services
Data Systems Group
R/FST 325 Broadway
Boulder, CO 80305-3328
Email: ara.t.howard@fsl.noaa.gov
Phone: 303-497-7238
Fax: 303-497-7259
====================================

Austin,

I’m doing something that required RLE, and the code
that I translated from Perl to do this included the
following regexp:

/^(.?)((.)\2{2,127})(.?)$/ois
# Later code using $1 (.?), $2 (…), $3 (.?)

Thus, the translated version is:

/^(.?)((.)\3{2,127})(.?)$/m

While what’s happening makes sense, I’m wondering if
it’s correct – how deep should backreferences be
nested and considered part of the process?

Your basic problem is that the original regular expression did not work

in Perl. Perl and Ruby both count backreferences as the order of the left
parenthesis of the expression (i.e. the first left parenthesis defines $1,
the second defines $2, etc.). Therefore the original regular expression
would have never matched, since the backreference \2 was yet fully defined
where it appeared (i.e. it tried to define some sort of recursive
backreference):

%perl -e ‘“xyyyyyyyz” =~ /^(.?)((.)\2{2,127})(.?)$/; print
“($1,$2,$3,$4)\n”;’
(,)

Changing the backreference to \3 fixes it in Perl:

%perl -e ‘“xyyyyyyyz” =~ /^(.?)((.)\3{2,127})(.?)$/; print
“($1,$2,$3,$4)\n”;’
(x,yyyyyyy,y,z)

... as well as in Ruby:

%ruby -e ‘“xyyyyyyyz” =~ /^(.?)((.)\3{2,127})(.?)$/; print
“(#{$1},#{$2},#{$3},#{$4})\n”;’
(x,yyyyyyy,y,z)

By the way, to match the original options ('ois') you would want 'oim'

in Ruby. Since there are no embedded variables in the regular expression,
the ‘o’ option is meaningless. However, the ‘i’ option is ignore case,
which would seem to be a bad idea for RLE. My guess is that it is a
mistake.

I hope this helps!

- Warren Brown

No, in fact, I wasn’t sure if it worked or not. It was what I found
in the Perl library I was using. I’m supposing that little, if any,
testing was done on the RunLength encoding filter provided, since
that’s what was in there. Oh well. Time to file a bug report (:

-austin
– Austin Ziegler, austin@halostatue.ca on 2003.04.30 at 10:29:07

···

On Wed, 30 Apr 2003 14:13:10 +0900, ahoward wrote:

On Wed, 30 Apr 2003, Austin Ziegler wrote:

I’m doing something that required RLE, and the code that I
translated from Perl to do this included the following regexp:

/^(.?)((.)\2{2,127})(.?)$/ois

austin - are you sure this works? it’s odd because

/^(.?)((.)\2{2,127})(.?)$/ois
^^^^^^^^^^^^^^
to me this seems to say, “the second match shall be composed of a
single char followed 2 to 127 of the second match.” in otherwords,
it would seem to be recursive.