Regex negative look-behind bug?

Ruby_Nuby · 23 November 2010 14:36

irb, Ruby 1.9.1

What am I missing here?

"b T T W b".match(/(?<!t t|a b) w/i)
=> nil

#The second look-behind is now just a
"b T T W b".match(/(?<!t t|a) w/i)
=> #<MatchData " W">

#Regex stays the same, the T T are now in lower case
"b t t W b".match(/(?<!t t|a) w/i)
=> nil

#Look-behind only contains the t t condition now and, T T are back to
upper case
"b T T W b".match(/(?<!t t) w/i)
=> nil

···

--
Posted via http://www.ruby-forum.com/.

Ammar_Ali · 23 November 2010 15:36

irb, Ruby 1.9.1

What am I missing here?

"b T T W b".match(/(?<!t t|a b) w/i)
=> nil

#The second look-behind is now just a
"b T T W b".match(/(?<!t t|a) w/i)
=> #<MatchData " W">

#Regex stays the same, the T T are now in lower case
"b t t W b".match(/(?<!t t|a) w/i)
=> nil

#Look-behind only contains the t t condition now and, T T are back to
upper case
"b T T W b".match(/(?<!t t) w/i)
=> nil

No bug here. It is doing exactly what you asked: only match a w if it is not
preceded by 't t'. In all cases the w is preceded by 't t', and in the case
that did match (?<!t t|a), the w was preceded by a 't t' but not an 'a', as
you asked, so it did match.

"b Y T W b".match( /(?<!t t) w/i )

=> #<MatchData " W">

Regards,
Ammar

···

On Tue, Nov 23, 2010 at 4:36 PM, Ruby Nuby <b1st@hotmail.com> wrote:

Ruby_Nuby · 24 November 2010 16:37

Ammar, Robert,

Thank you both for your healthy discussions. I'm glad that I'm not crazy
and you guys agree that it's probably a bug or a very very special
feature

You guys understand the underlying issue and implications much better
than I do. I think it'd be better if one of you reported this instead of
I. Please don't fight over it

Thanks again.

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 23 November 2010 15:55

That was an alternative! If the RX in the lookbehind can match, the
negative lookbehind must fail IMHO.

There is a problem with the match though. I suspect there is an issue
with case sensitivity propagation

irb(main):009:0> "b T T W b".match(/(?<!t t|a) w/i)
=> #<MatchData " W">
irb(main):010:0> "b T T W b".match(/(?i:<!t t|a) w/i)
=> nil

irb(main):013:0> RUBY_VERSION
=> "1.9.1"
irb(main):014:0> RUBY_PATCHLEVEL
=> 430

Kind regards

robert

···

On Tue, Nov 23, 2010 at 4:36 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

On Tue, Nov 23, 2010 at 4:36 PM, Ruby Nuby <b1st@hotmail.com> wrote:

irb, Ruby 1.9.1

What am I missing here?

"b T T W b".match(/(?<!t t|a b) w/i)
=> nil

#The second look-behind is now just a
"b T T W b".match(/(?<!t t|a) w/i)
=> #<MatchData " W">

#Regex stays the same, the T T are now in lower case
"b t t W b".match(/(?<!t t|a) w/i)
=> nil

#Look-behind only contains the t t condition now and, T T are back to
upper case
"b T T W b".match(/(?<!t t) w/i)
=> nil

No bug here. It is doing exactly what you asked: only match a w if it is not
preceded by 't t'. In all cases the w is preceded by 't t', and in the case
that did match (?<!t t|a), the w was preceded by a 't t' but not an 'a', as
you asked, so it did match.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert_K1 · 25 November 2010 08:41

Thank you both for your healthy discussions. I'm glad that I'm not crazy
and you guys agree that it's probably a bug or a very very special
feature

You're welcome.

You guys understand the underlying issue and implications much better
than I do. I think it'd be better if one of you reported this instead of
I. Please don't fight over it

Done.

http://redmine.ruby-lang.org/issues/show/4088

Cheers

robert

···

On Wed, Nov 24, 2010 at 5:37 PM, Ruby Nuby <b1st@hotmail.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Ammar_Ali · 25 November 2010 12:44

You're welcome. I'm fascinated by Oniguruma (ruby's regex engine) so this is
much fun for me.

Regards,
Ammar

···

On Wed, Nov 24, 2010 at 6:37 PM, Ruby Nuby <b1st@hotmail.com> wrote:

Ammar, Robert,

Thank you both for your healthy discussions. I'm glad that I'm not crazy
and you guys agree that it's probably a bug or a very very special
feature

Ammar_Ali · 23 November 2010 16:12

>
>> irb, Ruby 1.9.1
>>
>> What am I missing here?
>>
>> "b T T W b".match(/(?<!t t|a b) w/i)
>> => nil
>>
>> #The second look-behind is now just a
>> "b T T W b".match(/(?<!t t|a) w/i)
>> => #<MatchData " W">
>>
>> #Regex stays the same, the T T are now in lower case
>> "b t t W b".match(/(?<!t t|a) w/i)
>> => nil
>>
>> #Look-behind only contains the t t condition now and, T T are back to
>> upper case
>> "b T T W b".match(/(?<!t t) w/i)
>> => nil
>
> No bug here. It is doing exactly what you asked: only match a w if it is
not
> preceded by 't t'. In all cases the w is preceded by 't t', and in the
case
> that did match (?<!t t|a), the w was preceded by a 't t' but not an 'a',
as
> you asked, so it did match.

That was an alternative! If the RX in the lookbehind can match, the
negative lookbehind must fail IMHO.

The thing is what's in the lookbehind, and all assertions for that matter,
is not really a regular expression. It is a fixed length literal. The only
exception, AFAIK, is character sets because they are also fixed length. The
engine needs to know how many characters to step back and examine.

Also the first alternative that matches wins. Here it is in lower case and
without ignoring case:

"b t t w b".match( /(?<!t t|a) w/ )

=> nil

There is a problem with the match though. I suspect there is an issue

with case sensitivity propagation

irb(main):009:0> "b T T W b".match(/(?<!t t|a) w/i)
=> #<MatchData " W">
irb(main):010:0> "b T T W b".match(/(?i:<!t t|a) w/i)
=> nil

That's not a valid assertion any more, it is now an options specification.

"b <!t t w b".match( /(?i:<!t t|a) w/ )

=> #<MatchData "<!t t w">

irb(main):013:0> RUBY_VERSION
=> "1.9.1"
irb(main):014:0> RUBY_PATCHLEVEL
=> 430

I initially tried the cases with 1.9.2, but I tried the above with the
latest 1.9.1 on my system (a bit older).

RUBY_VERSION

=> "1.9.1"

RUBY_PATCHLEVEL

=> 378

Regards,
Ammar

···

On Tue, Nov 23, 2010 at 5:55 PM, Robert Klemme <shortcutter@googlemail.com>wrote:

On Tue, Nov 23, 2010 at 4:36 PM, Ammar Ali <ammarabuali@gmail.com> wrote:
> On Tue, Nov 23, 2010 at 4:36 PM, Ruby Nuby <b1st@hotmail.com> wrote:

Ammar_Ali · 25 November 2010 12:50

Done.

http://redmine.ruby-lang.org/issues/show/4088

Thanks. I was too busy to follow up yesterday.

I'm glad you reported it because I'm still on the fence about it being a
bug. I think the negative match (when it matches it doesn't) and alternation
are confusing in this case.

IMHO, the following examples prove that ignoring case works as expected, but
it's difficult to verify this with when alternation is added to the mix.

Expected, not ignoring case

"abcd" =~ /(?<!bc)d/

nil

Expected, case differs, and it's not being ignored

"aBcd" =~ /(?<!bc)d/

3

Expected, case differs, but it's being ignored

"aBcd" =~ /(?<!bc)d/i

nil

Adding alternation is playing a part in either making it hard to tell which
part is matching, or not respecting the i option.

Thanks again,
Ammar

···

On Thu, Nov 25, 2010 at 10:41 AM, Robert Klemme <shortcutter@googlemail.com>wrote:

Robert_K1 · 24 November 2010 10:57

>
>> irb, Ruby 1.9.1
>>
>> What am I missing here?
>>
>> "b T T W b".match(/(?<!t t|a b) w/i)
>> => nil
>>
>> #The second look-behind is now just a
>> "b T T W b".match(/(?<!t t|a) w/i)
>> => #<MatchData " W">
>>
>> #Regex stays the same, the T T are now in lower case
>> "b t t W b".match(/(?<!t t|a) w/i)
>> => nil
>>
>> #Look-behind only contains the t t condition now and, T T are back to
>> upper case
>> "b T T W b".match(/(?<!t t) w/i)
>> => nil
>
> No bug here. It is doing exactly what you asked: only match a w if it is
not
> preceded by 't t'. In all cases the w is preceded by 't t', and in the
case
> that did match (?<!t t|a), the w was preceded by a 't t' but not an 'a',
as
> you asked, so it did match.

That was an alternative! If the RX in the lookbehind can match, the
negative lookbehind must fail IMHO.

The thing is what's in the lookbehind, and all assertions for that matter,
is not really a regular expression. It is a fixed length literal. The only
exception, AFAIK, is character sets because they are also fixed length. The
engine needs to know how many characters to step back and examine.

Docs say that the regexp cannot be unlimited. But it is by far not
only a fixed length literal. "|" is certainly meta in an assertion -
the second line would not match if the lookbehind assertion was a
literal.

str = ["bc", "abc", "a|bc", "a\\|bc"]
rxs = [/(?<=ab)c/,/(?<=a|b)c/,/(?<=a\|b)c/]

str.each do |s|
  rxs.each do |r|
    printf "%-10s %-15p %p\n", s, r, s.scan(r)
  end
end

10:45:45 ~$

Docs even say "In negative-look-behind, captured group isn't allowed,
but shy group(? is allowed." So it's a regexp albeit a limited one.

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

Also the first alternative that matches wins. Here it is in lower case and
without ignoring case:

"b t t w b".match( /(?<!t t|a) w/ )

=> nil

There is a problem with the match though. I suspect there is an issue

with case sensitivity propagation

irb(main):009:0> "b T T W b".match(/(?<!t t|a) w/i)
=> #<MatchData " W">
irb(main):010:0> "b T T W b".match(/(?i:<!t t|a) w/i)
=> nil

That's not a valid assertion any more, it is now an options specification.

"b <!t t w b".match( /(?i:<!t t|a) w/ )

=> #<MatchData "<!t t w">

Right, apparently we cannot have options in assertions.

irb(main):013:0> RUBY_VERSION
=> "1.9.1"
irb(main):014:0> RUBY_PATCHLEVEL
=> 430

I initially tried the cases with 1.9.2, but I tried the above with the
latest 1.9.1 on my system (a bit older).

RUBY_VERSION

=> "1.9.1"

RUBY_PATCHLEVEL

=> 378

The root issue still exists

irb(main):014:0> "a ac".scan /(?<!a a|b)c/i
=>
irb(main):015:0> "A Ac".scan /(?<!a a|b)c/i
=> ["c"]
irb(main):016:0> "ac".scan /(?<!a|b)c/i
=>
irb(main):017:0> "Ac".scan /(?<!a|b)c/i
=>

Statement 15 should not yield any results in the same way as 17 does.
Apparently /i breaks in if there is an alternative ("|") in
conjunction with more than one chars in one alternative:

Fails (more than 1 char AND alternative)

irb(main):018:0> "aac".scan /(?<!aa|b)c/i
=>
irb(main):019:0> "AAc".scan /(?<!aa|b)c/i
=> ["c"]
irb(main):020:0> "Aac".scan /(?<!aa|b)c/i
=> ["c"]
irb(main):021:0> "aAc".scan /(?<!aa|b)c/i
=> ["c"]

Works (more then 1 char OR alternative):

irb(main):022:0> "aac".scan /(?<!aa)c/i
=>
irb(main):023:0> "aAc".scan /(?<!aa)c/i
=>
irb(main):024:0> "Aac".scan /(?<!aa)c/i
=>
irb(main):025:0> "AAc".scan /(?<!aa)c/i
=>
irb(main):026:0> "ac".scan /(?<!a)c/i
=>
irb(main):027:0> "Ac".scan /(?<!a)c/i
=>
irb(main):028:0> "ac".scan /(?<!a|b)c/i
=>
irb(main):029:0> "Ac".scan /(?<!a|b)c/i
=>

IMHO this is a bug.

Kind regards

robert

···

On Tue, Nov 23, 2010 at 5:12 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

On Tue, Nov 23, 2010 at 5:55 PM, Robert Klemme > <shortcutter@googlemail.com>wrote:

On Tue, Nov 23, 2010 at 4:36 PM, Ammar Ali <ammarabuali@gmail.com> wrote:
> On Tue, Nov 23, 2010 at 4:36 PM, Ruby Nuby <b1st@hotmail.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert_K1 · 29 November 2010 08:22

It's fixed already.

http://redmine.ruby-lang.org/issues/show/4088

Cheers

robert

···

On Thu, Nov 25, 2010 at 1:50 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

On Thu, Nov 25, 2010 at 10:41 AM, Robert Klemme > <shortcutter@googlemail.com>wrote:

Done.

http://redmine.ruby-lang.org/issues/show/4088

I'm glad you reported it because I'm still on the fence about it being a
bug. I think the negative match (when it matches it doesn't) and alternation
are confusing in this case.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Ammar_Ali · 24 November 2010 12:32

>
> The thing is what's in the lookbehind, and all assertions for that
matter,
> is not really a regular expression. It is a fixed length literal. The
only
> exception, AFAIK, is character sets because they are also fixed length.
The
> engine needs to know how many characters to step back and examine.

Docs say that the regexp cannot be unlimited. But it is by far not
only a fixed length literal. "|" is certainly meta in an assertion -
the second line would not match if the lookbehind assertion was a
literal.

Yes, please excuse the terseness of my last response. I wrote it as I was
rushing out the door.

What I meant, but did not properly clarify, is; the contents of assertions
are not *full* expressions. They can not contain quantifiers, they can not
contain captures, and they can not include backreferences or anything that
can complicate determining the length of the contents. Obviously alternation
is allowed, since that's what we were discussing. However, only as long as
the alternatives abide by the limitations.

Ruby's regular expression engine is quite flexible in this regard, as it
allows the alternatives to be of different lengths, unlike some other
engines that require them to be of the same length.

----8<----

The root issue still exists

irb(main):014:0> "a ac".scan /(?<!a a|b)c/i
=>
irb(main):015:0> "A Ac".scan /(?<!a a|b)c/i
=> ["c"]
irb(main):016:0> "ac".scan /(?<!a|b)c/i
=>
irb(main):017:0> "Ac".scan /(?<!a|b)c/i
=>

----8<----

IMHO this is a bug.

OK, now that we've eliminated the syntax and the double-negative confusion,
I see the issue clearly. Thank you for your patience

It might be a bug, but since the contents of assertions do not go through
the full eval/exec cycle of "regular" regular expressions, this could be
just another limitation of assertions. It might be difficult to figure out
the last options in effect because they can be inserted multiple times in an
expression, on their own (from here on) and they can be nested. Which of
these would be used? Maybe just use the top level options? That can
potentially introduce more confusion.

Anyway, it's definitely worth reporting. Worst case, we'll find out it's a
limitation, and best case, it will end being a feature request, if not a
bug.

Is the OP able/willing to report this?

http://redmine.ruby-lang.org/

Regards,
Ammar

···

On Wed, Nov 24, 2010 at 12:57 PM, Robert Klemme <shortcutter@googlemail.com>wrote:

On Tue, Nov 23, 2010 at 5:12 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

Ammar_Ali · 29 November 2010 09:03

Hallelujah!

Cheers,
Ammar

···

On Mon, Nov 29, 2010 at 10:22 AM, Robert Klemme <shortcutter@googlemail.com> wrote:

On Thu, Nov 25, 2010 at 1:50 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

On Thu, Nov 25, 2010 at 10:41 AM, Robert Klemme >> <shortcutter@googlemail.com>wrote:

Done.

http://redmine.ruby-lang.org/issues/show/4088

I'm glad you reported it because I'm still on the fence about it being a
bug. I think the negative match (when it matches it doesn't) and alternation
are confusing in this case.

It's fixed already.

http://redmine.ruby-lang.org/issues/show/4088

Robert_K1 · 24 November 2010 13:56

>
> The thing is what's in the lookbehind, and all assertions for that
matter,
> is not really a regular expression. It is a fixed length literal. The
only
> exception, AFAIK, is character sets because they are also fixed length.
The
> engine needs to know how many characters to step back and examine.

Docs say that the regexp cannot be unlimited. But it is by far not
only a fixed length literal. "|" is certainly meta in an assertion -
the second line would not match if the lookbehind assertion was a
literal.

Yes, please excuse the terseness of my last response. I wrote it as I was
rushing out the door.

Probably not the best thing to do. I know. It has happened to me as well.

What I meant, but did not properly clarify, is; the contents of assertions
are not *full* expressions. They can not contain quantifiers, they can not
contain captures, and they can not include backreferences or anything that
can complicate determining the length of the contents. Obviously alternation
is allowed, since that's what we were discussing. However, only as long as
the alternatives abide by the limitations.

Ruby's regular expression engine is quite flexible in this regard, as it
allows the alternatives to be of different lengths, unlike some other
engines that require them to be of the same length.

----8<----

The root issue still exists

irb(main):014:0> "a ac".scan /(?<!a a|b)c/i
=>
irb(main):015:0> "A Ac".scan /(?<!a a|b)c/i
=> ["c"]
irb(main):016:0> "ac".scan /(?<!a|b)c/i
=>
irb(main):017:0> "Ac".scan /(?<!a|b)c/i
=>

----8<----

IMHO this is a bug.

OK, now that we've eliminated the syntax and the double-negative confusion,
I see the issue clearly. Thank you for your patience

YWC.

It might be a bug, but since the contents of assertions do not go through
the full eval/exec cycle of "regular" regular expressions, this could be
just another limitation of assertions. It might be difficult to figure out
the last options in effect because they can be inserted multiple times in an
expression, on their own (from here on) and they can be nested. Which of
these would be used? Maybe just use the top level options? That can
potentially introduce more confusion.

I don't see any difference in finding out options to other grouping
constructs: the innermost surrounding flags should be used. Every
other rule would be utmost confusing.

irb(main):002:0> "aBc".scan /(?i:a(?:b)c)/
=> ["aBc"]

irb(main):005:0> "abCde".scan /(?-i:a(?i:b(?:c)d)e)/i
=> ["abCde"]
irb(main):008:0> "abCde".scan /a(?i:b(?:c)d)e/
=> ["abCde"]

Anyway, it's definitely worth reporting. Worst case, we'll find out it's a
limitation, and best case, it will end being a feature request, if not a
bug.

I vote for "bug".

Is the OP able/willing to report this?

http://redmine.ruby-lang.org/

Please, do.

Cheers

robert

···

On Wed, Nov 24, 2010 at 1:32 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

On Wed, Nov 24, 2010 at 12:57 PM, Robert Klemme > <shortcutter@googlemail.com>wrote:

On Tue, Nov 23, 2010 at 5:12 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Regex in Ruby question ruby-talk	5	73	22 February 2008
Zero-width positive "look-behind"? ruby-talk	2	128	9 April 2010
Regexp gotcha ruby-talk	18	100	30 March 2006
Regex bug? ruby-talk	8	164	20 January 2015
Ruby regex lookarounds? ruby-talk	6	79	19 March 2007

Regex negative look-behind bug?

Related topics