Possible regular expression

Ruby's regular expression engine appears to act incorrectly when given a
non-greedy match-range of the form {m,n}?

Take this example:

"Age: 21" =~ /Age.{0,60}: ([\w]+)/

This returns 0, as expected and $1 is set to "21"

However:

"Age: 21" =~ /Age.{0,60}?: ([\w]+)/

This returns nil and $1 is set to nil.

I believe the greedy and non-greedy cases should be equivalent in this
case, but are not.

I've included a tarball with two files, one written in perl and the
other in ruby, performing this match. The perl script acts as expected.

Apologies if this is a known bug that I have been unable to find on
RubyForge, or if this is expected behavior. If it is the former I would
appreciate anyone who could point me at the bug listing, and if it is
the latter, I would appreciate enlightenment on the reason for this
behavior. In any other case, it would be much appreciated for anyone to
verify that this behavior is a bug, and I will file it.

Thanks

Attachments:
http://www.ruby-forum.com/attachment/1854/re.tar

···

--
Posted via http://www.ruby-forum.com/.

Ruby's regular expression engine appears to act incorrectly when given a
non-greedy match-range of the form {m,n}?

Take this example:

"Age: 21" =~ /Age.{0,60}: ([\w]+)/

This returns 0, as expected and $1 is set to "21"

However:

"Age: 21" =~ /Age.{0,60}?: ([\w]+)/

This returns nil and $1 is set to nil.

This seems like a bug, given:
s = "Age: 21"
s =~ /Age.*: (\w+)/ #=> 0
s =~ /Age.*?: (\w+)/ #=> 0
s =~ /Age.{0,60}: (\w+)/ #=> 0
s =~ /Age.{0,60}?: (\w+)/ #=> nil

(Perhaps you were pairing down a real-world testcase; did you know
that you can simply use \w+ instead of [\w]+ to match one-or-more-word-
characters? And that \d may be more appropriate, matching only digit
characters?)

My simple experiments make me believe this is an edge case
specifically when:
a) a non-greedy range
b) that is matching any-char
c) has a lower-limit of 0
d) and must match 0 times to succeed.

Here's my test data, with analysis following.

  s = "abbc"
  %w|
    ab{1,9}c ab{1,9}?c
    abb{1,9}c abb{1,9}?c
    abbb{1,9}c abbb{1,9}?c
    ab{0,9}c ab{0,9}?c
    abb{0,9}c abb{0,9}?c
    abbb{0,9}c abbb{0,9}?c
    a.{1,9}c a.{1,9}?c
    ab.{1,9}c ab.{1,9}?c
    abb.{1,9}c abb.{1,9}?c
    a.{0,9}c a.{0,9}?c
    ab.{0,9}c ab.{0,9}?c
    abb.{0,9}c abb.{0,9}?c
  >.each_with_index{ |pattern,i|
    regex = Regexp.new( pattern )
    puts "%2i %-15s %s" % [
      i, regex.inspect, (s =~ regex).inspect
    ]
  }

  #=> 0 /ab{1,9}c/ 0
  #=> 1 /ab{1,9}?c/ 0
  #=> 2 /abb{1,9}c/ 0
  #=> 3 /abb{1,9}?c/ 0
  #=> 4 /abbb{1,9}c/ nil
  #=> 5 /abbb{1,9}?c/ nil
  #=> 6 /ab{0,9}c/ 0
  #=> 7 /ab{0,9}?c/ 0
  #=> 8 /abb{0,9}c/ 0
  #=> 9 /abb{0,9}?c/ 0
  #=> 10 /abbb{0,9}c/ 0
  #=> 11 /abbb{0,9}?c/ 0
  #=> 12 /a.{1,9}c/ 0
  #=> 13 /a.{1,9}?c/ 0
  #=> 14 /ab.{1,9}c/ 0
  #=> 15 /ab.{1,9}?c/ 0
  #=> 16 /abb.{1,9}c/ nil
  #=> 17 /abb.{1,9}?c/ nil
  #=> 18 /a.{0,9}c/ 0
  #=> 19 /a.{0,9}?c/ 0
  #=> 20 /ab.{0,9}c/ 0
  #=> 21 /ab.{0,9}?c/ 0
  #=> 22 /abb.{0,9}c/ 0
  #=> 23 /abb.{0,9}?c/ nil

In the above, we would expect patterns 4, 5, 16 and 17 to fail, but
not 23.

Notable is that pattern #15 succeeds (showing that a non-greedy range
matching any-char can match a lower-limit number of times) and that
pattern #11 succeeds (showing that a non-greedy range matching a
specific char can match zero number of times).

···

On May 5, 7:42 pm, James Sanders <james.sand...@colorado.edu> wrote:

I forgot to note, in my previous reply, that my test results are
against 1.8.6:
ruby 1.8.6 (2007-09-24 patchlevel 111) [i686-darwin9.1.0]

Ruby v1.9 (using a different regexp engine, "Oniguruma") does not
suffer from the same problem.

···

On May 5, 7:42 pm, James Sanders <james.sand...@colorado.edu> wrote:

Ruby's regular expression engine appears to act incorrectly when given a
non-greedy match-range of the form {m,n}?

Rubinius and JRuby don't seem to suffer from it either.

Chris

···

On May 5, 8:59 pm, Phrogz <phr...@mac.com> wrote:

On May 5, 7:42 pm, James Sanders <james.sand...@colorado.edu> wrote:

> Ruby's regular expression engine appears to act incorrectly when given a
> non-greedy match-range of the form {m,n}?

I forgot to note, in my previous reply, that my test results are
against 1.8.6:
ruby 1.8.6 (2007-09-24 patchlevel 111) [i686-darwin9.1.0]

Ruby v1.9 (using a different regexp engine, "Oniguruma") does not
suffer from the same problem.

Thank you Gavin and Chris for your verification. Gavin, you are right
that it is pared down from a real problem where a character class and
alphanumerics were necessary, thank you for your much better examples.
I'll file a bug report against 1.8.6.

-James

···

--
Posted via http://www.ruby-forum.com/.