Oniguruma lookbehind question

Hi --

For some reason, lookbehind and alternation seem not to be playing
together in a little Oniguruma test. This is based on the string
splitting thread from a little while ago this evening, and uses a CVS
1.9.0 Ruby acquired about 1/2 an hour ago.

   str = %Q{abc def "ghi jkl" mno}

   # Look for "..." but just get the ... part:
   re1 = /(?<=")[^"]+(?=")/

   # Test that:
   p str.scan(re1) # => ["ghi jkl"]

   # Now, do the same thing *or* \S+. This should, I think,
   # pick up the abc, def, and mno substrings too.

   re2 = /((?<=")[^"]+(?="))|(\S+)/

   # But it doesn't; the part before the alternation never
   # matches, even though it did before (as shown by the
   # captures):

   p str.scan(re2)
   # => [[nil, "abc"], [nil, "def"], [nil, "\"ghi"], [nil, "jkl\""],
   # [nil, "mno"]]

I know that's all a bit cluttered, but the basic thing is that a
sub-pattern using lookbehind doesn't seem to match any more when
there's an alternation. Instead, only the second alternative ever
matches.

Does anyone know why?

David

···

--
David A. Black
dblack@wobblini.net

"Ruby for Rails", from Manning Publications, coming April 2006!

dblack@wobblini.net wrote:

For some reason, lookbehind and alternation seem not to be playing
together in a little Oniguruma test. This is based on the string
splitting thread from a little while ago this evening, and uses a CVS
1.9.0 Ruby acquired about 1/2 an hour ago.

  str = %Q{abc def "ghi jkl" mno}

  # Look for "..." but just get the ... part:
  re1 = /(?<=")[^"]+(?=")/

  # Test that:
  p str.scan(re1) # => ["ghi jkl"]

  # Now, do the same thing *or* \S+. This should, I think,
  # pick up the abc, def, and mno substrings too.

  re2 = /((?<=")[^"]+(?="))|(\S+)/

  # But it doesn't; the part before the alternation never
  # matches, even though it did before (as shown by the
  # captures):

  p str.scan(re2)
  # => [[nil, "abc"], [nil, "def"], [nil, "\"ghi"], [nil, "jkl\""],
  # [nil, "mno"]]

I know that's all a bit cluttered, but the basic thing is that a
sub-pattern using lookbehind doesn't seem to match any more when
there's an alternation. Instead, only the second alternative ever
matches.

Is this pattern work for you?

str = %Q{abc def "ghi jkl" mno}
re3 = /((?<=")[^"]+(?="))|([\S&&[^"]]+)/
p str.scan(re3) #=> [[nil, "abc"], [nil, "def"], ["ghi jkl", nil], [nil, "mno"]]

···

--
K.Kosako

I'm not at all sure about this, but this is my take on it. Firstly, is this the behaviour you expected?

  str = %Q{abc def "ghi jkl" mno}
  re2 = /(?:(?<=")[^"]+(?="))|\S+/

  p str.scan(re2)
  # => ["abc", "def", "\"ghi", "jkl\"", "mno"]

?

If so, then I believe the problem is something to do with the fact that lookaround is atomic, so that when used with capturing groups and alternations you sometimes experience problems because the regex immediately forgets the (zero-width, remember) lookaround match, so that by the time it comes to that 'or' it doesn't have the information to compare.

Generally, there are restrictions with lookaround (esp lookbehind) matching, and especially when matching regexps. So far my experiments with Oniguruma suggest it's fairly sophisticated in this respect, supporting stuff like varying-width alternations, fixed repetition and optional groups in lookbehind, but of course still no star and plus.

Anyway, that's what I think. Hope it helps :slight_smile:

Cheers,

···

On Sat, 07 Jan 2006 02:56:50 -0000, <dblack@wobblini.net> wrote:

Hi --

For some reason, lookbehind and alternation seem not to be playing
together in a little Oniguruma test. This is based on the string
splitting thread from a little while ago this evening, and uses a CVS
1.9.0 Ruby acquired about 1/2 an hour ago.

   str = %Q{abc def "ghi jkl" mno}

   # Look for "..." but just get the ... part:
   re1 = /(?<=")[^"]+(?=")/

   # Test that:
   p str.scan(re1) # => ["ghi jkl"]

   # Now, do the same thing *or* \S+. This should, I think,
   # pick up the abc, def, and mno substrings too.

   re2 = /((?<=")[^"]+(?="))|(\S+)/

   # But it doesn't; the part before the alternation never
   # matches, even though it did before (as shown by the
   # captures):

   p str.scan(re2)
   # => [[nil, "abc"], [nil, "def"], [nil, "\"ghi"], [nil, "jkl\""],
   # [nil, "mno"]]

I know that's all a bit cluttered, but the basic thing is that a
sub-pattern using lookbehind doesn't seem to match any more when
there's an alternation. Instead, only the second alternative ever
matches.

Does anyone know why?

--
Ross Bamford - rosco@roscopeco.remove.co.uk

It shouldn't, since pattern-matching goes left-to-right \S will match the quote before the first half of the regexp gets a chance, since it wants to match the first character _after_ the quote.

-- fxn

···

On Jan 7, 2006, at 3:56, dblack@wobblini.net wrote:

  # Now, do the same thing *or* \S+. This should, I think,
  # pick up the abc, def, and mno substrings too.

  re2 = /((?<=")[^"]+(?="))|(\S+)/

  # But it doesn't; the part before the alternation never
  # matches,

Oops, I mean nested regexps.

···

On Sat, 07 Jan 2006 03:36:34 -0000, Ross Bamford <rosco@roscopeco.remove.co.uk> wrote:

Generally, there are restrictions with lookaround (esp lookbehind) matching, and especially when matching regexps.

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Hi --

For some reason, lookbehind and alternation seem not to be playing
together in a little Oniguruma test. This is based on the string
splitting thread from a little while ago this evening, and uses a CVS
1.9.0 Ruby acquired about 1/2 an hour ago.

  str = %Q{abc def "ghi jkl" mno}

  # Look for "..." but just get the ... part:
  re1 = /(?<=")[^"]+(?=")/

  # Test that:
  p str.scan(re1) # => ["ghi jkl"]

  # Now, do the same thing *or* \S+. This should, I think,
  # pick up the abc, def, and mno substrings too.

  re2 = /((?<=")[^"]+(?="))|(\S+)/

  # But it doesn't; the part before the alternation never
  # matches, even though it did before (as shown by the
  # captures):

  p str.scan(re2)
  # => [[nil, "abc"], [nil, "def"], [nil, "\"ghi"], [nil, "jkl\""],
  # [nil, "mno"]]

I know that's all a bit cluttered, but the basic thing is that a
sub-pattern using lookbehind doesn't seem to match any more when
there's an alternation. Instead, only the second alternative ever
matches.

Is this pattern work for you?

str = %Q{abc def "ghi jkl" mno}
re3 = /((?<=")[^"]+(?="))|([\S&&[^"]]+)/
p str.scan(re3) #=> [[nil, "abc"], [nil, "def"], ["ghi jkl", nil], [nil, "mno"]]

No; I get this:

[[nil, "\""], ["ghi jkl", nil], [nil, "\""]]

This:

   /(?<=")[^"]+(?=")|[^\s"]+/

gives me the result you got for your re3. But it still seems to be
based on the right-hand alternate being checked first.

David

···

On Sat, 7 Jan 2006, K.Kosako wrote:

dblack@wobblini.net wrote:

--
David A. Black
dblack@wobblini.net

"Ruby for Rails", from Manning Publications, coming April 2006!

Argh, I meant 'if not'. I'm too tired now, I've spent too long perfecting this quiz thing...
I was guessing it was dropping the match without the capture and so always matching the alternate but that's not right.

Sorry.

···

On Sat, 07 Jan 2006 03:36:34 -0000, Ross Bamford <rosco@roscopeco.remove.co.uk> wrote:

If so, then I believe the problem is something to do with the fact that lookaround is atomic

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Hi --

···

On Sat, 7 Jan 2006, Xavier Noria wrote:

On Jan 7, 2006, at 3:56, dblack@wobblini.net wrote:

# Now, do the same thing *or* \S+. This should, I think,
# pick up the abc, def, and mno substrings too.

re2 = /((?<=")[^"]+(?="))|(\S+)/

# But it doesn't; the part before the alternation never
# matches,

It shouldn't, since pattern-matching goes left-to-right \S will match the quote before the first half of the regexp gets a chance, since it wants to match the first character _after_ the quote.

OK, I see. I was somehow discounting the fact that the first "
*itself* doesn't match the left-hand alternate.

Thanks --

David

--
David A. Black
dblack@wobblini.net

"Ruby for Rails", from Manning Publications, coming April 2006!

Hi --

···

On Sat, 7 Jan 2006, K.Kosako wrote:

Is this pattern work for you?

str = %Q{abc def "ghi jkl" mno}
re3 = /((?<=")[^"]+(?="))|([\S&&[^"]]+)/
p str.scan(re3) #=> [[nil, "abc"], [nil, "def"], ["ghi jkl", nil], [nil, "mno"]]

Sorry -- I tested that accidentally with an old 1.9.0. Yes, with
today's I do get the same as you.

(And I also understand why it's choosing the right-hand alternate :slight_smile:
See later posts in thread.)

David

--
David A. Black
dblack@wobblini.net

"Ruby for Rails", from Manning Publications, coming April 2006!

Hi --

If so, then I believe the problem is something to do with the fact that lookaround is atomic

Argh, I meant 'if not'. I'm too tired now, I've spent too long perfecting this quiz thing...
I was guessing it was dropping the match without the capture and so always matching the alternate but that's not right.

See Xavier's post. My mistake was, essentially, expecting the first "
to "know" that it was supposed to match a zero-width condition
governing the state one character later. Instead, of course, it
asserts itself as a character in its own right; fails to match the
first alternate; and does match the second.

So [^\s"]+ is indeed probably the best thing. (Other than the
appropriate real libraries, of course :slight_smile:

David

···

On Sat, 7 Jan 2006, Ross Bamford wrote:

On Sat, 07 Jan 2006 03:36:34 -0000, Ross Bamford > <rosco@roscopeco.remove.co.uk> wrote:

--
David A. Black
dblack@wobblini.net

"Ruby for Rails", from Manning Publications, coming April 2006!

Oh Damn it, yeah I see now. Wish I'd held my tongue now :smiley:

···

On Sat, 07 Jan 2006 04:12:53 -0000, <dblack@wobblini.net> wrote:

Hi --

On Sat, 7 Jan 2006, Ross Bamford wrote:

On Sat, 07 Jan 2006 03:36:34 -0000, Ross Bamford >> <rosco@roscopeco.remove.co.uk> wrote:

If so, then I believe the problem is something to do with the fact that lookaround is atomic

Argh, I meant 'if not'. I'm too tired now, I've spent too long perfecting this quiz thing...
I was guessing it was dropping the match without the capture and so always matching the alternate but that's not right.

See Xavier's post. My mistake was, essentially, expecting the first "
to "know" that it was supposed to match a zero-width condition
governing the state one character later. Instead, of course, it
asserts itself as a character in its own right; fails to match the
first alternate; and does match the second.

--
Ross Bamford - rosco@roscopeco.remove.co.uk