#scan with or'd (`|`) subexpressions

T.,

Does the new Ruby regexp engine do this?

irb(main):001:0> '1234'.scan(/(1)(2)|(3)(4)/)
=> [["1", "2", nil, nil], [nil, nil, "3", "4"]]
irb(main):002:0>

Why would all the subexpressions be listed when there
is an `|` (or) used?

    For collecting matches, Ruby simply looks at opening parenthesis -
nothing else. The part of the string matched by the regular expression
delimited by the first open parenthesis and its matching close
parenthesis will be the first match, the second opening parenthesis and
its matching close parenthesis will define the next match, etc.

    I have not yet had an opportunity to play with Oniguruma, so I can't
say definitively if behaves the same way. However I would be very
surprised if it didn't, since virtually every other language behaves
this way.

    I hope this helps.

    - Warren Brown

T.,

> Does the new Ruby regexp engine do this?
>
> irb(main):001:0> '1234'.scan(/(1)(2)|(3)(4)/)
> => [["1", "2", nil, nil], [nil, nil, "3", "4"]]
> irb(main):002:0>
>
> Why would all the subexpressions be listed when there
> is an `|` (or) used?

    For collecting matches, Ruby simply looks at opening parenthesis -
nothing else. The part of the string matched by the regular expression
delimited by the first open parenthesis and its matching close
parenthesis will be the first match, the second opening parenthesis and
its matching close parenthesis will define the next match, etc.

I see. Perhaps there is good reason for this. But I just don't see it. IN
practice it causes me to have to strip out a whole lot of nils from the
results. Honestly, I can't see how it makes any sense. The regexp will match
on the first "or" that succeeds, right? So all the others are by necessity
nil. But perhaps I'm overlooking some possibility.

    I have not yet had an opportunity to play with Oniguruma, so I can't
say definitively if behaves the same way. However I would be very
surprised if it didn't, since virtually every other language behaves
this way.

Neither have I. But I do hope Oniguruma is a better than "every other". By the
way, have you read about Perl 6 new RE engine? I must say it look pretty
sweet. Theses are definitely not your average everyday Regular Expressions.
It now allows you to create your own rules and encapsulate those and resuse
them -- much more like a grammer parser.

Thanks,
T.

···

On Thursday 11 November 2004 10:54 am, Warren Brown wrote:

I see. Perhaps there is good reason for this. But I just don't see it. IN
practice it causes me to have to strip out a whole lot of nils from the
results. Honestly, I can't see how it makes any sense. The regexp will match
on the first "or" that succeeds, right? So all the others are by necessity
nil. But perhaps I'm overlooking some possibility.

If those nils are stripped, you loose information about which "or" succeeded. In some cases that is not important in that it does not matter where the captures come from as long as they are interchangeable, but that is certainly not always so. Also it makes interpretation of the captures very difficult when the different "ors" have a different number of captures:

   /(?:(1)(a)|(-))(?:(2)|(b)(+))/

With nils stripped, this can return ['1','a','2'],['1','a','b','+'],['-','2'] or ['-','b','+']. If each capture needs a different treatment, there's no way to relate the correct treatment to the index in the array of captures.

Peter

Hi --

Well, I'm not sure how it helps. What I ended up doing was making sure all my
expressions did have the _same number_ of sub-expressions (in this case 7).
So then I could count the preceding nils and divide by 7 to find out which
match. But that's a hack IMHO.

Matz, yourself, and others past have all mentioned being able to figure out
which match, but how?

Thanks,
T.

···

On Thursday 11 November 2004 11:57 am, Peter wrote:

> I see. Perhaps there is good reason for this. But I just don't see it. IN
> practice it causes me to have to strip out a whole lot of nils from the
> results. Honestly, I can't see how it makes any sense. The regexp will
> match on the first "or" that succeeds, right? So all the others are by
> necessity nil. But perhaps I'm overlooking some possibility.

If those nils are stripped, you loose information about which "or"
succeeded. In some cases that is not important in that it does not matter
where the captures come from as long as they are interchangeable, but that
is certainly not always so. Also it makes interpretation of the captures
very difficult when the different "ors" have a different number of
captures:

   /(?:(1)(a)|(-))(?:(2)|(b)(+))/

With nils stripped, this can return
['1','a','2'],['1','a','b','+'],['-','2'] or ['-','b','+']. If each
capture needs a different treatment, there's no way to relate the correct
treatment to the index in the array of captures.

[snip]

Well, I'm not sure how it helps. What I ended up doing was making sure all
my expressions did have the _same number_ of sub-expressions (in this case
7). So then I could count the preceding nils and divide by 7 to find out
which match. But that's a hack IMHO.

Matz, yourself, and others past have all mentioned being able to figure out
which match, but how?

I don't know if this helps..

bash-2.05b$ ruby a.rb
"lab0" => [["0", nil, nil]]
"version1-beta" => [[nil, "1", nil]]
"go2ruby" => [[nil, nil, "2"]]
"1 goto 1" => [[nil, "1", nil], [nil, "1", nil]]
"2 1 0" => [[nil, nil, "2"], [nil, "1", nil], ["0", nil, nil]]
bash-2.05b$ expand -t2 a.rb
def s(str)
  m = str.scan(/(0)|(1)|(2)/)
  puts "#{str.inspect.ljust(15)} => #{m.inspect}"
end
s "lab0"
s "version1-beta"
s "go2ruby"
s "1 goto 1"
s "2 1 0"
bash-2.05b$

···

On Friday 12 November 2004 03:55, trans. (T. Onoma) wrote:

--
Simon Strandgaard