Another strange regexp case

Hi,

here is another regexp behaviour which surprises me.
There may be some logic behind it, but I fail to see it...

irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

irb(main):003:0> /(theone)?/.match("theone").to_a
=> ["theone", "theone"]

irb(main):005:0> / (theone)?/.match(" theone").to_a
=> [" theone", "theone"]

In the first case, it doesn't match "theone", but in
the second and third it does...

Could anyone explain this?

Kristof

irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

When the regexp engine try to match `t' it fail, because the first
character is ` ' and the regexp succeed because `theone' was optional

irb(main):003:0> /(theone)?/.match("theone").to_a
=> ["theone", "theone"]

it can match `theone' in its first try

Guy Decoux

irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

? means 'zero or one'

we start a the beginning of ' theone' and instantly find a match: zero of
them.

irb(main):003:0> /(theone)?/.match("theone").to_a
=> ["theone", "theone"]

same here.

irb(main):005:0> / (theone)?/.match(" theone").to_a
=> [" theone", "theone"]

same here. :wink:

remember regexp engines work (well, some of them) by staring at a position and
consuming chars while the pattern matches, iff all the pattern was used we
have a positive match, otherwise not. so in all these cases we start like so

   ' theone'
   ^
   ptr

and drive with the regexp asking "does the regexp match starting here? if so
how many chars did it consume" the consumed chars are returned in $1, $2,
etc. in all the cases above this explains the matching.

note that some regexp engines work in the reverse sense but the effect is
largely the same...

In the first case, it doesn't match "theone", but in the second and third it
does...

so it matched in all cases -- sometimes zero times, sometimes one time. this
is what you asked the regexp to do. i try to follow these rules when
composing regexps:

   - always use anchors ^ and $
   - never use anything that can match 'zero' things

it's the 'zero' thing that suprised you. your first two regexps match even
the empty string!

obviously this is not always possible but i will maintain this:

   if you create a regexp without anchors and with portions that can match zero
   things and have not done so out of absolute need - your code has a bug.

kind regards.

-a

···

On Tue, 29 Jun 2004, Kristof Bastiaensen wrote:
--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

Kristof Bastiaensen wrote:

Hi,

Moin!

here is another regexp behaviour which surprises me.
There may be some logic behind it, but I fail to see it...
irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

I think that this is about how greediness in Regexps works:

A Regexp will try to match as much as possible starting at the current position, but even a "bad" match at the current position will be better than a "good" match at a later position in the String.

Maybe it would be possible to do a version of .match that finds the "best" (== longest, if greedy) match in the whole string. I assume that it would be based on .scan in some kind of way.

Regards,
Florian Gross

<snip>

irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

? means 'zero or one'

we start a the beginning of ' theone' and instantly find a match: zero of
them.

<snip>

   if you create a regexp without anchors and with portions that can match zero
   things and have not done so out of absolute need - your code has a bug.

Thanks for the answer. I expected the pattern to expand greedily,
but I forgot it will return the first match, which is the empty
match. You are right, /(theone)?/ is a silly thing to write,
finally I just needed another regexp for my problem.

Thanks,
Kristof

···

On Tue, 29 Jun 2004 11:05:43 -0600, Ara.T.Howard wrote:

"Kristof Bastiaensen" <kristof@vleeuwen.org> schrieb im Newsbeitrag
news:pan.2004.06.29.18.06.03.518152@vleeuwen.org...

<snip>
>> irb(main):004:0> /(theone)?/.match(" theone").to_a
>> => ["", nil]
>
> ? means 'zero or one'
>
> we start a the beginning of ' theone' and instantly find a match: zero

of

> them.
<snip>
>
> if you create a regexp without anchors and with portions that can

match zero

> things and have not done so out of absolute need - your code has a

bug.

Thanks for the answer. I expected the pattern to expand greedily,
but I forgot it will return the first match, which is the empty
match. You are right, /(theone)?/ is a silly thing to write,
finally I just needed another regexp for my problem.

This is a case of the simple general rule "Watch out for regular
expressions that match the empty string". All sorts of problems can arise
when using them and usually you don't want to match an empty string
anyway.

Kind regards

    robert

···

On Tue, 29 Jun 2004 11:05:43 -0600, Ara.T.Howard wrote: