Regular expression mismatch?

Han,

Why does the following:

s = "aaa aaa\n\n\nbbb bbb"
puts(s =~ /^\s+$/)

produce: 8 (instead of nil) ?

    Because /^/ matches the beginning of a line (not the beginning of
the string), and /\s/ matches whitespace, which includes newlines (\n).
So the first place in the string where the beginning of a line is
followed by one or more whitespaces is at position 8.

(If I put in only 2 newlines, it's fine).

    With only two newlines, the /$/ prevents the match, since there are
"b"s following the newline.

    I hope this helps.

    - Warren Brown

Han,

[ cut ]

    Because /^/ matches the beginning of a line (not the beginning of
the string), and /\s/ matches whitespace, which includes newlines (\n).
So the first place in the string where the beginning of a line is
followed by one or more whitespaces is at position 8.

Thanks for the reactions to all.

It's not that simple: ^ _also_ matches the beginning of the string.
Perl does _not_
produce a match, unless you suffix the regular expression with m.

Cheers,

Han

···

On Apr 6, 2005 4:58 PM, Warren Brown <warrenb@timevision.com> wrote:

And so it's worth pointing out that in Ruby you should write:

    str.untaint if str =~ /\A[a-z0-9]*\z/ # good

and not:

    str.untaint if str =~ /^[a-z0-9]*$/ # HIGHLY DANGEROUS

It means that these sorts of regexp are a bit less readable than Perl's.

Regards,

Brian.

···

On Thu, Apr 07, 2005 at 03:47:15PM +0900, Han Holl wrote:

On Apr 6, 2005 4:58 PM, Warren Brown <warrenb@timevision.com> wrote:
> Han,
[ cut ]
>
> Because /^/ matches the beginning of a line (not the beginning of
> the string), and /\s/ matches whitespace, which includes newlines (\n).
> So the first place in the string where the beginning of a line is
> followed by one or more whitespaces is at position 8.
>
Thanks for the reactions to all.

It's not that simple: ^ _also_ matches the beginning of the string.
Perl does _not_
produce a match, unless you suffix the regular expression with m.

Which leaves the question: what is the meaning if the m suffix in ruby ?
It would seem that multi-line is on by default, with no means to switch it off.

Ruby should not be different from the other RE engines with no good reason.

Cheers,

Han Holl

···

On Apr 7, 2005 10:05 AM, Brian Candler <B.Candler@pobox.com> wrote:

And so it's worth pointing out that in Ruby you should write:

    str.untaint if str =~ /\A[a-z0-9]*\z/ # good

and not:

    str.untaint if str =~ /^[a-z0-9]*$/ # HIGHLY DANGEROUS

It means that these sorts of regexp are a bit less readable than Perl's.

Regards,

Brian.

Well, too late now, since not breaking existing scripts is good reason to
keep the present behavior.

···

On Thu, 07 Apr 2005 19:29:42 +0900, Han Holl wrote:

Ruby should not be different from the other RE engines with no good reason.

--
Neil Stevens - neil@hakubi.us

'A republic, if you can keep it.' -- Benjamin Franklin

Hi --

···

On Thu, 7 Apr 2005, Han Holl wrote:

On Apr 7, 2005 10:05 AM, Brian Candler <B.Candler@pobox.com> wrote:

And so it's worth pointing out that in Ruby you should write:

    str.untaint if str =~ /\A[a-z0-9]*\z/ # good

and not:

    str.untaint if str =~ /^[a-z0-9]*$/ # HIGHLY DANGEROUS

It means that these sorts of regexp are a bit less readable than Perl's.

Regards,

Brian.

Which leaves the question: what is the meaning if the m suffix in ruby ?

The /m suffix means that \n is included in . (dot).

David

--
David A. Black
dblack@wobblini.net

This is from man perlre:
       m Treat string as multiple lines. That is, change "^" and "$"
    from matching the start or end of the string to matching then
    start or end of any line anywhere within the string.

This should go on the page I've seen somewhere with gotchas.
Perl RE is quite widespread, and when ruby deviates from it it's
easy to trip up.

Cheers,
Han Holl

···

On Apr 7, 2005 12:34 PM, David A. Black > The /m suffix means that \n is included in . (dot). > Yes, looked it up in the Pickaxe, and indeed that's what it says.

Hi --

···

On Thu, 7 Apr 2005, Han Holl wrote:

On Apr 7, 2005 12:34 PM, David A. Black >> The /m suffix means that \n is included in . (dot). >> > Yes, looked it up in the Pickaxe, and indeed that's what it says.

This is from man perlre:
      m Treat string as multiple lines. That is, change "^" and "$"
   from matching the start or end of the string to matching then
   start or end of any line anywhere within the string.

This should go on the page I've seen somewhere with gotchas.
Perl RE is quite widespread, and when ruby deviates from it it's
easy to trip up.

Not if you use Ruby more and more :slight_smile:

David

--
David A. Black
dblack@wobblini.net

This problem occurred while porting nasty old Perl program
to shiny new Ruby. I used to rely on ruby's re to be Perl
compatible.

Han

···

On Apr 7, 2005 3:09 PM, David A. Black <dblack@wobblini.net> wrote:

Not if you use Ruby more and more :slight_smile:

Hi --

···

On Thu, 7 Apr 2005, Han Holl wrote:

On Apr 7, 2005 3:09 PM, David A. Black <dblack@wobblini.net> wrote:

Not if you use Ruby more and more :slight_smile:

This problem occurred while porting nasty old Perl program
to shiny new Ruby. I used to rely on ruby's re to be Perl
compatible.

I don't think they ever have been, at least not in the treatment of
all this line-ending stuff (and maybe a few other things).

David

--
David A. Black
dblack@wobblini.net

And I'm sure people who have relied on Perl REs being compatible with its
predecessors have been bitten by problems, too.

Regular expressions never really have been regular enough to make that
assumption, though.

···

On Thu, 07 Apr 2005 23:38:17 +0900, Han Holl wrote:

On Apr 7, 2005 3:09 PM, David A. Black <dblack@wobblini.net> wrote:

Not if you use Ruby more and more :slight_smile:

This problem occurred while porting nasty old Perl program
to shiny new Ruby. I used to rely on ruby's re to be Perl
compatible.

--
Neil Stevens - neil@hakubi.us

'A republic, if you can keep it.' -- Benjamin Franklin

Kind of interesting, and mightily adding to the confusion: Pickaxe2
calls the m option: 'multi-line mode', dot matches newline.
Jeffrey E. F. Friedl, in the content page of the Mastering book:
Dot-matches-all match mode (a.k.a., "single-line mode").
He calls multi-line mode the different interpretation of ^ and $.

Does anyone know if there is, or has been, a reason why Ruby chooses
to be different from the rest (Perl, Python, PHP, Apache to name a few).

Cheers,

Han

···

On Apr 7, 2005 3:45 PM, David A. Black <dblack@wobblini.net> wrote:

I don't think they ever have been, at least not in the treatment of
all this line-ending stuff (and maybe a few other things).

David

He's using the Perl convention. From man perlre:

      /m Treat string as multiple lines. That is, change "^" and "$" from
           matching the start or end of the string to matching the start or
           end of any line anywhere within the string.

[Ruby has this mode always enabled; you have to use \A and \z to match just
start and end of string. Perl has these too, but they're rarely used]

      /s Treat string as single line. That is, change "." to match any
           character whatsoever, even a newline, which normally it would not
           match.

[That's the same as Ruby's /m modifier, just to make things confusing]

Regards,

Brian.

···

On Fri, Apr 08, 2005 at 12:39:02AM +0900, Han Holl wrote:

On Apr 7, 2005 3:45 PM, David A. Black <dblack@wobblini.net> wrote:
> I don't think they ever have been, at least not in the treatment of
> all this line-ending stuff (and maybe a few other things).
>
> David
Kind of interesting, and mightily adding to the confusion: Pickaxe2
calls the m option: 'multi-line mode', dot matches newline.
Jeffrey E. F. Friedl, in the content page of the Mastering book:
Dot-matches-all match mode (a.k.a., "single-line mode").
He calls multi-line mode the different interpretation of ^ and $.