Regular expression mismatch?

Warren_Brown1 · 6 April 2005 14:58

Han,

Why does the following:

s = "aaa aaa\n\n\nbbb bbb"
puts(s =~ /^\s+$/)

produce: 8 (instead of nil) ?

Because /^/ matches the beginning of a line (not the beginning of
the string), and /\s/ matches whitespace, which includes newlines (\n).
So the first place in the string where the beginning of a line is
followed by one or more whitespaces is at position 8.

(If I put in only 2 newlines, it's fine).

With only two newlines, the /$/ prevents the match, since there are
"b"s following the newline.

I hope this helps.

- Warren Brown

Han_Holl · 7 April 2005 06:47

Han,

[ cut ]

Because /^/ matches the beginning of a line (not the beginning of
the string), and /\s/ matches whitespace, which includes newlines (\n).
So the first place in the string where the beginning of a line is
followed by one or more whitespaces is at position 8.

Thanks for the reactions to all.

It's not that simple: ^ _also_ matches the beginning of the string.
Perl does _not_
produce a match, unless you suffix the regular expression with m.

Cheers,

Han

···

On Apr 6, 2005 4:58 PM, Warren Brown <warrenb@timevision.com> wrote:

Brian_Candler · 7 April 2005 08:05

And so it's worth pointing out that in Ruby you should write:

str.untaint if str =~ /\A[a-z0-9]*\z/ # good

and not:

str.untaint if str =~ /^[a-z0-9]*$/ # HIGHLY DANGEROUS

It means that these sorts of regexp are a bit less readable than Perl's.

Regards,

Brian.

···

On Thu, Apr 07, 2005 at 03:47:15PM +0900, Han Holl wrote:

On Apr 6, 2005 4:58 PM, Warren Brown <warrenb@timevision.com> wrote:
> Han,
[ cut ]
>
> Because /^/ matches the beginning of a line (not the beginning of
> the string), and /\s/ matches whitespace, which includes newlines (\n).
> So the first place in the string where the beginning of a line is
> followed by one or more whitespaces is at position 8.
>
Thanks for the reactions to all.

It's not that simple: ^ _also_ matches the beginning of the string.
Perl does _not_
produce a match, unless you suffix the regular expression with m.

Han_Holl · 7 April 2005 09:29

Which leaves the question: what is the meaning if the m suffix in ruby ?
It would seem that multi-line is on by default, with no means to switch it off.

Ruby should not be different from the other RE engines with no good reason.

Cheers,

Han Holl

···

On Apr 7, 2005 10:05 AM, Brian Candler <B.Candler@pobox.com> wrote:

And so it's worth pointing out that in Ruby you should write:

str.untaint if str =~ /\A[a-z0-9]*\z/ # good

and not:

str.untaint if str =~ /^[a-z0-9]*$/ # HIGHLY DANGEROUS

It means that these sorts of regexp are a bit less readable than Perl's.

Regards,

Brian.

Neil_Stevens · 7 April 2005 09:40

Well, too late now, since not breaking existing scripts is good reason to
keep the present behavior.

···

On Thu, 07 Apr 2005 19:29:42 +0900, Han Holl wrote:

Ruby should not be different from the other RE engines with no good reason.

--
Neil Stevens - neil@hakubi.us

'A republic, if you can keep it.' -- Benjamin Franklin

David_A_Black3 · 7 April 2005 10:34

Hi --

···

On Thu, 7 Apr 2005, Han Holl wrote:

On Apr 7, 2005 10:05 AM, Brian Candler <B.Candler@pobox.com> wrote:

And so it's worth pointing out that in Ruby you should write:

str.untaint if str =~ /\A[a-z0-9]*\z/ # good

and not:

str.untaint if str =~ /^[a-z0-9]*$/ # HIGHLY DANGEROUS

It means that these sorts of regexp are a bit less readable than Perl's.

Regards,

Brian.

Which leaves the question: what is the meaning if the m suffix in ruby ?

The /m suffix means that \n is included in . (dot).

David

--
David A. Black
dblack@wobblini.net

Han_Holl · 7 April 2005 12:59

This is from man perlre:
       m Treat string as multiple lines. That is, change "^" and "$"
    from matching the start or end of the string to matching then
    start or end of any line anywhere within the string.

This should go on the page I've seen somewhere with gotchas.
Perl RE is quite widespread, and when ruby deviates from it it's
easy to trip up.

Cheers,
Han Holl

···

On Apr 7, 2005 12:34 PM, David A. Black > The /m suffix means that \n is included in . (dot). > Yes, looked it up in the Pickaxe, and indeed that's what it says.

David_A_Black3 · 7 April 2005 13:09

Hi --

···

On Thu, 7 Apr 2005, Han Holl wrote:

On Apr 7, 2005 12:34 PM, David A. Black >> The /m suffix means that \n is included in . (dot). >> > Yes, looked it up in the Pickaxe, and indeed that's what it says.

This is from man perlre:
      m Treat string as multiple lines. That is, change "^" and "$"
   from matching the start or end of the string to matching then
   start or end of any line anywhere within the string.

This should go on the page I've seen somewhere with gotchas.
Perl RE is quite widespread, and when ruby deviates from it it's
easy to trip up.

Not if you use Ruby more and more

David

--
David A. Black
dblack@wobblini.net

Han_Holl · 7 April 2005 13:38

This problem occurred while porting nasty old Perl program
to shiny new Ruby. I used to rely on ruby's re to be Perl
compatible.

Han

···

On Apr 7, 2005 3:09 PM, David A. Black <dblack@wobblini.net> wrote:

Not if you use Ruby more and more

David_A_Black3 · 7 April 2005 13:45

Hi --

···

On Thu, 7 Apr 2005, Han Holl wrote:

On Apr 7, 2005 3:09 PM, David A. Black <dblack@wobblini.net> wrote:

Not if you use Ruby more and more

This problem occurred while porting nasty old Perl program
to shiny new Ruby. I used to rely on ruby's re to be Perl
compatible.

I don't think they ever have been, at least not in the treatment of
all this line-ending stuff (and maybe a few other things).

David

--
David A. Black
dblack@wobblini.net

Neil_Stevens · 7 April 2005 15:34

And I'm sure people who have relied on Perl REs being compatible with its
predecessors have been bitten by problems, too.

Regular expressions never really have been regular enough to make that
assumption, though.

···

On Thu, 07 Apr 2005 23:38:17 +0900, Han Holl wrote:

On Apr 7, 2005 3:09 PM, David A. Black <dblack@wobblini.net> wrote:

Not if you use Ruby more and more

This problem occurred while porting nasty old Perl program
to shiny new Ruby. I used to rely on ruby's re to be Perl
compatible.

--
Neil Stevens - neil@hakubi.us

'A republic, if you can keep it.' -- Benjamin Franklin

Han_Holl · 7 April 2005 15:39

Kind of interesting, and mightily adding to the confusion: Pickaxe2
calls the m option: 'multi-line mode', dot matches newline.
Jeffrey E. F. Friedl, in the content page of the Mastering book:
Dot-matches-all match mode (a.k.a., "single-line mode").
He calls multi-line mode the different interpretation of ^ and $.

Does anyone know if there is, or has been, a reason why Ruby chooses
to be different from the rest (Perl, Python, PHP, Apache to name a few).

Cheers,

Han

···

On Apr 7, 2005 3:45 PM, David A. Black <dblack@wobblini.net> wrote:

I don't think they ever have been, at least not in the treatment of
all this line-ending stuff (and maybe a few other things).

David

Brian_Candler · 8 April 2005 08:11

He's using the Perl convention. From man perlre:

      /m Treat string as multiple lines. That is, change "^" and "$" from
           matching the start or end of the string to matching the start or
           end of any line anywhere within the string.

[Ruby has this mode always enabled; you have to use \A and \z to match just
start and end of string. Perl has these too, but they're rarely used]

      /s Treat string as single line. That is, change "." to match any
           character whatsoever, even a newline, which normally it would not
           match.

[That's the same as Ruby's /m modifier, just to make things confusing]

Regards,

Brian.

···

On Fri, Apr 08, 2005 at 12:39:02AM +0900, Han Holl wrote:

On Apr 7, 2005 3:45 PM, David A. Black <dblack@wobblini.net> wrote:
> I don't think they ever have been, at least not in the treatment of
> all this line-ending stuff (and maybe a few other things).
>
> David
Kind of interesting, and mightily adding to the confusion: Pickaxe2
calls the m option: 'multi-line mode', dot matches newline.
Jeffrey E. F. Friedl, in the content page of the Mastering book:
Dot-matches-all match mode (a.k.a., "single-line mode").
He calls multi-line mode the different interpretation of ^ and $.

Topic		Replies	Views
Regular expression mismatch? ruby-talk	2	65	6 April 2005
Regular expression mismatch? ruby-talk	1	62	7 April 2005
Regex query ruby-talk	2	113	10 January 2007
Question about a regular expression ruby-talk	2	117	22 July 2006
Multiline Regexps ruby-talk	3	83	9 December 2003

Regular expression mismatch?

Related topics