Regexp and $

Brian_Candler · 27 April 2003 11:41

I seem to remember some discussion about regexps recently, including Perl
regexps versus Ruby regexps.

I am actually in favour of Ruby regexps pretty much as they are, but I have
one gripe: it doesn’t seem to be possible to make “$” match “end of string”
rather than “before a newline”.

I wrote a regexp which I thought would match trailing \n’s in a string:

a.gsub!(/\n+$/,'')

This is pretty standard stuff in Perl:

perl -e '$a="abc\ndef\n"; $a =~ s/\n+$//; print "<$a>\n";'

But it doesn’t work in Ruby. Adding /m doesn’t help either. It took a fair
bit of head-scratching and digging around for me to discover that to match
the end of a string you must use ‘\z’.

In Perl, I believe “^” and “$” match the start and end of the string only,
unless you have multiline mode enabled.

A bit of playing around shows:

       ^ matches         $ matches       . matches
       ---------         ---------       ---------

Ruby // start of string end of string any char except \n
or after \n or before \n
[note]

Ruby //m start of string end of string any char
or after \n or before \n

Perl // start of string end of string any char except \n

Perl //m start of string end of string any char except \n
or after \n or before \n

Perl //s start of string end of string any char

Perl //ms start or string end of string any char
or after \n or before \n

So it looks like Ruby’s “normal” mode is like Perl’s “multiline” mode, and
there is no way to get Perl’s “normal” mode in Ruby. Ruby’s “multiline” mode
is really Perl’s “/ms” mode

[note]
Actually, that’s not even true. Perl’s $ will match end-of-string after a
newline even in multiline mode:

$ perl -e ‘$a=“abc\n”; $a =~ s/c\n$//m; print “<$a>\n”;’

But Ruby can’t.

$ ruby -e ‘a=“abc\n”; a.gsub!(/c\n$/,“”); puts “<#{a}>”;’
<abc

This means the description “$ matches end of string or before \n” is wrong
for Ruby. In fact, I’m not sure what exactly $ matches. Can anybody give
me a description of what $ matches in Ruby?

I think this difference in behaviour is unfortunate, as I consider that “^”
and “$” are part of the lowest-common-denominator of regexp behaviour which
ought to be reasonably portable.

Regards,

Brian.

Brian_Candler · 27 April 2003 12:04

The closest description I can come up with is:

In Ruby, ‘$’ matches:

before a newline
at the end of the string, UNLESS the last character of the string is a
newline, in which case it doesn’t match there.

irb(main):001:0> a = “abc\n\n”
=> “abc\n\n”
irb(main):002:0> a.gsub(/$/,‘z’)
=> “abcz\nz\n”
irb(main):003:0> a = “abc”
=> “abc”
irb(main):004:0> a.gsub(/$/,‘z’)
=> “abcz”

I wonder why this was chosen rather that just “‘$’ matches before a newline
and at the end of the string”?

I guess if you do gsub(/$/,‘foo’) to add text to the end of a \n-terminated
line then it kind of makes sense. But it does stop you matching /\n$/ if you
want to explicitly eat up newlines - and breaks compatibility with other
languages.

‘^’ doesn’t have this issue:

irb(main):005:0> b = “\nabc”
=> “\nabc”
irb(main):006:0> b.gsub(/^/,‘z’)
=> “z\nzabc”

So ‘^’ always matches at the start of the string and after a newline.

Regards,

Brian.

···

On Sun, Apr 27, 2003 at 08:41:04PM +0900, I wrote:

This means the description “$ matches end of string or before \n” is wrong
for Ruby. In fact, I’m not sure what exactly $ matches. Can anybody give
me a description of what $ matches in Ruby?

ts1 · 27 April 2003 12:05

This means the description "$ matches end of string or before \n" is wrong
for Ruby. In fact, I'm not sure *what* exactly $ matches. Can anybody give
me a description of what $ matches in Ruby?

$ match before \n or the end of string if it's not \n

I think this difference in behaviour is unfortunate, as I consider that "^"
and "$" are part of the lowest-common-denominator of regexp behaviour which
ought to be reasonably portable.

Why use the broken P implementation ?

Guy Decoux

Brian_Candler · 27 April 2003 12:17

I guess I am used to:

s/[\r\n\s]+$//;   # Strip trailing spaces and newlines

Somehow it seems right to me that $ matches “end of string”, perhaps because
a long time ago I did a course which had a more formal definition of regular
expressions, and $ was used to mean end of input stream.

FWIW, awk agrees with Perl:

$ awk – ‘BEGIN { a=“abc\n”; gsub(“$”,“z”,a); print a }’ </dev/null
abc
z

So it seems to me this is an anomoly introduced by Ruby.

Regards,

Brian.

···

On Sun, Apr 27, 2003 at 09:05:29PM +0900, ts wrote:

I think this difference in behaviour is unfortunate, as I consider that “^”
and “$” are part of the lowest-common-denominator of regexp behaviour which
ought to be reasonably portable.

Why use the broken P implementation ?

ts1 · 27 April 2003 12:23

FWIW, awk agrees with Perl:

Retrieve the test for regexp from rubicon and run it against ruby, perl,
python, awk, sed, rx and you'll see how many different implementations of
regexp exist

Guy Decoux

David_A_Black2 · 27 April 2003 12:31

Hi –

···

On Sun, 27 Apr 2003, Brian Candler wrote:

FWIW, awk agrees with Perl:

$ awk – ‘BEGIN { a=“abc\n”; gsub(“$”,“z”,a); print a }’ </dev/null
abc
z

So it seems to me this is an anomoly introduced by Ruby.

sed treats $ line-wise:

$ echo -e “abc\nabc” | sed -e ‘s/abc$/def/’
def
def

David

–
David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Robert · 28 April 2003 06:39

Did you look at \z, \Z and option “m” (multiline mode)?

robert

“Brian Candler” B.Candler@pobox.com schrieb im Newsbeitrag
news:20030427121720.GA38491@uk.tiscali.com…

I think this difference in behaviour is unfortunate, as I consider
that “^”
and “$” are part of the lowest-common-denominator of regexp behaviour
which
ought to be reasonably portable.

Why use the broken P implementation ?

I guess I am used to:
s/[\r\n\s]+$//;   # Strip trailing spaces and newlines
Somehow it seems right to me that $ matches “end of string”, perhaps
because
a long time ago I did a course which had a more formal definition of
regular

···

On Sun, Apr 27, 2003 at 09:05:29PM +0900, ts wrote:
expressions, and $ was used to mean end of input stream.

FWIW, awk agrees with Perl:

$ awk – ‘BEGIN { a=“abc\n”; gsub(“$”,“z”,a); print a }’ </dev/null
abc
z

So it seems to me this is an anomoly introduced by Ruby.

Regards,

Brian.

Brian_Candler · 28 April 2003 09:00

Err, yes. I think perhaps you missed my posting which started the thread, at
http://ruby-talk.org/70243

“Adding /m doesn’t help either. It took a fair
bit of head-scratching and digging around for me to discover that to
match the end of a string you must use ‘\z’.”

Regards,

Brian.

···

On Mon, Apr 28, 2003 at 03:39:58PM +0900, Robert wrote:

Did you look at \z, \Z and option “m” (multiline mode)?

Robert · 28 April 2003 14:21

“Brian Candler” B.Candler@pobox.com schrieb im Newsbeitrag
news:20030428100050.B53856@linnet.org…

Did you look at \z, \Z and option “m” (multiline mode)?

Err, yes. I think perhaps you missed my posting which started the thread,
at
http://ruby-talk.org/70243

“Adding /m doesn’t help either. It took a fair
bit of head-scratching and digging around for me to discover that to
match the end of a string you must use ‘\z’.”

Yes, I missed that part, sorry.

robert

···

On Mon, Apr 28, 2003 at 03:39:58PM +0900, Robert wrote:

Topic		Replies	Views
Re[rough cuts]: regexp ruby-talk	1	78	15 January 2011
Multiline Regexps ruby-talk	3	83	9 December 2003
Regexp - start and end of line or string ruby-talk	1	115	16 January 2011
Ruby in "Mastering Regular Expressions" ruby-talk	1	107	10 October 2002
Regex ^ beginning not strong? ruby-talk	2	89	27 July 2010

Regexp and $

Related topics