Regexp and $

I seem to remember some discussion about regexps recently, including Perl
regexps versus Ruby regexps.

I am actually in favour of Ruby regexps pretty much as they are, but I have
one gripe: it doesn’t seem to be possible to make “$” match “end of string”
rather than “before a newline”.

I wrote a regexp which I thought would match trailing \n’s in a string:

a.gsub!(/\n+$/,'')

This is pretty standard stuff in Perl:

perl -e '$a="abc\ndef\n"; $a =~ s/\n+$//; print "<$a>\n";'

But it doesn’t work in Ruby. Adding /m doesn’t help either. It took a fair
bit of head-scratching and digging around for me to discover that to match
the end of a string you must use ‘\z’.

In Perl, I believe “^” and “$” match the start and end of the string only,
unless you have multiline mode enabled.

A bit of playing around shows:

       ^ matches         $ matches       . matches
       ---------         ---------       ---------

Ruby // start of string end of string any char except \n
or after \n or before \n
[note]

Ruby //m start of string end of string any char
or after \n or before \n

Perl // start of string end of string any char except \n

Perl //m start of string end of string any char except \n
or after \n or before \n

Perl //s start of string end of string any char

Perl //ms start or string end of string any char
or after \n or before \n

So it looks like Ruby’s “normal” mode is like Perl’s “multiline” mode, and
there is no way to get Perl’s “normal” mode in Ruby. Ruby’s “multiline” mode
is really Perl’s “/ms” mode

[note]
Actually, that’s not even true. Perl’s $ will match end-of-string after a
newline even in multiline mode:

$ perl -e ‘$a=“abc\n”; $a =~ s/c\n$//m; print “<$a>\n”;’

But Ruby can’t.

$ ruby -e ‘a=“abc\n”; a.gsub!(/c\n$/,“”); puts “<#{a}>”;’
<abc

This means the description “$ matches end of string or before \n” is wrong
for Ruby. In fact, I’m not sure what exactly $ matches. Can anybody give
me a description of what $ matches in Ruby?

I think this difference in behaviour is unfortunate, as I consider that “^”
and “$” are part of the lowest-common-denominator of regexp behaviour which
ought to be reasonably portable.

Regards,

Brian.

The closest description I can come up with is:

In Ruby, ‘$’ matches:

  • before a newline
  • at the end of the string, UNLESS the last character of the string is a
    newline, in which case it doesn’t match there.

irb(main):001:0> a = “abc\n\n”
=> “abc\n\n”
irb(main):002:0> a.gsub(/$/,‘z’)
=> “abcz\nz\n”
irb(main):003:0> a = “abc”
=> “abc”
irb(main):004:0> a.gsub(/$/,‘z’)
=> “abcz”

I wonder why this was chosen rather that just “‘$’ matches before a newline
and at the end of the string”?

I guess if you do gsub(/$/,‘foo’) to add text to the end of a \n-terminated
line then it kind of makes sense. But it does stop you matching /\n$/ if you
want to explicitly eat up newlines - and breaks compatibility with other
languages.

‘^’ doesn’t have this issue:

irb(main):005:0> b = “\nabc”
=> “\nabc”
irb(main):006:0> b.gsub(/^/,‘z’)
=> “z\nzabc”

So ‘^’ always matches at the start of the string and after a newline.

Regards,

Brian.

···

On Sun, Apr 27, 2003 at 08:41:04PM +0900, I wrote:

This means the description “$ matches end of string or before \n” is wrong
for Ruby. In fact, I’m not sure what exactly $ matches. Can anybody give
me a description of what $ matches in Ruby?

This means the description "$ matches end of string or before \n" is wrong
for Ruby. In fact, I'm not sure *what* exactly $ matches. Can anybody give
me a description of what $ matches in Ruby?

$ match before \n or the end of string if it's not \n

I think this difference in behaviour is unfortunate, as I consider that "^"
and "$" are part of the lowest-common-denominator of regexp behaviour which
ought to be reasonably portable.

Why use the broken P implementation ?

Guy Decoux

I guess I am used to:

s/[\r\n\s]+$//;   # Strip trailing spaces and newlines

Somehow it seems right to me that $ matches “end of string”, perhaps because
a long time ago I did a course which had a more formal definition of regular
expressions, and $ was used to mean end of input stream.

FWIW, awk agrees with Perl:

$ awk – ‘BEGIN { a=“abc\n”; gsub(“$”,“z”,a); print a }’ </dev/null
abc
z

So it seems to me this is an anomoly introduced by Ruby.

Regards,

Brian.

···

On Sun, Apr 27, 2003 at 09:05:29PM +0900, ts wrote:

I think this difference in behaviour is unfortunate, as I consider that “^”
and “$” are part of the lowest-common-denominator of regexp behaviour which
ought to be reasonably portable.

Why use the broken P implementation ?

FWIW, awk agrees with Perl:

Retrieve the test for regexp from rubicon and run it against ruby, perl,
python, awk, sed, rx and you'll see how many different implementations of
regexp exist

Guy Decoux

Hi –

···

On Sun, 27 Apr 2003, Brian Candler wrote:

FWIW, awk agrees with Perl:

$ awk – ‘BEGIN { a=“abc\n”; gsub(“$”,“z”,a); print a }’ </dev/null
abc
z

So it seems to me this is an anomoly introduced by Ruby.

sed treats $ line-wise:

$ echo -e “abc\nabc” | sed -e ‘s/abc$/def/’
def
def

David


David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Did you look at \z, \Z and option “m” (multiline mode)?

robert

“Brian Candler” B.Candler@pobox.com schrieb im Newsbeitrag
news:20030427121720.GA38491@uk.tiscali.com

I think this difference in behaviour is unfortunate, as I consider
that “^”
and “$” are part of the lowest-common-denominator of regexp behaviour
which
ought to be reasonably portable.

Why use the broken P implementation ?

I guess I am used to:

s/[\r\n\s]+$//;   # Strip trailing spaces and newlines

Somehow it seems right to me that $ matches “end of string”, perhaps
because
a long time ago I did a course which had a more formal definition of
regular

···

On Sun, Apr 27, 2003 at 09:05:29PM +0900, ts wrote:
expressions, and $ was used to mean end of input stream.

FWIW, awk agrees with Perl:

$ awk – ‘BEGIN { a=“abc\n”; gsub(“$”,“z”,a); print a }’ </dev/null
abc
z

So it seems to me this is an anomoly introduced by Ruby.

Regards,

Brian.

Err, yes. I think perhaps you missed my posting which started the thread, at
http://ruby-talk.org/70243

“Adding /m doesn’t help either. It took a fair
bit of head-scratching and digging around for me to discover that to
match the end of a string you must use ‘\z’.”

Regards,

Brian.

···

On Mon, Apr 28, 2003 at 03:39:58PM +0900, Robert wrote:

Did you look at \z, \Z and option “m” (multiline mode)?

“Brian Candler” B.Candler@pobox.com schrieb im Newsbeitrag
news:20030428100050.B53856@linnet.org

Did you look at \z, \Z and option “m” (multiline mode)?

Err, yes. I think perhaps you missed my posting which started the thread,
at
http://ruby-talk.org/70243

“Adding /m doesn’t help either. It took a fair
bit of head-scratching and digging around for me to discover that to
match the end of a string you must use ‘\z’.”

Yes, I missed that part, sorry.

robert
···

On Mon, Apr 28, 2003 at 03:39:58PM +0900, Robert wrote: