Ruby in "Mastering Regular Expressions"

The latest edition of “Mastering Regular Expressions, 2nd Edition” refers to
Ruby (Yippy!!), but not always in a positive light (Bummer!!).

Here is a sampling:

page 91, table indicates that Ruby version 1.6.7 was used when testing
regular expressions.

page 128, Table 3-11: Line Anchors for Some Scripting Languages. This table
lists “Concerns” and how they are handled. Under the Ruby column, the
following “Concerns” are noted:

Concern: “^ matches after any newline”.
– Note: “Ruby’s $ and ^ match at embedded newlines, but its \A and \Z do
not”

Concern: “$ matches before any newline”
– Note: “Ruby’s $ and ^ match at embedded newlines, but its \A and \Z do
not”

Under the title “Enhanced line-anchor mode. . .”
– Note: “N/A”. Indicates Ruby does not have this feature. While every other
language listed does (Java, Perl, PHP, Python, Tcl, .NET).

Concern: “\A always matches like normal ^”
– Note: “Ruby’s \A, unlike its ^, matches only at the start of the string”

Concern: “\Z always matches like normal $”
– Note: “Ruby’s \Z, unlike its $, matches at the end of the string, or
before a string-ending newline”

Concern: “\z always matches only at end of string”
– Note: “N/A”.

page 131, “My testing has shown that java.util.regex and Ruby have \G match
at the start of the current match, while Perl and the .NET languages have it
match at the end of the previous match. (Sun tells me that the next release
of java.util.regex will have its \g behavior match the documentation.)”

page 132, Table 3-12: A Few Utilities and Their Word Boundary
Metacharacters. The table indicates that Ruby does not support
“Start-of-word” and “End-of-word” boundary characters [e.g. Perl: (?<!\w)
(?=\w) … (?<=\w) (?!\w) ].

page 133, "Ruby has a bug whereby sometimes (?i) doesn’t apply to

-separated alternatives that are lowercase (but does if they’re
uppercase)."

I am not a Master at Regular Expressions so I would like comments on if
these things should change (or possibly already are changed) in Ruby.

Hello –

The latest edition of “Mastering Regular Expressions, 2nd Edition” refers to
Ruby (Yippy!!), but not always in a positive light (Bummer!!).

Take heart: some of the things you’ve listed are not negative, but
merely descriptive.

page 128, Table 3-11: Line Anchors for Some Scripting Languages. This table
lists “Concerns” and how they are handled. Under the Ruby column, the
following “Concerns” are noted:

I don’t think he’s using “concern” in a negative way here. It’s just
a chart of “concerns”, in the sense of “things that come into play
where line anchors are involved”, and how a variety of languages
handle them. If they were all negative points, he’d be condemning the
existence of line anchor handling in all of these languages :slight_smile:

Concern: “^ matches after any newline”.
– Note: “Ruby’s $ and ^ match at embedded newlines, but its \A and \Z do
not”

Concern: “$ matches before any newline”
– Note: “Ruby’s $ and ^ match at embedded newlines, but its \A and \Z do
not”

Under the title “Enhanced line-anchor mode. . .”
– Note: “N/A”. Indicates Ruby does not have this feature. While every other
language listed does (Java, Perl, PHP, Python, Tcl, .NET).

That’s because Ruby doesn’t need it :slight_smile:

In Ruby, $ and ^ always match starts and ends of lines (embedded or
otherwise), while \A and the \Z,\z pair matching the beginning and end
of strings (\Z and \z differing as to whether they match before or
after a final newline, if any). Therefore, you don’t need a special
“mode” indicating that $ and ^ should temporarily change their
meanings. It’s very simple and very consistent.

Concern: “\A always matches like normal ^”
– Note: “Ruby’s \A, unlike its ^, matches only at the start of the string”

Concern: “\Z always matches like normal $”
– Note: “Ruby’s \Z, unlike its $, matches at the end of the string, or
before a string-ending newline”

Concern: “\z always matches only at end of string”
– Note: “N/A”.

Yes, all related to the same point: in Ruby, \A,\Z/z, ^, and $ all do
their jobs without overlapping or needing behavior-altering switches.

page 131, “My testing has shown that java.util.regex and Ruby have \G match
at the start of the current match, while Perl and the .NET languages have it
match at the end of the previous match. (Sun tells me that the next release
of java.util.regex will have its \g behavior match the documentation.)”

Hmmm. I’m not sure about that one, or why there’s that difference.

page 132, Table 3-12: A Few Utilities and Their Word Boundary
Metacharacters. The table indicates that Ruby does not support
“Start-of-word” and “End-of-word” boundary characters [e.g. Perl: (?<!\w)
(?=\w) … (?<=\w) (?!\w) ].

I’m puzzling through why one would need those as long as one has \b
and \B. It looks like Friedl is saying: here are different ways to
achieve this, either with \b/\B and/or with lookahead/lookbehind. I’m
not sure whether there are plans afoot to add lookbehind to Ruby
regexes, but in any case, I’d use \b and \B for word boundaries.

I notice this from the Ruby ChangeLog:

  * regex.c (re_compile_pattern): \< (wordbeg), \> (wordend)
        disabled.

So I guess it existed at some point and Matz decided we didn’t need
it. (You can see its remains in regex.c :slight_smile:

page 133, "Ruby has a bug whereby sometimes (?i) doesn’t apply to

-separated alternatives that are lowercase (but does if they’re
uppercase)."

I wish he’d provide an example…

I am not a Master at Regular Expressions so I would like comments on if
these things should change (or possibly already are changed) in Ruby.

Any that are bugs should change :slight_smile: The line and string boundary
syntax in Ruby seems to me to be exemplary; I wouldn’t want to see
that regress.

David

···

On Fri, 11 Oct 2002, Dale Martenson wrote:
Thu Nov 4 17:41:18 1999 Yukihiro Matsumoto matz@netlab.co.jp


David Alan Black | Register for RubyConf 2002!
home: dblack@candle.superlink.net | November 1-3
work: blackdav@shu.edu | Seattle, WA, USA
Web: http://pirate.shu.edu/~blackdav | http://www.rubyconf.com