Ruby-dev summary 19944 - 19957

Hello,

This is a summary of ruby-dev mailing list for last week.

[ruby-dev:19953] Re: How do we parse Regular Expressions in our brain?

This thread has been started from [ruby-dev:19923] as a branch of
Oniguruma thread. TAKAHASHI Masayoshi and Akira Tanaka are investigating
rules to escape ‘[’, ‘]’ and ‘-’ characters in regexp for legibility.

They plan to change the regexp parsing rule in 1.8.0 as follows so far.
The plan is based on regexp coding conventions and requests from others
in ruby-dev:

  1. ‘[’ and ‘]’ must be escaped by ‘’ when they appear as a literal
    in character class expressions. /[[]]/ should be coded as /[[]]/
    for example.

  2. A literal ‘-’ in a character class should be escaped too if the
    class has other ‘-’ literals for range representation. For instance,
    /[abcd-f-hijk]/ and /[–abc]/ should be written as
    /[abcd-f-hijk]/ (or /[abcd-f-hijk]/) and /[–abc]/.
    You can use ‘-’ without ‘’ if the class has no ranges, like
    /[-abc]/ or /[^-]/.

  3. A literal ‘]’ at outside of character classes will be warned
    if it has no escape character such like /a]/. It should be
    changed to /a]/.

Regexp parser will warn for them in 1.8.0.

Kazuo Saito ksaito@uranus.dti.ne.jp

  1. A literal ‘-’ in a character class should be escaped too if the
    class has other ‘-’ literals for range representation. For
    instance,
    /[abcd-f-hijk]/ and /[–abc]/ should be written as
    /[abcd-f-hijk]/ (or /[abcd-f-hijk]/) and /[--abc]/.
    You can use ‘-’ without '' if the class has no ranges, like
    /[-abc]/ or /[^-]/.

Ick. Any reason why the standard “if ‘-’ occurs as the first or last
character in the character class, it matches itself” behavior isn’t
being used in Ruby?

Regex syntax is fractured enough now that I’m unsure of the wisdom of
adding yet another variation on it. Is there some compelling
advantage to this that I’m missing?

···

Do you Yahoo!?
Yahoo! Tax Center - File online, calculators, forms, and more

They plan to change the regexp parsing rule in 1.8.0 as follows so far.
The plan is based on regexp coding conventions and requests from others
in ruby-dev:

2) A literal ‘-’ in a character class should be escaped too if the
class has other ‘-’ literals for range representation. For instance,
/[abcd-f-hijk]/ and /[–abc]/ should be written as
/[abcd-f-hijk]/ (or /[abcd-f-hijk]/) and /[--abc]/.
You can use ‘-’ without '' if the class has no ranges, like
/[-abc]/ or /[^-]/.

The way I parse regexps in my brain is the way I have been conditioned
through what I have learned with grep, awk, perl and so on :slight_smile:

I often use character classes like this:

 /[a-zA-Z0-9.-]/

which matches a-z, A-Z, 0-9, dot or dash. This is how it works everywhere
else. The rule is ‘if you want to match a dash, put it at the very beginning
or very end of your character class’

I can’t see any particular reason to outlaw this practice…

Regards,

Brian.

···

On Tue, Apr 08, 2003 at 12:20:40AM +0900, Kazuo Saito wrote:

Please, please reconsider this.

Although the current way could be viewed as being ugly, at the same
time it is idiomatic. I use regexps in many different languages and
environments, and (with the exception of good ol’ Emacs) they all work
the same way. Adding a new variant will just confuse my poor little
brain. In this case, readability comes from familiarity, not theory.

Cheers

Dave

···

On Monday, April 7, 2003, at 10:20 AM, Kazuo Saito wrote:

They plan to change the regexp parsing rule in 1.8.0 as follows so far.
The plan is based on regexp coding conventions and requests from others
in ruby-dev:

In article 3E91974E.7000903@uranus.dti.ne.jp,
Kazuo Saito ksaito@uranus.dti.ne.jp writes:

  1. ‘[’ and ‘]’ must be escaped by '' when they appear as a literal
    in character class expressions. /[]/ should be coded as /[]/
    for example.

No. What should be coed as /[]/ is //.

Or, /[]/ should be coded as /[]/

  1. A literal ‘-’ in a character class should be escaped too if the
    class has other ‘-’ literals for range representation. For instance,
    /[abcd-f-hijk]/ and /[–abc]/ should be written as
    /[abcd-f-hijk]/ (or /[abcd-f-hijk]/) and /[--abc]/.
    You can use ‘-’ without '' if the class has no ranges, like
    /[-abc]/ or /[^-]/.

No. "-" at top or bottom of character class' will not warned except for /[--abc]/ and /[abc--]/. However the definition of top’ is bit
different between mine and matz’s.

So /[a-zA-Z0-9.-]/ will not be warned.

···


Tanaka Akira

i’m with brian on this - those of us who’ve been converting old lex/perl
parsers to ruby have enough trouble saving the regexes people took hours to
write. backward compatibility in something as arcane as regexes is not just
nice - it’s essential.

-a

···

On Tue, 8 Apr 2003, Brian Candler wrote:

On Tue, Apr 08, 2003 at 12:20:40AM +0900, Kazuo Saito wrote:

They plan to change the regexp parsing rule in 1.8.0 as follows so far.
The plan is based on regexp coding conventions and requests from others
in ruby-dev:

2) A literal ‘-’ in a character class should be escaped too if the
class has other ‘-’ literals for range representation. For instance,
/[abcd-f-hijk]/ and /[–abc]/ should be written as
/[abcd-f-hijk]/ (or /[abcd-f-hijk]/) and /[--abc]/.
You can use ‘-’ without '' if the class has no ranges, like
/[-abc]/ or /[^-]/.

The way I parse regexps in my brain is the way I have been conditioned
through what I have learned with grep, awk, perl and so on :slight_smile:

I often use character classes like this:

 /[a-zA-Z0-9.-]/

which matches a-z, A-Z, 0-9, dot or dash. This is how it works everywhere
else. The rule is ‘if you want to match a dash, put it at the very beginning
or very end of your character class’

I can’t see any particular reason to outlaw this practice…

Ara Howard
NOAA Forecast Systems Laboratory
Information and Technology Services
Data Systems Group
R/FST 325 Broadway
Boulder, CO 80305-3328
Email: ahoward@fsl.noaa.gov
Phone: 303-497-7238
Fax: 303-497-7259
====================================

Hi,

The way I parse regexps in my brain is the way I have been conditioned
through what I have learned with grep, awk, perl and so on :slight_smile:

I often use character classes like this:

/[a-zA-Z0-9.-]/

which matches a-z, A-Z, 0-9, dot or dash. This is how it works everywhere
else. The rule is ‘if you want to match a dash, put it at the very beginning
or very end of your character class’

It does (and will) works so in Ruby too. The dash in the character
class here is not confusing. Remember examples in [ruby-talk:68760]:

/[abcd-f-hijk]/

The second dash in not a part of range; should be escaped,
warning. Note that it’s not an error.

/[–abc]/

The first dash is the beginning of character range; should be
escaped, warning.

/[-abc]/

The first dash is apparently not part of character range; no
confusing, no warning.

/[abc-]/

The dash at the end of character class is apparently not part of
character range; no confusing, no warning.

/[^-]/

The first dash is apparently not part of character range; no
confusing, no warning.

/a]/

The closing bracket is not part of character class; should be
escaped, warning.

I can’t see any particular reason to outlaw this practice…

We are not going to outlaw them. We want to discourage them by giving
warning.

						matz.
···

In message “Re: ruby-dev summary 19944 - 19957” on 03/04/08, Brian Candler B.Candler@pobox.com writes:

Saluton!

I often use character classes like this:

 /[a-zA-Z0-9.-]/

which matches a-z, A-Z, 0-9, dot or dash. This is how it works
everywhere else. The rule is ‘if you want to match a dash, put it
at the very beginning or very end of your character class’

I can’t see any particular reason to outlaw this practice…

I suggest outlawing unescaped ‘-’… I don’t like opaque regex like
the following two ‘capacitors’:

/[–\–]/
/[-----\-----]/

Gis,

Josef ‘Jupp’ Schugt

In article Pine.LNX.4.53.0304071546350.1744@eli.fsl.noaa.gov,

···

ahoward ahoward@fsl.noaa.gov wrote:

On Tue, 8 Apr 2003, Brian Candler wrote:

On Tue, Apr 08, 2003 at 12:20:40AM +0900, Kazuo Saito wrote:

They plan to change the regexp parsing rule in 1.8.0 as follows so far.
The plan is based on regexp coding conventions and requests from others
in ruby-dev:

2) A literal ‘-’ in a character class should be escaped too if the
class has other ‘-’ literals for range representation. For instance,
/[abcd-f-hijk]/ and /[–abc]/ should be written as
/[abcd-f-hijk]/ (or /[abcd-f-hijk]/) and /[--abc]/.
You can use ‘-’ without '' if the class has no ranges, like
/[-abc]/ or /[^-]/.

The way I parse regexps in my brain is the way I have been conditioned
through what I have learned with grep, awk, perl and so on :slight_smile:

I often use character classes like this:

 /[a-zA-Z0-9.-]/

which matches a-z, A-Z, 0-9, dot or dash. This is how it works everywhere
else. The rule is ‘if you want to match a dash, put it at the very beginning
or very end of your character class’

I can’t see any particular reason to outlaw this practice…

i’m with brian on this - those of us who’ve been converting old lex/perl
parsers to ruby have enough trouble saving the regexes people took hours to
write. backward compatibility in something as arcane as regexes is not just
nice - it’s essential.

Well said. I agree completely. Please do not break backward
compatibility of regexes - doing so could seriously scare people away from
Ruby since they would never know if their code will work with the next
release without having to be rewritten.

Phil

According to that post:

“2) A literal ‘-’ in a character class should be escaped too if the
class has other ‘-’ literals for range representation.

You can use ‘-’ without '' if the class has no ranges, like
/[-abc]/ or /[^-]/.”

But it didn’t say that a literal ‘-’ could b used without ‘-’ at the start
or end of a character class, e.g. /[a-z-]/

If you are going to allow that (without a warning) then I am happy :slight_smile:

Regards,

Brian.

···

On Tue, Apr 08, 2003 at 11:44:54AM +0900, Yukihiro Matsumoto wrote:

It does (and will) works so in Ruby too. The dash in the character
class here is not confusing. Remember examples in [ruby-talk:68760]:

Saluton!

/a]/

The closing bracket is not part of character class; should be
escaped, warning.

Great. For me this has a nice advantage: I use a Dutch keyboard
layout where ‘]’ and ‘[’ are on one key; former without latter with
shift - not the other way round. It therefore sometimes happens that
I enter

/]a-b[/

while it should read

/[a-b]/

Ruby now will warn me about that :slight_smile:

Gis,

Josef ‘Jupp’ Schugt