Regular expressions

When I first learned regular expressions, they were no problem. It was in a
math class, and we were working over a finite alphabet ({a, b}, as I recall)
which certainly did not include any of the delimiters or special symbols
(+, *, ?, etc.). No problem.

Then came regexps in real life. Yuck. I’ve lost count of the number of
times I have relearned regexps, like cramming for pop quiz, and forgotten it
all promptly after I figured it out. When I need a regexp, I open up irb
and just duke it out till I get it. Then I copy the regexp into my code and
make sure never to look at it again.

I’m not asking for new regexp syntax in Ruby or anything. I’m sure that
doing it the Perl way is the Right way in this case, and if I used regexps
any more often than I do, I would probably just learn the damn syntax, too,
and that would be the end of it. But, like looking at an old Perl program,
an old regexp is just…

Then I started thinking: We have Perl-style regexps, right? What other
style regexps are there? I’m sure people have come up with a number of
different ideas, but a quick Google didn’t turn up anything. Does anyone
know of any other ideas?

I’m just convinced that whoever came up with the current standard syntax
(was it Chomsky?) was not implementing them on a computer over the ASCII
alphabet.

Chris

It seems to me you might like this:
http://mywebpages.comcast.net/mdschneider/ruby/index.html#REGTREE

Example:

string to match

url_string = “myWebPages.comcast.net/mdschneider

build regex

word = RegexTree.init.word_char.plus_times
domain_element = word.capture_array(“domains”)
domain = (domain_element | ‘.’).plus_times
url = domain + ‘/’ + word.cast(“user”)

match the string

url_match = url.match(url_string)

extract the values

domains = url_match.value(“domains”).compact
user = url_match.value(“user”)

display

puts
puts “URL path extraction demo:”
puts
puts “url_string: #{url_string}”
puts “regex generated: #{url.render}” # ==> ((?:[0-9A-Z_a-z]+|.)+)/([0-9A-Z_a-z]+)
puts “domains found: #{domains.inspect}” # ==> [“myWebPages”,“comcast”, “net”]
puts “user found: #{user.inspect}” # ==> mdschneider

It all comes down to using some pseudo-EBNF notation. It is more
powerful than required for regexps, and far more verbose, but then
again, sometimes we just want that.

···

On Wed, Apr 16, 2003 at 07:18:43AM +0900, Chris Pine wrote:

Then I started thinking: We have Perl-style regexps, right? What other
style regexps are there? I’m sure people have come up with a number of
different ideas, but a quick Google didn’t turn up anything. Does anyone
know of any other ideas?


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Look, I’m about to buy me a double barreled sawed off shotgun and show
Linus what I think about backspace and delete not working.
– some anonymous .signature

Then I started thinking: We have Perl-style regexps, right? What other
style regexps are there? I’m sure people have come up with a number of
different ideas, but a quick Google didn’t turn up anything. Does anyone
know of any other ideas?

Well… there sort of are other styles, but not really…
All styles are very very similar.

I guess it all started with the shell’s “wildcards”, which were extended
by grep, awk and sed. Then came Perl which extended those again.

Each extension caused a slight change in syntax, and an increase in
functionality. Thus, you will find that they are all very similar, but
not quite identical (and each is a functional superset than its ancestor).

I’m just convinced that whoever came up with the current standard syntax
(was it Chomsky?) was not implementing them on a computer over the ASCII
alphabet.

If my theory of (RE) evolution is correct, it would suggest that as
functionality was added, the syntax was forced to migrate to more remote
regions of the keyboard.

– I think I’ll call this the “Carrera theory of evolution” :slight_smile:

Cheers,

···


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

“Chris Pine” nemo@hellotree.com schrieb im Newsbeitrag
news:08fc01c3039c$f95514b0$6401a8c0@MELONBALLER…

Then I started thinking: We have Perl-style regexps, right? What other
style regexps are there? I’m sure people have come up with a number of
different ideas, but a quick Google didn’t turn up anything. Does
anyone
know of any other ideas?

I’d roughly classify according to these three styles

grep, sed
() aren’t meta

POSIX / egrep
() are meta
character classes like [:digit:], [:space:] etc.

perl / ruby
() are meta
character classes like \d, \s etc.
positive and negative lookahead and lookbehind
white space for readability

Regards

robert

“Chris Pine” nemo@hellotree.com writes:

I’m not asking for new regexp syntax in Ruby or anything. I’m sure that
doing it the Perl way is the Right way in this case, and if I used regexps
any more often than I do, I would probably just learn the damn syntax, too,
and that would be the end of it. But, like looking at an old Perl program,
an old regexp is just…

To a great extent the Perl way with regexps is the Unix way - a terse
notation for serious, regular users. It’s not ideal for intermittent
users.

Then I started thinking: We have Perl-style regexps, right? What other
style regexps are there? I’m sure people have come up with a number of
different ideas, but a quick Google didn’t turn up anything. Does anyone
know of any other ideas?

Well, a long time ago, in a very different world, there was a pattern
matching language called Snobol (or rather SNOBOL as punched cards
didn’t have lower case :-). 30+ years ago I used to use it for the
sort of quick string manipulating hacks that Ruby and Perl often get
used for today. SNOBOL had “patterns” which weren’t exactly today’s
regexps but were very similar. I’ve often thought that a modern
variation of Snobol patterns would make a more user friendly regexp
syntax.

[5 minutes later and I’ve found my old SNOBOL4 language manual,
copyright 1971. This takes me back.]

Here are SNOBOL4’s basic patterns, in the order listed in the manual.
Patterns could be enclosed in parentheses, alternatives could be
separated by “|”, and juxtaposition of patterns meant concatenation
like in regexps. Using a variable in a pattern meant taking its
contents as a pattern. The string matched by a pattern could be
assigned to a variable by appending either $ variable or . variable.
In the former case the match was assigned as the matching was working
so could be used with “*” to act like \1, \2, etc in regexps. A $ var
match could be reassigned if backtracking caused another match to
happen. In the . var case the assignment only happened after the
entire match worked.

SNOBOL Perl/Ruby
pattern regexp

LEN(3) .{3,3}
SPAN(‘aeiou’) [aeiou]+
BREAK(‘aeiou’) [^aeiou]+
ANY(‘aeiou’) [aeiou]
NOTANY(‘aeiou’) [^aeiou]
TAB(n) No exact regexp equivalent. Match all (maybe 0) chars
from current position to just before the nth char. If
beyond the nth character, fail.
RTAB(n) Like TAB, but from end of string rather than
beginning.
POS(n) Match if and only if at nth character. POS(0) is
equivalent to regexp ^.
RPOS(n) Like POS, but from the right. RPOS(0) = $
FAIL Always fails to match, forcing alternatives to be
tried.
FENCE Succeeds when first matched, if backtracking causes
the character position to try to back up through the
fence, the match fails.
ABORT Aborts the entire match, including untried
alternatives.
variable Match the pattern (usually a simple string) in the
variable at the time the * pattern is tried for a
match. With a prior $ variable assignment this acts
the same as \1 etc in regexps.
ARB .
? (i.e. minimum match of .)
ARBNO(pattern) (pattern)

BAL Matched any string containing balanced, possibly
nested, parentheses. Impossible for regexp I believe.
SUCCEED Succeeds when first matched, if backtracking causes
the character position to try to back up through the
SUCCEED then it matches again and starts the match
going forward. Sort of a reflective FENCE. You don’t
want to know what eveil tricks this could be used for.

Hope that wasn’t too much like trying to drink from a firehose. :slight_smile: It
might give someone an idea. Just don’t ask about Snobol’s control flow
mechanisms, you really don’t want to know.

···


Conspiracy theories exist to prevent you knowing what’s really going on.

It seems to me you might like this:
http://mywebpages.comcast.net/mdschneider/ruby/index.html#REGTREE

···

----- Original Message -----
From: “Mauricio Fernández” batsman.geo@yahoo.com


Hmmm… I really like what he has to say on his webpage, but that is just
way too verbose.

----- Original Message -----
From: “Daniel Carrera” dcarrera@math.umd.edu


If my theory of (RE) evolution is correct,

It seems that a mathematician (of course) invented regular expressions:
Kleene. Ken Thompson then built them into qed, then ed, then grep. From
there they took off. So, no, they were not intended to be used as they now
are. (I mean the current syntax… it really is natural and easy in a math
class!)

----- Original Message -----
From: “Arthur Chance” {spamtrap}@qeng-ho.org

Well, a long time ago, in a very different world, there was a pattern
matching language called Snobol

Interesting! I’ll look into it.

What is a regular expression? Basically, it’s a terse notation for a set of
(perhaps countably (infinitely) many) strings. So this has got me to
thinking about what we use regexps for (since PCREs do a lot of little
things you don’t get in old-school regular expressions). Given a string
str' and a set of stringsS’, we want to know:

  • Is str' inS’?
  • Is some substring of str' inS’?
    — If so, which one? Where does it start?
    — And is there another one after the end of that one?

Sometimes we also want this information about the sub-regexps in a regexp
(the stuff in parentheses).

Then there’s gsub… I’ll skip that for now.

It seems that the major problem is that regexps are used on strings, but
they are implemented as if they are strings, so you have to escape
everything. It’s like programming in a language where (' is a reasonable variable name! So here's a first shot (and I'll use<’ and >' as delimiters, so as not to suggest that I am trying to get this implemented over Ruby's own/…/’ regexps; obviously this wouldn’t work at all in
Ruby; I’m just wondering if this is easier to read):

<^ ‘abc’* [^’!@#$’]>

which would match ‘abba’ and ‘ccr’, but not ‘bab!’.

Perhaps if we changed the second ^' to!’ (more natural) and twiddle a
bit:

[^, ‘abc’*, !’!@#$’.1]

Hmmm… I like the `.1’, but I don’t like the array-style and the commas.

Well, that’s a beginning, anyway. Thoughts?

Chris

It’s little more difficult. First you shouldn’t mix wildcards and regexp,
they are different at all.

To do regexps with the chars available in ASCII jou have to treat some chars
special. So e.g. in grep the chars .'' and *‘’ (and some others) are
special and have special meanings (as we all know). If you want to include
one of this chars literally in a regexp, you escape it by \'': *‘’.
Other chars like ``(‘’ did not have a special meaning and therefore there
was no need to escape it.

Now the feature of grouping was added to grep, and ('' and )‘’ were
choosen to start and end a group. This would need to change the meaning of
('' and )‘’ from a literal to a special char, and this would break old
scripts using ('' and )‘’ literally. So the programmers of grep decided
to use \('' and )‘’ for grouping.

With any new feature added the situation got more and more fucked up: some
chars have special meaning without being escaped: . * [ ] ^ $'' ..., and have to get escaped to be used literally: . * ^ $‘’ …,
others don’t have special meaning like ( ) < > |'' ... and gain special meaning by being escaped: ( ) < > |‘’, so you have to remember for a lot
of chars whether they have to be escaped or not (and these chars differ in
different tools, compare emacs regexps, egrep, grep, vi, awk, …).

So perl did a cut and decided to simplify (and ruby thankfully adopted):
Any char within a-zA-Z0-9 (and maybe umlauts like äöüß) is taken literally,
so S'' is a literal S’’ and may gain special meaning by being escaped
(like \S'' or \1’‘).
Any other char # ' % & $ ^ ? = ) ( /'' and many more may have special meaning and have to be escaped by '’ if to be used literally.
So you just have to remember 2 character groups, which is much easier (and
even future extensions can be done much easier).

The only thing you loose is compatibility with tools like grep.

Stony
(42)

···

Daniel Carrera dcarrera@math.umd.edu wrote:

I guess it all started with the shell’s “wildcards”, which were extended
by grep, awk and sed. Then came Perl which extended those again.

Each extension caused a slight change in syntax, and an increase in
functionality. Thus, you will find that they are all very similar, but
not quite identical (and each is a functional superset than its ancestor).

======================================
The Answer is 42.
And I am the Answer.
Now I am looking for the Question.

Oops!

···

----- Original Message -----
From: “Chris Pine” nemo@hellotree.com

<^ ‘abc’* [^’!@#$’]>

which would match ‘abba’ and ‘ccr’, but not ‘bab!’.

Obviously not! It would match ‘&’, ‘abcd’, ‘abcabca’, etc.

(I should have waited until I was awake…)

Chris

Well, a long time ago, in a very different world, there was a pattern
matching language called Snobol (or rather SNOBOL as punched cards
didn’t have lower case :-). 30+ years ago I used to use it for the
sort of quick string manipulating hacks that Ruby and Perl often get
used for today. SNOBOL had “patterns” which weren’t exactly today’s
regexps but were very similar. I’ve often thought that a modern
variation of Snobol patterns would make a more user friendly regexp
syntax.

[5 minutes later and I’ve found my old SNOBOL4 language manual,
copyright 1971. This takes me back.]

[snip interesting SNOBOL stuff]

First of all, it does my heart good to know that I am not
actually the oldest person on this list. When you were
programming in SNOBOL, I was barely in sixth grade, and
the only computers I knew of were in Star Trek reruns.

More on-topic: I had an idea for “de-obfuscating” regexes
a year or more ago; and I may even have mentioned it on
the list. In fact, I think I did. Generally I wanted to
use a notation that was more like a programming language
(in a vein similar to the SNOBOL stuff you showed us)
which could (e.g.) be put into here-documents and compiled
into regular Ruby regexes.

But eventually I was convinced that the Rockit parser could
do everything I wanted to do (though in a somewhat different
fashion, no doubt). So I abandoned that idea.

I’ve never look at Rockit much. I assume it’s in the RAA.
But I’d be curious to know what a SNOBOL hacker thinks
of it.

Thanks,
Hal

···

----- Original Message -----
From: “Arthur Chance” {spamtrap}@qeng-ho.org
Newsgroups: comp.lang.ruby
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, April 16, 2003 8:12 AM
Subject: Re: regular expressions

Very interesting. Refer to the post I just made
replying to Arthur Chance.

This is in some ways similar to my idea. But I
agree, I find it a little verbose.

I would have done it something like this:

myreg = RegexLang.new(<<EOF)

This would be a full-featured regex

language “script” with keywords instead

of punctuation; comments allowed; etc.

blah blah blah…
EOF

myreg.is_a? Regexp # true

And so on…

But again, someone said Rockit was a better
way to go. YMMV.

Cheers,
Hal

···

----- Original Message -----
From: “Chris Pine” nemo@hellotree.com
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, April 16, 2003 10:14 AM
Subject: Re: regular expressions

It seems to me you might like this:
http://mywebpages.comcast.net/mdschneider/ruby/index.html#REGTREE

Hmmm… I really like what he has to say on his webpage, but that is just
way too verbose.

I would have done it something like this:

myreg = RegexLang.new(<<EOF)

This would be a full-featured regex

language “script” with keywords instead

of punctuation; comments allowed; etc.

blah blah blah…
EOF

···

----- Original Message -----
From: “Hal E. Fulton” hal9000@hypermetrics.com


That’s very much like what I was thinking of, too! I was thinking that we
should also send a binding along, so we could use local variables if we
wanted to.

About Rockit:

It seems like Rockit might indeed be the way to implement this, but that
says nothing about the syntax/grammar to use in the heredoc.

Also, this means we could get rid of the delimiters altogether, which is
nice. The only thing I don’t like is that we wouldn’t be able to toss a
small one on one line, and we would have to precede them all with:

RegexLang.new(binding,<<EOF)

or something.

Chris

In article 041c01c3042f$2105ad20$0300a8c0@austin.rr.com,
“Hal E. Fulton” hal9000@hypermetrics.com writes:

I would have done it something like this:

myreg = RegexLang.new(<<EOF)

This would be a full-featured regex

language “script” with keywords instead

of punctuation; comments allowed; etc.

blah blah blah…
EOF

myreg.is_a? Regexp # true

And so on…

I wrote such library: [RAA:abnf].

It can be used as

myreg = /\A#{ABNF.regexp <<‘End’}\z/o
… ABNF(RFC2234) description …
End

myreg.is_a? Regexp # true

Of course, it handles only some (regular) subset of ABNF.

···


Tanaka Akira

“Hal E. Fulton” hal9000@hypermetrics.com writes:

From: “Arthur Chance” {spamtrap}@qeng-ho.org
Newsgroups: comp.lang.ruby
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, April 16, 2003 8:12 AM
Subject: Re: regular expressions

[Me wittering about Snobol]

[snip interesting SNOBOL stuff]

First of all, it does my heart good to know that I am not
actually the oldest person on this list. When you were
programming in SNOBOL, I was barely in sixth grade, and
the only computers I knew of were in Star Trek reruns.

Excuse me while I go and quietly crumble in the corner. :slight_smile:

As you get older either the memory goes or the mental arithmetic
abilities, but I can’t remember which. However, a recalculation shows
me that it was only a mere 27-28 years ago that I started using
Snobol. Even with that correction it was still back when I had far
more hair and far less waist. :slight_smile:

More on-topic: I had an idea for “de-obfuscating” regexes
a year or more ago; and I may even have mentioned it on
the list. In fact, I think I did. Generally I wanted to
use a notation that was more like a programming language
(in a vein similar to the SNOBOL stuff you showed us)
which could (e.g.) be put into here-documents and compiled
into regular Ruby regexes.

But eventually I was convinced that the Rockit parser could
do everything I wanted to do (though in a somewhat different
fashion, no doubt). So I abandoned that idea.

I’ve never look at Rockit much. I assume it’s in the RAA.
But I’d be curious to know what a SNOBOL hacker thinks
of it.

I wouldn’t exactly describe myself as a SNOBOL hacker (the main
languages I used at the time were Algol68 + various assemblers),
SNOBOL was just this great tool for mashing strings around, much like
Ruby and Perl today. (Admission of dirty secrets: I used it mainly to
take the fixed format output from Fortran programs and massage it into
entirely differently shaped fixed format input for other Fortran
programs. The people who I did it for thought it was miraculous, just
the same as suits used to Office do today when watching an ad-hoc
script in Ruby/Perl. Some things never change.)

Anyway, I hadn’t seen Rockit before you mentioned it, so this is an
off the top of the head response. I have seriously used lexical and
parser generators as well as hand writing lexers and parsers over the
years (part of my career has been in academic CS, part designing and
implementing proprietary languages for industry including one (Magik)
that’s remarkably like Ruby in feel. Matz and I obviously absorbed the
same influences.)

My general feeling is that regexps or similar ideas like Snobol
patterns are at a different level from parsing. Of course, the fact
that Rockit (like a few other parser generators) generates both lexer
and scanner from the same input rather blurs the distinction, but I
tend to think lexing is like brick laying and parsing is like
architecture - you generally wouldn’t want one job done by the
practitioner of the other.

The line from the SourceForge website

rockit-generated parsers builds the AST; NO need to write “action
code” in the grammar. “Action code” separated from grammar

rather makes me wince. To be fair this may be because I grew up with
LL(1) and LALR(1) based parser generators, where one often has to warp
the grammar away from the ambiguous grammar that would produce a
“natural” AST, and GLR parsing may not suffer this problem, but when
working on compilers I want the AST I design, not one a program with
no idea of what I’m going to do with it afterwards generates. (This
could always be me just being an old fogey of course.)

Generally I tend to believe in using tools that work for the job in
hand but have never met one that does everything well. (I spent a
couple of years working in PL/I which tried to be all things to all
men. Shudder.) Rockit looks like it will be a useful tools for
rapidly writing small and useful parser systems in Ruby, especially as
it uses GLR, but it’s overkill for a job regexps can already do.

In truth the parsing problem was really solved for most practical
purposes 10-20 years ago and another parser generator, even a GLR one
is really a “me too” exercise. Now if someone can produce a Ruby
framework for prototyping provably correct AST transformation, plus
optimised code generation automatically derived from a machine
specification, and spit it out in a form that can be used by C code,
that would be wonderful. But that’s a long way from regexp notation
where we started. Maybe what’s needed is Ruby code that takes a regexp
denotation and explains it in English (or whatever human language is
prefered), just like the program that turns C declarations into
English.

···

----- Original Message -----


Conspiracy theories exist to prevent you knowing what’s really going on.

Glad we’re thinking alike… maybe this idea
is not dead after all?

Passing in a binding is a good idea. But I
think it should be the second param and default
to nil.

I also think there should be a debugging option,
so we can print out submatches and such, if we
need to.

And I favor a way to bind submatches to Ruby
variables within the notation, so that we might
almost never need to call #match and do explicit
assignments.

If the notation is “Ruby-like” (i.e., with
statement terminators allowed but optional)
we could still do one-liners. So these two
would be equivalent:

pattern = RegexLang.new(<<EOF)
foo
bar
baz
EOF

pattern2 = RegexLang.new(“foo;bar;baz”)

Now, your thoughts? Or those of others?

Hal

···

----- Original Message -----
From: “Chris Pine” nemo@hellotree.com
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, April 16, 2003 10:57 AM
Subject: Re: regular expressions

----- Original Message -----
From: “Hal E. Fulton” hal9000@hypermetrics.com

I would have done it something like this:

myreg = RegexLang.new(<<EOF)

This would be a full-featured regex

language “script” with keywords instead

of punctuation; comments allowed; etc.

blah blah blah…
EOF

That’s very much like what I was thinking of, too! I was thinking that we
should also send a binding along, so we could use local variables if we
wanted to.

About Rockit:

It seems like Rockit might indeed be the way to implement this, but that
says nothing about the syntax/grammar to use in the heredoc.

Also, this means we could get rid of the delimiters altogether, which is
nice. The only thing I don’t like is that we wouldn’t be able to toss a
small one on one line, and we would have to precede them all with:

RegexLang.new(binding,<<EOF)

or something.

That is very interesting. I have never
heard of ABNF. Is that the same as EBNF?

I will look at it.

Hal

···

----- Original Message -----
From: “Tanaka Akira” akr@m17n.org
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, April 16, 2003 12:40 PM
Subject: Re: regular expressions

I wrote such library: [RAA:abnf].

It can be used as

myreg = /\A#{ABNF.regexp <<‘End’}\z/o
… ABNF(RFC2234) description …
End

myreg.is_a? Regexp # true

Of course, it handles only some (regular) subset of ABNF.

First of all, it does my heart good to know that I am not
actually the oldest person on this list. When you were
programming in SNOBOL, I was barely in sixth grade, and
the only computers I knew of were in Star Trek reruns.

Excuse me while I go and quietly crumble in the corner. :slight_smile:

As you get older either the memory goes or the mental arithmetic
abilities, but I can’t remember which. However, a recalculation shows
me that it was only a mere 27-28 years ago that I started using
Snobol. Even with that correction it was still back when I had far
more hair and far less waist. :slight_smile:

Ha… don’t take it personally. I’m constantly reminded that
some of these people are current undergraduates, and were born
in 1983 or so. In the perspective of many such people, I am
nothing but a dinosaur, I who learned Pascal in 1980. No worries.
We’ve all been young, and we will all all either grow old or
die young. Eternity is very democratic.

[snippage]

Anyway, I hadn’t seen Rockit before you mentioned it, so this is an
off the top of the head response. I have seriously used lexical and
parser generators as well as hand writing lexers and parsers over the
years (part of my career has been in academic CS, part designing and
implementing proprietary languages for industry including one (Magik)
that’s remarkably like Ruby in feel. Matz and I obviously absorbed the
same influences.)

My general feeling is that regexps or similar ideas like Snobol
patterns are at a different level from parsing. Of course, the fact
that Rockit (like a few other parser generators) generates both lexer
and scanner from the same input rather blurs the distinction, but I
tend to think lexing is like brick laying and parsing is like
architecture - you generally wouldn’t want one job done by the
practitioner of the other.

Well, frankly I can’t remember how I reached the
conclusion that Rockit was the superior to what
I had conceived. Search the archives if you wish,
it’s in there. :slight_smile:

[snip comments on parsers]

I don’t know enough about parsers to fully grasp what
you’re saying. But it sounds OK to me. :slight_smile:

Hal

···

----- Original Message -----
From: “Arthur Chance” {spamtrap}@qeng-ho.org
Newsgroups: comp.lang.ruby
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, April 16, 2003 1:14 PM
Subject: Re: regular expressions

Hi –

Glad we’re thinking alike… maybe this idea
is not dead after all?

Passing in a binding is a good idea. But I
think it should be the second param and default
to nil.

I also think there should be a debugging option,
so we can print out submatches and such, if we
need to.

And I favor a way to bind submatches to Ruby
variables within the notation, so that we might
almost never need to call #match and do explicit
assignments.

If the notation is “Ruby-like” (i.e., with
statement terminators allowed but optional)
we could still do one-liners. So these two
would be equivalent:

pattern = RegexLang.new(<<EOF)
foo
bar
baz
EOF

pattern2 = RegexLang.new(“foo;bar;baz”)

Now, your thoughts? Or those of others?

I have a few questions – sort of just “academic” interest, as I am
(as you know) one of those freaky people who think regular expressions
are incredibly cool and elegant and have no intention of abandoning
them :slight_smile:

The questions being… what exactly would be happening in the above?
In what sense is it a “language”? And at what point would you be
matching something? Could you show a mock-up example of a whole real
case? Do you know the way to San Jose?

David

···

On Thu, 17 Apr 2003, Hal E. Fulton wrote:


David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

In article 052a01c3045c$278907a0$0300a8c0@austin.rr.com,
“Hal E. Fulton” hal9000@hypermetrics.com writes:

That is very interesting. I have never
heard of ABNF. Is that the same as EBNF?

They have bit different syntax but represents same set.

···


Tanaka Akira

I have a few questions – sort of just “academic” interest, as I am
(as you know) one of those freaky people who think regular expressions
are incredibly cool and elegant and have no intention of abandoning
them :slight_smile:

I’ll give you my opinion, speaking for no one
else’s idea or project.

Regexes are fine, and I wouldn’t want to abandon them.

It’s the notation that makes my eyes hurt. It looks
like line noise. It’s dense, terse, compressed,
cryptic.

The questions being… what exactly would be happening in the above?
In what sense is it a “language”?

I’d like to write a complex regex in a more readable
form which could then be “compiled” to a Ruby regex.
(Now that I think about it, the analogy between regex
notation and machine language is not a bad one.)

Now, I’ll be the first to admit that it most cases,
it just doesn’t matter much. Most of the time I would
still use a “plain” regualr expression.

But there is a complexity threshold where I would
rather have a more verbose, readable, multi-line
notation. Whether it’s considered a “language” is
debatable – in fact, the same is true for programming
“languages” in general. :wink:

And at what point would you be
matching something? Could you show a mock-up example of a whole real
case?

Now you’re asking me to invent syntax on the spur of
the moment. I’m not good at that. :slight_smile:

Here’s a very rough first effort. (One too simple really.)

phone = RegexLang.new(<<EOF)
string “(”
digits(3,:area_code,Fixnum)
# Above: Grab three digits, store in area_code
# as a Fixnum
string ") "
match(:rest) do # Store this stuff in ‘rest’ as
digits(3) # a String
string “-”
digits(4)
end
EOF

area_code = rest = nil
str = “(800) 555-1234”
phone.match(str)
puts area_code # 800
puts rest # 555-1234
area_code.is_a? Fixnum # true
puts phone.to_r # /((\d{3}) (\d{3}-\d{4})/

There are lots of problems here. I just tossed it
off the top of my head.

Do you know the way to San Jose?

Take a left turn at Albuquerque, and make sure
Bugs Bunny is following you.

Hal

···

----- Original Message -----
From: dblack@superlink.net
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, April 16, 2003 12:33 PM
Subject: Re: regular expressions

I’d like to write a complex regex in a more readable
form which could then be “compiled” to a Ruby regex.
(Now that I think about it, the analogy between regex
notation and machine language is not a bad one.)

···

----- Original Message -----
From: “Hal E. Fulton” hal9000@hypermetrics.com


It’s important, however, not to confuse “the regular language described by a
regular expression” with “a language for building objects of the class
Regexp”. For example, the current regexp literal is a (yucky) language for
building Regexps, but a+' is *not* a member of the regular language it describes, which only has words with 1 or morea’s in it. You see what I’m
saying? (I’m not even sure I understood you right!)


Here’s a very rough first effort. (One too simple really.)

phone = RegexLang.new(<<EOF)
string "("
digits(3,:area_code,Fixnum)
# Above: Grab three digits, store in area_code
# as a Fixnum
string ") "
match(:rest) do # Store this stuff in ‘rest’ as
digits(3) # a String
string "-"
digits(4)
end
EOF


I was hoping for something clearer than regexps, but without giving up too
much in the way of concision (is that a word?):

phone = RegexLang.new(<<EOF)
’(’ area_code=d3.to_i ') ’ rest=(d3 ‘-’ d*4)
EOF

Feels like too much magic to me, though. Seems like we whould be able to do
something closer… closer to pure Ruby…

I’m not seeing it yet, though. :frowning:

We’re trying to define in one fell swoop a (possibly infinite) set of
strings. Could we use some sort of… of… like this:

/a*b/

‘a’*(0…) + ‘b’

/(a|b){5,7}/

(‘a’ | ‘b’) * 5…7

Those aren’t great examples, but it seems like it could be done with a fair
amount of overloading/defining in the base classes.

So…

local_number = Digit3 + ‘-’ + Digit4
area_code = Digit*3 # .to_i here??
phone_number = ‘(’ + area_code + ') ’ + local_number

Oh, I don’t know. It’s a start. At least it looks like Ruby and not Perl!

Speaking of which, I went to read about the changes to regexps in Perl 6.
It was in one of Larry Wall’s apocalypses. It drove home several important
points for me, but really only one of them was fit for a public forum :slight_smile:

  • Just because you can see the problem, that need not imply you have any
    clue about the solution. (Obviously, this goes for myself as well as for
    LW!)

Take a left turn at Albuquerque

I did that once! On a road trip. It was cool…

Chris