About Regular Expressions

* James Edward Gray II <james@grayproductions.net> [Nov 18, 2004 19:40]:

> My opinion is this: Regexp's are for CS-heads --those who really
> love programming for its it's own sake. Regexp are about as terse
> and cryptic as one can get, and thus have a steep learning curve.

My experience has been the opposite.

At my wife's company, they spend a significant portion of everyday
importing raw text reports from their database into Excel for various
uses. Some of the reports do not go into Excel well at all. A large
chunk of most employee's day is spent cleaning up these reports, by
hand. (I'm considering one of these reports for a future Ruby Quiz, if
that gives you any idea how wonky they can be.)

When my wife came to me for help and showed me the problem, the
solution turned out to be simple. I helped her download a simple text
editor with Regular Expression search and replace, over the phone.
That evening, I taught her a useful subset of Regular Expression and
together we built and printed a "cheat sheet" she could take to work
with her.

Yet, she came to you. Wouldn't it have been great if this kind of thing
had been so obvious to anyone that there wasn't a need for a programmer
to point it out to someone with a search and replace problem?

My wife is no programmer. She's a slightly above average computer
user. She's great with Excel and can record macros, but she would
never thing of editing one, by way of example. It literally took me
about three hours to get her doing useful things with Regular
Expression. Today, she's the wizard at her company they all come to
for help. She's easily the most productive employee they have, when it
comes to any kind of reporting work.

That's great! Hopefully they'll take the time to actually learn what
she's teaching them and spread the knowledge themselves.

I'm not trying to imply anything about you or your beliefs. I'm just
saying that some people don't seem to have too much trouble with them.
I know my wife still keeps that cheat sheet right next to her computer.
Perhaps a trick like that would be of some use to you or other folks
who struggle with Regular Expression.

The regular expression syntax we are using today is a potpourri of
various ideas, extensions, and cruft that's been collected over the
years. It was never designed, just as Unix was never designed. Yet,
they are both extremely useful. They are, however, not easy to use for
anyone unfamiliar with them. Regular expressions stem from mathematical
research and therefore most of the notation is very terse.
Mathematicians prefer terseness and so do many programmers, but there is
no need for terseness in something like regular expressions used for
search and replace. That's why EMACS and Vim use extended versions of
BRE not ERE for example (which is a point though as \ is a bitch to
type).

Remember, regular expressions match precisely the regular languages,
which are, according to Chomsky, the simplest of languages. Why, then,
do they have to be so hard to specify?
  nikolai

···

--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

James Edward Gray II wrote:

My opinion is this: Regexp's are for CS-heads --those who really love
programming for its it's own sake. Regexp are about as terse and cryptic as
one can get, and thus have a steep learning curve.

My experience has been the opposite.

At my wife's company, they spend a significant portion of everyday importing raw text reports from their database into Excel for various uses. Some of the reports do not go into Excel well at all. A large chunk of most employee's day is spent cleaning up these reports, by hand. (I'm considering one of these reports for a future Ruby Quiz, if that gives you any idea how wonky they can be.)

...

My experience is that I use REs on a sporadic basis, and tend to learn and remember just enough to get a task done. But by the time I need to write another RE, if it isn't something trivial, I have to go looking for docs and such. And the docs often do not explain how to do certain things, or tell me if it is even possible.

(An example: Can I write an RE that tells me if a given string contains all substrings in a given set of substrings, irrespective of the order of the substrings in either the target string or the set of substrings? )

But, to be fair, some of these sorts of requests may be beyond a basic cookbook or newbie intro Web page. So, it's either get the O"Reilly book, post a question, or hack away.

In general, I don't really care if people are too lazy to look things up, if they don't make it a regular habit and overrun the list. I scan message headers, do a sort of triage for attention, and ignore many, many things. Pretty painless. So, stupid questions are welcome. (I'd hate to feel reluctant to ask a stupid question myself; I have too many of them.)

One a side note, I wonder what it would take to write a domain language, parable by Ruby, that let one write REs in plain-ish English? Maybe take some of the mystery out of regular expressions for most of the common cases. (Or would that just get people dependent on too much hand-holding?)

James

···

On Nov 18, 2004, at 9:44 AM, trans. (T. Onoma) wrote:

Hi --

···

On Fri, 19 Nov 2004, James Britt wrote:

One a side note, I wonder what it would take to write a domain language,
parable by Ruby, that let one write REs in plain-ish English? Maybe
take some of the mystery out of regular expressions for most of the
common cases. (Or would that just get people dependent on too much
hand-holding?)

Florian Gross has written such a thing. Paging flgr....

David

--
David A. Black
dblack@wobblini.net

One a side note, I wonder what it would take to write a domain language,
parable by Ruby, that let one write REs in plain-ish English? Maybe
take some of the mystery out of regular expressions for most of the
common cases. (Or would that just get people dependent on too much
hand-holding?)

Take a look at Florian Groß' Regexp::English.
Short example: http://www.rubycookbook.org/cookbook/view/ReadableRegularExpressions

Also Simon Strandgaard's re might interest you; you can do things like

puts /ab|cd(?=x)|xxx/.tree

+-Alternation
  +-Sequence
  > +-Inside set="a"
  > +-Inside set="b"
  +-Sequence
  > +-Inside set="c"
  > +-Inside set="d"
  > +-Lookahead positive
  > +-Inside set="x"
  +-Sequence
    +-Inside set="x"
    +-Inside set="x"
    +-Inside set="x"
=> nil

···

On Fri, Nov 19, 2004 at 04:26:08AM +0900, James Britt wrote:

--
Hassle-free packages for Ruby?
RPA is available from http://www.rubyarchive.org/

Thats the only thing I hate in emacs. I always have to look up the regular expression syntax, or do some trial-error until I get the emacs re escapings right. Why don't they simply use the syntax everyone else uses.

I do not see, where the emacs syntax is simpler to read/write than ruby/perl/egrep syntax.

regards,

Brian

···

On Fri, 19 Nov 2004 04:20:29 +0900 Nikolai Weibull <mailing-lists.ruby-talk@rawuncut.elitemail.org> wrote:

[snip]
Mathematicians prefer terseness and so do many programmers, but there is
no need for terseness in something like regular expressions used for
search and replace. That's why EMACS and Vim use extended versions of
BRE not ERE for example (which is a point though as \ is a bitch to
type).

--
Brian Schröder
http://www.brian-schroeder.de/

"James Britt" <jamesUNDERBARb@neurogami.com> schrieb im Newsbeitrag
news:419CF779.2080806@neurogami.com...

all substrings in a given set of substrings, irrespective of the order
of the substrings in either the target string or the set of

ubstrings? )

Yes you can, but it's going to be uuuuugly and slooooow if you have more
than two sub strings. A set of one regexp per sub string is likely to be
more efficient:

substrings = %w{foo bar baz}
rxs = substrings.map {|s| Regexp.new(Regexp.escape(s))}

string = "klajsd askdjkahs bar asjdajsd asdbazagdjhagsdh f0dfdfuufoosds"
puts "Got it: #{string}" if rxs.all? {|rx| rx =~ string}

One a side note, I wonder what it would take to write a domain language,
parable by Ruby, that let one write REs in plain-ish English? Maybe
take some of the mystery out of regular expressions for most of the
common cases. (Or would that just get people dependent on too much
hand-holding?)

Such a language with the same expressiveness as typical regexps would be
too verbose for me. But I think I remember someone did something similar
with Ruby. Can't quite remember who it was or a mail thread subject but I
do believe someone has attempted to do something in that direction.

Kind regards

    robert

[snip english (sorry)]

Also Simon Strandgaard's re might interest you; you can do things like

>> puts /ab|cd(?=x)|xxx/.tree
+-Alternation
  +-Sequence
  > +-Inside set="a"
  > +-Inside set="b"
  +-Sequence
  > +-Inside set="c"
  > +-Inside set="d"
  > +-Lookahead positive
  > +-Inside set="x"
  +-Sequence
    +-Inside set="x"
    +-Inside set="x"
    +-Inside set="x"
=> nil

Thanks Mauricio for mentioning this.

It can be obtained through raa (or via rpa)

http://raa.ruby-lang.org/project/regexp/

btw: does any one use this package?
(got ideas for improvement?)

···

On Thursday 18 November 2004 21:19, Mauricio Fernández wrote:

--
Simon Strandgaard

* Brian Schröder <ruby@brian-schroeder.de> [Nov 18, 2004 23:50]:

> Mathematicians prefer terseness and so do many programmers, but
> there is no need for terseness in something like regular expressions
> used for search and replace. That's why EMACS and Vim use extended
> versions of BRE not ERE for example (which is a point though as \ is
> a bitch to type).

Thats the only thing I hate in emacs. I always have to look up the
regular expression syntax, or do some trial-error until I get the
emacs re escapings right. Why don't they simply use the syntax
everyone else uses.

Good question. See below.

I do not see, where the emacs syntax is simpler to read/write than
ruby/perl/egrep syntax.

It isn't simpler. It is less mathematical if you will. The idea is
that most often you want to search for a fixed string and thus /(/ should
match '(' and not be a metacharacter for grouping (and capturing if
that's required). This makes much more sense in Vi-based editors, as
there is only one search command (by default), namely '/'. In EMACS you
have a choice of searching for a regular expression and a fixed string,
by using different keybindings. Thus, for EMACS, using BRE over ERE is
just plain silly (I'm trying to use kind words from now on). Anyway,
the main problem with the current state of affairs is that there are so
many incompatible regular expression implementations, all with their own
quirks and special syntax. The biggest evil-doer in my opinion is
Perl5. There's too much going on. They have crammed context-free
language matching into regular expressions, while at the same time
making them NP-complete. Sure, look-around has its uses, and there's a
lot of nice stuff you can do. Is it necessary, though? Hardly. I
won't argue this point any further, but the bottom line is that regular
expressions have been abused and need some time to recuperate.
  nikolai

···

--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

Mauricio Fernández wrote:

···

On Fri, Nov 19, 2004 at 04:26:08AM +0900, James Britt wrote:

One a side note, I wonder what it would take to write a domain language, parable by Ruby, that let one write REs in plain-ish English? Maybe take some of the mystery out of regular expressions for most of the common cases. (Or would that just get people dependent on too much hand-holding?)

Take a look at Florian Groß' Regexp::English.
Short example: http://www.rubycookbook.org/cookbook/view/ReadableRegularExpressions

Also Simon Strandgaard's re might interest you; you can do things like

It truly is a magical Ruby world. Feels like Christmas already.

Thanks,

James

Robert Klemme wrote:

"James Britt" <jamesUNDERBARb@neurogami.com> schrieb im Newsbeitrag
news:419CF779.2080806@neurogami.com...

all substrings in a given set of substrings, irrespective of the order
of the substrings in either the target string or the set of

ubstrings? )

Yes you can, but it's going to be uuuuugly and slooooow if you have more
than two sub strings. A set of one regexp per sub string is likely to be
more efficient:

substrings = %w{foo bar baz}
rxs = substrings.map {|s| Regexp.new(Regexp.escape(s))}

string = "klajsd askdjkahs bar asjdajsd asdbazagdjhagsdh f0dfdfuufoosds"
puts "Got it: #{string}" if rxs.all? {|rx| rx =~ string}

Thank you. I would be searching on possibly any number (but figure 5 or 6 as an average) of such substrings. The all? syntax is nice and clear.

One a side note, I wonder what it would take to write a domain language,
parable by Ruby, that let one write REs in plain-ish English? Maybe
take some of the mystery out of regular expressions for most of the
common cases. (Or would that just get people dependent on too much
hand-holding?)

Such a language with the same expressiveness as typical regexps would be
too verbose for me. But I think I remember someone did something similar
with Ruby. Can't quite remember who it was or a mail thread subject but I
do believe someone has attempted to do something in that direction.

Yes, the details were posted here the other day. I thought it would be handy for folks who have a hard time remembering certain syntax, or want the intent of the regexp to be more clear.

Thanks,

James

···

Kind regards

    robert

.

Mauricio Fernández wrote:

···

On Fri, Nov 19, 2004 at 04:26:08AM +0900, James Britt wrote:

One a side note, I wonder what it would take to write a domain language, parable by Ruby, that let one write REs in plain-ish English? Maybe take some of the mystery out of regular expressions for most of the common cases. (Or would that just get people dependent on too much hand-holding?)

Take a look at Florian Groß' Regexp::English.
Short example: http://www.rubycookbook.org/cookbook/view/ReadableRegularExpressions

This is available at http://noegnud.sourceforge.net/.flgr/re_english.zip (for now -- I can't guarantee it being there permanently. Sorry.)