Common regular expressions

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

  • valid email addresses
  • valid http address
  • etc.

TIA

-Michael Garriss

Hi Michael,

Has anyone compiled a collection of ‘common’ regular expression
patterns.

Never heard of such a list. The O’Reilly book “Mastering Regular
Expressions” probably contains a whole lot.

  • valid email addresses

Short answer: /^\w+@[\w.]+\w+$/

Take an email address: “joe97_smith@some.domain.org

  1. The email starts with “word” characters (letters, underscores and
    numbers):

    Word character
    >
    /^\w+/

    Start

  2. Followed by an “@” symbol.

    /^\w+@/

  3. Followed by a collection of word characters and dots and ending in word
    characters.

                End of the string.
                 >
    

    /^\w+@[\w.]+\w+$/
    ____/
    >
    New class containing either word characters or dots.

Similarly, you can construct REs for other tasks.

···
  • valid http address
  • etc.

TIA

-Michael Garriss


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

You might look at the URI at http://arika.org/ruby/uri. It goes a bit beyond
just a regexp for email and http though.

···

On Mon, Jan 27, 2003 at 04:59:15AM +0900, Michael Garriss wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

  • valid email addresses
  • valid http address
  • etc.

TIA

-Michael Garriss


Alan Chen
Digikata Computing
http://digikata.com

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

  • valid email addresses

To be strictly pedantic, I don’t think there is such a thing. For the truly
adventurous, consider bang-paths, %'s and the like which have all been valid at
one point or another. There was a big writeup/FAQ in the perl domain about this
very subject.

The perl people have (well, Damian Conway has):

martin

···

Michael Garriss mgarriss@earthlink.net wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

  • valid email addresses
  • valid http address
  • etc.

The Perl Cookbook has a chapter on regular expressions, and concludes
with about three pages of nifty little ones. It also has a few items
devoted to things like email adresses.

You might be able to view these online at the PLEAC project:
http://pleac.sourceforge.net. The code from the book can also be
downloaded from the O’Reilly website somewhere.

Gavin

···

On Monday, January 27, 2003, 6:59:15 AM, Michael wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

  • valid email addresses
  • valid http address
  • etc.

Try this site:

http://www.regxlib.com/Default.aspx

later…

···

— Michael Garriss mgarriss@earthlink.net wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

  • valid email addresses
  • valid http address
  • etc.

TIA

-Michael Garriss

=====

Mike Thomas
http://www.samoht.com
It’s better backwards


Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.

Quoteing dcarrera@math.umd.edu, on Mon, Jan 27, 2003 at 05:17:02AM +0900:

Hi Michael,

Has anyone compiled a collection of ‘common’ regular expression
patterns.

Never heard of such a list. The O’Reilly book “Mastering Regular
Expressions” probably contains a whole lot.

Including one for a valid rfc822 email address list, which takes about
a page, no spaces.

  • valid email addresses

Short answer: /^\w+@[\w.]+\w+$/

Take an email address: “joe97_smith@some.domain.org

How about we take:

“hi y@”.“ruby ",master!” ( … a comment!!) @ u%me . u+me . u-me

its syntactically valid, too! Though admittedly unusual…

Cheers,
Sam

Yes, there is actually. Email is a strict protocol, just like FTP and
others. However, there’s so much flexibility in valid email addresses
that you probably want to stick to common email addresses.

···

On Mon, Jan 27, 2003 at 05:53:53AM +0900, Mike Campbell wrote:

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

  • valid email addresses

To be strictly pedantic, I don’t think there is such a thing.


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Perfect! Thank you.

Mike Thomas wrote:

···

Try this site:

regxlib.com

later…

— Michael Garriss mgarriss@earthlink.net wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

  • valid email addresses
  • valid http address
  • etc.

TIA

-Michael Garriss

=====

Mike Thomas
http://www.samoht.com
It’s better backwards


Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Wrote Mike Thomas mike_thomas@yahoo.com, on Tue, Jan 28, 2003 at 02:58:47AM +0900:

Try this site:

http://www.regxlib.com/Default.aspx

One of those doesn’t allow _ in the domain name,

admin@ensemble_independant.org

neither allow quoted local-parts,

"C=ca,CO=Certicom,CN=Sam Roberts"@inet-gateway.x500.org

and neither allow whitespace between tokens.

sroberts @ uniserve . com

Here’s the version from Mastering Regular Expressions. It’s perl, so
you’ll have to do a little converting.

A single email address is usually an addr-spec, so if you don’t want to
match things like

Sam Roberts sroberts@uniserve.com

or lists of addresses, you can just look for “Item 2:” below, and expand
that RE. The variables below match the RFC 822 BNF pretty closely.

Cheers,
Sam

···


Sam Roberts sroberts@certicom.com

(From http://public.yahoo.com/~jfriedl/regex/code.html)

Program to build a regex to match an internet email address,

from Chapter 7 of Mastering Regular Expressions (Friedl / O’Reilly)

(http://www.ora.com/catalog/regexp/)

Unoptimized version.

Copyright 1997 O’Reilly & Associates, Inc.

Some things for avoiding backslashitis later on.

$esc = ‘\\’; $Period = ‘.’;
$space = ‘\040’; $tab = ‘\t’;
$OpenBR = ‘[’; $CloseBR = ‘]’;
$OpenParen = ‘(’; $CloseParen = ‘)’;
$NonASCII = ‘\x80-\xff’; $ctrl = ‘\000-\037’;
$CRlist = ‘\n\015’; # note: this should really be only \015.

Items 19, 20, 21

$qtext = qq/[^$esc$NonASCII$CRlist"]/; # for within “…”
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within […]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character

Item 10: atom

$atom_char = qq/[^($space)<>@,;:".$esc$OpenBR$CloseBR$ctrl$NonASCII]/;
$atom = qq<
$atom_char+ # some number of atom characters…
(?!$atom_char) # …not followed by something that could be part of an atom

;

Items 22 and 23, comment.

Impossible to do properly with a regex, I make do by allowing at most one level of nesting.

$ctext = qq< [^$esc$NonASCII$CRlist()] >;
$Cnested = qq< $OpenParen (?: $ctext | $quoted_pair )* $CloseParen >;
$comment = qq< $OpenParen
(?: $ctext | $quoted_pair | $Cnested )*
$CloseParen >;

$X = qq< (?: [$space$tab] | $comment )* >; # optional separator

Item 11: doublequoted string, with escaped items allowed

$quoted_str = qq<
" (?: # opening quote…
$qtext # Anything except backslash and quote
> # or
$quoted_pair # Escaped something (something != CR)
)* " # closing quote

;

Item 7: word is an atom or quoted string

$word = qq< (?: $atom | $quoted_str ) >;

Item 12: domain-ref is just an atom

$domain_ref = $atom;

Item 13 domain-literal is like a quoted string, but […] instead of “…”

$domain_lit = qq< $OpenBR # [
(?: $dtext | $quoted_pair )* # stuff
$CloseBR # ]

;

Item 9: sub-domain is a domain-ref or domain-literal

$sub_domain = qq< (?: $domain_ref | $domain_lit ) >;

Item 6: domain is a list of subdomains separated by dots.

$domain = qq< $sub_domain # initial subdomain
(?: #
$X $Period # if led by a period…
$X $sub_domain # …further okay
)*

;

Item 8: a route. A bunch of “@ $domain” separated by commas, followed by a colon

$route = qq< @ $X $domain
(?: $X , $X @ $X $domain )* # further okay, if led by comma
: # closing colon

;

Item 5: local-part is a bunch of $word separated by periods

$local_part = qq< $word # initial word
(?: $X $Period $X $word )* # further okay, if led by a period

;

Item 2: addr-spec is local@domain

$addr_spec = qq< $local_part $X @ $X $domain >;

Item 4: route-addr is <route? addr-spec>

$route_addr = qq[ < $X # leading <
(?: $route $X )? # optional route
$addr_spec # address spec
$X > # trailing >
];

Item 3: phrase

$phrase_ctrl = ‘\000-\010\012-\037’; # like ctrl, but without tab

Like atom-char, but without listing space, and uses phrase_ctrl.

Since the class is negated, this matches the same as atom-char plus space and tab

$phrase_char =
qq/[^()<>@,;:".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;

$phrase = qq< $word # one word, optionally followed by…
(?:
$phrase_char | # atom and space parts, or…
$comment | # comments, or…
$quoted_str # quoted strings
)*

;

Item #1: mailbox is an addr_spec or a phrase/route_addr

$mailbox = qq< $X # optional leading comment
(?: $addr_spec # address
> # or
$phrase $route_addr # name and address
) $X # optional trailing comment

;

###########################################################################

Here’s a little snippet to test it.

Addresses given on the commandline are described.

my $error = 0;
my $valid;
foreach $address (@ARGV) {
$valid = $address =~ m/^$mailbox$/xo;
printf “`$address’ is syntactically %s.\n”, $valid ? “valid” : “invalid”;
$error = 1 if not $valid;
}
exit $error;

Hi,

Isn’t this the very last example in the book “Mastering Regular
Expressions” by Friedl, i.e., the Appendix B: Email Regex Program? (At
least that is in the first edition.) Be careful though, because when the
regex is expanded into its plain form, the regex size is 6,598 bytes long
:slight_smile: .

Regards,

Bill

···

Mike Campbell michael_s_campbell@yahoo.com wrote:

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

  • valid email addresses

To be strictly pedantic, I don’t think there is such a thing. For the truly
adventurous, consider bang-paths, %'s and the like which have all been valid at
one point or another. There was a big writeup/FAQ in the perl domain about this
very subject.

I only meant to write an “approximate” RE that would work most of the
time. Writing a truly comprehensive RE would be very difficult and
probably not even worth it.

For instance, did you know that the backspace character is technically
allowed? Your address could be:

@domain.com

But who’s really going to have a backspace in their email address? (good
luck getting any email there).
It’s better to just ignore this possibility.

Cheers,

···

On Mon, Jan 27, 2003 at 05:41:17AM +0900, Sam Roberts wrote:

How about we take:

“hi y@”.“ruby ",master!” ( … a comment!!) @ u%me . u+me . u-me

its syntactically valid, too! Though admittedly unusual…


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Quoteing dcarrera@math.umd.edu, on Mon, Jan 27, 2003 at 06:03:31AM +0900:

> > - valid email addresses
> To be strictly pedantic, I don't think there is such a thing.

Yes, there is actually. Email is a strict protocol, just like FTP and
others. However, there's so much flexibility in valid email addresses
that you probably want to stick to common email addresses.

I think Mikes referring to a perl conversation about validity,
in which there was some ambiguity in what people mean by valid.

The syntax is completely described, but some people mean by "valid
email address" an email address you can actually send mail to, which
involves doing stuff like making sure the domain name exists and
is reachable. That kind of validity is pretty much impossible to
check, without actually sending a mail and getting a reply!

Cheers,
Sam

···

On Mon, Jan 27, 2003 at 05:53:53AM +0900, Mike Campbell wrote:

To be strictly pedantic, I don’t think there is such a thing.

Yes, there is actually. Email is a strict protocol, just like FTP and
others. However, there’s so much flexibility in valid email addresses
that you probably want to stick to common email addresses.

As another person noted, I wasn’t saying you couldn’t check for RFC-822
compliance, but rather that you can’t, via a regex, determine if a mail address
is valid; i.e., it’ll get there.

Here’s the perl faq to which I was referring. this is a shortened version,
which is a shame, as the one I recall from years gone by gave examples using %'s
which could (IIRC) be legally parsed in more than 1 way, either, both, or
neither being “valid”. That may have been pre RFC-822 though, to be fair.

···

=============

How do I check a valid email address?

You can’t.

Remember that without sending mail to the address and seeing whether it
bounces (and even then you face the halting problem), you cannot
determine whether an email address is valid. Even if you apply
the email header standard, you can have problems, because there are deliverable
addresses that aren’t RFC-822 (the mail header standard) compliant,
and addresses that aren’t deliverable which are.

Many are tempted to try to eliminate many frequently-invalid email
addresses with a simple regex, such as /[1]+@([\w.-].)+\w+$/.
However, this also throws out many valid ones, and says nothing
about potential deliverability, so is not suggested. Instead, see the
${CPAN}/authors/Tom_Christiansen/scripts/ckaddr.gz program, which actually
checks against the full RFC spec (well, modulo nested comments), looks
for addresses you may not wish to accept email to (say, Bill Clinton or
your postmaster), and then makes sure that the hostname given can be
looked up in DNS. It’s not fast, but it works.


  1. \w.- ↩︎

Daniel Carrera dcarrera@math.umd.edu writes:

But who’s really going to have a backspace in their email address? (good
luck getting any email there).
It’s better to just ignore this possibility.

And who’s going to have crazy things like pluses or hyphens in their
email address? Please, if you’re going to do email address validation,
do it properly, or you won’t be getting any mail from me. Some might
consider that an unexpected bonus, of course.

···


Given an infinite amount of monkeys an infinite amount of time, an
infinite amount of drafting supplies, and an infinite amount of crack,
they’d come up with Downtown Chicago. – David Jacoby, in the monastery

Quoteing dcarrera@math.umd.edu, on Mon, Jan 27, 2003 at 05:53:51AM +0900:

How about we take:

“hi y@”.“ruby ",master!” ( … a comment!!) @ u%me . u+me . u-me

its syntactically valid, too! Though admittedly unusual…

I only meant to write an “approximate” RE that would work most of the

Oh, hey, I know that! I wasn’t trying to trash your regexp.

It’ time. Writing a truly comprehensive RE would be very difficult and
probably not even worth it.

Yep, but + and - in domain names isn’t too uncommon, and you still see email
addresses with the %-hack for uucp in the local-part.

Anyhow, its actually not too hard to write a RE from the BNF in RFC822,
particularly if you ignore some deprecated by RFC2822 stuff, but
luckily, we don’t have to, because you can find the RE in the
excellent book you recommended, Mastering Regular Expressions.

I picked it up because I couldn’t believe REs were complicated enough to
need a whole book, and then kept reading out of amazement.

Cheers,
Sam

···

On Mon, Jan 27, 2003 at 05:41:17AM +0900, Sam Roberts wrote: