Common regular expressions

Michael_Garriss1 · 26 January 2003 19:59

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

valid email addresses
valid http address
etc.

TIA

-Michael Garriss

Daniel_Carrera · 26 January 2003 20:17

Hi Michael,

Has anyone compiled a collection of ‘common’ regular expression
patterns.

Never heard of such a list. The O’Reilly book “Mastering Regular
Expressions” probably contains a whole lot.

valid email addresses

Short answer: /^\w+@[\w.]+\w+$/

Take an email address: “joe97_smith@some.domain.org”

The email starts with “word” characters (letters, underscores and
numbers):

Word character
>
/^\w+/

Start
Followed by an “@” symbol.

/^\w+@/
Followed by a collection of word characters and dots and ending in word
characters.
```
            End of the string.
             >
```
/^\w+@[\w.]+\w+$/
____/
>
New class containing either word characters or dots.

Similarly, you can construct REs for other tasks.

···

valid http address

etc.

TIA

-Michael Garriss

–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Alan_Chen2 · 26 January 2003 20:23

You might look at the URI at http://arika.org/ruby/uri. It goes a bit beyond
just a regexp for email and http though.

···

On Mon, Jan 27, 2003 at 04:59:15AM +0900, Michael Garriss wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

valid email addresses

valid http address

etc.

TIA

-Michael Garriss

–
Alan Chen
Digikata Computing
http://digikata.com

Michael_Campbell1 · 26 January 2003 20:53

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

valid email addresses

To be strictly pedantic, I don’t think there is such a thing. For the truly
adventurous, consider bang-paths, %'s and the like which have all been valid at
one point or another. There was a big writeup/FAQ in the perl domain about this
very subject.

Martin_DeMello1 · 26 January 2003 20:54

The perl people have (well, Damian Conway has):

martin

···

Michael Garriss mgarriss@earthlink.net wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

valid email addresses

valid http address

etc.

Gavin_Sinclair · 27 January 2003 07:53

The Perl Cookbook has a chapter on regular expressions, and concludes
with about three pages of nifty little ones. It also has a few items
devoted to things like email adresses.

You might be able to view these online at the PLEAC project:
http://pleac.sourceforge.net. The code from the book can also be
downloaded from the O’Reilly website somewhere.

Gavin

···

On Monday, January 27, 2003, 6:59:15 AM, Michael wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

valid email addresses

valid http address

etc.

Mike_Thomas · 27 January 2003 17:58

Try this site:

later…

···

— Michael Garriss mgarriss@earthlink.net wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

valid email addresses

valid http address

etc.

TIA

-Michael Garriss

=====

Mike Thomas
http://www.samoht.com
It’s better backwards

Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.

Sam_Roberts1 · 26 January 2003 20:41

Quoteing dcarrera@math.umd.edu, on Mon, Jan 27, 2003 at 05:17:02AM +0900:

Hi Michael,

Has anyone compiled a collection of ‘common’ regular expression
patterns.

Never heard of such a list. The O’Reilly book “Mastering Regular
Expressions” probably contains a whole lot.

Including one for a valid rfc822 email address list, which takes about
a page, no spaces.

valid email addresses

Short answer: /^\w+@[\w.]+\w+$/

Take an email address: “joe97_smith@some.domain.org”

How about we take:

“hi y@”.“ruby ",master!” ( … a comment!!) @ u%me . u+me . u-me

its syntactically valid, too! Though admittedly unusual…

Cheers,
Sam

Daniel_Carrera · 26 January 2003 21:03

Yes, there is actually. Email is a strict protocol, just like FTP and
others. However, there’s so much flexibility in valid email addresses
that you probably want to stick to common email addresses.

···

On Mon, Jan 27, 2003 at 05:53:53AM +0900, Mike Campbell wrote:

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

valid email addresses

To be strictly pedantic, I don’t think there is such a thing.

–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Michael_Garriss1 · 27 January 2003 18:26

Perfect! Thank you.

Mike Thomas wrote:

···

Try this site:

regxlib.com

later…

— Michael Garriss mgarriss@earthlink.net wrote:

Sorry if this a stupid question but I am new to ruby AND regular
expressions.

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

valid email addresses

valid http address

etc.

TIA

-Michael Garriss

=====

Mike Thomas
http://www.samoht.com
It’s better backwards

Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Sam_Roberts · 28 January 2003 00:02

Wrote Mike Thomas mike_thomas@yahoo.com, on Tue, Jan 28, 2003 at 02:58:47AM +0900:

Try this site:

regxlib.com - regxlib Resources and Information.

One of those doesn’t allow _ in the domain name,

admin@ensemble_independant.org

neither allow quoted local-parts,

"C=ca,CO=Certicom,CN=Sam Roberts"@inet-gateway.x500.org

and neither allow whitespace between tokens.

sroberts @ uniserve . com

Here’s the version from Mastering Regular Expressions. It’s perl, so
you’ll have to do a little converting.

A single email address is usually an addr-spec, so if you don’t want to
match things like

Sam Roberts sroberts@uniserve.com

or lists of addresses, you can just look for “Item 2:” below, and expand
that RE. The variables below match the RFC 822 BNF pretty closely.

Cheers,
Sam

···

–
Sam Roberts sroberts@certicom.com

(From http://public.yahoo.com/~jfriedl/regex/code.html)

Program to build a regex to match an internet email address,

from Chapter 7 of Mastering Regular Expressions (Friedl / O’Reilly)

(O'Reilly Media - Technology and Business Training)

Unoptimized version.

Copyright 1997 O’Reilly & Associates, Inc.

Some things for avoiding backslashitis later on.

$esc = ‘\\’; $Period = ‘.’;
$space = ‘\040’; $tab = ‘\t’;
$OpenBR = ‘[’; $CloseBR = ‘]’;
$OpenParen = ‘(’; $CloseParen = ‘)’;
$NonASCII = ‘\x80-\xff’; $ctrl = ‘\000-\037’;
$CRlist = ‘\n\015’; # note: this should really be only \015.

Items 19, 20, 21

$qtext = qq/[^$esc$NonASCII$CRlist"]/; # for within “…”
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within […]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character

Item 10: atom

$atom_char = qq/[^($space)<>@,;:".$esc$OpenBR$CloseBR$ctrl$NonASCII]/;
$atom = qq<
$atom_char+ # some number of atom characters…
(?!$atom_char) # ..not followed by something that could be part of an atom

;

Items 22 and 23, comment.

Impossible to do properly with a regex, I make do by allowing at most one level of nesting.

$ctext = qq< [^$esc$NonASCII$CRlist()] >;
$Cnested = qq< $OpenParen (?: $ctext | $quoted_pair )* $CloseParen >;
$comment = qq< $OpenParen
(?: $ctext | $quoted_pair | $Cnested )*
$CloseParen >;

$X = qq< (?: [$space$tab] | $comment )* >; # optional separator

Item 11: doublequoted string, with escaped items allowed

$quoted_str = qq<
" (?: # opening quote…
$qtext # Anything except backslash and quote
> # or
$quoted_pair # Escaped something (something != CR)
)* " # closing quote

;

Item 7: word is an atom or quoted string

$word = qq< (?: $atom | $quoted_str ) >;

Item 12: domain-ref is just an atom

$domain_ref = $atom;

Item 13 domain-literal is like a quoted string, but […] instead of “…”

$domain_lit = qq< $OpenBR # [
(?: $dtext | $quoted_pair )* # stuff
$CloseBR # ]

;

Item 9: sub-domain is a domain-ref or domain-literal

$sub_domain = qq< (?: $domain_ref | $domain_lit ) >;

Item 6: domain is a list of subdomains separated by dots.

$domain = qq< $sub_domain # initial subdomain
(?: #
$X $Period # if led by a period…
$X $sub_domain # …further okay
)*

;

Item 8: a route. A bunch of “@ $domain” separated by commas, followed by a colon

$route = qq< @ $X $domain
(?: $X , $X @ $X $domain )* # further okay, if led by comma
: # closing colon

;

Item 5: local-part is a bunch of $word separated by periods

$local_part = qq< $word # initial word
(?: $X $Period $X $word )* # further okay, if led by a period

;

Item 2: addr-spec is local@domain

$addr_spec = qq< $local_part $X @ $X $domain >;

Item 4: route-addr is <route? addr-spec>

$route_addr = qq[ < $X # leading <
(?: $route $X )? # optional route
$addr_spec # address spec
$X > # trailing >
];

Item 3: phrase

$phrase_ctrl = ‘\000-\010\012-\037’; # like ctrl, but without tab

Like atom-char, but without listing space, and uses phrase_ctrl.

Since the class is negated, this matches the same as atom-char plus space and tab

$phrase_char =
qq/[^()<>@,;:".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;

$phrase = qq< $word # one word, optionally followed by…
(?:
$phrase_char | # atom and space parts, or…
$comment | # comments, or…
$quoted_str # quoted strings
)*

;

Item #1: mailbox is an addr_spec or a phrase/route_addr

$mailbox = qq< $X # optional leading comment
(?: $addr_spec # address
> # or
$phrase $route_addr # name and address
) $X # optional trailing comment

;

###########################################################################

Here’s a little snippet to test it.

Addresses given on the commandline are described.

my $error = 0;
my $valid;
foreach $address (@ARGV) {
$valid = $address =~ m/^$mailbox$/xo;
printf “`$address’ is syntactically %s.\n”, $valid ? “valid” : “invalid”;
$error = 1 if not $valid;
}
exit $error;

William_Djaja_Tjokr1 · 28 January 2003 23:23

Hi,

Isn’t this the very last example in the book “Mastering Regular
Expressions” by Friedl, i.e., the Appendix B: Email Regex Program? (At
least that is in the first edition.) Be careful though, because when the
regex is expanded into its plain form, the regex size is 6,598 bytes long
.

Regards,

Bill

···

Mike Campbell michael_s_campbell@yahoo.com wrote:

Has anyone compiled a collection of ‘common’ regular expression
patterns. For example:

valid email addresses

To be strictly pedantic, I don’t think there is such a thing. For the truly
adventurous, consider bang-paths, %'s and the like which have all been valid at
one point or another. There was a big writeup/FAQ in the perl domain about this
very subject.

Daniel_Carrera · 26 January 2003 20:53

I only meant to write an “approximate” RE that would work most of the
time. Writing a truly comprehensive RE would be very difficult and
probably not even worth it.

For instance, did you know that the backspace character is technically
allowed? Your address could be:

@domain.com

But who’s really going to have a backspace in their email address? (good
luck getting any email there).
It’s better to just ignore this possibility.

Cheers,

···

On Mon, Jan 27, 2003 at 05:41:17AM +0900, Sam Roberts wrote:

How about we take:

“hi y@”.“ruby ",master!” ( … a comment!!) @ u%me . u+me . u-me

its syntactically valid, too! Though admittedly unusual…

–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Sam_Roberts1 · 26 January 2003 23:23

Quoteing dcarrera@math.umd.edu, on Mon, Jan 27, 2003 at 06:03:31AM +0900:

> > - valid email addresses
> To be strictly pedantic, I don't think there is such a thing.

Yes, there is actually. Email is a strict protocol, just like FTP and
others. However, there's so much flexibility in valid email addresses
that you probably want to stick to common email addresses.

I think Mikes referring to a perl conversation about validity,
in which there was some ambiguity in what people mean by valid.

The syntax is completely described, but some people mean by "valid
email address" an email address you can actually send mail to, which
involves doing stuff like making sure the domain name exists and
is reachable. That kind of validity is pretty much impossible to
check, without actually sending a mail and getting a reply!

Cheers,
Sam

···

On Mon, Jan 27, 2003 at 05:53:53AM +0900, Mike Campbell wrote:

Michael_Campbell1 · 27 January 2003 02:32

To be strictly pedantic, I don’t think there is such a thing.

Yes, there is actually. Email is a strict protocol, just like FTP and
others. However, there’s so much flexibility in valid email addresses
that you probably want to stick to common email addresses.

As another person noted, I wasn’t saying you couldn’t check for RFC-822
compliance, but rather that you can’t, via a regex, determine if a mail address
is valid; i.e., it’ll get there.

Here’s the perl faq to which I was referring. this is a shortened version,
which is a shame, as the one I recall from years gone by gave examples using %'s
which could (IIRC) be legally parsed in more than 1 way, either, both, or
neither being “valid”. That may have been pre RFC-822 though, to be fair.

···

=============

How do I check a valid email address?

You can’t.

Remember that without sending mail to the address and seeing whether it
bounces (and even then you face the halting problem), you cannot
determine whether an email address is valid. Even if you apply
the email header standard, you can have problems, because there are deliverable
addresses that aren’t RFC-822 (the mail header standard) compliant,
and addresses that aren’t deliverable which are.

Many are tempted to try to eliminate many frequently-invalid email
addresses with a simple regex, such as /^[1]+@([\w.-].)+\w+$/.
However, this also throws out many valid ones, and says nothing
about potential deliverability, so is not suggested. Instead, see the
${CPAN}/authors/Tom_Christiansen/scripts/ckaddr.gz program, which actually
checks against the full RFC spec (well, modulo nested comments), looks
for addresses you may not wish to accept email to (say, Bill Clinton or
your postmaster), and then makes sure that the hostname given can be
looked up in DNS. It’s not fast, but it works.

\w.- ↩︎

Simon_Cozens1 · 26 January 2003 21:15

Daniel Carrera dcarrera@math.umd.edu writes:

But who’s really going to have a backspace in their email address? (good
luck getting any email there).
It’s better to just ignore this possibility.

And who’s going to have crazy things like pluses or hyphens in their
email address? Please, if you’re going to do email address validation,
do it properly, or you won’t be getting any mail from me. Some might
consider that an unexpected bonus, of course.

···

–
Given an infinite amount of monkeys an infinite amount of time, an
infinite amount of drafting supplies, and an infinite amount of crack,
they’d come up with Downtown Chicago. – David Jacoby, in the monastery

Sam_Roberts1 · 26 January 2003 23:16

Quoteing dcarrera@math.umd.edu, on Mon, Jan 27, 2003 at 05:53:51AM +0900:

How about we take:

“hi y@”.“ruby ",master!” ( … a comment!!) @ u%me . u+me . u-me

its syntactically valid, too! Though admittedly unusual…

I only meant to write an “approximate” RE that would work most of the

Oh, hey, I know that! I wasn’t trying to trash your regexp.

It’ time. Writing a truly comprehensive RE would be very difficult and
probably not even worth it.

Yep, but + and - in domain names isn’t too uncommon, and you still see email
addresses with the %-hack for uucp in the local-part.

Anyhow, its actually not too hard to write a RE from the BNF in RFC822,
particularly if you ignore some deprecated by RFC2822 stuff, but
luckily, we don’t have to, because you can find the RE in the
excellent book you recommended, Mastering Regular Expressions.

I picked it up because I couldn’t believe REs were complicated enough to
need a whole book, and then kept reading out of amazement.

Cheers,
Sam

···

On Mon, Jan 27, 2003 at 05:41:17AM +0900, Sam Roberts wrote:

Topic		Replies	Views
Email Address Regex [was Re: silly regex question] ruby-talk	29	212	9 January 2006
Regular expressions ruby-talk	20	145	25 February 2007
ANN: Regexador - A mini-language for regular expressions ruby-talk	12	245	28 September 2013
About Regular Expressions ruby-talk	30	185	20 November 2004
Alternate Regular Expressions? ruby-talk	26	187	24 December 2009

Common regular expressions

=====

Mike Thomas http://www.samoht.com It’s better backwards

=====

Mike Thomas http://www.samoht.com It’s better backwards

Program to build a regex to match an internet email address,

from Chapter 7 of Mastering Regular Expressions (Friedl / O’Reilly)

(O'Reilly Media - Technology and Business Training)

Unoptimized version.

Copyright 1997 O’Reilly & Associates, Inc.

Some things for avoiding backslashitis later on.

Items 19, 20, 21

Item 10: atom

Items 22 and 23, comment.

Impossible to do properly with a regex, I make do by allowing at most one level of nesting.

Item 11: doublequoted string, with escaped items allowed

Item 7: word is an atom or quoted string

Item 12: domain-ref is just an atom

Item 13 domain-literal is like a quoted string, but […] instead of “…”

Item 9: sub-domain is a domain-ref or domain-literal

Item 6: domain is a list of subdomains separated by dots.

Item 8: a route. A bunch of “@ $domain” separated by commas, followed by a colon

Item 5: local-part is a bunch of $word separated by periods

Item 2: addr-spec is local@domain

Item 4: route-addr is <route? addr-spec>

Item 3: phrase

Like atom-char, but without listing space, and uses phrase_ctrl.

Since the class is negated, this matches the same as atom-char plus space and tab

Item #1: mailbox is an addr_spec or a phrase/route_addr

Here’s a little snippet to test it.

Addresses given on the commandline are described.

Related topics

Mike Thomas
http://www.samoht.com
It’s better backwards

Mike Thomas
http://www.samoht.com
It’s better backwards