Regexps and anchoring again

Brian_Candler · 22 May 2003 10:47

There was a discussion a few weeks back about Ruby’s handling of ^ and $ in
regexps, and I have realised what may me so uncomfortable with it. I’m used
to matching strings on /^…$/ to mean “match exactly this”, and it doesn’t
work. In fact it could lead to very nasty security holes. Consider this
example:

   str = cgi['unsafe_item']
   str.untaint if str =~ /^[a-z0-9]+$/

Looks perfectly safe, doesn’t it? Errm, no.

   str = "rf -rf /*\nabcde\ndrop table master_db;"
   puts "oops!" if str =~ /^[a-z0-9]+$/   #>> "oops!"

For this to be safe, you actually have to write:

  str.untaint if str =~ /\A[a-z0-9]+\z/

The asymmetry between \A and \z is annoying (I have to keep looking it up to
remember which one is capital and which is lower-case), and it leaves
regular expressions looking a lot less readable.

I guess this is fixed in concrete now, but I thought it was pointing this
out as potentially a very important “gotcha”

Cheers,

Brian.

David_A_Black2 · 22 May 2003 11:09

Hi –

There was a discussion a few weeks back about Ruby’s handling of ^
and $ in regexps, and I have realised what may me so uncomfortable
with it. I’m used to matching strings on /^…$/ to mean “match
exactly this”, and it doesn’t work. In fact it could lead to very
nasty security holes. Consider this example:

But… but… it’s not like it’s being kept a secret I guess
different regex systems do this differently. sed, for example, treats
^…$ linewise, not stringwise:

$ echo -e ‘abc\ndef’ | sed -e ‘s/^def$/ghi/’
abc
ghi

whereas Perl requires the /m modifer. So there isn’t already one
universal syntax outside of Ruby; there’s always the need to adjust to
each language’s view of things. I refuse to cast Ruby as the villain
of the piece

[…]
str.untaint if str =~ /\A[a-z0-9]+\z/

The asymmetry between \A and \z is annoying (I have to keep looking
it up to remember which one is capital and which is lower-case), and
it leaves regular expressions looking a lot less readable.

You can probably use \Z in most cases; the only difference between \z
and \Z is that \Z anchors before a trailing newline, if there is one.

David

···

On Thu, 22 May 2003, Brian Candler wrote:

–
David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Robert · 22 May 2003 12:21

“Brian Candler” B.Candler@pobox.com schrieb im Newsbeitrag
news:20030522114658.A83352@linnet.org…

There was a discussion a few weeks back about Ruby’s handling of ^ and $
in
regexps, and I have realised what may me so uncomfortable with it. I’m
used
to matching strings on /^…$/ to mean “match exactly this”, and it
doesn’t
work. In fact it could lead to very nasty security holes. Consider this
example:
   str = cgi['unsafe_item']
   str.untaint if str =~ /^[a-z0-9]+$/
Looks perfectly safe, doesn’t it? Errm, no.
   str = "rf -rf /*\nabcde\ndrop table master_db;"
   puts "oops!" if str =~ /^[a-z0-9]+$/   #>> "oops!"
For this to be safe, you actually have to write:
  str.untaint if str =~ /\A[a-z0-9]+\z/
The asymmetry between \A and \z is annoying (I have to keep looking it
up to
remember which one is capital and which is lower-case), and it leaves
regular expressions looking a lot less readable.

I always use uppercase, because that’s a reasonable choice if you process
lines from a file like in

while ( line = gets ) do
case line
when /\Abegin\Z/
…
end
end

\A and \Z might be even more mnemonic than ^ and $ if you think a moment
about it - but then, we’re used to cryptic symbols.

I guess this is fixed in concrete now, but I thought it was pointing
this
out as potentially a very important “gotcha”

Yes, it really is. But I would not blame regexp syntax. Designing
applications that do potentially dangerous things with input from the
outside world should be crafted carefully anyway.

Regards

robert

Brian_Candler · 22 May 2003 12:48

There was a discussion a few weeks back about Ruby’s handling of ^
and $ in regexps, and I have realised what may me so uncomfortable
with it. I’m used to matching strings on /^…$/ to mean “match
exactly this”, and it doesn’t work. In fact it could lead to very
nasty security holes. Consider this example:

But… but… it’s not like it’s being kept a secret

Well no, if you read the documentation in its entirety, and forget
everything you knew about regexps and Perl previously. But regexp handling
in Ruby cries out “Yes I’m like Perl! I have /regexp/ and =~ and $1,$2…”
and you have to read the small print - or in my case write broken programs -
to discover something as fundamental as start and end anchoring doesn’t work
in the way that you expect.

“Way that I expect” comes from not only Perl, but also things like Exim
(which embeds PCRE, Perl-compatible Regular Expressions)

  str.untaint if str =~ /\A[a-z0-9]+\z/
The asymmetry between \A and \z is annoying (I have to keep looking
it up to remember which one is capital and which is lower-case), and
it leaves regular expressions looking a lot less readable.
You can probably use \Z in most cases; the only difference between \z
and \Z is that \Z anchors before a trailing newline, if there is one.

I want to say unambiguously “start of string” and “end of string”, with no
messing around. If I am validating a string which is going to be inserted
into another string later on, it’s important to me whether the provided
value has or does not have a trailing newline.

Cheers,

Brian.

···

On Thu, May 22, 2003 at 08:09:04PM +0900, dblack@superlink.net wrote:

Topic		Replies	Views
String.strip( string_to_strip ) ruby-talk	1	96	23 December 2003
Regular expressions ruby-talk	26	145	17 April 2003
Ruby in "Mastering Regular Expressions" ruby-talk	1	107	10 October 2002
Regexp and $ ruby-talk	8	76	28 April 2003
Regexp Error? ruby-talk	14	83	14 May 2004

Regexps and anchoring again

Related topics