Regexp Error?

What’s wrong here?

irb(main):022:0> RUBY_VERSION
=> "1.8.1"
irb(main):023:0> “123-456”.gsub(/./, ‘X’)
=> "XX"
irb(main):024:0> “123-456”.gsub(/^.
/, ‘X’)
=> "X"
irb(main):025:0> “123-456”.gsub(/.$/, ‘X’)
=> "XX"
irb(main):026:0> “123-456”.gsub(/^.
$/, ‘X’)
=> “X”

I’d have expected “X” as result of gsub in 23 and 25 because .* is greedy.

robert

irb(main):023:0> "123-456".gsub(/.*/, 'X')
=> "XX"

  * first match : "123-456"
  * now it's at end
  * second match with the empty string

irb(main):024:0> "123-456".gsub(/^.*/, 'X')
=> "X"

  * first match : "123-456"
  * now it's at end
  * it can't match the empty string because there is ^ in the regexp

Guy Decoux

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200405141118.i4EBIo422693@moulon.inra.fr

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

  • first match : “123-456”
  • now it’s at end
  • second match with the empty string

irb(main):024:0> “123-456”.gsub(/^.*/, ‘X’)
=> “X”

  • first match : “123-456”
  • now it’s at end
  • it can’t match the empty string because there is ^ in the regexp

That explains it. I still have to say that I find this a bit surprising.
Moreover, case strikes me as error:

irb(main):025:0> “123-456”.gsub(/.*$/, ‘X’)
=> “XX”

Applying your explanation, one would have to say that “$” is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return “XX”?

Regards

robert

I would say the empty string is included with “123-456”,
so it shouldn’t give another match:

echo “123-456” | sed ‘s/.*/X/g’
X

Kristof

···

On Fri, 14 May 2004 20:18:55 +0900, ts wrote:

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

  • first match : “123-456”
  • now it’s at end
  • second match with the empty string

irb(main):025:0> "123-456".gsub(/.*$/, 'X')
=> "XX"

Applying your explanation, one would have to say that "$" is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return "XX"?

This is the same explanation. Your regexp means match a string which is at
the end (where end can be \n or the end of string, in this case)

  * first match : "123-456"
  * now it's at end
  * it can match the empty string

$ is like ^ : it don't match a character but a position in the string
(sort of ...)

Guy Decoux

^ and $ are special and they consume no chars and so are not really ‘matched’
in the same way…

your regex says ‘zero or more chars before the end of a string’ so you get

^ 1 2 3 - 4 5 6 $

···

On Fri, 14 May 2004, Robert Klemme wrote:

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200405141118.i4EBIo422693@moulon.inra.fr

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

  • first match : “123-456”
  • now it’s at end
  • second match with the empty string

irb(main):024:0> “123-456”.gsub(/^.*/, ‘X’)
=> “X”

  • first match : “123-456”
  • now it’s at end
  • it can’t match the empty string because there is ^ in the regexp

That explains it. I still have to say that I find this a bit surprising.
Moreover, case strikes me as error:

irb(main):025:0> “123-456”.gsub(/.*$/, ‘X’)
=> “XX”

Applying your explanation, one would have to say that “$” is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return “XX”?

Regards

robert

             ^

the first go then then scanning starts again - the problem is that it’s then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference for
the second match is that it does not advance the scanner ptr and can therefore
know it’s done… it does seem odd, but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed things…
for instance if you did this

“123-456”.gsub(/.*|$/, ‘X’)

you would expect ‘XX’, where the second ‘X’ is inserted into a zero width
position and ‘.’ does not include ‘$’ and yet this is realy the same exact
behaviour - scanning is done again from the non-space before the end of line,
allowing you to finally match ‘$’ which '.
’ did not consume.

regexs can be so tricky, i try to use these rules with them

  • always use both ^ and $ (this makes it a lot harder to write the expression
    too!)

  • never use .* (or * at all really)

the last is actually pretty important - we use a product here, ldm (local data
manager), that scans a huge memeory mapped queue full of data products matched
a list of actions against the product tags. the list of actions use regexps
and all of ours had ‘.’ in them. top showed the ldm process at around 30%
cpu - reworking the patterns to not include '.
’ dropped it off the rader.

-a

===============================================================================

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================

yes but

~ > echo “123-456” | perl -npe ‘s/.*$/X/g’
XX

and sed regexps are not the same as perl/ruby right?

-a

···

On Fri, 14 May 2004, Kristof Bastiaensen wrote:

On Fri, 14 May 2004 20:18:55 +0900, ts wrote:

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

  • first match : “123-456”
  • now it’s at end
  • second match with the empty string

I would say the empty string is included with “123-456”,
so it shouldn’t give another match:

echo “123-456” | sed ‘s/.*/X/g’
X

Kristof

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================

Perl and Javascript (MSIE) do it too, so I don’t propose changing it, but it
seems a strange (wrong) behaviour.

I would have thought that the second match you refer to should have been
included in the first; that is, that the greedy match should match (and then
replace) the whole string, including the 0 characters between the last
character and the end of the string.

But apparently it’s not like that.

“ts” decoux@moulon.inra.fr wrote in message
news:200405141224.i4ECOZo25483@moulon.inra.fr

irb(main):025:0> “123-456”.gsub(/.*$/, ‘X’)
=> “XX”

Applying your explanation, one would have to say that “$” is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return “XX”?

This is the same explanation. Your regexp means match a string which is
at

···

the end (where end can be \n or the end of string, in this case)

  • first match : “123-456”
  • now it’s at end
  • it can match the empty string

$ is like ^ : it don’t match a character but a position in the string
(sort of …)

Guy Decoux

echo “123-456” | sed ‘s/.*/X/g’
X

Kristof

yes but

~ > echo “123-456” | perl -npe ‘s/.*$/X/g’
XX

gawk also returns X:

$ echo “123-456” | gawk ‘{ gsub(/.*/, “X”); print }’
X

and sed regexps are not the same as perl/ruby right?

-a

No, but I would expect the basic ones to behave the same
way. (sed and gawk have been there before perl/ruby/javascript).
Is that not a expectation that can be trusted?

···

On Fri, 14 May 2004 07:16:57 -0600, Ara.T.Howard wrote:

Ara.T.Howardwrote:

···

On Fri, 14 May 2004, Kristof Bastiaensen wrote:

On Fri, 14 May 2004 20:18:55 +0900, ts wrote:

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

  • first match : “123-456”
  • now it’s at end
  • second match with the empty string

I would say the empty string is included with “123-456”,
so it shouldn’t give another match:

echo “123-456” | sed ‘s/.*/X/g’
X

Kristof

yes but

~ > echo “123-456” | perl -npe ‘s/.*$/X/g’
XX

and sed regexps are not the same as perl/ruby right?

This is a widespread problem with regexp, when dealing with kleene star, its
tricky to detemine when to stop looping. I have putted lot of effort investigating
where to stop in my engine, so the output is the most desired.

Unfortunatly Ruby’s native regexp engines (GNU or Oniguruma) attempts to be
perl compatible, and thus sometimes emulating a non-desired behavior.


Simon Strandgaard

“Ara.T.Howard” ahoward@fattire.ngdc.noaa.gov schrieb im Newsbeitrag
news:Pine.LNX.4.44.0405140634490.5586-100000@fattire.ngdc.noaa.gov

^ and $ are special and they consume no chars and so are not really
‘matched’
in the same way…

your regex says ‘zero or more chars before the end of a string’ so you
get

^ 1 2 3 - 4 5 6 $

             ^

the first go then then scanning starts again - the problem is that it’s
then
looking for something potentially zero widthed followed by something
zero
widthed - which is always going to match (again). i guess the
difference for
the second match is that it does not advance the scanner ptr and can
therefore
know it’s done… it does seem odd,

Definitely! What strikes me odd is, that the engine must know start and
end of the match. So it could relaize that end is at the end.

but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed
things…

You mean because then it would immediately stop without matching anything.
Yeah, might be true.

The sed and awk examples show that apparently there’s disagreement on how
this should be handled. I just wonder why I didn’t step into this pitfall
earlier. Apparently I never felt the need for .* in a replacement context
before. :slight_smile:

Thx all!

Kind regards

robert

I would have thought, from that logic, that you could just as well expect an
infinite loop (“XXXXXX…”) rather than just “XX” - why does /.*/ not keep
matching that same 0-char gap at the end?

“Ara.T.Howard” ahoward@fattire.ngdc.noaa.gov wrote in message
news:Pine.LNX.4.44.0405140634490.5586-100000@fattire.ngdc.noaa.gov

^ and $ are special and they consume no chars and so are not really
‘matched’
in the same way…

your regex says ‘zero or more chars before the end of a string’ so you get

^ 1 2 3 - 4 5 6 $

             ^

the first go then then scanning starts again - the problem is that it’s
then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference
for
the second match is that it does not advance the scanner ptr and can
therefore
know it’s done… it does seem odd, but without that behaviour it would
be
hard to match empty strings, line boundries, and other zero widthed
things…
for instance if you did this

“123-456”.gsub(/.*|$/, ‘X’)

you would expect ‘XX’, where the second ‘X’ is inserted into a zero width
position and ‘.’ does not include ‘$’ and yet this is realy the same
exact
behaviour - scanning is done again from the non-space before the end of
line,
allowing you to finally match ‘$’ which '.
’ did not consume.

regexs can be so tricky, i try to use these rules with them

  • always use both ^ and $ (this makes it a lot harder to write the
    expression
    too!)

  • never use .* (or * at all really)

the last is actually pretty important - we use a product here, ldm (local
data
manager), that scans a huge memeory mapped queue full of data products
matched
a list of actions against the product tags. the list of actions use
regexps
and all of ours had ‘.*’ in them. top showed the ldm process at around
30%

···

cpu - reworking the patterns to not include ‘.*’ dropped it off the rader.

-a

============================================================================

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done

============================================================================

Definitely! What strikes me odd is, that the engine must know start and
end of the match. So it could relaize that end is at the end.

Well you can have another explanation in 'man 7 regex' (linux) (or another
way to see it)

    Match lengths are measured in characters, not collating elements. A
    null string is considered longer than no match at all.

It's in the case of null string vs no match

Guy Decoux

“Dave Burt” burtdav@hotmail.com schrieb im Newsbeitrag
news:UJgpc.38870$TT.13050@news-server.bigpond.net.au…

I would have thought, from that logic, that you could just as well expect
an
infinite loop (“XXXXXX…”) rather than just “XX” - why does /.*/ not keep
matching that same 0-char gap at the end?

I thought that for a moment, too. But he gave the answer already:

i guess the difference for
the second match is that it does not advance the scanner ptr and can
therefore
know it’s done…

Regards

robert

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200405141521.i4EFLOj03962@moulon.inra.fr

Definitely! What strikes me odd is, that the engine must know start
and
end of the match. So it could relaize that end is at the end.

Well you can have another explanation in ‘man 7 regex’ (linux) (or
another
way to see it)

Match  lengths  are  measured in characters, not collating elements.

A

null string is considered longer than no match at  all.

It’s in the case of null string vs no match

This reminds me of the mathematician that found an epsilon so small, that -
if you divided it in halves - it was already negative. :slight_smile:

Cheers

robert

Robert Klemme wrote:

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200405141521.i4EFLOj03962@moulon.inra.fr

Definitely! What strikes me odd is, that the engine must know start
and
end of the match. So it could relaize that end is at the end.

Well you can have another explanation in ‘man 7 regex’ (linux) (or
another
way to see it)

Match  lengths  are  measured in characters, not collating elements.

A

null string is considered longer than no match at  all.

It’s in the case of null string vs no match

This reminds me of the mathematician that found an epsilon so small, that -
if you divided it in halves - it was already negative. :slight_smile:

Epsilon transitions is a very interesting feature of regexp… I like them.
However variable-width lookbehind with subcaptures and backreferences are
even more amazing (that would be suitable to a small research project).

···


Simon Strandgaard