Regexp Error?

Robert · 14 May 2004 11:14

What’s wrong here?

irb(main):022:0> RUBY_VERSION
=> "1.8.1"
irb(main):023:0> “123-456”.gsub(/./, ‘X’)
=> "XX"
irb(main):024:0> “123-456”.gsub(/^./, ‘X’)
=> "X"
irb(main):025:0> “123-456”.gsub(/.$/, ‘X’)
=> "XX"
irb(main):026:0> “123-456”.gsub(/^.$/, ‘X’)
=> “X”

I’d have expected “X” as result of gsub in 23 and 25 because .* is greedy.

robert

ts1 · 14 May 2004 11:18

irb(main):023:0> "123-456".gsub(/.*/, 'X')
=> "XX"

  * first match : "123-456"
  * now it's at end
  * second match with the empty string

irb(main):024:0> "123-456".gsub(/^.*/, 'X')
=> "X"

  * first match : "123-456"
  * now it's at end
  * it can't match the empty string because there is ^ in the regexp

Guy Decoux

Robert · 14 May 2004 12:18

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200405141118.i4EBIo422693@moulon.inra.fr…

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

first match : “123-456”

now it’s at end

second match with the empty string

irb(main):024:0> “123-456”.gsub(/^.*/, ‘X’)
=> “X”

first match : “123-456”

now it’s at end

it can’t match the empty string because there is ^ in the regexp

That explains it. I still have to say that I find this a bit surprising.
Moreover, case strikes me as error:

irb(main):025:0> “123-456”.gsub(/.*$/, ‘X’)
=> “XX”

Applying your explanation, one would have to say that “$” is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return “XX”?

Regards

robert

Kristof_Bastiaensen1 · 14 May 2004 13:09

I would say the empty string is included with “123-456”,
so it shouldn’t give another match:

echo “123-456” | sed ‘s/.*/X/g’
X

Kristof

···

On Fri, 14 May 2004 20:18:55 +0900, ts wrote:

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

first match : “123-456”

now it’s at end

second match with the empty string

ts1 · 14 May 2004 12:24

irb(main):025:0> "123-456".gsub(/.*$/, 'X')
=> "XX"

Applying your explanation, one would have to say that "$" is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return "XX"?

This is the same explanation. Your regexp means match a string which is at
the end (where end can be \n or the end of string, in this case)

  * first match : "123-456"
  * now it's at end
  * it can match the empty string

$ is like ^ : it don't match a character but a position in the string
(sort of ...)

Guy Decoux

Ara.T.Howard · 14 May 2004 13:13

^ and $ are special and they consume no chars and so are not really ‘matched’
in the same way…

your regex says ‘zero or more chars before the end of a string’ so you get

^ 1 2 3 - 4 5 6 $

···

On Fri, 14 May 2004, Robert Klemme wrote:

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200405141118.i4EBIo422693@moulon.inra.fr…

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

first match : “123-456”

now it’s at end

second match with the empty string

irb(main):024:0> “123-456”.gsub(/^.*/, ‘X’)
=> “X”

first match : “123-456”

now it’s at end

it can’t match the empty string because there is ^ in the regexp

That explains it. I still have to say that I find this a bit surprising.
Moreover, case strikes me as error:

irb(main):025:0> “123-456”.gsub(/.*$/, ‘X’)
=> “XX”

Applying your explanation, one would have to say that “$” is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return “XX”?

Regards
robert

the first go then then scanning starts again - the problem is that it’s then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference for
the second match is that it does not advance the scanner ptr and can therefore
know it’s done… it does seem odd, but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed things…
for instance if you did this

“123-456”.gsub(/.*|$/, ‘X’)

you would expect ‘XX’, where the second ‘X’ is inserted into a zero width
position and ‘.’ does not include ‘$’ and yet this is realy the same exact
behaviour - scanning is done again from the non-space before the end of line,
allowing you to finally match ‘$’ which '.’ did not consume.

regexs can be so tricky, i try to use these rules with them

always use both ^ and $ (this makes it a lot harder to write the expression
too!)
never use .* (or * at all really)

the last is actually pretty important - we use a product here, ldm (local data
manager), that scans a huge memeory mapped queue full of data products matched
a list of actions against the product tags. the list of actions use regexps
and all of ours had ‘.’ in them. top showed the ldm process at around 30%
cpu - reworking the patterns to not include '.’ dropped it off the rader.

-a

===============================================================================

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================

Ara.T.Howard · 14 May 2004 13:23

yes but

~ > echo “123-456” | perl -npe ‘s/.*$/X/g’
XX

and sed regexps are not the same as perl/ruby right?

-a

···

On Fri, 14 May 2004, Kristof Bastiaensen wrote:

On Fri, 14 May 2004 20:18:55 +0900, ts wrote:

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

first match : “123-456”

now it’s at end

second match with the empty string

I would say the empty string is included with “123-456”,
so it shouldn’t give another match:

echo “123-456” | sed ‘s/.*/X/g’
X

Kristof

–

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================

Dave_Burt · 14 May 2004 13:18

Perl and Javascript (MSIE) do it too, so I don’t propose changing it, but it
seems a strange (wrong) behaviour.

I would have thought that the second match you refer to should have been
included in the first; that is, that the greedy match should match (and then
replace) the whole string, including the 0 characters between the last
character and the end of the string.

But apparently it’s not like that.

“ts” decoux@moulon.inra.fr wrote in message
news:200405141224.i4ECOZo25483@moulon.inra.fr…

irb(main):025:0> “123-456”.gsub(/.*$/, ‘X’)
=> “XX”

Applying your explanation, one would have to say that “$” is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return “XX”?

This is the same explanation. Your regexp means match a string which is
at

···

the end (where end can be \n or the end of string, in this case)

first match : “123-456”

now it’s at end

it can match the empty string

$ is like ^ : it don’t match a character but a position in the string
(sort of …)

Guy Decoux

Kristof_Bastiaensen1 · 14 May 2004 13:28

echo “123-456” | sed ‘s/.*/X/g’
X

Kristof

yes but

~ > echo “123-456” | perl -npe ‘s/.*$/X/g’
XX

gawk also returns X:

$ echo “123-456” | gawk ‘{ gsub(/.*/, “X”); print }’
X

and sed regexps are not the same as perl/ruby right?

-a

No, but I would expect the basic ones to behave the same
way. (sed and gawk have been there before perl/ruby/javascript).
Is that not a expectation that can be trusted?

···

On Fri, 14 May 2004 07:16:57 -0600, Ara.T.Howard wrote:

Simon_Strandgaard1 · 14 May 2004 13:36

Ara.T.Howardwrote:

···

On Fri, 14 May 2004, Kristof Bastiaensen wrote:

On Fri, 14 May 2004 20:18:55 +0900, ts wrote:

irb(main):023:0> “123-456”.gsub(/.*/, ‘X’)
=> “XX”

first match : “123-456”

now it’s at end

second match with the empty string

I would say the empty string is included with “123-456”,
so it shouldn’t give another match:

echo “123-456” | sed ‘s/.*/X/g’
X

Kristof

yes but

~ > echo “123-456” | perl -npe ‘s/.*$/X/g’
XX

and sed regexps are not the same as perl/ruby right?

This is a widespread problem with regexp, when dealing with kleene star, its
tricky to detemine when to stop looping. I have putted lot of effort investigating
where to stop in my engine, so the output is the most desired.

Unfortunatly Ruby’s native regexp engines (GNU or Oniguruma) attempts to be
perl compatible, and thus sometimes emulating a non-desired behavior.

–
Simon Strandgaard

Robert · 14 May 2004 15:13

“Ara.T.Howard” ahoward@fattire.ngdc.noaa.gov schrieb im Newsbeitrag
news:Pine.LNX.4.44.0405140634490.5586-100000@fattire.ngdc.noaa.gov…

^ and $ are special and they consume no chars and so are not really
‘matched’
in the same way…

your regex says ‘zero or more chars before the end of a string’ so you
get

^ 1 2 3 - 4 5 6 $
             ^
the first go then then scanning starts again - the problem is that it’s
then
looking for something potentially zero widthed followed by something
zero
widthed - which is always going to match (again). i guess the
difference for
the second match is that it does not advance the scanner ptr and can
therefore
know it’s done… it does seem odd,

Definitely! What strikes me odd is, that the engine must know start and
end of the match. So it could relaize that end is at the end.

but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed
things…

You mean because then it would immediately stop without matching anything.
Yeah, might be true.

The sed and awk examples show that apparently there’s disagreement on how
this should be handled. I just wonder why I didn’t step into this pitfall
earlier. Apparently I never felt the need for .* in a replacement context
before.

Thx all!

Kind regards

robert

Dave_Burt · 15 May 2004 04:13

I would have thought, from that logic, that you could just as well expect an
infinite loop (“XXXXXX…”) rather than just “XX” - why does /.*/ not keep
matching that same 0-char gap at the end?

“Ara.T.Howard” ahoward@fattire.ngdc.noaa.gov wrote in message
news:Pine.LNX.4.44.0405140634490.5586-100000@fattire.ngdc.noaa.gov…

^ and $ are special and they consume no chars and so are not really
‘matched’
in the same way…

your regex says ‘zero or more chars before the end of a string’ so you get

^ 1 2 3 - 4 5 6 $
             ^
the first go then then scanning starts again - the problem is that it’s
then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference
for
the second match is that it does not advance the scanner ptr and can
therefore
know it’s done… it does seem odd, but without that behaviour it would
be
hard to match empty strings, line boundries, and other zero widthed
things…
for instance if you did this

“123-456”.gsub(/.*|$/, ‘X’)

you would expect ‘XX’, where the second ‘X’ is inserted into a zero width
position and ‘.’ does not include ‘$’ and yet this is realy the same
exact
behaviour - scanning is done again from the non-space before the end of
line,
allowing you to finally match ‘$’ which '.’ did not consume.

regexs can be so tricky, i try to use these rules with them

always use both ^ and $ (this makes it a lot harder to write the
expression
too!)

never use .* (or * at all really)

the last is actually pretty important - we use a product here, ldm (local
data
manager), that scans a huge memeory mapped queue full of data products
matched
a list of actions against the product tags. the list of actions use
regexps
and all of ours had ‘.*’ in them. top showed the ldm process at around
30%

···

cpu - reworking the patterns to not include ‘.*’ dropped it off the rader.

-a

============================================================================

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done

============================================================================

ts1 · 14 May 2004 15:21

Definitely! What strikes me odd is, that the engine must know start and
end of the match. So it could relaize that end is at the end.

Well you can have another explanation in 'man 7 regex' (linux) (or another
way to see it)

Match lengths are measured in characters, not collating elements. A
null string is considered longer than no match at all.

It's in the case of null string vs no match

Guy Decoux

Robert · 15 May 2004 08:08

“Dave Burt” burtdav@hotmail.com schrieb im Newsbeitrag
news:UJgpc.38870$TT.13050@news-server.bigpond.net.au…

I would have thought, from that logic, that you could just as well expect
an
infinite loop (“XXXXXX…”) rather than just “XX” - why does /.*/ not keep
matching that same 0-char gap at the end?

I thought that for a moment, too. But he gave the answer already:

i guess the difference for
the second match is that it does not advance the scanner ptr and can
therefore
know it’s done…

Regards

robert

Robert · 15 May 2004 08:13

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200405141521.i4EFLOj03962@moulon.inra.fr…

Definitely! What strikes me odd is, that the engine must know start
and
end of the match. So it could relaize that end is at the end.

Well you can have another explanation in ‘man 7 regex’ (linux) (or
another
way to see it)
Match  lengths  are  measured in characters, not collating elements.

A

null string is considered longer than no match at  all.
It’s in the case of null string vs no match

This reminds me of the mathematician that found an epsilon so small, that -
if you divided it in halves - it was already negative.

Cheers

robert

Simon_Strandgaard1 · 15 May 2004 08:23

Robert Klemme wrote:

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200405141521.i4EFLOj03962@moulon.inra.fr…

Definitely! What strikes me odd is, that the engine must know start
and
end of the match. So it could relaize that end is at the end.

Well you can have another explanation in ‘man 7 regex’ (linux) (or
another
way to see it)
Match  lengths  are  measured in characters, not collating elements.
A
null string is considered longer than no match at  all.
It’s in the case of null string vs no match
This reminds me of the mathematician that found an epsilon so small, that -
if you divided it in halves - it was already negative.

Epsilon transitions is a very interesting feature of regexp… I like them.
However variable-width lookbehind with subcaptures and backreferences are
even more amazing (that would be suitable to a small research project).

···

–
Simon Strandgaard

Topic		Replies	Views
Regexp Error? ruby-talk	14	83	14 May 2004
Bug is ruby regexp ruby-talk	5	95	3 February 2007
Surprising Regexp Behavior ruby-talk	12	85	14 September 2005
Surprising Regexp Behavior ruby-talk	2	86	13 September 2005
Another strange regexp case ruby-talk	5	79	30 June 2004

Regexp Error?

-a

–

^ 1 2 3 - 4 5 6 $

^ 1 2 3 - 4 5 6 $

-a

============================================================================

============================================================================

Related topics