Applying your explanation, one would have to say that “$” is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return “XX”?
Applying your explanation, one would have to say that "$" is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return "XX"?
This is the same explanation. Your regexp means match a string which is at
the end (where end can be \n or the end of string, in this case)
* first match : "123-456"
* now it's at end
* it can match the empty string
$ is like ^ : it don't match a character but a position in the string
(sort of ...)
Applying your explanation, one would have to say that “$” is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return “XX”?
Regards
robert
^
the first go then then scanning starts again - the problem is that it’s then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference for
the second match is that it does not advance the scanner ptr and can therefore
know it’s done… it does seem odd, but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed things…
for instance if you did this
“123-456”.gsub(/.*|$/, ‘X’)
you would expect ‘XX’, where the second ‘X’ is inserted into a zero width
position and ‘.’ does not include ‘$’ and yet this is realy the same exact
behaviour - scanning is done again from the non-space before the end of line,
allowing you to finally match ‘$’ which '.’ did not consume.
regexs can be so tricky, i try to use these rules with them
always use both ^ and $ (this makes it a lot harder to write the expression
too!)
never use .* (or * at all really)
the last is actually pretty important - we use a product here, ldm (local data
manager), that scans a huge memeory mapped queue full of data products matched
a list of actions against the product tags. the list of actions use regexps
and all of ours had ‘.’ in them. top showed the ldm process at around 30%
cpu - reworking the patterns to not include '.’ dropped it off the rader.
Perl and Javascript (MSIE) do it too, so I don’t propose changing it, but it
seems a strange (wrong) behaviour.
I would have thought that the second match you refer to should have been
included in the first; that is, that the greedy match should match (and then
replace) the whole string, including the 0 characters between the last
character and the end of the string.
Applying your explanation, one would have to say that “$” is matched
twice. IMHO this is not correct. Or is there another explanation why
this gsub happens to return “XX”?
This is the same explanation. Your regexp means match a string which is
at
···
the end (where end can be \n or the end of string, in this case)
first match : “123-456”
now it’s at end
it can match the empty string
$ is like ^ : it don’t match a character but a position in the string
(sort of …)
and sed regexps are not the same as perl/ruby right?
-a
No, but I would expect the basic ones to behave the same
way. (sed and gawk have been there before perl/ruby/javascript).
Is that not a expectation that can be trusted?
···
On Fri, 14 May 2004 07:16:57 -0600, Ara.T.Howard wrote:
I would say the empty string is included with “123-456”,
so it shouldn’t give another match:
echo “123-456” | sed ‘s/.*/X/g’
X
Kristof
yes but
~ > echo “123-456” | perl -npe ‘s/.*$/X/g’
XX
and sed regexps are not the same as perl/ruby right?
This is a widespread problem with regexp, when dealing with kleene star, its
tricky to detemine when to stop looping. I have putted lot of effort investigating
where to stop in my engine, so the output is the most desired.
Unfortunatly Ruby’s native regexp engines (GNU or Oniguruma) attempts to be
perl compatible, and thus sometimes emulating a non-desired behavior.
^ and $ are special and they consume no chars and so are not really
‘matched’
in the same way…
your regex says ‘zero or more chars before the end of a string’ so you
get
^ 1 2 3 - 4 5 6 $
^
the first go then then scanning starts again - the problem is that it’s
then
looking for something potentially zero widthed followed by something
zero
widthed - which is always going to match (again). i guess the
difference for
the second match is that it does not advance the scanner ptr and can
therefore
know it’s done… it does seem odd,
Definitely! What strikes me odd is, that the engine must know start and
end of the match. So it could relaize that end is at the end.
but without that behaviour it would be
hard to match empty strings, line boundries, and other zero widthed
things…
You mean because then it would immediately stop without matching anything.
Yeah, might be true.
The sed and awk examples show that apparently there’s disagreement on how
this should be handled. I just wonder why I didn’t step into this pitfall
earlier. Apparently I never felt the need for .* in a replacement context
before.
I would have thought, from that logic, that you could just as well expect an
infinite loop (“XXXXXX…”) rather than just “XX” - why does /.*/ not keep
matching that same 0-char gap at the end?
^ and $ are special and they consume no chars and so are not really
‘matched’
in the same way…
your regex says ‘zero or more chars before the end of a string’ so you get
^ 1 2 3 - 4 5 6 $
^
the first go then then scanning starts again - the problem is that it’s
then
looking for something potentially zero widthed followed by something zero
widthed - which is always going to match (again). i guess the difference
for
the second match is that it does not advance the scanner ptr and can
therefore
know it’s done… it does seem odd, but without that behaviour it would
be
hard to match empty strings, line boundries, and other zero widthed
things…
for instance if you did this
“123-456”.gsub(/.*|$/, ‘X’)
you would expect ‘XX’, where the second ‘X’ is inserted into a zero width
position and ‘.’ does not include ‘$’ and yet this is realy the same
exact
behaviour - scanning is done again from the non-space before the end of
line,
allowing you to finally match ‘$’ which '.’ did not consume.
regexs can be so tricky, i try to use these rules with them
always use both ^ and $ (this makes it a lot harder to write the
expression
too!)
never use .* (or * at all really)
the last is actually pretty important - we use a product here, ldm (local
data
manager), that scans a huge memeory mapped queue full of data products
matched
a list of actions against the product tags. the list of actions use
regexps
and all of ours had ‘.*’ in them. top showed the ldm process at around
30%
···
cpu - reworking the patterns to not include ‘.*’ dropped it off the rader.
EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
“Dave Burt” burtdav@hotmail.com schrieb im Newsbeitrag
news:UJgpc.38870$TT.13050@news-server.bigpond.net.au…
I would have thought, from that logic, that you could just as well expect
an
infinite loop (“XXXXXX…”) rather than just “XX” - why does /.*/ not keep
matching that same 0-char gap at the end?
I thought that for a moment, too. But he gave the answer already:
i guess the difference for
the second match is that it does not advance the scanner ptr and can
therefore
know it’s done…
Definitely! What strikes me odd is, that the engine must know start
and
end of the match. So it could relaize that end is at the end.
Well you can have another explanation in ‘man 7 regex’ (linux) (or
another
way to see it)
Match lengths are measured in characters, not collating elements.
A
null string is considered longer than no match at all.
It’s in the case of null string vs no match
This reminds me of the mathematician that found an epsilon so small, that -
if you divided it in halves - it was already negative.
Epsilon transitions is a very interesting feature of regexp… I like them.
However variable-width lookbehind with subcaptures and backreferences are
even more amazing (that would be suitable to a small research project).