Applying your explanation, one would have to say that “$” is
matched
twice. IMHO this is not correct. Or is there another
explanation why
this gsub happens to return “XX”?
This is the same explanation. Your regexp means match a string
which is at
the end (where end can be \n or the end of string, in this case)
first match : “123-456”
now it’s at end
it can match the empty string
$ is like ^ : it don’t match a character but a position in the
string
(sort of …)
Does ANY other language that you know of have this behavior? What
advantage does it give Ruby to have it?
(In other words, this seems to me like “it’s not a bug, it’s a
feature!” type issue.)
···
Do you Yahoo!?
SBC Yahoo! - Internet access at a great low price.
what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching. therefore we
all agree that
I wasn’t aware that GNU or Oniguruma represented the empty string as a token, but it
makes sense. I have made my engine sligthly different, where I have to do loop
detection instead.
According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…
Then you have make .* match twice
If the regexp can’t match the empty string in the second .* (because it’s
included in the first), it must not give a result
I don’t understand you here… (maybe your assumption is wrong) ?
what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching. therefore we
all agree that
‘123-456’.gsub /.*$/, ‘X’ => ‘XX’
because if it didn’t ‘(.).’ could not match ‘abc’
What Guy told me makes sense… your explaination also makes sense. The output ‘XX’ is
because GNU/Oniguruma uses the empty-string token.
I guess sed and awk uses loop-detection instead, and therefore outputs ‘X’.
my engine uses loop-detection, it thus outputs more desired results.
http://raa.ruby-lang.org/list.rhtml?name=regexp
For instance my engine outputs
/a(a|)*/ ~= ‘aaab’ → [‘aaa’, ‘a’]
/a(|ab)*b/ ~= ‘aaabbb’ → [‘aabb’, ‘ab’]
/x(y?)*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
/x(y{0,2})*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
If you try the same examples with Gnu or Oniguruma, you will see the last elements
are empty, I guess this is because they use the empty-token concept.
what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching.
Agree until here.
therefore we
all agree that
‘123-456’.gsub /.*$/, ‘X’ => ‘XX’
because if it didn’t ‘(.).’ could not match ‘abc’
-a
I don’t agree. The things is, you can’t really consume ‘nothing’.
If I follow your reasoning, the above should give ‘XXX’, because
123-456 matches /(.)(.)(.*)/. How much empty spaces are there between
two characters? One, three, infinite? There have to be rules, and I would
find the following the most logical one:
when the empty space is already matched before, don’t match
again, unless explicitly against ? or *. (gawk behaviour?)
Kristof
···
On Fri, 14 May 2004 08:54:22 -0600, Ara.T.Howard wrote:
[gus@comp tmp]$ cat regexp.rb
p /a(a|)*/.match(‘aaab’).to_a
p /a(|ab)*b/.match(‘aaabbb’).to_a
p /x(y?)*z/.match(‘xyz’).to_a
p /x(y{0,2})*z/.match(‘xyz’).to_a
[gus@comp tmp]$ ruby regexp.rb
[“aaa”, “a”]
[“aabb”, “ab”]
[“xyz”, “y”]
[“xyz”, “y”]
[gus@comp tmp]$ ruby -v
ruby 1.8.1 (2004-03-31) [i586-linux-gnu]
Guillaume.
···
On Fri, 2004-05-14 at 11:47, Simon Strandgaard wrote:
my engine uses loop-detection, it thus outputs more desired results.
http://raa.ruby-lang.org/list.rhtml?name=regexp
For instance my engine outputs
/a(a|)*/ ~= ‘aaab’ → [‘aaa’, ‘a’]
/a(|ab)*b/ ~= ‘aaabbb’ → [‘aabb’, ‘ab’]
/x(y?)*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
/x(y{0,2})*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
If you try the same examples with Gnu or Oniguruma, you will see the last elements
are empty, I guess this is because they use the empty-token concept.
On Fri, 14 May 2004 08:54:22 -0600, Ara.T.Howard wrote:
what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching.
Agree until here.
therefore we
all agree that
‘123-456’.gsub /.*$/, ‘X’ => ‘XX’
because if it didn’t ‘(.).’ could not match ‘abc’
-a
I don’t agree. The things is, you can’t really consume ‘nothing’.
If I follow your reasoning, the above should give ‘XXX’, because
123-456 matches /(.)(.)(.*)/. How much empty spaces are there between
two characters? One, three, infinite? There have to be rules, and I would
find the following the most logical one:
when the empty space is already matched before, don’t match
again, unless explicitly against ? or *. (gawk behaviour?)
Kristof
–
EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: http://www.ngdc.noaa.gov/stp/
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================
my engine uses loop-detection, it thus outputs more desired results.
http://raa.ruby-lang.org/list.rhtml?name=regexp
For instance my engine outputs
/a(a|)*/ ~= ‘aaab’ → [‘aaa’, ‘a’]
/a(|ab)*b/ ~= ‘aaabbb’ → [‘aabb’, ‘ab’]
/x(y?)*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
/x(y{0,2})*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
If you try the same examples with Gnu or Oniguruma, you will see the last elements
are empty, I guess this is because they use the empty-token concept.
Huh? False advertisement?
[gus@comp tmp]$ cat regexp.rb
p /a(a|)*/.match(‘aaab’).to_a
p /a(|ab)*b/.match(‘aaabbb’).to_a
p /x(y?)*z/.match(‘xyz’).to_a
p /x(y{0,2})*z/.match(‘xyz’).to_a
[gus@comp tmp]$ ruby regexp.rb
[“aaa”, “a”]
[“aabb”, “ab”]
[“xyz”, “y”]
[“xyz”, “y”]
[gus@comp tmp]$ ruby -v
ruby 1.8.1 (2004-03-31) [i586-linux-gnu]
I am sorry about the false statement… however on Oniguruma (the future) it true
cat a.rb
p /a(a|)*/.match(‘aaab’).to_a
p /a(|ab)*b/.match(‘aaabbb’).to_a
p /x(y?)*z/.match(‘xyz’).to_a
p /x(y{0,2})*z/.match(‘xyz’).to_a
ruby -v a.rb
ruby 1.9.0 (2004-04-16) [i386-freebsd5.1]
a.rb:1: warning: ambiguous first argument; put parentheses or even spaces
a.rb:2: warning: ambiguous first argument; put parentheses or even spaces
a.rb:3: warning: ambiguous first argument; put parentheses or even spaces
a.rb:4: warning: ambiguous first argument; put parentheses or even spaces
[“aaa”, “”]
[“aabb”, “”]
[“xyz”, “”]
[“xyz”, “”]
···
On Sat, 15 May 2004 01:02:48 +0900 Guillaume Marcais guslist@free.fr wrote:
On Fri, 2004-05-14 at 11:47, Simon Strandgaard wrote:
I would say it does match, because the empty token wouldn’t have been
matched at that moment. Perhaps a better way to formulate it would be
this:
consume the empty token when matching greedily, but still allow it
to be matched against * and ?.
This way the empty token will treated as part of any string (not a
seperate entity).
I don’t want to say it is the only way, but it seems the most logical to
me.
(But maybe it is best to just avoid such a regexp
k
···
On Fri, 14 May 2004 10:41:20 -0600, Ara.T.Howard wrote:
On Fri, 14 May 2004, Kristof Bastiaensen wrote:
On Fri, 14 May 2004 08:54:22 -0600, Ara.T.Howard wrote:
what guy is saying here is that unless both ‘(.)’ and '.’
match in the above - you don’t have match. we all agree that the
‘(.)’ matches the entire ‘abc’ and we all agree that the pattern
'(.).*’ matches ‘abc’ - therefore we all agree that it’s correct
that certain pattern match zero width positions in strings and
consume no chars whilst still matching.
Agree until here.
therefore we
all agree that
‘123-456’.gsub /.*$/, ‘X’ => ‘XX’
because if it didn’t ‘(.).’ could not match ‘abc’
-a
I don’t agree. The things is, you can’t really consume ‘nothing’. If I
follow your reasoning, the above should give ‘XXX’, because 123-456
matches /(.)(.)(.*)/. How much empty spaces are there between two
characters? One, three, infinite? There have to be rules, and I would
find the following the most logical one:
when the empty space is already matched before, don’t match again,
unless explicitly against ? or *. (gawk behaviour?)