Regexp Error?

ts wrote:

“R” == Robert Klemme bob.news@gmx.net writes:

irb(main):025:0> “123-456”.gsub(/.*$/, ‘X’)
=> “XX”

Applying your explanation, one would have to say that “$” is
matched
twice. IMHO this is not correct. Or is there another
explanation why
this gsub happens to return “XX”?

This is the same explanation. Your regexp means match a string
which is at
the end (where end can be \n or the end of string, in this case)

  • first match : “123-456”
  • now it’s at end
  • it can match the empty string

$ is like ^ : it don’t match a character but a position in the
string
(sort of …)

Does ANY other language that you know of have this behavior? What
advantage does it give Ruby to have it?

(In other words, this seems to me like “it’s not a bug, it’s a
feature!” type issue.)

···

Do you Yahoo!?
SBC Yahoo! - Internet access at a great low price.

Does *ANY* other language that you know of have this behavior? What
advantage does it give Ruby to have it?

My problem is to give language that *don't* have this behaviour :slight_smile:

Not sure, but this must the difference between "modern" style (posix, v8)
and "old" style (sed, awk) : but I know nothing in regexp

What do you expect with ?

  /(.*).*/ =~ "abc"

Guy Decoux

[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

same output with Gnu and Oniguruma
irb(main):001:0> /(.)./.match(“abc”).to_a
=> [“abc”, “abc”]
irb(main):002:0>

···

ts decoux@moulon.inra.fr wrote:


Simon Strandgaard

Then you have make .* match *twice* :slight_smile:

If the regexp can't match the empty string in the second .* (because it's
included in the first), it must not give a result

Guy Decoux

···

ts <decoux@moulon.inra.fr> wrote:
[snip]

What do you expect with ?

/(.*).*/ =~ "abc"

According to the left-most-longest rule.. I would guess the output
should be ["abc", "abc"]..

I don’t understand you here… (maybe your assumption is wrong) ?

Lets break it down

  1. The left-most .* will first eat to the end of input, and will now contain “abc”.
    Then it hits end of input.

  2. the left-most .* backtracks one time

  3. The right-most .* eats to end of input… it now contains “”

  4. the left-most .* backtracks one time

  5. Done.

···

On Fri, 14 May 2004 23:21:27 +0900 ts decoux@moulon.inra.fr wrote:

ts decoux@moulon.inra.fr wrote:
[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

Then you have make .* match twice :slight_smile:

If the regexp can’t match the empty string in the second .* (because it’s
included in the first), it must not give a result


Simon Strandgaard

[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

Then you have make .* match twice :slight_smile:

If the regexp can’t match the empty string in the second .* (because it’s
included in the first), it must not give a result

I don’t understand you here… (maybe your assumption is wrong) ?

Lets break it down

  1. The left-most .* will first eat to the end of input, and will now contain “abc”.
    Then it hits end of input.

  2. the left-most .* backtracks one time

  3. The right-most .* eats to end of input… it now contains “”

  4. the left-most .* backtracks one time

      ^^^^^^^^^
      right-most
···

Simon Strandgaard neoneye@adslhome.dk wrote:

On Fri, 14 May 2004 23:21:27 +0900 > ts decoux@moulon.inra.fr wrote:

ts decoux@moulon.inra.fr wrote:

  1. Done.


Simon Strandgaard

Lets break it down

svg% ruby -rjj -e '"abc".match(/(.*).*/)'
Regexp /(.*).*/
  0 start_memory $1
  1 anychar_repeat
  2 stop_memory $1
  3 anychar_repeat
  4 end
subexpressions : 1
Fastmap supplied : \000-\011\013-\377

String <<abc>> pos=0

  0 start_memory |abc |
  1 anychar_repeat |abc | >2[0] >2[1] >2[2] >2[3] F2[3]
  2 stop_memory abc| | $1=abc
  3 anychar_repeat abc| | >4[0] F4[0] SUCCESS
svg%

What I want to say is that if the second .* can't match the empty string
because it's included in the first match, then the result will be
different.

Guy Decoux

what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.
).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching. therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

-a

···

On Fri, 14 May 2004, Simon Strandgaard wrote:

On Fri, 14 May 2004 23:21:27 +0900 > ts decoux@moulon.inra.fr wrote:

ts decoux@moulon.inra.fr wrote:
[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

Then you have make .* match twice :slight_smile:

If the regexp can’t match the empty string in the second .* (because it’s
included in the first), it must not give a result

I don’t understand you here… (maybe your assumption is wrong) ?

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================

I wasn’t aware that GNU or Oniguruma represented the empty string as a token, but it
makes sense. I have made my engine sligthly different, where I have to do loop
detection instead.

···

ts decoux@moulon.inra.fr wrote:

Lets break it down

svg% ruby -rjj -e ‘“abc”.match(/(.)./)’
Regexp /(.)./
0 start_memory $1
1 anychar_repeat
2 stop_memory $1
3 anychar_repeat
4 end
subexpressions : 1
Fastmap supplied : \000-\011\013-\377

String <> pos=0

0 start_memory |abc |
1 anychar_repeat |abc | >2[0] >2[1] >2[2] >2[3] F2[3]
2 stop_memory abc| | $1=abc
3 anychar_repeat abc| | >4[0] F4[0] SUCCESS
svg%

What I want to say is that if the second .* can’t match the empty string
because it’s included in the first match, then the result will be
different.


Simon Strandgaard

Ara.T.Howard wrote:

[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

Then you have make .* match twice :slight_smile:

If the regexp can’t match the empty string in the second .* (because it’s
included in the first), it must not give a result

I don’t understand you here… (maybe your assumption is wrong) ?

what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.
).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching. therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

What Guy told me makes sense… your explaination also makes sense. The output ‘XX’ is
because GNU/Oniguruma uses the empty-string token.

I guess sed and awk uses loop-detection instead, and therefore outputs ‘X’.

my engine uses loop-detection, it thus outputs more desired results. http://raa.ruby-lang.org/list.rhtml?name=regexp

For instance my engine outputs
/a(a|)*/ ~= ‘aaab’ → [‘aaa’, ‘a’]
/a(|ab)*b/ ~= ‘aaabbb’ → [‘aabb’, ‘ab’]
/x(y?)*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
/x(y{0,2})*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
If you try the same examples with Gnu or Oniguruma, you will see the last elements
are empty, I guess this is because they use the empty-token concept.

···

On Fri, 14 May 2004, Simon Strandgaard wrote:

On Fri, 14 May 2004 23:21:27 +0900 > > ts decoux@moulon.inra.fr wrote:

ts decoux@moulon.inra.fr wrote:


Simon Strandgaard

what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.
).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching.

Agree until here.

therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

-a

I don’t agree. The things is, you can’t really consume ‘nothing’.
If I follow your reasoning, the above should give ‘XXX’, because
123-456 matches /(.)(.)(.*)/. How much empty spaces are there between
two characters? One, three, infinite? There have to be rules, and I would
find the following the most logical one:

  • when the empty space is already matched before, don’t match
    again, unless explicitly against ? or *. (gawk behaviour?)

Kristof

···

On Fri, 14 May 2004 08:54:22 -0600, Ara.T.Howard wrote:

Huh? False advertisement?

[gus@comp tmp]$ cat regexp.rb
p /a(a|)*/.match(‘aaab’).to_a
p /a(|ab)*b/.match(‘aaabbb’).to_a
p /x(y?)*z/.match(‘xyz’).to_a
p /x(y{0,2})*z/.match(‘xyz’).to_a
[gus@comp tmp]$ ruby regexp.rb
[“aaa”, “a”]
[“aabb”, “ab”]
[“xyz”, “y”]
[“xyz”, “y”]
[gus@comp tmp]$ ruby -v
ruby 1.8.1 (2004-03-31) [i586-linux-gnu]

Guillaume.

···

On Fri, 2004-05-14 at 11:47, Simon Strandgaard wrote:

my engine uses loop-detection, it thus outputs more desired results. http://raa.ruby-lang.org/list.rhtml?name=regexp

For instance my engine outputs
/a(a|)*/ ~= ‘aaab’ → [‘aaa’, ‘a’]
/a(|ab)*b/ ~= ‘aaabbb’ → [‘aabb’, ‘ab’]
/x(y?)*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
/x(y{0,2})*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
If you try the same examples with Gnu or Oniguruma, you will see the last elements
are empty, I guess this is because they use the empty-token concept.

but then

/^$/ would not match ‘’

empty tokens must not consume and must stack

-a

···

On Fri, 14 May 2004, Kristof Bastiaensen wrote:

On Fri, 14 May 2004 08:54:22 -0600, Ara.T.Howard wrote:

what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.
).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching.

Agree until here.

therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

-a

I don’t agree. The things is, you can’t really consume ‘nothing’.
If I follow your reasoning, the above should give ‘XXX’, because
123-456 matches /(.)(.)(.*)/. How much empty spaces are there between
two characters? One, three, infinite? There have to be rules, and I would
find the following the most logical one:

  • when the empty space is already matched before, don’t match
    again, unless explicitly against ? or *. (gawk behaviour?)

Kristof

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: http://www.ngdc.noaa.gov/stp/
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================

my engine uses loop-detection, it thus outputs more desired results. http://raa.ruby-lang.org/list.rhtml?name=regexp

For instance my engine outputs
/a(a|)*/ ~= ‘aaab’ → [‘aaa’, ‘a’]
/a(|ab)*b/ ~= ‘aaabbb’ → [‘aabb’, ‘ab’]
/x(y?)*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
/x(y{0,2})*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
If you try the same examples with Gnu or Oniguruma, you will see the last elements
are empty, I guess this is because they use the empty-token concept.

Huh? False advertisement?

[gus@comp tmp]$ cat regexp.rb
p /a(a|)*/.match(‘aaab’).to_a
p /a(|ab)*b/.match(‘aaabbb’).to_a
p /x(y?)*z/.match(‘xyz’).to_a
p /x(y{0,2})*z/.match(‘xyz’).to_a
[gus@comp tmp]$ ruby regexp.rb
[“aaa”, “a”]
[“aabb”, “ab”]
[“xyz”, “y”]
[“xyz”, “y”]
[gus@comp tmp]$ ruby -v
ruby 1.8.1 (2004-03-31) [i586-linux-gnu]

I am sorry about the false statement… however on Oniguruma (the future) it true

cat a.rb
p /a(a|)*/.match(‘aaab’).to_a
p /a(|ab)*b/.match(‘aaabbb’).to_a
p /x(y?)*z/.match(‘xyz’).to_a
p /x(y{0,2})*z/.match(‘xyz’).to_a
ruby -v a.rb
ruby 1.9.0 (2004-04-16) [i386-freebsd5.1]
a.rb:1: warning: ambiguous first argument; put parentheses or even spaces
a.rb:2: warning: ambiguous first argument; put parentheses or even spaces
a.rb:3: warning: ambiguous first argument; put parentheses or even spaces
a.rb:4: warning: ambiguous first argument; put parentheses or even spaces
[“aaa”, “”]
[“aabb”, “”]
[“xyz”, “”]
[“xyz”, “”]

···

On Sat, 15 May 2004 01:02:48 +0900 Guillaume Marcais guslist@free.fr wrote:

On Fri, 2004-05-14 at 11:47, Simon Strandgaard wrote:


Simon Strandgaard

I would say it does match, because the empty token wouldn’t have been
matched at that moment. Perhaps a better way to formulate it would be
this:

  • consume the empty token when matching greedily, but still allow it
    to be matched against * and ?.
    This way the empty token will treated as part of any string (not a
    seperate entity).
    I don’t want to say it is the only way, but it seems the most logical to
    me.

(But maybe it is best to just avoid such a regexp :slight_smile:

k

···

On Fri, 14 May 2004 10:41:20 -0600, Ara.T.Howard wrote:

On Fri, 14 May 2004, Kristof Bastiaensen wrote:

On Fri, 14 May 2004 08:54:22 -0600, Ara.T.Howard wrote:

what guy is saying here is that unless both ‘(.)’ and '.
match in the above - you don’t have match. we all agree that the
‘(.)’ matches the entire ‘abc’ and we all agree that the pattern
'(.
).*’ matches ‘abc’ - therefore we all agree that it’s correct
that certain pattern match zero width positions in strings and
consume no chars whilst still matching.

Agree until here.

therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

-a

I don’t agree. The things is, you can’t really consume ‘nothing’. If I
follow your reasoning, the above should give ‘XXX’, because 123-456
matches /(.)(.)(.*)/. How much empty spaces are there between two
characters? One, three, infinite? There have to be rules, and I would
find the following the most logical one:

  • when the empty space is already matched before, don’t match again,
    unless explicitly against ? or *. (gawk behaviour?)

Kristof

but then

/^$/ would not match ‘’

empty tokens must not consume and must stack

-a