Regexp Error?

Michael_Campbell1 · 14 May 2004 13:34

ts wrote:

“R” == Robert Klemme bob.news@gmx.net writes:

irb(main):025:0> “123-456”.gsub(/.*$/, ‘X’)
=> “XX”

Applying your explanation, one would have to say that “$” is
matched
twice. IMHO this is not correct. Or is there another
explanation why
this gsub happens to return “XX”?

This is the same explanation. Your regexp means match a string
which is at
the end (where end can be \n or the end of string, in this case)

first match : “123-456”

now it’s at end

it can match the empty string

$ is like ^ : it don’t match a character but a position in the
string
(sort of …)

Does ANY other language that you know of have this behavior? What
advantage does it give Ruby to have it?

(In other words, this seems to me like “it’s not a bug, it’s a
feature!” type issue.)

···

Do you Yahoo!?
SBC Yahoo! - Internet access at a great low price.

ts1 · 14 May 2004 14:05

Does *ANY* other language that you know of have this behavior? What
advantage does it give Ruby to have it?

My problem is to give language that *don't* have this behaviour

Not sure, but this must the difference between "modern" style (posix, v8)
and "old" style (sed, awk) : but I know nothing in regexp

What do you expect with ?

/(.*).*/ =~ "abc"

Guy Decoux

Simon_Strandgaard1 · 14 May 2004 14:17

[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

same output with Gnu and Oniguruma
irb(main):001:0> /(.)./.match(“abc”).to_a
=> [“abc”, “abc”]
irb(main):002:0>

···

ts decoux@moulon.inra.fr wrote:

–
Simon Strandgaard

ts1 · 14 May 2004 14:21

Then you have make .* match *twice*

If the regexp can't match the empty string in the second .* (because it's
included in the first), it must not give a result

Guy Decoux

···

ts <decoux@moulon.inra.fr> wrote:
[snip]

What do you expect with ?

/(.*).*/ =~ "abc"

According to the left-most-longest rule.. I would guess the output
should be ["abc", "abc"]..

Simon_Strandgaard1 · 14 May 2004 14:34

I don’t understand you here… (maybe your assumption is wrong) ?

Lets break it down

The left-most .* will first eat to the end of input, and will now contain “abc”.
Then it hits end of input.
the left-most .* backtracks one time
The right-most .* eats to end of input… it now contains “”
the left-most .* backtracks one time
Done.

···

On Fri, 14 May 2004 23:21:27 +0900 ts decoux@moulon.inra.fr wrote:

ts decoux@moulon.inra.fr wrote:
[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

Then you have make .* match twice

If the regexp can’t match the empty string in the second .* (because it’s
included in the first), it must not give a result

–
Simon Strandgaard

Simon_Strandgaard1 · 14 May 2004 14:36

[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

Then you have make .* match twice

If the regexp can’t match the empty string in the second .* (because it’s
included in the first), it must not give a result

I don’t understand you here… (maybe your assumption is wrong) ?

Lets break it down

The left-most .* will first eat to the end of input, and will now contain “abc”.
Then it hits end of input.

the left-most .* backtracks one time

The right-most .* eats to end of input… it now contains “”

the left-most .* backtracks one time

      ^^^^^^^^^
      right-most

···

Simon Strandgaard neoneye@adslhome.dk wrote:

On Fri, 14 May 2004 23:21:27 +0900 > ts decoux@moulon.inra.fr wrote:

ts decoux@moulon.inra.fr wrote:

Done.

–
Simon Strandgaard

ts1 · 14 May 2004 14:40

Lets break it down

svg% ruby -rjj -e '"abc".match(/(.*).*/)'
Regexp /(.*).*/
  0 start_memory $1
  1 anychar_repeat
  2 stop_memory $1
  3 anychar_repeat
  4 end
subexpressions : 1
Fastmap supplied : \000-\011\013-\377

String <<abc>> pos=0

What I want to say is that if the second .* can't match the empty string
because it's included in the first match, then the result will be
different.

Guy Decoux

Ara.T.Howard · 14 May 2004 15:03

what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching. therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

-a

···

On Fri, 14 May 2004, Simon Strandgaard wrote:

On Fri, 14 May 2004 23:21:27 +0900 > ts decoux@moulon.inra.fr wrote:

ts decoux@moulon.inra.fr wrote:
[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

Then you have make .* match twice

If the regexp can’t match the empty string in the second .* (because it’s
included in the first), it must not give a result

I don’t understand you here… (maybe your assumption is wrong) ?

–

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================

Simon_Strandgaard1 · 14 May 2004 14:49

I wasn’t aware that GNU or Oniguruma represented the empty string as a token, but it
makes sense. I have made my engine sligthly different, where I have to do loop
detection instead.

···

ts decoux@moulon.inra.fr wrote:

Lets break it down

svg% ruby -rjj -e ‘“abc”.match(/(.)./)’
Regexp /(.)./
0 start_memory $1
1 anychar_repeat
2 stop_memory $1
3 anychar_repeat
4 end
subexpressions : 1
Fastmap supplied : \000-\011\013-\377

String <> pos=0

0 start_memory |abc |
1 anychar_repeat |abc | >2[0] >2[1] >2[2] >2[3] F2[3]
2 stop_memory abc| | $1=abc
3 anychar_repeat abc| | >4[0] F4[0] SUCCESS
svg%

What I want to say is that if the second .* can’t match the empty string
because it’s included in the first match, then the result will be
different.

–
Simon Strandgaard

Simon_Strandgaard1 · 14 May 2004 15:47

Ara.T.Howard wrote:

[snip]

What do you expect with ?

/(.)./ =~ “abc”

According to the left-most-longest rule… I would guess the output
should be [“abc”, “abc”]…

Then you have make .* match twice

If the regexp can’t match the empty string in the second .* (because it’s
included in the first), it must not give a result

I don’t understand you here… (maybe your assumption is wrong) ?

what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching. therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

What Guy told me makes sense… your explaination also makes sense. The output ‘XX’ is
because GNU/Oniguruma uses the empty-string token.

I guess sed and awk uses loop-detection instead, and therefore outputs ‘X’.

my engine uses loop-detection, it thus outputs more desired results. http://raa.ruby-lang.org/list.rhtml?name=regexp

For instance my engine outputs
/a(a|)*/ ~= ‘aaab’ → [‘aaa’, ‘a’]
/a(|ab)*b/ ~= ‘aaabbb’ → [‘aabb’, ‘ab’]
/x(y?)*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
/x(y{0,2})*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
If you try the same examples with Gnu or Oniguruma, you will see the last elements
are empty, I guess this is because they use the empty-token concept.

···

On Fri, 14 May 2004, Simon Strandgaard wrote:

On Fri, 14 May 2004 23:21:27 +0900 > > ts decoux@moulon.inra.fr wrote:

ts decoux@moulon.inra.fr wrote:

–
Simon Strandgaard

Kristof_Bastiaensen1 · 14 May 2004 16:18

what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching.

Agree until here.

therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

-a

I don’t agree. The things is, you can’t really consume ‘nothing’.
If I follow your reasoning, the above should give ‘XXX’, because
123-456 matches /(.)(.)(.*)/. How much empty spaces are there between
two characters? One, three, infinite? There have to be rules, and I would
find the following the most logical one:

when the empty space is already matched before, don’t match
again, unless explicitly against ? or *. (gawk behaviour?)

Kristof

···

On Fri, 14 May 2004 08:54:22 -0600, Ara.T.Howard wrote:

Guillaume_Marcais1 · 14 May 2004 16:02

Huh? False advertisement?

[gus@comp tmp]$ cat regexp.rb
p /a(a|)*/.match(‘aaab’).to_a
p /a(|ab)*b/.match(‘aaabbb’).to_a
p /x(y?)*z/.match(‘xyz’).to_a
p /x(y{0,2})*z/.match(‘xyz’).to_a
[gus@comp tmp]$ ruby regexp.rb
[“aaa”, “a”]
[“aabb”, “ab”]
[“xyz”, “y”]
[“xyz”, “y”]
[gus@comp tmp]$ ruby -v
ruby 1.8.1 (2004-03-31) [i586-linux-gnu]

Guillaume.

···

On Fri, 2004-05-14 at 11:47, Simon Strandgaard wrote:

my engine uses loop-detection, it thus outputs more desired results. http://raa.ruby-lang.org/list.rhtml?name=regexp
For instance my engine outputs
/a(a|)*/ ~= ‘aaab’ → [‘aaa’, ‘a’]
/a(|ab)*b/ ~= ‘aaabbb’ → [‘aabb’, ‘ab’]
/x(y?)*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
/x(y{0,2})*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
If you try the same examples with Gnu or Oniguruma, you will see the last elements
are empty, I guess this is because they use the empty-token concept.

Ara.T.Howard3 · 14 May 2004 16:53

but then

/^$/ would not match ‘’

empty tokens must not consume and must stack

-a

···

On Fri, 14 May 2004, Kristof Bastiaensen wrote:

On Fri, 14 May 2004 08:54:22 -0600, Ara.T.Howard wrote:

what guy is saying here is that unless both ‘(.)’ and '.’ match in
the above - you don’t have match. we all agree that the ‘(.)’ matches the
entire ‘abc’ and we all agree that the pattern '(.).*’ matches ‘abc’ -
therefore we all agree that it’s correct that certain pattern match zero width
positions in strings and consume no chars whilst still matching.

Agree until here.

therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

-a

I don’t agree. The things is, you can’t really consume ‘nothing’.
If I follow your reasoning, the above should give ‘XXX’, because
123-456 matches /(.)(.)(.*)/. How much empty spaces are there between
two characters? One, three, infinite? There have to be rules, and I would
find the following the most logical one:

when the empty space is already matched before, don’t match
again, unless explicitly against ? or *. (gawk behaviour?)

Kristof

–

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: http://www.ngdc.noaa.gov/stp/
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================

Simon_Strandgaard1 · 14 May 2004 16:09

my engine uses loop-detection, it thus outputs more desired results. http://raa.ruby-lang.org/list.rhtml?name=regexp
For instance my engine outputs
/a(a|)*/ ~= ‘aaab’ → [‘aaa’, ‘a’]
/a(|ab)*b/ ~= ‘aaabbb’ → [‘aabb’, ‘ab’]
/x(y?)*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
/x(y{0,2})*z/ ~= ‘xyz’ → [‘xyz’, ‘y’]
If you try the same examples with Gnu or Oniguruma, you will see the last elements
are empty, I guess this is because they use the empty-token concept.

Huh? False advertisement?

[gus@comp tmp]$ cat regexp.rb
p /a(a|)*/.match(‘aaab’).to_a
p /a(|ab)*b/.match(‘aaabbb’).to_a
p /x(y?)*z/.match(‘xyz’).to_a
p /x(y{0,2})*z/.match(‘xyz’).to_a
[gus@comp tmp]$ ruby regexp.rb
[“aaa”, “a”]
[“aabb”, “ab”]
[“xyz”, “y”]
[“xyz”, “y”]
[gus@comp tmp]$ ruby -v
ruby 1.8.1 (2004-03-31) [i586-linux-gnu]

I am sorry about the false statement… however on Oniguruma (the future) it true

cat a.rb
p /a(a|)*/.match(‘aaab’).to_a
p /a(|ab)*b/.match(‘aaabbb’).to_a
p /x(y?)*z/.match(‘xyz’).to_a
p /x(y{0,2})*z/.match(‘xyz’).to_a
ruby -v a.rb
ruby 1.9.0 (2004-04-16) [i386-freebsd5.1]
a.rb:1: warning: ambiguous first argument; put parentheses or even spaces
a.rb:2: warning: ambiguous first argument; put parentheses or even spaces
a.rb:3: warning: ambiguous first argument; put parentheses or even spaces
a.rb:4: warning: ambiguous first argument; put parentheses or even spaces
[“aaa”, “”]
[“aabb”, “”]
[“xyz”, “”]
[“xyz”, “”]

···

On Sat, 15 May 2004 01:02:48 +0900 Guillaume Marcais guslist@free.fr wrote:

On Fri, 2004-05-14 at 11:47, Simon Strandgaard wrote:

–
Simon Strandgaard

Kristof_Bastiaensen1 · 14 May 2004 18:58

I would say it does match, because the empty token wouldn’t have been
matched at that moment. Perhaps a better way to formulate it would be
this:

consume the empty token when matching greedily, but still allow it
to be matched against * and ?.
This way the empty token will treated as part of any string (not a
seperate entity).
I don’t want to say it is the only way, but it seems the most logical to
me.

(But maybe it is best to just avoid such a regexp

k

···

On Fri, 14 May 2004 10:41:20 -0600, Ara.T.Howard wrote:

On Fri, 14 May 2004, Kristof Bastiaensen wrote:

On Fri, 14 May 2004 08:54:22 -0600, Ara.T.Howard wrote:

what guy is saying here is that unless both ‘(.)’ and '.’
match in the above - you don’t have match. we all agree that the
‘(.)’ matches the entire ‘abc’ and we all agree that the pattern
'(.).*’ matches ‘abc’ - therefore we all agree that it’s correct
that certain pattern match zero width positions in strings and
consume no chars whilst still matching.

Agree until here.

therefore we
all agree that

‘123-456’.gsub /.*$/, ‘X’ => ‘XX’

because if it didn’t ‘(.).’ could not match ‘abc’

-a

I don’t agree. The things is, you can’t really consume ‘nothing’. If I
follow your reasoning, the above should give ‘XXX’, because 123-456
matches /(.)(.)(.*)/. How much empty spaces are there between two
characters? One, three, infinite? There have to be rules, and I would
find the following the most logical one:

when the empty space is already matched before, don’t match again,
unless explicitly against ? or *. (gawk behaviour?)

Kristof

but then

/^$/ would not match ‘’

empty tokens must not consume and must stack

-a

Topic		Replies	Views
Regexp Error? ruby-talk	15	102	15 May 2004
Surprising Regexp Behavior ruby-talk	2	86	13 September 2005
Surprising Regexp Behavior ruby-talk	12	85	14 September 2005
Another strange regexp case ruby-talk	5	79	30 June 2004
Ruby regexpresion, error? :-( ruby-talk	8	93	20 September 2010

Regexp Error?

–

–

Related topics