Strange regexp behaviour in gsub

Kristof_Bastiaensen1 · 11 May 2004 23:03

Hi,

I am wondering if this is the correct behaviour in gsub:

“bab”.gsub(/(?!a)ab/, “cd”)
=> “bab”

shouldn’t that be “bcd”?

Also the following:
"\nab".gsub(/(?!\w)ab/, “cd”)
=> “\nab”

This seems to work
"bab".gsub(/(?!c)ab/, “cd”)
=> “bcd”

Kristof

Joel_VanderWerf1 · 11 May 2004 23:15

Kristof Bastiaensen wrote:

Hi,

I am wondering if this is the correct behaviour in gsub:

“bab”.gsub(/(?!a)ab/, “cd”)
=> “bab”

shouldn’t that be “bcd”?

I think /(?!a)ab/ can’t match anything. It’s saying that the first
character after the beginning of the match must not be “a”, and the
first character of the match must be “a”. This is contradictory.

Kristof_Bastiaensen1 · 11 May 2004 23:38

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

Kristof

···

On Wed, 12 May 2004 08:15:28 +0900, Joel VanderWerf wrote:

Kristof Bastiaensen wrote:

Hi,

I am wondering if this is the correct behaviour in gsub:

“bab”.gsub(/(?!a)ab/, “cd”)
=> “bab”

shouldn’t that be “bcd”?

I think /(?!a)ab/ can’t match anything. It’s saying that the first
character after the beginning of the match must not be “a”, and the
first character of the match must be “a”. This is contradictory.

Florian_Gross · 12 May 2004 00:13

Kristof Bastiaensen wrote:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

Regards,
Florian Gross

David_A_Black3 · 12 May 2004 00:53

Hi –

Kristof Bastiaensen kristof@vleeuwen.org writes:

Kristof Bastiaensen wrote:

Hi,

I am wondering if this is the correct behaviour in gsub:

“bab”.gsub(/(?!a)ab/, “cd”)
=> “bab”

shouldn’t that be “bcd”?

I think /(?!a)ab/ can’t match anything. It’s saying that the first
character after the beginning of the match must not be “a”, and the
first character of the match must be “a”. This is contradictory.

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume the
character, it effectively is the character ‘before’ the match (if
any).

/(?!a)/ doesn’t match or consume any character; it refers to the state
of things between characters. The previous character (or start of
string) has come and gone; the assertion, now, is “what lies just
ahead is not ‘a’”.

The other behaviour wouldn’t make sense, because (?!a)b is then
exactly the same as b.

Assertions like this always have the possibility of being redundant –
for example:

/(?=a)abc/ # same as /abc/

but there are a lot of cases where they aren’t, and that’s where they
become useful:

/David (?!Black)(\S+)/ # grab another David’s last name

David

···

On Wed, 12 May 2004 08:15:28 +0900, Joel VanderWerf wrote:

–
David A. Black
dblack@wobblini.net

Kristof_Bastiaensen1 · 12 May 2004 01:28

Kristof Bastiaensen wrote:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

Hi,
You are right, I looked it up in the manual, and there it was. The
term zero-width-look-ahead pretty much says it all. I must have
gotten the definition all wrong.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

I was thinking more about something like variable-width look-between
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

Regards,
Florian Gross

Thanks,
Kristof

···

On Wed, 12 May 2004 02:12:19 +0200, Florian Gross wrote:

Robert · 12 May 2004 07:38

“Florian Gross” flgr@ccan.de schrieb im Newsbeitrag
news:2gd8f2F1gphmU1@uni-berlin.de…

Kristof Bastiaensen wrote:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

I’d use /[^a]b/ if I wanted to consume the character. No need for
negative lookahead here.

robert

Nobuyoshi_Nakada · 13 May 2004 11:56

Hi,

At Wed, 12 May 2004 09:13:51 +0900,
Florian Gross wrote in [ruby-talk:99884]:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

Ruby 1.9 (Oniguruma) has look-behind feature.

$ ruby -v -e ‘p “bab”.gsub(/(?<!a)ab/, “cd”)’
ruby 1.9.0 (2004-05-12) [i686-linux]
“bcd”

···

–
Nobu Nakada

David_A_Black3 · 12 May 2004 01:53

Hi –

Kristof Bastiaensen kristof@vleeuwen.org writes:

I was thinking more about something like variable-width look-between
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

For that particular case you can use \W (opposite of \w):

/a\W*b/ # a + zero or more non-\w + b

For more specific cases, you can use a negated character class:

/a[^123]*b/ # a + [zero or more of NOT 1,2,3] + b

David

···

–
David A. Black
dblack@wobblini.net

Robert · 12 May 2004 07:48

“Kristof Bastiaensen” kristof@vleeuwen.org schrieb im Newsbeitrag
news:pan.2004.05.12.01.24.42.706400@vleeuwen.org…

Kristof Bastiaensen wrote:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you
want
to consume the character.

Hi,
You are right, I looked it up in the manual, and there it was. The
term zero-width-look-ahead pretty much says it all. I must have
gotten the definition all wrong.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

I was thinking more about something like variable-width look-between
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

IMHO that’s not generally possible with regular expressions. You’ll
always have to define positively things that should match. Exclusion
character classes are just a means of convenience but this does not extend
to complete (sub) expressions.

For example: to match a.*a where the part in the middle does not contain
only b’s (i.e. matches /b+/) you can do:

/a(.[^b].)?a/

irb(main):004:0> rx=/a(.[^b].)?a/
=> /a(.[^b].)?a/
irb(main):005:0> rx === “aa”
=> true
irb(main):006:0> rx === “aba”
=> false
irb(main):007:0> rx === “acba”
=> true

Regards

robert

···

On Wed, 12 May 2004 02:12:19 +0200, Florian Gross wrote:

Kristof_Bastiaensen1 · 13 May 2004 12:53

Great!

That is exactly what I needed. And I saw it has negative
look behind also. (?<!subexp)
I think this is especially usefull in String#gsub, so you don’t
have to subgroup the context, and replicate it in the
substitution.

Kristof

···

On Thu, 13 May 2004 20:56:55 +0900, nobu.nokada wrote:

Hi,

At Wed, 12 May 2004 09:13:51 +0900,
Florian Gross wrote in [ruby-talk:99884]:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

Ruby 1.9 (Oniguruma) has look-behind feature.

$ ruby -v -e ‘p “bab”.gsub(/(?<!a)ab/, “cd”)’
ruby 1.9.0 (2004-05-12) [i686-linux]
“bcd”

Florian_Gross · 15 May 2004 11:08

Hi,

Moin!

Ruby 1.9 (Oniguruma) has look-behind feature.

$ ruby -v -e ‘p “bab”.gsub(/(?<!a)ab/, “cd”)’
ruby 1.9.0 (2004-05-12) [i686-linux]
“bcd”

Very nice. Does this work too?

ruby -v -e ‘p “bab\nbcab”.gsub(/^(?<!bc?)ab$/, “cd”)’

(I would expect it to produce “bcd\nbccd”.)

Regards,
Florian Gross

···

nobu.nokada@softhome.net wrote:

Joel_VanderWerf1 · 12 May 2004 02:42

David Alan Black wrote:

Hi –

Kristof Bastiaensen kristof@vleeuwen.org writes:

I was thinking more about something like variable-width look-between
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

For that particular case you can use \W (opposite of \w):

/a\W*b/ # a + zero or more non-\w + b

Not quite the same:

/a\W*b/ =~ “a%xb”

=> nil

It sounded like the OP wanted an re that matched “a%xb”, because “%x” is
“not equal to (\w+)”. Sort of like:

/a(?!\w+b).*b/ =~ “a%xb”

=> 0

Simon_Strandgaard1 · 15 May 2004 11:22

Florian Gross wrote:

“bab\nbcab”.gsub(/^(?<!bc?)ab$/, “cd”)
^^^
^^^
oniguruma doesn’t like your questionmark

Oniguruma only supports fixed width lookbehind… quantifiers are not possible
however you can use alternation instead.

   (?<!b(?:c|))

···

–
Simon Strandgaard

David_A_Black3 · 12 May 2004 11:18

Hi –

Joel VanderWerf vjoel@PATH.Berkeley.EDU writes:

David Alan Black wrote:

Hi –

Kristof Bastiaensen kristof@vleeuwen.org writes:

I was thinking more about something like variable-width look-between
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

For that particular case you can use \W (opposite of \w):

/a\W*b/ # a + zero or more non-\w + b

Not quite the same:

/a\W*b/ =~ “a%xb”

=> nil

Whoops

It sounded like the OP wanted an re that matched “a%xb”, because “%x” is
“not equal to (\w+)”. Sort of like:

/a(?!\w+b).*b/ =~ “a%xb”

=> 0

Yes, you’re right, though I’m driven to find something that doesn’t
involve repeating the ‘b’. Current iteration:

/a(.\W.)?b/.match(“a%xb”)

=> #MatchData:0x4019d298

(possibly with *?'s instead of *'s, depending on the OP’s needs)

David

···

–
David A. Black
dblack@wobblini.net

Topic		Replies	Views
Regexp Error? ruby-talk	14	91	14 May 2004
Question about regular expression ruby-talk	11	65	20 January 2006
Regexp Error? ruby-talk	15	108	15 May 2004
Regex negative look-behind bug? ruby-talk	12	138	29 November 2010
Pattern matching strings ruby-talk	1	103	7 June 2012

Strange regexp behaviour in gsub

=> nil

=> 0

=> nil

=> 0

=> #MatchData:0x4019d298

Related topics