Strange regexp behaviour in gsub

Hi,

I am wondering if this is the correct behaviour in gsub:

“bab”.gsub(/(?!a)ab/, “cd”)
=> “bab”

shouldn’t that be “bcd”?

Also the following:
"\nab".gsub(/(?!\w)ab/, “cd”)
=> “\nab”

This seems to work
"bab".gsub(/(?!c)ab/, “cd”)
=> “bcd”

Kristof

Kristof Bastiaensen wrote:

Hi,

I am wondering if this is the correct behaviour in gsub:

“bab”.gsub(/(?!a)ab/, “cd”)
=> “bab”

shouldn’t that be “bcd”?

I think /(?!a)ab/ can’t match anything. It’s saying that the first
character after the beginning of the match must not be “a”, and the
first character of the match must be “a”. This is contradictory.

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

Kristof

···

On Wed, 12 May 2004 08:15:28 +0900, Joel VanderWerf wrote:

Kristof Bastiaensen wrote:

Hi,

I am wondering if this is the correct behaviour in gsub:

“bab”.gsub(/(?!a)ab/, “cd”)
=> “bab”

shouldn’t that be “bcd”?

I think /(?!a)ab/ can’t match anything. It’s saying that the first
character after the beginning of the match must not be “a”, and the
first character of the match must be “a”. This is contradictory.

Kristof Bastiaensen wrote:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

Regards,
Florian Gross

Hi –

Kristof Bastiaensen kristof@vleeuwen.org writes:

Kristof Bastiaensen wrote:

Hi,

I am wondering if this is the correct behaviour in gsub:

“bab”.gsub(/(?!a)ab/, “cd”)
=> “bab”

shouldn’t that be “bcd”?

I think /(?!a)ab/ can’t match anything. It’s saying that the first
character after the beginning of the match must not be “a”, and the
first character of the match must be “a”. This is contradictory.

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume the
character, it effectively is the character ‘before’ the match (if
any).

/(?!a)/ doesn’t match or consume any character; it refers to the state
of things between characters. The previous character (or start of
string) has come and gone; the assertion, now, is “what lies just
ahead is not ‘a’”.

The other behaviour wouldn’t make sense, because (?!a)b is then
exactly the same as b.

Assertions like this always have the possibility of being redundant –
for example:

/(?=a)abc/ # same as /abc/

but there are a lot of cases where they aren’t, and that’s where they
become useful:

/David (?!Black)(\S+)/ # grab another David’s last name

David

···

On Wed, 12 May 2004 08:15:28 +0900, Joel VanderWerf wrote:


David A. Black
dblack@wobblini.net

Kristof Bastiaensen wrote:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

Hi,
You are right, I looked it up in the manual, and there it was. The
term zero-width-look-ahead pretty much says it all. I must have
gotten the definition all wrong.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

I was thinking more about something like variable-width look-between :slight_smile:
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

Regards,
Florian Gross

Thanks,
Kristof

···

On Wed, 12 May 2004 02:12:19 +0200, Florian Gross wrote:

“Florian Gross” flgr@ccan.de schrieb im Newsbeitrag
news:2gd8f2F1gphmU1@uni-berlin.de

Kristof Bastiaensen wrote:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

I’d use /[^a]b/ if I wanted to consume the character. No need for
negative lookahead here.

robert

Hi,

At Wed, 12 May 2004 09:13:51 +0900,
Florian Gross wrote in [ruby-talk:99884]:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

Ruby 1.9 (Oniguruma) has look-behind feature.

$ ruby -v -e ‘p “bab”.gsub(/(?<!a)ab/, “cd”)’
ruby 1.9.0 (2004-05-12) [i686-linux]
“bcd”

···


Nobu Nakada

Hi –

Kristof Bastiaensen kristof@vleeuwen.org writes:

I was thinking more about something like variable-width look-between :slight_smile:
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

For that particular case you can use \W (opposite of \w):

/a\W*b/ # a + zero or more non-\w + b

For more specific cases, you can use a negated character class:

/a[^123]*b/ # a + [zero or more of NOT 1,2,3] + b

David

···


David A. Black
dblack@wobblini.net

“Kristof Bastiaensen” kristof@vleeuwen.org schrieb im Newsbeitrag
news:pan.2004.05.12.01.24.42.706400@vleeuwen.org

Kristof Bastiaensen wrote:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you
want
to consume the character.

Hi,
You are right, I looked it up in the manual, and there it was. The
term zero-width-look-ahead pretty much says it all. I must have
gotten the definition all wrong.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

I was thinking more about something like variable-width look-between :slight_smile:
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

IMHO that’s not generally possible with regular expressions. You’ll
always have to define positively things that should match. Exclusion
character classes are just a means of convenience but this does not extend
to complete (sub) expressions.

For example: to match a.*a where the part in the middle does not contain
only b’s (i.e. matches /b+/) you can do:

/a(.[^b].)?a/

irb(main):004:0> rx=/a(.[^b].)?a/
=> /a(.[^b].)?a/
irb(main):005:0> rx === “aa”
=> true
irb(main):006:0> rx === “aba”
=> false
irb(main):007:0> rx === “acba”
=> true

Regards

robert
···

On Wed, 12 May 2004 02:12:19 +0200, Florian Gross wrote:

Great!

That is exactly what I needed. And I saw it has negative
look behind also. (?<!subexp)
I think this is especially usefull in String#gsub, so you don’t
have to subgroup the context, and replicate it in the
substitution.

Kristof

···

On Thu, 13 May 2004 20:56:55 +0900, nobu.nokada wrote:

Hi,

At Wed, 12 May 2004 09:13:51 +0900,
Florian Gross wrote in [ruby-talk:99884]:

Yes, that would clarify the situation, but is it the correct
behaviour? I would think that (?!a)a doesn’t mean the same
character, but consecutive ones. Because it doesn’t consume
the character, it effectively is the character ‘before’ the
match (if any). The other behaviour wouldn’t make sense,
because (?!a)b is then exactly the same as b.

I think that it’s the intended behavior. Just use /(?!a).b/ if you want
to consume the character.

Thinking about this, it is indeed possible to implement fixed-width
look-behind – interesting.

Ruby 1.9 (Oniguruma) has look-behind feature.

$ ruby -v -e ‘p “bab”.gsub(/(?<!a)ab/, “cd”)’
ruby 1.9.0 (2004-05-12) [i686-linux]
“bcd”

Hi,

Moin!

Ruby 1.9 (Oniguruma) has look-behind feature.

$ ruby -v -e ‘p “bab”.gsub(/(?<!a)ab/, “cd”)’
ruby 1.9.0 (2004-05-12) [i686-linux]
“bcd”

Very nice. Does this work too?

ruby -v -e ‘p “bab\nbcab”.gsub(/^(?<!bc?)ab$/, “cd”)’

(I would expect it to produce “bcd\nbccd”.)

Regards,
Florian Gross

···

nobu.nokada@softhome.net wrote:

David Alan Black wrote:

Hi –

Kristof Bastiaensen kristof@vleeuwen.org writes:

I was thinking more about something like variable-width look-between :slight_smile:
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

For that particular case you can use \W (opposite of \w):

/a\W*b/ # a + zero or more non-\w + b

Not quite the same:

/a\W*b/ =~ “a%xb”

=> nil

It sounded like the OP wanted an re that matched “a%xb”, because “%x” is
“not equal to (\w+)”. Sort of like:

/a(?!\w+b).*b/ =~ “a%xb”

=> 0

Florian Gross wrote:

“bab\nbcab”.gsub(/^(?<!bc?)ab$/, “cd”)
^^^
^^^
oniguruma doesn’t like your questionmark

Oniguruma only supports fixed width lookbehind… quantifiers are not possible
however you can use alternation instead.

   (?<!b(?:c|))
···


Simon Strandgaard

Hi –

Joel VanderWerf vjoel@PATH.Berkeley.EDU writes:

David Alan Black wrote:

Hi –

Kristof Bastiaensen kristof@vleeuwen.org writes:

I was thinking more about something like variable-width look-between :slight_smile:
Meaning for example a(?^\w+)b would match any a(.)b if (.) is
not equal to (\w+)

For that particular case you can use \W (opposite of \w):

/a\W*b/ # a + zero or more non-\w + b

Not quite the same:

/a\W*b/ =~ “a%xb”

=> nil

Whoops :slight_smile:

It sounded like the OP wanted an re that matched “a%xb”, because “%x” is
“not equal to (\w+)”. Sort of like:

/a(?!\w+b).*b/ =~ “a%xb”

=> 0

Yes, you’re right, though I’m driven to find something that doesn’t
involve repeating the ‘b’. Current iteration:

/a(.\W.)?b/.match(“a%xb”)

=> #MatchData:0x4019d298

(possibly with *?'s instead of *'s, depending on the OP’s needs)

David

···


David A. Black
dblack@wobblini.net