Regex

A_Berger · 24 June 2016 18:11

how can I search for a string1, but not string2,

similar to chars:

/[ab]*[^de]/
=> /123|456(!789)/ # not possible
string 789 may not follow
How can I do that (in one regex)?

Thank you
Berg

Mike_Stok1 · 24 June 2016 19:39

I’m not exactly sure what you’re trying to do, but…

$ irb
irb(main):001:0> "123xxx" =~ /(123|456)(?!789)/
=> 0
irb(main):002:0> "234xxx" =~ /(123|456)(?!789)/
=> nil
irb(main):003:0> "456xxx" =~ /(123|456)(?!789)/
=> 0
irb(main):004:0> "456789" =~ /(123|456)(?!789)/
=> nil

Hope this helps,

Mike

···

On Jun 24, 2016, at 7:11 PM, A Berger <aberger7890@gmail.com> wrote:

> how can I search for a string1, but not string2,
similar to chars:
> /[ab]*[^de]/
> => /123|456(!789)/ # not possible
> string 789 may not follow
> How can I do that (in one regex)?

Thank you
Berg

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

--

Mike Stok <mike@stok.ca>
http://www.stok.ca/~mike/

The "`Stok' disclaimers" apply.

Recursive_Madman · 24 June 2016 19:13

By using an optional group for example

/123|456(789)?/

or

/123|456(789|)/

or

/123|456(?:789|)/ # if you don't want to capture the result

.

Hope this helps.

recmm

···

On 06/24/2016 08:11 PM, A Berger wrote:

> how can I search for a string1, but not string2,
similar to chars:
> /[ab]*[^de]/
> => /123|456(!789)/ # not possible
> string 789 may not follow
> How can I do that (in one regex)?

A_Berger · 24 June 2016 20:58

I should have written
string2 _must_ not follow...

Robert_K1 · 24 June 2016 21:13

What exactly are you trying to achieve? Can you post sample strings
and what you want to match?

robert

···

On Fri, Jun 24, 2016 at 8:11 PM, A Berger <aberger7890@gmail.com> wrote:

how can I search for a string1, but not string2,

similar to chars:

/[ab]*[^de]/
=> /123|456(!789)/ # not possible
string 789 may not follow
How can I do that (in one regex)?

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Recursive_Madman · 24 June 2016 21:16

Right, I misread your post. Mike Stok's answer does what you want though, as far as I can see.
Post examples and we will know.

···

On 06/24/2016 10:58 PM, A Berger wrote:

I should have written
string2 _must_ not follow...

Matthew_Kerwin · 24 June 2016 22:47

Yeah, negative lookahead is good.

Otherwise there's good old:

/123|456(?:[^7]?|7[^8]?|78[^9]?)$/

... Which should match '4567' and '456780' but not '456789', and can
be used in the middle of a longer regexp (if you change the $ part.)

···

On 25/06/2016, Mike Stok <mike@stok.ca> wrote:

On Jun 24, 2016, at 7:11 PM, A Berger <aberger7890@gmail.com> wrote:

> how can I search for a string1, but not string2,
similar to chars:
> /[ab]*[^de]/
> => /123|456(!789)/ # not possible
> string 789 may not follow
> How can I do that (in one regex)?

Thank you
Berg

I’m not exactly sure what you’re trying to do, but…

$ irb
irb(main):001:0> "123xxx" =~ /(123|456)(?!789)/
=> 0
irb(main):002:0> "234xxx" =~ /(123|456)(?!789)/
=> nil
irb(main):003:0> "456xxx" =~ /(123|456)(?!789)/
=> 0
irb(main):004:0> "456789" =~ /(123|456)(?!789)/
=> nil

Hope this helps,

Mike

--

Mike Stok <mike@stok.ca>
Mike Stok

The "`Stok' disclaimers" apply.

--
Matthew Kerwin
http://matthew.kerwin.net.au/

Robert_K1 · 25 June 2016 10:04

Note though that the version you presented will not properly work as
you put the $ anchor behind the expression and there is no .*
present: this will only ever match "123" and "456" followed by zero to
three characters _at the end of the input_.

irb(main):001:0> '456a456b456c'.scan /123|456(?:[^7]?|7[^8]?|78[^9]?)$/
=> ["456c"]

Also, precedence of | is such that the sequence beginning with the
opening bracket is only matched after "456":

irb(main):001:0> 'axbxx'.scan /a|bx+/
=> ["a", "bxx"]

You need to put the initial part in a group:

irb(main):002:0> 'axbxx'.scan /(?:a|b)x+/
=> ["ax", "bxx"]

So what you rather want is

/(?:123|456)(?:\z|[^7]?|7[^8]?|78[^9]?)/

But this expression still consumes characters after "123" and "456"
and includes them in the match:

irb(main):007:0> '456a456b456c'.scan /(?:123|456)(?:[^7]?|7[^8]?|78[^9]?)/
=> ["456a", "456b", "456c"]

This does not happen with negative lookahead which gives a simpler and
more targeted regexp:

/(?:123|456)(?!789)/

irb(main):006:0> '456a456b456c456'.scan /(?:123|456)(?!789)/
=> ["456", "456", "456", "456"]
irb(main):010:0> '456a456b456c4567d45678e456780'.scan /(?:123|456)(?!789)/
=> ["456", "456", "456", "456", "456", "456"]
irb(main):011:0> '456a456b456c4567d45678e456780456789'.scan /(?:123|456)(?!789)/
=> ["456", "456", "456", "456", "456", "456"]

Negative lookahead is really the best solution as it directly
expresses intend and does not need these tricks to deal with end of
input.

Kind regards

robert

···

On Sat, Jun 25, 2016 at 12:47 AM, Matthew Kerwin <matthew@kerwin.net.au> wrote:

On 25/06/2016, Mike Stok <mike@stok.ca> wrote:

On Jun 24, 2016, at 7:11 PM, A Berger <aberger7890@gmail.com> wrote:

> how can I search for a string1, but not string2,
similar to chars:
> /[ab]*[^de]/
> => /123|456(!789)/ # not possible
> string 789 may not follow
> How can I do that (in one regex)?

Thank you
Berg

I’m not exactly sure what you’re trying to do, but…

$ irb
irb(main):001:0> "123xxx" =~ /(123|456)(?!789)/
=> 0
irb(main):002:0> "234xxx" =~ /(123|456)(?!789)/
=> nil
irb(main):003:0> "456xxx" =~ /(123|456)(?!789)/
=> 0
irb(main):004:0> "456789" =~ /(123|456)(?!789)/
=> nil

Hope this helps,

Mike

--

Mike Stok <mike@stok.ca>
Mike Stok

The "`Stok' disclaimers" apply.

Yeah, negative lookahead is good.

Otherwise there's good old:

/123|456(?:[^7]?|7[^8]?|78[^9]?)$/

... Which should match '4567' and '456780' but not '456789', and can
be used in the middle of a longer regexp (if you change the $ part.)

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Robert_K1 · 25 June 2016 11:21

PS: I forgot one important thing: even if you fix the precedence issue
and remove the end of line anchor your approach does not work.
According to my understanding of the task ("match 123 or 456 which is
not followed by "789") you would expect "456789" to never have match,
right? But see what happens:

irb(main):001:0> "456789".scan /(?:123|456)(?:[^7]?|7[^8]?|78[^9]?)/
=> ["456"]

Even if we rearrange a bit to simplify

irb(main):002:0> "456789".scan /(?:123|456)(?:[^7]|7[^8]|78[^9])?/
=> ["456"]

and experiment further

irb(main):003:0> "456789".scan /(?:123|456)(?:[^7]|7[^8]|78[^9])/
=>
irb(main):004:0> "456789".scan /(?:123|456)(?:[^7]|7[^8]|78[^9]?)/
=> ["45678"]
irb(main):005:0> "456789".scan /(?:123|456)(?:78[^9]|7[^8]|[^7])/
=>
irb(main):006:0> "456789".scan /(?:123|456)(?:78[^9]|7[^8]|[^7])?/
=> ["456"]

we notice that only the approach works that forces the length of the
match after the initial bit to be three characters.

There is a difference between the statement "match 123 or 456 which is
not followed by 789" and "match 123 or 456 followed by a sequence
which is not 789". There is a reason why positive and negative
lookahead were implemented in regex engines.

Cheers

robert

···

On Sat, Jun 25, 2016 at 12:04 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

On Sat, Jun 25, 2016 at 12:47 AM, Matthew Kerwin <matthew@kerwin.net.au> wrote:

On 25/06/2016, Mike Stok <mike@stok.ca> wrote:

On Jun 24, 2016, at 7:11 PM, A Berger <aberger7890@gmail.com> wrote:

> how can I search for a string1, but not string2,
similar to chars:
> /[ab]*[^de]/
> => /123|456(!789)/ # not possible
> string 789 may not follow
> How can I do that (in one regex)?

Thank you
Berg

I’m not exactly sure what you’re trying to do, but…

$ irb
irb(main):001:0> "123xxx" =~ /(123|456)(?!789)/
=> 0
irb(main):002:0> "234xxx" =~ /(123|456)(?!789)/
=> nil
irb(main):003:0> "456xxx" =~ /(123|456)(?!789)/
=> 0
irb(main):004:0> "456789" =~ /(123|456)(?!789)/
=> nil

Hope this helps,

Mike

--

Mike Stok <mike@stok.ca>
Mike Stok

The "`Stok' disclaimers" apply.

Yeah, negative lookahead is good.

Otherwise there's good old:

/123|456(?:[^7]?|7[^8]?|78[^9]?)$/

... Which should match '4567' and '456780' but not '456789', and can
be used in the middle of a longer regexp (if you change the $ part.)

Note though that the version you presented will not properly work as
you put the $ anchor behind the expression and there is no .*
present: this will only ever match "123" and "456" followed by zero to
three characters _at the end of the input_.

irb(main):001:0> '456a456b456c'.scan /123|456(?:[^7]?|7[^8]?|78[^9]?)$/
=> ["456c"]

Also, precedence of | is such that the sequence beginning with the
opening bracket is only matched after "456":

irb(main):001:0> 'axbxx'.scan /a|bx+/
=> ["a", "bxx"]

You need to put the initial part in a group:

irb(main):002:0> 'axbxx'.scan /(?:a|b)x+/
=> ["ax", "bxx"]

So what you rather want is

/(?:123|456)(?:\z|[^7]?|7[^8]?|78[^9]?)/

But this expression still consumes characters after "123" and "456"
and includes them in the match:

irb(main):007:0> '456a456b456c'.scan /(?:123|456)(?:[^7]?|7[^8]?|78[^9]?)/
=> ["456a", "456b", "456c"]

This does not happen with negative lookahead which gives a simpler and
more targeted regexp:

/(?:123|456)(?!789)/

irb(main):006:0> '456a456b456c456'.scan /(?:123|456)(?!789)/
=> ["456", "456", "456", "456"]
irb(main):010:0> '456a456b456c4567d45678e456780'.scan /(?:123|456)(?!789)/
=> ["456", "456", "456", "456", "456", "456"]
irb(main):011:0> '456a456b456c4567d45678e456780456789'.scan /(?:123|456)(?!789)/
=> ["456", "456", "456", "456", "456", "456"]

Negative lookahead is really the best solution as it directly
expresses intend and does not need these tricks to deal with end of
input.

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Matthew_Kerwin · 26 June 2016 00:34

Note though that the version you presented will not properly work as
you put the $ anchor behind the expression and there is no .*
present: this will only ever match "123" and "456" followed by zero to
three characters _at the end of the input_.

I know. That was a conscious choice. Given the lack of detail in OP,
you and I both made assumptions. If I knew the exact problem, and I
felt so inclined, I could write an exact solution. Instead I made an
example demonstrating that "not abc" can be written as "not a, or a
and not b, or ab and not c."

I added the $ to arbitrarily anchor the expression.

"will not properly work" is a strong statement based on assumptions.

Also, precedence of | is such that the sequence beginning with the
opening bracket is only matched after "456":

You need to put the initial part in a group:

No, this was another conscious choice, to match OP's /123|456(!789)/

But this expression still consumes characters after "123" and "456"
and includes them in the match:

Again, a conscious assumption. If the purpose was only to use =~ to
find the initial index of the match, the result is the same.

Negative lookahead is really the best solution as it directly
expresses intend and does not need these tricks to deal with end of
input.

That's as may be, but a problem may have many solutions. I've
illustrated an alterrnative approach, which may be useful in solving
some future problem.

My goal was not to solve a specific problem (otherwise I'd have asked
for more specific details.) As is often the case on this list, I
presented a snippet which approaches a solution from a particular
direction, in the hope of promoting self-learning and unconventional
thinking.

Cheers

···

On 25/06/2016, Robert Klemme <shortcutter@googlemail.com> wrote:
--
Matthew Kerwin
http://matthew.kerwin.net.a

Matthew_Kerwin · 26 June 2016 00:45

PS: I forgot one important thing: even if you fix the precedence issue
and remove the end of line anchor your approach does not work.

Of course, if you remove the anchor you need to work out another
approach; for example, removing all the '?'s.

"456789".scan /123|456(?:[^7]|7[^8]|78[^9])/

But again, this only works in certain contexts.

Anchors are important, be they (?!x) or $ or \b or...

<snip>

There is a difference between the statement "match 123 or 456 which is
not followed by 789" and "match 123 or 456 followed by a sequence
which is not 789".

Neither of which were stated in OP, and thus either of which may be valid

There is a reason why positive and negative
lookahead were implemented in regex engines.

Cheers

···

On 25/06/2016, Robert Klemme <shortcutter@googlemail.com> wrote:
--
Matthew Kerwin
http://matthew.kerwin.net.au

A_Berger · 26 June 2016 07:42

Thanks Robert and Matthew for these details!
whats the meaning of OP?
The exact problem could be
(abc|def).*(?!ghi)
- which doesnt do what one would expect (no, result is what to be expected,
but not wanted
it should be
^/abc|def/ and ! (...... .* ghi)
is that possible in one regex (best giving the matching string)
Although likely never needed in real-world problems...
thanks
Berg

Robert_K1 · 26 June 2016 09:27

Thanks Robert and Matthew for these details!
whats the meaning of OP?
The exact problem could be

What do you mean "could"? Is there a problem you are trying to solve or not??

(abc|def).*(?!ghi)
- which doesnt do what one would expect (no, result is what to be expected,
but not wanted
it should be
^/abc|def/ and ! (...... .* ghi)
is that possible in one regex (best giving the matching string)
Although likely never needed in real-world problems...

I suggest you start by describing in plain English what you are
attempting to do and not throw up regex examples from which nobody can
derive what your intentions are.

robert

···

On Sun, Jun 26, 2016 at 9:42 AM, A Berger <aberger7890@gmail.com> wrote:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Matthew_Kerwin · 26 June 2016 09:42

Thanks Robert and Matthew for these details!
whats the meaning of OP?

OP stands for the "Original Post" or the original poster.

The exact problem could be
(abc|def).*(?!ghi)
- which doesnt do what one would expect (no, result is what to be
expected, but not wanted
it should be
^/abc|def/ and ! (...... .* ghi)
is that possible in one regex (best giving the matching string)

There are still a bunch of uncertainties with what you're asking. For
example, if you're matching *words* you need to deal with things like the
punctuation and spaces in between. In that case, personally I'd go with an
algorithmic approach. (Any regexp you write that ends up matching what you
want is going to become unwieldy and hard to maintain.)

Although likely never needed in real-world problems...

Are you saying this is a purely theoretical question? In that case, the
answers you've gotten so far should be enough to be getting on with.

thanks
Berg

Cheers

···

On 26 June 2016 at 17:42, A Berger <aberger7890@gmail.com> wrote:
--
Matthew Kerwin
http://matthew.kerwin.net.au/

Topic		Replies	Views
Regular Expression Help ruby-talk	5	117	6 October 2012
Regex help ruby-talk	17	76	21 January 2004
Regex question ruby-talk	3	83	2 November 2007
Regex questions ruby-talk	0	73	27 January 2005
Regex question ruby-talk	5	77	20 December 2009

Regex

Related topics