(Maybe) a simple question about regex

Hello!

I think that I am missing a very simple concept about regex.

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

What I want is:

s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]

What should some_regex be?

Can somebody help me?

Sam

[Sam Kong <sam.s.kong@gmail.com>, 2005-03-24 02.49 CET]

Hello!

I think that I am missing a very simple concept about regex.

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

You can use a "negative lookahead assertion":

s.scan(/(?!45)\d\d/)

This means, at every point the regex tries to match, "if the next two
characters aren't "45", match \d\d".

HTH.

···

--

Hello!

I think that I am missing a very simple concept about regex.

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

What I want is:

s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]

What should some_regex be?

You can use a negative assertion to say you want to skip "45", but it
will bump forward one space and you will end up with the last matches
being "56" and "78"

s.scan(/(?!45)\d\d/)

=> ["01", "23", "56", "78"]

So with a little uglier assertion, you can say:

s.scan(/(?!45|5)\d\d/)

=> ["01", "23", "67", "89"]

and get what you specified, but though it works for your toy case, I
would be worried that this might not extrapolate out to your real goal
well.

HTH

Regards,
Jason
http://blog.casey-sweat.us/

···

On Thu, 24 Mar 2005 10:49:49 +0900, Sam Kong <sam.s.kong@gmail.com> wrote:

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

What I want is:

s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]

Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]

You need to tune it to your exact domain.

Cheers,
Assaph

What they said, but also if you can be more precise about your real
problem, we might be able to better model a solution. You might find
matching the expression you want and then scanning it to be more
flexible for example.

···

On Thu, 24 Mar 2005 11:09:51 +0900, Assaph Mehr <assaph@gmail.com> wrote:

> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.
>
> What I want is:
>
> s = '0123456789'
> s.scan(some_regex) #-> ["01", "23", "67", "89"]

Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]

You need to tune it to your exact domain.

Cheers,
Assaph

"Assaph Mehr" <assaph@gmail.com> schrieb im Newsbeitrag
news:1111629894.417238.111830@l41g2000cwc.googlegroups.com...

> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.
>
> What I want is:
>
> s = '0123456789'
> s.scan(some_regex) #-> ["01", "23", "67", "89"]

Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]

But:

s = '01234567894657'

=> "01234567894657"

s.scan /(?!4|5)\d\d/

=> ["01", "23", "67", "89", "65"]

s.scan /\d\d/

=> ["01", "23", "45", "67", "89", "46", "57"]

IOW, you loose "46" and "57".

I prefer a non RE solution in these cases as it's simpler

s.scan(/\d\d/).reject {|x| "45" == x}

=> ["01", "23", "67", "89", "46", "57"]

Otherwise RE becomes really complex if you want to make it right - if it's
possible at all (see other postings).

Kind regards

    robert

Thank you and other posters for the answers.
Actually s.scan(/(?!45)\d\d/) suffices my real problem.

What I was trying to solve was...
To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.
Actually my toy case was not well-defined (I realized this later) and
thus it required more complex solutions like your second case -
s.scan(/(?!45|5)\d\d/) .

I think non-RE solution would be better like Mr. Robert Klemme said.
But I wanted to learn some RE.

Thanks.
Sam

does this help?

ary=%w(a.html index.html other.txt evil.html.exe stuff.html)
ary.select{|s| s =~ /\A(?!index).*\.html\z/ } #=> ["a.html", "stuff.html"]

···

On Thu, 24 Mar 2005 18:09:50 +0900, Sam Kong <sam.s.kong@gmail.com> wrote:

To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.

--
Simon Strandgaard

What I was trying to solve was...
To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.
Actually my toy case was not well-defined (I realized this later) and
thus it required more complex solutions like your second case -
s.scan(/(?!45|5)\d\d/) .

Why don't you use a dedicated html parser? Eg. there's htmltokenizer,
available ar Rubyforge, quite lightweight and very easy to use, but
there are others, of course.

I think non-RE solution would be better like Mr. Robert Klemme said.
But I wanted to learn some RE.

This thread was useful, I admit :slight_smile:

Csaba

···

On 2005-03-24, Sam Kong <sam.s.kong@gmail.com> wrote: