(Maybe) a simple question about regex

Sam_Kong · 24 March 2005 01:49

Hello!

I think that I am missing a very simple concept about regex.

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

What I want is:

s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]

What should some_regex be?

Can somebody help me?

Sam

Carlos · 24 March 2005 02:08

[Sam Kong <sam.s.kong@gmail.com>, 2005-03-24 02.49 CET]

Hello!

I think that I am missing a very simple concept about regex.

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

You can use a "negative lookahead assertion":

s.scan(/(?!45)\d\d/)

This means, at every point the regex tries to match, "if the next two
characters aren't "45", match \d\d".

HTH.

···

--

Jason_Sweat · 24 March 2005 02:09

Hello!

I think that I am missing a very simple concept about regex.

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

What I want is:

s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]

What should some_regex be?

You can use a negative assertion to say you want to skip "45", but it
will bump forward one space and you will end up with the last matches
being "56" and "78"

s.scan(/(?!45)\d\d/)

=> ["01", "23", "56", "78"]

So with a little uglier assertion, you can say:

s.scan(/(?!45|5)\d\d/)

=> ["01", "23", "67", "89"]

and get what you specified, but though it works for your toy case, I
would be worried that this might not extrapolate out to your real goal
well.

HTH

Regards,
Jason
http://blog.casey-sweat.us/

···

On Thu, 24 Mar 2005 10:49:49 +0900, Sam Kong <sam.s.kong@gmail.com> wrote:

Assaph_Mehr1 · 24 March 2005 02:09

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

What I want is:

s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]

Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]

You need to tune it to your exact domain.

Cheers,
Assaph

Patrick_Hurley1 · 24 March 2005 02:50

What they said, but also if you can be more precise about your real
problem, we might be able to better model a solution. You might find
matching the expression you want and then scanning it to be more
flexible for example.

···

On Thu, 24 Mar 2005 11:09:51 +0900, Assaph Mehr <assaph@gmail.com> wrote:

> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.
>
> What I want is:
>
> s = '0123456789'
> s.scan(some_regex) #-> ["01", "23", "67", "89"]

Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]

You need to tune it to your exact domain.

Cheers,
Assaph

Robert · 24 March 2005 08:09

"Assaph Mehr" <assaph@gmail.com> schrieb im Newsbeitrag
news:1111629894.417238.111830@l41g2000cwc.googlegroups.com...

> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.
>
> What I want is:
>
> s = '0123456789'
> s.scan(some_regex) #-> ["01", "23", "67", "89"]

Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]

But:

s = '01234567894657'

=> "01234567894657"

s.scan /(?!4|5)\d\d/

=> ["01", "23", "67", "89", "65"]

s.scan /\d\d/

=> ["01", "23", "45", "67", "89", "46", "57"]

IOW, you loose "46" and "57".

I prefer a non RE solution in these cases as it's simpler

s.scan(/\d\d/).reject {|x| "45" == x}

=> ["01", "23", "67", "89", "46", "57"]

Otherwise RE becomes really complex if you want to make it right - if it's
possible at all (see other postings).

Kind regards

robert

Sam_Kong · 24 March 2005 09:09

Thank you and other posters for the answers.
Actually s.scan(/(?!45)\d\d/) suffices my real problem.

What I was trying to solve was...
To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.
Actually my toy case was not well-defined (I realized this later) and
thus it required more complex solutions like your second case -
s.scan(/(?!45|5)\d\d/) .

I think non-RE solution would be better like Mr. Robert Klemme said.
But I wanted to learn some RE.

Thanks.
Sam

Simon_Strandgaard2 · 24 March 2005 11:00

does this help?

ary=%w(a.html index.html other.txt evil.html.exe stuff.html)
ary.select{|s| s =~ /\A(?!index).*\.html\z/ } #=> ["a.html", "stuff.html"]

···

On Thu, 24 Mar 2005 18:09:50 +0900, Sam Kong <sam.s.kong@gmail.com> wrote:

To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.

--
Simon Strandgaard

Csaba_Henk4 · 25 March 2005 13:34

What I was trying to solve was...
To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.
Actually my toy case was not well-defined (I realized this later) and
thus it required more complex solutions like your second case -
s.scan(/(?!45|5)\d\d/) .

Why don't you use a dedicated html parser? Eg. there's htmltokenizer,
available ar Rubyforge, quite lightweight and very easy to use, but
there are others, of course.

I think non-RE solution would be better like Mr. Robert Klemme said.
But I wanted to learn some RE.

This thread was useful, I admit

Csaba

···

On 2005-03-24, Sam Kong <sam.s.kong@gmail.com> wrote:

Topic		Replies	Views
Regex - Exclude Multiple Characters and Global Scanning ruby-talk	3	106	23 June 2008
Regex question: this should be easy but doesn't work as I expect ruby-talk	7	154	21 December 2006
Regular expression ruby-talk	12	101	1 June 2009
Regex to NOT match? ruby-talk	15	74	11 January 2004
Scanning for more than one specific character with String#scan ruby-talk	1	112	29 January 2006

(Maybe) a simple question about regex

Related topics