Non-greediness in a regex - need some help verifying syntax

Wes_Gamble · 3 August 2006 21:31

All,

Need some medium - level regex help.

Here's my regex: /~\^LNK:[\t\r\n]+?\^~/m

I'm trying to find all occurrences of strings in my big string that are
between
~^LNK: and ^~ sequences of characters that have at least one tab, form
feed, or newline character between those two characters. I use the
multiline option so that I can match on the newlines.

What I'm seeing is the string that is consumed by this regex spans many
many many
~^LNK ^~ pairs so that I am removing a bunch of tabs, newlines, etc.
that I don't want to.

I understand the concept of greediness in regexes, so I put the ? after
the [\t\r\n] sequence.

Why is the match spanning so many pairs of the delimiter sequences? Why
doesn't regex engine stop attempting to match when it sees that first ^~
after the ~^LNK:?

Any help is appreciated.

Thanks,
Wes

···

--
Posted via http://www.ruby-forum.com/.

Forum · 3 August 2006 21:43

All,

Need some medium - level regex help.

Here's my regex: /~\^LNK:[\t\r\n]+?\^~/m

Hmmm I fail to reproduce the problem is there nothing missing between
"[\t\r\n]+?" and "\^" ?
And if so the missing link is probably what's consuming all your data.

I'm trying to find all occurrences of strings in my big string that are

between
~^LNK: and ^~ sequences of characters that have at least one tab, form
feed, or newline character between those two characters.

and something else as I said above, no?

<snip>

Robert

···

On 8/3/06, Wes Gamble <weyus@att.net> wrote:

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein

Wes_Gamble · 3 August 2006 22:09

I realized I made an error when I did the original post.

Now my problem is that it won't find any of these occurrences now.

So @bigstring.scan(/~\^LNK:[\t\r\n]+?\^~/m) isn't returning anything.
My guess is because there are no occurrences of a tab, newline, or line
feed character _immediately_ after ~^LNK.

If I do /~\^LNK:.*?[\t\r\n]+?.*?\^~/m - that should pick up what I want,
correct?

Wes

···

--
Posted via http://www.ruby-forum.com/.

Morton_Goldberg · 3 August 2006 22:51

I'm not exactly an expert on regexs, to say the least, but I think .*? always matches an empty string and is therefore useless. I would try something like

@bigstring.scan(/~\^LNK:[^\t\r\n]*[\t\r\n]+?[^\t\r\n]*\^~/m)

I have not test this.

Regards, Morton

···

On Aug 3, 2006, at 6:09 PM, Wes Gamble wrote:

I realized I made an error when I did the original post.

Now my problem is that it won't find any of these occurrences now.

So @bigstring.scan(/~\^LNK:[\t\r\n]+?\^~/m) isn't returning anything.
My guess is because there are no occurrences of a tab, newline, or line
feed character _immediately_ after ~^LNK.

If I do /~\^LNK:.*?[\t\r\n]+?.*?\^~/m - that should pick up what I want,
correct?

Wes

--
Posted via http://www.ruby-forum.com/\.

Daniel_Martin · 4 August 2006 14:28

Wes Gamble <weyus@att.net> writes:

If I do /~\^LNK:.*?[\t\r\n]+?.*?\^~/m - that should pick up what I want,
correct?

Almost.

The problem is that with this text:

a = "~^LNK:foo^~\n\n~^LNK:bar^~"

You get a match of the whole text:

irb(main):009:0> a.scan(/~\^LNK:.*?[\t\r\n]+?.*?\^~/m)
=> ["~^LNK:foo^~\n\n~^LNK:bar^~"]

Where you obviously wanted to get no matches.

So, here's what I suggest:

/~\^LNK:(?:[^\t\r\n^]|\^(?!~))*[\t\r\n].*?\^~/m

Read that as:

'~^LNK:' followed by zero or more of:
Some character that isn't \t, \r, \n, or '^', OR
A '^' character that isn't followed by a '~'
Then a \t, \r, or \n character.
Then whatever is the minimum other characters necessary to get to ^~.

For these "containing at least one of" type problems, I often find it
useful to write the regular expression as:

begin sequence ( ~\^LNK: )
zero or more characters with none of what we want
( (?:[^\t\r\n^]|\^(?!~))* )
one of what we want
( [\t\r\n] )
.*? ( .*? )
end sequence ( \^~ )

For the related "at least n of" problem, (where n > 1), I do this:

begin sequence
(?:
zero or more characters with none of what we want
one of what we want
){n}
.*?
end sequence

The only tricky part is inside the "none of what we want" chunk, where
you have to take care that the "none of what we want" chunk can't
swallow up your end sequence. (Depending on what you want and what
your end sequence is, you also need to be careful that the "one of
what we want" part can't swallow part of your end sequence)

Sometimes it's easier to just write a regular expression that gets
more matches than you want, and then throw away excess matches in
code:

lnk_regex = /~\^LNK:.*?\^~/
text.scan(lnk_regex) { |m|
next unless m[0] =~ /[\t\r\n]/
...
}

That can often be more readable too. Depending on your data, however,
it may be much, much slower than using a regular expression that finds
only what you need to begin with.

Collins_Justin · 3 August 2006 23:00

I realized I made an error when I did the original post.

Now my problem is that it won't find any of these occurrences now.

So @bigstring.scan(/~\^LNK:[\t\r\n]+?\^~/m) isn't returning anything.
My guess is because there are no occurrences of a tab, newline, or line
feed character _immediately_ after ~^LNK.

If I do /~\^LNK:.*?[\t\r\n]+?.*?\^~/m - that should pick up what I want,
correct?

Wes

--Posted via http://www.ruby-forum.com/\.

Morton Goldberg wrote:

I'm not exactly an expert on regexs, to say the least, but I think .*? always matches an empty string and is therefore useless. I would try something like

@bigstring.scan(/~\^LNK:[^\t\r\n]*[\t\r\n]+?[^\t\r\n]*\^~/m)

I have not test this.

Regards, Morton

No, that's not true. .*? It will match whatever it needs to get to the next item:

irb(main):001:0> "asidjoaisdj".match(/.*?j/)[0]
=> "asidj"
irb(main):002:0> "asidjoaisdj".match(/.*?d/)[0]
=> "asid"
irb(main):003:0> "asidjoaisdj".match(/.*?sdj/)[0]
=> "asidjoaisdj"

Therefore, it should work as Wes expects.

-Justin

···

On Aug 3, 2006, at 6:09 PM, Wes Gamble wrote:

Wes_Gamble · 4 August 2006 15:07

Daniel,

Currently, I have this working using the .*? to match everything since I
am just passing the results into a block that then does a gsub on the
offending characters. Slightly inefficient, but as you pointed out,
much more readable.

Thanks for the through regex analysis though.

Wes

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Regular expression question ruby-talk	9	155	28 September 2009
Problem with getting RegEx to work ruby-talk	3	168	17 September 2008
Questions about * + and ? in Regex ruby-talk	3	158	31 December 2007
Newlines included in bracket negation ruby-talk	11	151	29 October 2007
Odd regexp behavior ruby-talk	15	167	12 August 2011

Non-greediness in a regex - need some help verifying syntax

Related topics