Making my Regex less greedy?

Luke_Duncalfe · 4 September 2005 23:51

Hi,

I'm working on a regular expression that will chop a posted message in half,
but chop it on a new paragraph break. I've decided it should look for the
new paragraph break after 100 characters. I'd like the regular expression to
choose an earlier paragraph break rather than a later one, but at the
moment, if there is a message with a number of paragraphs, it chooses the
last possible one it can in order to make a match. I remember reading in the
Pickaxe about how regular expressions are 'greedy', and wonder if this is a
case of regex gluttony perhaps and what I can do to recommend to it a
lighter diet.

  # final act is to chop message in half
  if message =~ /\A(.{100,#{message.length}})<\/p>\s*<p>(.*)/m then
   first_half = $1
   second_half = "</p>\n<p>" + $2
  else
   first_half = message
  end

The logic I'd like the above regex to operate with is: "Starting 100
characters into the message, chop the message at the next paragraph break".

Thanks
Luke

Gavin_Kistner2 · 5 September 2005 00:26

The question mark makes quantifiers non-greedy.
.+ versus .+?
.* versus .*?
.{a,b} versus .{a,b}?

For example:
txt = <<END
Twas brillig
and the slithy toves
did gyre and gimble
in the wabe
END

def truncate_after( str, length )
str =~ /\A(.{#{length},}?)\n(.+)/m
return [ $1, $2 ]
end

p truncate_after( txt, 0 )
#=>["Twas brillig", "and the slithy toves\ndid gyre and gimble\nin the wabe\n"]

p truncate_after( txt, 20 )
#=>["Twas brillig\nand the slithy toves", "did gyre and gimble\nin the wabe\n"]

p truncate_after( txt, 40 )
#=>["Twas brillig\nand the slithy toves\ndid gyre and gimble", "in the wabe\n"]

···

On Sep 4, 2005, at 5:51 PM, luke wrote:

The logic I'd like the above regex to operate with is: "Starting 100
characters into the message, chop the message at the next paragraph break".

Luke_Duncalfe · 5 September 2005 04:11

Thanks very much, that works a treat. Always nice to have something
demonstrated in Lewis Carroll.

So, {100,#{m.length}}? effectively is now finding the first match, if any
.... Out of curiosity, is it easy to express "find the 3rd match"?. (Rather
than saying "find 3 matches").

Luke

"Gavin Kistner" <gavin@refinery.com> wrote in message
news:3844E43F-DA57-489D-9DA1-4E5FF239A296@refinery.com...

···

On Sep 4, 2005, at 5:51 PM, luke wrote:
> The logic I'd like the above regex to operate with is: "Starting 100
> characters into the message, chop the message at the next paragraph
> break".

The question mark makes quantifiers non-greedy.
+ versus .+?
* versus .*?
{a,b} versus .{a,b}?

For example:
txt = <<END
Twas brillig
and the slithy toves
did gyre and gimble
in the wabe
END

def truncate_after( str, length )
str =~ /\A(.{#{length},}?)\n(.+)/m
return [ $1, $2 ]
end

p truncate_after( txt, 0 )
#=>["Twas brillig", "and the slithy toves\ndid gyre and gimble\nin
the wabe\n"]

p truncate_after( txt, 20 )
#=>["Twas brillig\nand the slithy toves", "did gyre and gimble\nin
the wabe\n"]

p truncate_after( txt, 40 )
#=>["Twas brillig\nand the slithy toves\ndid gyre and gimble", "in
the wabe\n"]

W_James · 5 September 2005 07:01

luke (dot) wrote:

... Out of curiosity, is it easy to express "find the 3rd match"?. (Rather
than saying "find 3 matches").

Third integer (counting starts at 0):

"1. All 27 bells were rung 3 times.".scan(/\d+/)[2]

Gavin_Kistner2 · 5 September 2005 07:13

a) As noted in my example, you can leave the second 'argument' to the range quantifier empty, in which case it is unbounded.
a{3,5} <== find 3-5 'a' chars
a{3,} <== find at least 3 'a' chars, up to ... well, as many as you can

b) String#scan will take a regexp and return an array of all matches in the document. (Not as useful if you need the saved sub-expressions, however.)

···

On Sep 4, 2005, at 10:11 PM, luke wrote:

So, {100,#{m.length}}? effectively is now finding the first match, if any
.... Out of curiosity, is it easy to express "find the 3rd match"?. (Rather
than saying "find 3 matches").

Dave_Burt2 · 5 September 2005 08:16

luke asked:

... Out of curiosity, is it easy to express "find the 3rd match"?. (Rather
than saying "find 3 matches").

scan will find all matches, so you can do:

a = "first second third".scan(/\w+/)
a[2] #=> "third"

Cheers,
Dave

Topic		Replies	Views
Non-greedy regexp ruby-talk	3	117	12 August 2002
Not greedy enough ruby-talk	7	71	23 September 2004
Regular expression seems broken not greedy when it should be ruby-talk	2	126	21 September 2004
Regular expression seems broken not greedy when it should be ruby-talk	0	120	20 September 2004
Too greedy of a regexp ruby-talk	3	105	9 November 2006

Making my Regex less greedy?

Related topics