Surprising Regexp Behavior

James_Edward_Gray_II · 13 September 2005 18:31

I keep running into some surprising points with Ruby's Regexp engine today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "one\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

Here's my Ruby version:

$ ruby -v
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.0]

Thanks for any wisdom you can impart.

James Edward Gray II

Pit · 13 September 2005 18:46

James Edward Gray II schrieb:

I keep running into some surprising points with Ruby's Regexp engine today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "one\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

James, what did you expect? Both examples look perfectly valid to me.

Regards,
Pit

Ara.T.Howard6 · 13 September 2005 18:47

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"

irb(main):002:0> html[ %r| .*? |x ]
=> ""

irb(main):003:0> html[ %r| .*? |x ]
=> "one"

irb(main):004:0> html[ %r| .*? .* |x ]
=> "one"

hmm?

but if we use 'm' to make '.' match newline:

irb(main):005:0> html[ %r| .*? .* |xm ]
=> "one\n\ntwo"

alternatively we can name newline explicitly:

irb(main):006:0> html[ %r| .*? [.\n]* |x ]
=> "one\n\n"

probably 'm' is better for html though.

irb(main):007:0> html =~ %r| (.*?) (.*) |xm and p [$1, $2]
["one", "\n\ntwo"]

cheers.

-a

···

On Wed, 14 Sep 2005, James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

David_A_Black3 · 13 September 2005 18:50

Hi --

···

On Wed, 14 Sep 2005, James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "one\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

In both cases, if you use the /m modifier, the dot will match \n, and
I think the behavior you want will happen.

David

--
David A. Black
dblack@wobblini.net

Robert · 14 September 2005 11:26

James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it there?
That way you make the regexp engine match more than you need and if you
change sub! to gsub! at some time, you'll likely still have only one
replacement, because .* matches anything to the end.

Kind regards

robert

James_Edward_Gray_II · 14 September 2005 13:03

So I could check to see if there was more content after the first paragraph that I trimmed. The code goes on to replace it with an ellipses if there was.

James Edward Gray II

···

On Sep 14, 2005, at 6:26 AM, Robert Klemme wrote:

James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it there?

Robert · 14 September 2005 13:21

James Edward Gray II wrote:

···

On Sep 14, 2005, at 6:26 AM, Robert Klemme wrote:

James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it
there?

So I could check to see if there was more content after the first
paragraph that I trimmed. The code goes on to replace it with an
ellipses if there was.

Ah! In that case I'd something like:

html.sub!(/()(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/(.*?)<\/p>(.*)/) { "#{$1.strip}" }

Kind regards

robert

James_Edward_Gray_II · 14 September 2005 13:32

The method takes a chunk of HTML and pulls the first paragraph out of it (minus the and tags). But I want to know if there was other content, so I can add an ellipses if needed.

Here's the entire method, defined in a Rails helper module:

def excerpt( textile, id )
 html = sanitize(textilize(textile))
 html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
 if $2 =~ /\S/
 "#{html} #{link_to '...', :action => :show, :id => id}"
 else
 html
 end
 end

It works as expected now.

James Edward Gray II

···

On Sep 14, 2005, at 8:21 AM, Robert Klemme wrote:

Ah! In that case I'd something like:

html.sub!(/()(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/(.*?)<\/p>(.*)/) { "#{$1.strip}" }

Ara.T.Howard6 · 14 September 2005 14:21

it never occured to me that regexes could be made to be context sensitive in
that way - that usage of the block, i think, makes them recognize more that
the regular languages doesn't it? something like

string.sub(pat){ $1 =~ /foo/ ? 'bar' : 'baz' }

though i suppose you can only look backward using this unless the pattern was
made quite general to ensure capture forward....

-a

···

On Wed, 14 Sep 2005, Robert Klemme wrote:

Ah! In that case I'd something like:

html.sub!(/()(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Robert · 14 September 2005 13:51

James Edward Gray II wrote:

Ah! In that case I'd something like:

html.sub!(/()(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/(.*?)<\/p>(.*)/) { "#{$1.strip}" }

The method takes a chunk of HTML and pulls the first paragraph out of
it (minus the and tags). But I want to know if there was
other content, so I can add an ellipses if needed.

Here's the entire method, defined in a Rails helper module:

 def excerpt( textile, id )
 html = sanitize(textilize(textile))
 html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
 if $2 =~ /\S/
 "#{html} #{link_to '...', :action => :show, :id => id}"
 else
 html
 end
 end

It works as expected now.

This might be a bit more efficient (dunno how often you call it):

def excerpt( textile, id )
 html = sanitize(textilize(textile))
 html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
 html << link_to( '...', :action => :show, :id => id ) if $2 =~
/\S/
 html
 end

An alternative

def excerpt( textile, id )
 html = sanitize(textilize(textile))
 html.sub!(/(.*?)<\/p>.*(\S)?\Z/m) { $1.strip }
 html << link_to( '...', :action => :show, :id => id ) if $2
 html
 end

Just an idea...

Cheers

robert

···

On Sep 14, 2005, at 8:21 AM, Robert Klemme wrote:

Robert · 14 September 2005 14:36

Ara.T.Howard wrote:

Ah! In that case I'd something like:

html.sub!(/()(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }

it never occured to me that regexes could be made to be context
sensitive in that way - that usage of the block, i think, makes them
recognize more that the regular languages doesn't it?

No. The block is just for the replacement. It doesn't change anything
for the match.

something like

string.sub(pat){ $1 =~ /foo/ ? 'bar' : 'baz' }

though i suppose you can only look backward using this unless the
pattern was made quite general to ensure capture forward....

I don't see how this is look forward or backward. The group actually has
to be matched to be able to use it as basis for some kind of conditional
replacement. There's no lookahead / lookbehing magic involved - or I
cannot see it.

Kind regards

robert

···

On Wed, 14 Sep 2005, Robert Klemme wrote:

James_Edward_Gray_II · 14 September 2005 14:04

That's not equivalent. You're missing a space between html's content and the ellipses.

But thanks for the ideas.

James Edward Gray II

···

On Sep 14, 2005, at 8:51 AM, Robert Klemme wrote:

James Edward Gray II wrote:

Here's the entire method, defined in a Rails helper module:

 def excerpt( textile, id )
 html = sanitize(textilize(textile))
 html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
 if $2 =~ /\S/
 "#{html} #{link_to '...', :action => :show, :id => id}"
 else
 html
 end
 end

It works as expected now.

This might be a bit more efficient (dunno how often you call it):

 def excerpt( textile, id )
 html = sanitize(textilize(textile))
 html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
 html << link_to( '...', :action => :show, :id => id ) if $2 =~
/\S/
 html
 end

Robert · 14 September 2005 14:26

James Edward Gray II wrote:

James Edward Gray II wrote:

Here's the entire method, defined in a Rails helper module:

 def excerpt( textile, id )
 html = sanitize(textilize(textile))
 html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
 if $2 =~ /\S/
 "#{html} #{link_to '...', :action => :show, :id => id}"
 else
 html
 end
 end

It works as expected now.

This might be a bit more efficient (dunno how often you call it):

 def excerpt( textile, id )
 html = sanitize(textilize(textile))
 html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
 html << link_to( '...', :action => :show, :id => id ) if
$2 =~
/\S/
 html
 end

That's not equivalent. You're missing a space between html's content
and the ellipses.

Right. But hey, that's an easy change, isn't it?

But thanks for the ideas.

You're welcome!

robert

···

On Sep 14, 2005, at 8:51 AM, Robert Klemme wrote:

Topic		Replies	Views
Surprising Regexp Behavior ruby-talk	2	86	13 September 2005
Surprising Regexp Behavior ruby-talk	0	82	13 September 2005
Another strange regexp case ruby-talk	5	79	30 June 2004
Regexp Error? ruby-talk	15	102	15 May 2004
A bug in Ruby regexp lib? ruby-talk	3	142	27 January 2009

Surprising Regexp Behavior

--

--

Related topics