Surprising Regexp Behavior

I keep running into some surprising points with Ruby's Regexp engine today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "<p>one</p>\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

Here's my Ruby version:

$ ruby -v
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.0]

Thanks for any wisdom you can impart.

James Edward Gray II

James Edward Gray II schrieb:

I keep running into some surprising points with Ruby's Regexp engine today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "<p>one</p>\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

James, what did you expect? Both examples look perfectly valid to me.

Regards,
Pit

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
   => "<p>one</p>\n\n<p>two</p>"

   irb(main):002:0> html[ %r| <p> .*? |x ]
   => "<p>"

   irb(main):003:0> html[ %r| <p> .*? </p> |x ]
   => "<p>one</p>"

   irb(main):004:0> html[ %r| <p> .*? </p> .* |x ]
   => "<p>one</p>"

hmm?

but if we use 'm' to make '.' match newline:

   irb(main):005:0> html[ %r| <p> .*? </p> .* |xm ]
   => "<p>one</p>\n\n<p>two</p>"

alternatively we can name newline explicitly:

   irb(main):006:0> html[ %r| <p> .*? </p> [.\n]* |x ]
   => "<p>one</p>\n\n"

probably 'm' is better for html though.

   irb(main):007:0> html =~ %r| <p> (.*?) </p> (.*) |xm and p [$1, $2]
   ["one", "\n\n<p>two</p>"]

cheers.

-a

···

On Wed, 14 Sep 2005, James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Hi --

···

On Wed, 14 Sep 2005, James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "<p>one</p>\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

In both cases, if you use the /m modifier, the dot will match \n, and
I think the behavior you want will happen.

David

--
David A. Black
dblack@wobblini.net

James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it there?
That way you make the regexp engine match more than you need and if you
change sub! to gsub! at some time, you'll likely still have only one
replacement, because .* matches anything to the end.

Kind regards

    robert

So I could check to see if there was more content after the first paragraph that I trimmed. The code goes on to replace it with an ellipses if there was.

James Edward Gray II

···

On Sep 14, 2005, at 6:26 AM, Robert Klemme wrote:

James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it there?

James Edward Gray II wrote:

···

On Sep 14, 2005, at 6:26 AM, Robert Klemme wrote:

James Edward Gray II wrote:

I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it
there?

So I could check to see if there was more content after the first
paragraph that I trimmed. The code goes on to replace it with an
ellipses if there was.

Ah! In that case I'd something like:

html.sub!(/(<p>)(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/<p>(.*?)<\/p>(.*)/) { "<p>#{$1.strip}</p>" }

Kind regards

    robert

The method takes a chunk of HTML and pulls the first paragraph out of it (minus the <p> and </p> tags). But I want to know if there was other content, so I can add an ellipses if needed.

Here's the entire method, defined in a Rails helper module:

     def excerpt( textile, id )
         html = sanitize(textilize(textile))
         html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
         if $2 =~ /\S/
             "#{html} #{link_to '...', :action => :show, :id => id}"
         else
             html
         end
     end

It works as expected now.

James Edward Gray II

···

On Sep 14, 2005, at 8:21 AM, Robert Klemme wrote:

Ah! In that case I'd something like:

html.sub!(/(<p>)(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/<p>(.*?)<\/p>(.*)/) { "<p>#{$1.strip}</p>" }

it never occured to me that regexes could be made to be context sensitive in
that way - that usage of the block, i think, makes them recognize more that
the regular languages doesn't it? something like

   string.sub(pat){ $1 =~ /foo/ ? 'bar' : 'baz' }

though i suppose you can only look backward using this unless the pattern was
made quite general to ensure capture forward....

-a

···

On Wed, 14 Sep 2005, Robert Klemme wrote:

Ah! In that case I'd something like:

html.sub!(/(<p>)(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

James Edward Gray II wrote:

Ah! In that case I'd something like:

html.sub!(/(<p>)(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/<p>(.*?)<\/p>(.*)/) { "<p>#{$1.strip}</p>" }

The method takes a chunk of HTML and pulls the first paragraph out of
it (minus the <p> and </p> tags). But I want to know if there was
other content, so I can add an ellipses if needed.

Here's the entire method, defined in a Rails helper module:

     def excerpt( textile, id )
         html = sanitize(textilize(textile))
         html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
         if $2 =~ /\S/
             "#{html} #{link_to '...', :action => :show, :id => id}"
         else
             html
         end
     end

It works as expected now.

This might be a bit more efficient (dunno how often you call it):

     def excerpt( textile, id )
         html = sanitize(textilize(textile))
         html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
         html << link_to( '...', :action => :show, :id => id ) if $2 =~
/\S/
         html
     end

An alternative

     def excerpt( textile, id )
         html = sanitize(textilize(textile))
         html.sub!(/<p>(.*?)<\/p>.*(\S)?\Z/m) { $1.strip }
         html << link_to( '...', :action => :show, :id => id ) if $2
         html
     end

Just an idea...

Cheers

    robert

···

On Sep 14, 2005, at 8:21 AM, Robert Klemme wrote:

Ara.T.Howard wrote:

Ah! In that case I'd something like:

html.sub!(/(<p>)(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }

it never occured to me that regexes could be made to be context
sensitive in that way - that usage of the block, i think, makes them
recognize more that the regular languages doesn't it?

No. The block is just for the replacement. It doesn't change anything
for the match.

something like

   string.sub(pat){ $1 =~ /foo/ ? 'bar' : 'baz' }

though i suppose you can only look backward using this unless the
pattern was made quite general to ensure capture forward....

I don't see how this is look forward or backward. The group actually has
to be matched to be able to use it as basis for some kind of conditional
replacement. There's no lookahead / lookbehing magic involved - or I
cannot see it. :slight_smile:

Kind regards

    robert

···

On Wed, 14 Sep 2005, Robert Klemme wrote:

That's not equivalent. You're missing a space between html's content and the ellipses.

But thanks for the ideas.

James Edward Gray II

···

On Sep 14, 2005, at 8:51 AM, Robert Klemme wrote:

James Edward Gray II wrote:

Here's the entire method, defined in a Rails helper module:

     def excerpt( textile, id )
         html = sanitize(textilize(textile))
         html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
         if $2 =~ /\S/
             "#{html} #{link_to '...', :action => :show, :id => id}"
         else
             html
         end
     end

It works as expected now.

This might be a bit more efficient (dunno how often you call it):

     def excerpt( textile, id )
         html = sanitize(textilize(textile))
         html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
         html << link_to( '...', :action => :show, :id => id ) if $2 =~
/\S/
         html
     end

James Edward Gray II wrote:

James Edward Gray II wrote:

Here's the entire method, defined in a Rails helper module:

     def excerpt( textile, id )
         html = sanitize(textilize(textile))
         html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
         if $2 =~ /\S/
             "#{html} #{link_to '...', :action => :show, :id => id}"
         else
             html
         end
     end

It works as expected now.

This might be a bit more efficient (dunno how often you call it):

     def excerpt( textile, id )
         html = sanitize(textilize(textile))
         html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
         html << link_to( '...', :action => :show, :id => id ) if
$2 =~
/\S/
         html
     end

That's not equivalent. You're missing a space between html's content
and the ellipses.

Right. But hey, that's an easy change, isn't it? :slight_smile:

But thanks for the ideas.

You're welcome!

    robert

···

On Sep 14, 2005, at 8:51 AM, Robert Klemme wrote: