Reg exp comparison hanging irb

hi,

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0> r=/^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
=> /^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

i'll appreciate if anyone helps with this.

regards
gaurav...

the GNU regexp engine has a few oddities..
here is another example that triggers an endless loop.

r='<META http-equiv="Content-Type content="text/html; charset=iso-8859-1">'
r.scan /<(?:[^">]+|"[^"]*")+>/

we have in common that our regexp has nested repeating patterns.

···

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0> r=/^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
=> /^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

--
Simon Strandgaard
http://opcoders.com/

There's already a library for what you want to do:

require 'uri'

uri = URI.parse "http://groups-beta.google.com/group/rubyonrails-talk/\.\.\.&quot;

···

On Dec 26, 2006, at 21:20, gaurav bagga wrote:

hi,

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0> r=/^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
=> /^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

i'll appreciate if anyone helps with this.

--
Eric Hodel - drbrain@segment7.net - http://blog.segment7.net

I LIT YOUR GEM ON FIRE!

hi simon,
  thanks for reply
  so whats the way out i am not used to reg exp that much
  some one helped me with reg exp that i posted
  as a single entry might stop the things to work....
regards
gaurav

the GNU regexp engine has a few oddities..
here is another example that triggers an endless loop.

r='<META http-equiv="Content-Type content="text/html; charset=iso-8859-1">'
r.scan /<(?:[^">]+|"[^"]*")+>/

To clarify, it's probably not an endless loop, just may or
may not finish in our lifetimes. :slight_smile:

If you make the string shorter, like:

r='<M h="C-T c="t/h; c=i-8-1">'

...you'll see it finishes quickly.

Add a few characters:

r='<M h="C-T c="t/h; charset=i-8-1">'

...and there's a slight delay before it finishes.

I found that removing greediness from your outer one-or-more
match, sped it up a lot: (changed + to +?)

r.scan /<(?:[^">]+|"[^"]*")+?>/

Now the match on your full string finishes in a few seconds
on my system. Still slow... just a lot faster than the
greedy version.

Incidentally, it's the mismatched quotes in the attribute
value that are causing the backtracking.

If we allow the regex to fail "gracefully" on mismatched
quotes, we can prevent the backtracking:

r.scan /<(?:[^">]+|"[^"]*"|")+?>/

...i.e. the thinking is, if all else fails, just gobble a
single " and keep going.

Regards,

Bill

···

From: "Simon Strandgaard" <neoneye@gmail.com>

Simon Strandgaard wrote:

> was just going trough reg exp
> the lines below just hangs irb and cpu usage goes 100% on windows xp
> can anyone explain me why?
>
>
> irb(main):001:0> r=/^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
> => /^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
> irb(main):002:0> "
> http://groups-beta.google.com/group/rubyonrails-talk/browse_th
> read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

the GNU regexp engine has a few oddities..
here is another example that triggers an endless loop.

r='<META http-equiv="Content-Type content="text/html; charset=iso-8859-1">'
r.scan /<(?:[^">]+|"[^"]*")+>/

we have in common that our regexp has nested repeating patterns.

Much faster:

r = %r{
   ^
   (https?://)?
   (?> [a-z0-9]+ )
   (?> [-._=&+/?]? [a-z0-9]+ )+
   $
  }xi
p ( "http://groups-beta.google.com/group/rubyonrails-talk/&quot; +
  "browse_thread/thread/8f085b191387d799/" +
  "e78a71cbd7354c0c#e78a71cbd7354c0c") =~ r

It doesn't match (because of the #).

···

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:

do you want to do a rough validation of the entire url?
or do you want to check only the beginning?

try this one
/^(https?:\/\/)?\w+([-._=&+\/?#]\w+)+$/i

···

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:

hi simon,
  thanks for reply
  so whats the way out i am not used to reg exp that much
  some one helped me with reg exp that i posted
  as a single entry might stop the things to work....

--
Simon Strandgaard
http://opcoders.com/

hi,
thank you all for help
regards
gaurav

···

Thanks for the clarification. I must have forgotten
a lot about regexp too.. iirc stress can cause amnesia.

···

On 12/27/06, Bill Kelly <billk@cts.com> wrote:

From: "Simon Strandgaard" <neoneye@gmail.com>
>
> the GNU regexp engine has a few oddities..
> here is another example that triggers an endless loop.
>
> r='<META http-equiv="Content-Type content="text/html; charset=iso-8859-1">'
> r.scan /<(?:[^">]+|"[^"]*")+>/

To clarify, it's probably not an endless loop, just may or
may not finish in our lifetimes. :slight_smile:

--
Simon Strandgaard
http://opcoders.com/

hi,
  thanks for the reg exp :slight_smile:
  basically that suffices simple website url(
http://www.something.com/something\) or a blog url
regards
gaurav

···

On 12/27/06, Simon Strandgaard <neoneye@gmail.com> wrote:

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:
> hi simon,
> thanks for reply
> so whats the way out i am not used to reg exp that much
> some one helped me with reg exp that i posted
> as a single entry might stop the things to work....

do you want to do a rough validation of the entire url?
or do you want to check only the beginning?

try this one
/^(https?:\/\/)?\w+([-._=&+\/?#]\w+)+$/i

--
Simon Strandgaard
http://opcoders.com/

hi simon,

can you explain regexp has nested repeating patterns.

regards
gaurav

the pattern /a+/ is not nested.
putting ( )+ around it, then /(a+)+/ is nested 1 level.
putting ( )+ around it, then /((a+)+)+/ is nested 2 levels.

I suspect the GNU regexp engine sometimes go crazy when there
is a pattern that can match _nothing_ inside a nested pattern.

by matching _nothing_ i mean: ^ $ \b * ? lookahead lookbehind.. etc

···

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:

can you explain regexp has nested repeating patterns.

--
Simon Strandgaard
http://opcoders.com/