Reg exp comparison hanging irb

gaurav_bagga · 27 December 2006 05:20

hi,

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0> r=/^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
=> /^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

i'll appreciate if anyone helps with this.

regards
gaurav...

Simon_Strandgaard2 · 27 December 2006 08:29

the GNU regexp engine has a few oddities..
here is another example that triggers an endless loop.

r='<META http-equiv="Content-Type content="text/html; charset=iso-8859-1">'
r.scan /<(?:[^">]+|"[^"]*")+>/

we have in common that our regexp has nested repeating patterns.

···

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0> r=/^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
=> /^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

--
Simon Strandgaard
http://opcoders.com/

Eric_Hodel1 · 27 December 2006 22:13

There's already a library for what you want to do:

require 'uri'

uri = URI.parse "http://groups-beta.google.com/group/rubyonrails-talk/\.\.\."

···

On Dec 26, 2006, at 21:20, gaurav bagga wrote:

hi,

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0> r=/^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
=> /^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

i'll appreciate if anyone helps with this.

--
Eric Hodel - drbrain@segment7.net - http://blog.segment7.net

I LIT YOUR GEM ON FIRE!

gaurav_bagga · 27 December 2006 08:41

hi simon,
  thanks for reply
  so whats the way out i am not used to reg exp that much
  some one helped me with reg exp that i posted
  as a single entry might stop the things to work....
regards
gaurav

Bill_Kelly · 27 December 2006 12:44

the GNU regexp engine has a few oddities..
here is another example that triggers an endless loop.

r='<META http-equiv="Content-Type content="text/html; charset=iso-8859-1">'
r.scan /<(?:[^">]+|"[^"]*")+>/

To clarify, it's probably not an endless loop, just may or
may not finish in our lifetimes.

If you make the string shorter, like:

r='<M h="C-T c="t/h; c=i-8-1">'

...you'll see it finishes quickly.

Add a few characters:

r='<M h="C-T c="t/h; charset=i-8-1">'

...and there's a slight delay before it finishes.

I found that removing greediness from your outer one-or-more
match, sped it up a lot: (changed + to +?)

r.scan /<(?:[^">]+|"[^"]*")+?>/

Now the match on your full string finishes in a few seconds
on my system. Still slow... just a lot faster than the
greedy version.

Incidentally, it's the mismatched quotes in the attribute
value that are causing the backtracking.

If we allow the regex to fail "gracefully" on mismatched
quotes, we can prevent the backtracking:

r.scan /<(?:[^">]+|"[^"]*"|")+?>/

...i.e. the thinking is, if all else fails, just gobble a
single " and keep going.

Regards,

Bill

···

From: "Simon Strandgaard" <neoneye@gmail.com>

W_James · 27 December 2006 20:20

Simon Strandgaard wrote:

> was just going trough reg exp
> the lines below just hangs irb and cpu usage goes 100% on windows xp
> can anyone explain me why?
>
>
> irb(main):001:0> r=/^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
> => /^(https?:\/\/)?[a-z0-9]+([\.\-\_=&\+\/\?]?[a-z0-9]+)+$/i
> irb(main):002:0> "
> http://groups-beta.google.com/group/rubyonrails-talk/browse_th
> read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

the GNU regexp engine has a few oddities..
here is another example that triggers an endless loop.

r='<META http-equiv="Content-Type content="text/html; charset=iso-8859-1">'
r.scan /<(?:[^">]+|"[^"]*")+>/

we have in common that our regexp has nested repeating patterns.

Much faster:

r = %r{
   ^
   (https?://)?
   (?> [a-z0-9]+ )
   (?> [-._=&+/?]? [a-z0-9]+ )+
   $
  }xi
p ( "http://groups-beta.google.com/group/rubyonrails-talk/" +
  "browse_thread/thread/8f085b191387d799/" +
  "e78a71cbd7354c0c#e78a71cbd7354c0c") =~ r

It doesn't match (because of the #).

···

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:

Simon_Strandgaard2 · 27 December 2006 10:27

do you want to do a rough validation of the entire url?
or do you want to check only the beginning?

try this one
/^(https?:\/\/)?\w+([-._=&+\/?#]\w+)+$/i

···

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:

hi simon,
  thanks for reply
  so whats the way out i am not used to reg exp that much
  some one helped me with reg exp that i posted
  as a single entry might stop the things to work....

--
Simon Strandgaard
http://opcoders.com/

gaurav_bagga · 27 December 2006 13:18

hi,
thank you all for help
regards
gaurav

···

Simon_Strandgaard2 · 27 December 2006 13:26

Thanks for the clarification. I must have forgotten
a lot about regexp too.. iirc stress can cause amnesia.

···

On 12/27/06, Bill Kelly <billk@cts.com> wrote:

From: "Simon Strandgaard" <neoneye@gmail.com>
>
> the GNU regexp engine has a few oddities..
> here is another example that triggers an endless loop.
>
> r='<META http-equiv="Content-Type content="text/html; charset=iso-8859-1">'
> r.scan /<(?:[^">]+|"[^"]*")+>/

To clarify, it's probably not an endless loop, just may or
may not finish in our lifetimes.

--
Simon Strandgaard
http://opcoders.com/

gaurav_bagga · 27 December 2006 10:58

hi,
thanks for the reg exp
basically that suffices simple website url(
http://www.something.com/something\) or a blog url
regards
gaurav

···

On 12/27/06, Simon Strandgaard <neoneye@gmail.com> wrote:

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:
> hi simon,
> thanks for reply
> so whats the way out i am not used to reg exp that much
> some one helped me with reg exp that i posted
> as a single entry might stop the things to work....

do you want to do a rough validation of the entire url?
or do you want to check only the beginning?

try this one
/^(https?:\/\/)?\w+([-._=&+\/?#]\w+)+$/i

--
Simon Strandgaard
http://opcoders.com/

gaurav_bagga · 27 December 2006 11:03

hi simon,

can you explain regexp has nested repeating patterns.

regards
gaurav

Simon_Strandgaard2 · 27 December 2006 11:48

the pattern /a+/ is not nested.
putting ( )+ around it, then /(a+)+/ is nested 1 level.
putting ( )+ around it, then /((a+)+)+/ is nested 2 levels.

I suspect the GNU regexp engine sometimes go crazy when there
is a pattern that can match _nothing_ inside a nested pattern.

by matching _nothing_ i mean: ^ $ \b * ? lookahead lookbehind.. etc

···

On 12/27/06, gaurav bagga <gaurav.v.bagga@gmail.com> wrote:

can you explain regexp has nested repeating patterns.

--
Simon Strandgaard
http://opcoders.com/

Topic		Replies	Views
Anchored Regexp get stalled or hung ruby-talk	6	111	19 November 2004
Ruby script hangs on regex match ruby-talk	5	109	23 July 2005
Regexp Error? ruby-talk	15	102	15 May 2004
Regexp and string halts ruby ruby-talk	7	138	26 January 2007
Irb regexp strangeness ruby-talk	4	65	23 August 2003

Reg exp comparison hanging irb

Related topics