Ruby script hangs on regex match

Adrian,

I have a regex infinte loop kind of problem. I use
ruby 1.8.2. The regular expression I used was:

[tT]he\s+(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z]
)[\w\d\_]*((\s*(\,|and|or)\s*)*[\w\d\_]+(?:[a-zA-Z][\d
A-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*))\s+((\w+\s+){0,3}\s*
(proteins|genes|protein|gene))

I try to match this regex against the string, without
the quotes (the following should be a whole single
line):

"to this end , NK_CTL clones derived from four donors
( KK , GG , GF , and DP ) were tested for their
ability to lyse the TAP2_deficient RMA_S\HLA_E
cell_line incubated with serial_dilutions of the
VMAPRTLIL , VMAPRTLVL , VMAPRTLLL , and VMAPRALLL
peptides ."

    First off, let me point out that this is *not* an infinite loop, the
match will complete (returning nil) after an hour or so.

    Breaking your regular expression apart a little, you have:

01: [tT]he\s+
02: (
03: (
04: [\w\d\_]+
05: (?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])
06: [\w\d\_]*
07: (
08: (
09: \s*
10: (
11: \,|and|or
12: )
13: \s*
14: )*
15: [\w\d\_]+
16: (?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])
17: [\w\d\_]*
18: )*
19: )
20: )
21: \s+
22: (
23: (\w+\s+){0,3}
24: \s*
25: (proteins|genes|protein|gene)
26: )

    A few minor things you can do to simplify:

    * The backslash before the underscore in lines 4, 6, 15, and 17 is
unnecessary.
    * The backslash before the comma in line 11 is also unnecessary.
    * /[\w]/ includes /[\d_]/, so lines 4 and 15 can be simplified to
"\w+".
    * Likewise lines 6 and 17 can be simplified to "\w*".
    * Lines 3 and 19 are a redundant set of parenthesis and can be
removed.
    * Two uppercase letters match *both* sides of the "or" in lines 5
and 16. Rework these lines to remove the redundancy.

    Now, the relevant part of the string is "the VMAPRTLIL , VMAPRTLVL ,
VMAPRTLLL , and VMAPRALLL", and the main problem that the RE engine is
having is in lines 4 through 17. Since the relevant part of the string
doesn't have numbers, underscores, or lower case letters, let's change
all of the occurrences of "[\w\d\_]", "[a-zA-Z]", and "[\dA-Z]" to
simply "[A-Z]":

04: [A-Z]+
05: (?:[A-Z][A-Z]|[A-Z][A-Z])
06: [A-Z]*
07: (
08: (
09: \s*
10: (
11: \,|and|or
12: )
13: \s*
14: )*
15: [A-Z]+
16: (?:[A-Z][A-Z]|[A-Z][A-Z])
17: [A-Z]*
18: )*

    If we then remove lines 8 through 14 (which we can do since they can
match *zero* times), this all simplifies down to:

([A-Z]{3,})([A-Z]{3,})*

    Now, assuming we try to match this against the string "VMAPRTLIL",
we find that we could match it in six different ways:

["VMA", ["PRT", "LIL"] ]
["VMA", ["PRTLIL"] ]
["VMAP", ["RTLIL"] ]
["VMAPR", ["TLIL"] ]
["VMAPRT", ["LIL"] ]
["VMAPRTLIL"]

    With the original line 5, this number doubles, and with the original
line 16, the number doubles again. This means that there would be 24
ways to match the each nine-letter words. With four words in a row (and
matching separators), this makes 24^4, or 331,776 distinct paths through
the four words. I'm sure there are probably a few more paths in there
that I've missed, but you get the idea. When writing regular
expressions, be extremely careful of zero length matches - they almost
always get you into trouble.

    Now, if I understand what you were trying to achieve, you can change
the "*" in line 14 to a "+" instead, thus *requiring* a separator. This
will vastly reduce the number of distinct paths through lines 4 through
18, and "fixes" your given example.

    To summarize, you probably want something more along the lines of:

[tT]he\s+((\w+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])\w*((\s*(\,|and|or)\s*)
+\w+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])\w*)*))\s+((\w+\s+){0,3}\s*(prote
ins>genes>protein>gene))

    I hope this helps.

    - Warren Brown

Warren Brown wrote:

   I hope this helps.

It surely does, thank you very very much!