Ruby script hangs on regex match

Hello,

I have a regex infinte loop kind of problem. I use ruby 1.8.2. The regular expression I used was:

[tT]he\s+(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*((\s*(\,|and|or)\s*)*[\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*))\s+((\w+\s+){0,3}\s*(proteins|genes|protein|gene))

I try to match this regex against the string, without the quotes (the following should be a whole single line):

"to this end , NK_CTL clones derived from four donors ( KK , GG , GF , and DP ) were tested for their ability to lyse the TAP2_deficient RMA_S\HLA_E cell_line incubated with serial_dilutions of the VMAPRTLIL , VMAPRTLVL , VMAPRTLLL , and VMAPRALLL peptides ."

Normally the regex is not supposed to match against this particular string. What it does, it hangs with 100% CPU consumption, while for lots of other lines of text it works ok. Am I doing something wrong?

I tried doing the same regex in perl, it complained about the string having the \H unknown control sequence (it appears indeed in the text at some point). If I replace the "\H" by, let's say, "-H", the regular expression passes through without finding anything in Perl -- which is normal, as I said. Ruby hangs even if I change the "\" in "-".

Needless to say I would much appreciate some help on this one. Feel free to ask for explanation of that complicated regex if needed or any other information.

As attachment, the ruby script that hangs.

Best regards,
Adrian Dimulescu.

regex-hangs.rb (517 Bytes)

Hi Adrian,

I have a regex infinte loop kind of problem. I use ruby
1.8.2. The regular expression I used was:

[tT]he\s+(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])
[\w\d\_]*((\s*(\,|and|or)\s*)*[\w\d\_]+(?:[a-zA-Z][\dA-Z]
>[\dA-Z][a-zA-Z])[\w\d\_]*)*))\s+((\w+\s+){0,3}\s*
(proteins|genes|protein|gene))

I just wanted to point out that if you add the /x flag to
the regexp, you can insert whitespace and comments at will.
That can sometimes make the expression pseudo-readable.

···

--
Daniel Brockman <daniel@brockman.se>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.

snip >>>>>

..
((\s*(\,|and|or)\s*)*
..
..
..
(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*))
..

snap >>>>>

When I look at this parts of your Regex, I see constructs like (a*|b*)* - These constructs consume some
billion of years if applied to a string where no match is possible. Take a closer look to "the Friedl" for
avoiding situations like this.

Maybe I'm wrong, but there are severe problems to read the Regex.

Best Regards, Wolfgang Nadasi-Donner

Adrian Petru Dimulescu wrote:

Hello,

I have a regex infinte loop kind of problem. I use ruby 1.8.2. The
regular expression I used was:

[tT]he\s+(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*((\s*(\,|and|or)\s*)*[\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*))\s+((\w+\s+){0,3}\s*(proteins|genes|protein|gene))

I try to match this regex against the string, without the quotes (the
following should be a whole single line):

"to this end , NK_CTL clones derived from four donors ( KK , GG , GF ,
and DP ) were tested for their ability to lyse the TAP2_deficient
RMA_S\HLA_E cell_line incubated with serial_dilutions of the VMAPRTLIL ,
VMAPRTLVL , VMAPRTLLL , and VMAPRALLL peptides ."

regexp =
/ # "The " or "the "
  [Tt]he\s+
  # Protein consists of letters & digits. Must not be only digits.
  (?!\d+\s)
  [A-Za-z\d]+\s+
  (
    # ", "
    ,\s+
    # Optional "and "
    (and\s+)?
    # Protein consists of letters & digits. Must not be only digits.
    (?!\d+\s)
    [A-Za-z\d]+\s+
  )*
  # 0--3 adjectives
  ([A-Za-z]+\s+){0,3}
  (protein|gene)s?
/x

Daniel Brockman wrote:

I just wanted to point out that if you add the /x flag to
the regexp, you can insert whitespace and comments at will.
That can sometimes make the expression pseudo-readable.

Hello Daniel,

thanks for pointing that out, I didn't know it.

I will quickly explain this regex: it's supposed to match strings like:

"the hek2p , a2p , and b3p multidomain proteins".

A protein usually contains a digit inside or an uppercase.

Here is the whole regex:

[tT]he => the
\s+

(([\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]* => a protein
((\s*(\,|and|or)\s*)* => a comma-separator, "and", "or"
[\w\d\_]+(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z])[\w\d\_]*)*)) => another protein (this with the preceding make up a protein comma-separated sequence)

\s+

((\w+\s+){0,3}\s* => some adjectives (at most 3)

(proteins|genes|protein|gene))

I'll also detail the protein regex which occurs 2 times in the big regex:

([\w\d\_]+ => some letters or digits
(?:[a-zA-Z][\dA-Z]|[\dA-Z][a-zA-Z]) -> necessarily a digit or uppercase but with at least a letter before or after
[\w\d\_]* => some other letters or digits

the complexity of the second part of the protein regex is in order to avoid matching a simple number (only digits)

Here it is, I hope it's clearer. Please tell me if I should file a bug.

Best regards,
Adrian.

Wolfgang Nádasi-Donner wrote:

When I look at this parts of your Regex, I see constructs like (a*|b*)* - These constructs consume some
billion of years if applied to a string where no match is possible. Take a closer look to "the Friedl" for
avoiding situations like this.

Hello,

I am sure this regex is far from being optimized as I am really no
expert in regular expressions. It is also more like the kind of script
that only gets executed once.

But I do pass the regex onto thousands of lines and many of those do not
match. The not-matching happens very fast (hundreds of phrases treated
per second) - until that specific one.

Best regards,
Adrian.