So, the matching appears to be much slower than O(n).
Isn't the whole point of regular expressions to be
fast and O(n)?
No. The point behind regular expressions is to provide a succinct
mini-language for matching text that can be expressed using a Type-3
grammar[1] without the pain of having to write the code to implement the
necessary FSM. Speed is a secondary consideration, and down to the
person implementing the regex engine and the person writing the regex.
Especially the person writing the regex.
Whenever my script encounters a long string,
it grinds to a halt.
Why is this?
You're making the matcher work harder than it needs to by not giving it
enough information about what it has to match. The bigger the file is,
the more it has to backtrack to match each line. See my note on anchors
below.
The regex engine will do as much as it can match the
Did I make the regular expression correctly?
Is there some way to optimize it?
Is there a problem with the matcher?
No, the matcher is fine.
<snip>
There's a few problems I can see straight off:
1. No anchors: without them, your regex ends up doing an awful lot more
work because it will end up attempting to match the pattern in the
middle of lines rather than only matching from the start of lines.
By putting '^' at the start of your regex to match the beginning of
the line, you'll see a big increase in performance. That's probably
your single biggest problem.
It's also possibly worth matching the end of the line with '$'.
2. To quote Jamie Zawinski:
"Some people, when confronted with a problem, think I know, I'll
use regular expressions. Now they have two problems."
If you don't need a regex, don't use them. In this case, try
reading the file in line by line (or slurp it all into memory if
it's not too big), and break the line up along spaces using the
split() method.
3. Get Jeffrey Freidl's Mastering Regular Expression[2]. You'll thank me
later.
K.
[1] Chomsky hierarchy - Wikipedia
[2] http://regex.info/
···
On Tue, Jul 25, 2006 at 03:59:32AM +0900, S3 wrote:
--
Keith Gaughan - kmgaughan@eircom.net - http://talideon.com/
QUARK:
The sound made by a well bred duck.