Here is the puzzle. You have a string. You have a list of strings and/or
regular expressions. You want to find the match, if any, between your
string and an element in the list.
If the list is also just strings, you can create a hash with the strings as
keys then do a hash lookup to do the match. That's fast, even if the list
is very long.
However, what is most of the list is string literals, but you want to be
able to use regexps or some sort of wildcard matching for some of the
matches, too. Does anyone have any bright ideas about how to do this as
quickly as possible?
The only idea that I have come up with is to put the literal matches in a
hash, and then have the regular expressions in an array. If there isn't a
literal match, then one has to accept the time consuming process of
iterating through each regexp and checking it. Can anyone think of any
other approaches that might be faster?
If you only need to know that there is a match, rather than what
matched, how about combining all the regexps into one big regexp using
((re1)|(re2)|...)?
martin
···
Kirk Haines <khaines@enigo.com> wrote:
The only idea that I have come up with is to put the literal matches in a
hash, and then have the regular expressions in an array. If there isn't a
literal match, then one has to accept the time consuming process of
iterating through each regexp and checking it. Can anyone think of any
other approaches that might be faster?
In general, this is am intractable problem. After all, a single regular expression can match any substring and, in the worst case, they could all do this. The best solution I know to this problem needs a more powerful facility for handling regular expressions than is normally provided by a programming language.
In what follows, I will talk about the finite-state automata (fsa) equivalent to regular expressions because that is what we need. So ...
1. Make a fsa E = e1 | e2 ... en for the union of the set you are
interested in.
2. From this create F = Sigma* E by creating a loop for every character
in the alphabet at the start state of E. Make F deterministic.
3. Create R as the reverse of E, that is, its recognizes strings that
match the regular expressions of interest, but read from right to
left. We need to annotate the final states of this automaton with a
list of the original expressions that are recognized when the
automaton enters that state. R also needs to be deterministic.
4. If we run F against the string, it will be in a final state just
in case the character it has just examined ends a substring that
matches (at least) one of the original expressions. Since F is
deterministic, it is therefore possible to find all these places in
linear time and, in fact, after examining each character exactly
once. The trouble is that you cannot tell which expressions match
at such a location, because, if you are k characters into the string,
there could be k different places at which a match began.
5. Starting at each location identified in step 4, run R backwards
through the string. Every time it enters a final state, it
identifies the starting point of a substring that ends
where this backwards match started and, thanks to its special
annotations, tells you which re(s) you have matched.
This is, of course, intractable in perverse cases because step 5 can still start and finish everywhere. But it is close to linear in all but very worst cases. Notice, for what it is worth that, for each string position at which you run step 5, at least one match is guaranteed. The problem is that you cannot stop when you find the first one because there could be more.
--Martin
···
On Aug 12, 2004, at 6:11 PM, Kirk Haines wrote:
Here is the puzzle. You have a string. You have a list of strings and/or
regular expressions. You want to find the match, if any, between your
string and an element in the list.
If the list is also just strings, you can create a hash with the strings as
keys then do a hash lookup to do the match. That's fast, even if the list
is very long.
However, what is most of the list is string literals, but you want to be
able to use regexps or some sort of wildcard matching for some of the
matches, too. Does anyone have any bright ideas about how to do this as
quickly as possible?
The only idea that I have come up with is to put the literal matches in a
hash, and then have the regular expressions in an array. If there isn't a
literal match, then one has to accept the time consuming process of
iterating through each regexp and checking it. Can anyone think of any
other approaches that might be faster?
> The only idea that I have come up with is to put the literal matches in
a
> hash, and then have the regular expressions in an array. If there isn't
a
> literal match, then one has to accept the time consuming process of
> iterating through each regexp and checking it. Can anyone think of any
> other approaches that might be faster?
If you only need to know that there is a match, rather than what
matched, how about combining all the regexps into one big regexp using
((re1)|(re2)|...)?
Hmmm. Interesting idea. I'll have to give it a try and see if there is a
performance difference when doing that.
Thanks,
Kirk Haines
···
On Fri, 13 Aug 2004 01:41:10 +0900, Martin DeMello wrote
"Kirk Haines" <khaines@enigo.com> schrieb im Newsbeitrag
news:20040812173551.M70191@enigo.com...
>
> > The only idea that I have come up with is to put the literal matches
in
a
> > hash, and then have the regular expressions in an array. If there
isn't
a
> > literal match, then one has to accept the time consuming process of
> > iterating through each regexp and checking it. Can anyone think of
any
> > other approaches that might be faster?
>
> If you only need to know that there is a match, rather than what
> matched, how about combining all the regexps into one big regexp using
> ((re1)|(re2)|...)?
Hmmm. Interesting idea. I'll have to give it a try and see if there is a
performance difference when doing that.
I was going to suggest the same: hash and a combined regexp. You could even
create a single regexp that coevers all - if that doesn't blow up the regexp
engine. I once made a script that takes a list of strings and creates a
single regexp from them that is supposedly matched fast (i.e. doesn't need
backtracking). If you're interested I can check at work tomorrow - I'm
quite sure I still have it somewhere (please email as reminder).
Kind regards
robert
···
On Fri, 13 Aug 2004 01:41:10 +0900, Martin DeMello wrote
> Kirk Haines <khaines@enigo.com> wrote:
I once made a script that takes a list of strings and creates a
single regexp from them that is supposedly matched fast (i.e. doesn't need
backtracking). If you're interested I can check at work tomorrow - I'm
quite sure I still have it somewhere (please email as reminder).
Is this a trie optimization? I've written something similar for Regexp::English but the optimization itself is quite slow and takes up lots of resources. (Because it needs to build the trie.)
It can do this:
irb(main):006:0> re
=> /foobar|foobark|foobard|foobarqux|food|fool|foolish/
irb(main):007:0> re.optimize
=> /foo(?:l(?:ish)?|bar(?:qux|[kd])?|d)/
It was able to create an optimized Regexp that matches the first few hundred English words from a words file -- I've attached it.
"Florian Gross" <flgr@ccan.de> schrieb im Newsbeitrag
news:2o235cF4pgrgU1@uni-berlin.de...
Robert Klemme wrote:
> I once made a script that takes a list of strings and creates a
> single regexp from them that is supposedly matched fast (i.e. doesn't
need
> backtracking). If you're interested I can check at work tomorrow -
I'm
> quite sure I still have it somewhere (please email as reminder).
Is this a trie optimization?
Yep.
I've written something similar for
Regexp::English but the optimization itself is quite slow and takes up
lots of resources. (Because it needs to build the trie.)
Well, it was ok in my case because the word list didn't change and I put
the generated regexp into code. Although I'm not sure that it was really
that slow. Lemmesee... IMHO Theoretically it should be around O(n*m)
with n the number of words and m the average word length.
It can do this:
irb(main):006:0> re
=> /foobar|foobark|foobard|foobarqux|food|fool|foolish/
irb(main):007:0> re.optimize
=> /foo(?:l(?:ish)?|bar(?:qux|[kd])?|d)/
Exactly. Only that my solution took a set of words as input.
It was able to create an optimized Regexp that matches the first few
hundred English words from a words file -- I've attached it.
I would love to see your script, Robert. I was flirting in my head with the
thought that if one took all of the regexes and broken them into shared
pieces and made a tree out of them, one could walk down the tree of pieces
until one found the complete regex that made the match. Neat to see my
brain wasn't totally off base with the idea. So I want to take the regexp
your script builds and benchmark it against the current state of affairs
with simple hash based fixed string matching as well as the basic match the
fixed strings then iterate through the regexes approach and see how overall
performance shakes out.
Thanks!
Kirk Haines
···
On Fri, 13 Aug 2004 17:36:11 +0900, Robert Klemme wrote
Yep.
> I've written something similar for
> Regexp::English but the optimization itself is quite slow and takes up
> lots of resources. (Because it needs to build the trie.)
Well, it was ok in my case because the word list didn't change and I
put the generated regexp into code. Although I'm not sure that it
was really that slow. Lemmesee... IMHO Theoretically it should be
around O(n*m) with n the number of words and m the average word length.
"Kirk Haines" <khaines@enigo.com> schrieb im Newsbeitrag
news:20040813142908.M67018@enigo.com...
> Yep.
>
> > I've written something similar for
> > Regexp::English but the optimization itself is quite slow and takes
up
> > lots of resources. (Because it needs to build the trie.)
>
> Well, it was ok in my case because the word list didn't change and I
> put the generated regexp into code. Although I'm not sure that it
> was really that slow. Lemmesee... IMHO Theoretically it should be
> around O(n*m) with n the number of words and m the average word
length.
I would love to see your script, Robert. I was flirting in my head with
the
thought that if one took all of the regexes and broken them into shared
pieces and made a tree out of them, one could walk down the tree of
pieces
until one found the complete regex that made the match. Neat to see my
brain wasn't totally off base with the idea. So I want to take the
regexp
your script builds and benchmark it against the current state of affairs
with simple hash based fixed string matching as well as the basic match
the
fixed strings then iterate through the regexes approach and see how
overall
performance shakes out.
Ha,
found it! Nothing gets lost on a good sorted and well sized hard disk.
:-))
(see attachment, one is the script and it needs the tree implementation in
the other file)