Implementation ideas? Flexible string matching

Kirk_Haines · 12 August 2004 16:11

Here is the puzzle. You have a string. You have a list of strings and/or
regular expressions. You want to find the match, if any, between your
string and an element in the list.

If the list is also just strings, you can create a hash with the strings as
keys then do a hash lookup to do the match. That's fast, even if the list
is very long.

However, what is most of the list is string literals, but you want to be
able to use regexps or some sort of wildcard matching for some of the
matches, too. Does anyone have any bright ideas about how to do this as
quickly as possible?

The only idea that I have come up with is to put the literal matches in a
hash, and then have the regular expressions in an array. If there isn't a
literal match, then one has to accept the time consuming process of
iterating through each regexp and checking it. Can anyone think of any
other approaches that might be faster?

Thanks,

Kirk Haines

Martin_DeMello1 · 12 August 2004 16:41

If you only need to know that there is a match, rather than what
matched, how about combining all the regexps into one big regexp using
((re1)|(re2)|...)?

martin

···

Kirk Haines <khaines@enigo.com> wrote:

The only idea that I have come up with is to put the literal matches in a
hash, and then have the regular expressions in an array. If there isn't a
literal match, then one has to accept the time consuming process of
iterating through each regexp and checking it. Can anyone think of any
other approaches that might be faster?

Martin_Kay · 12 August 2004 20:29

In general, this is am intractable problem. After all, a single regular expression can match any substring and, in the worst case, they could all do this. The best solution I know to this problem needs a more powerful facility for handling regular expressions than is normally provided by a programming language.

In what follows, I will talk about the finite-state automata (fsa) equivalent to regular expressions because that is what we need. So ...
1. Make a fsa E = e1 | e2 ... en for the union of the set you are
    interested in.
2. From this create F = Sigma* E by creating a loop for every character
    in the alphabet at the start state of E. Make F deterministic.
3. Create R as the reverse of E, that is, its recognizes strings that
    match the regular expressions of interest, but read from right to
    left. We need to annotate the final states of this automaton with a
    list of the original expressions that are recognized when the
    automaton enters that state. R also needs to be deterministic.
4. If we run F against the string, it will be in a final state just
    in case the character it has just examined ends a substring that
    matches (at least) one of the original expressions. Since F is
    deterministic, it is therefore possible to find all these places in
    linear time and, in fact, after examining each character exactly
    once. The trouble is that you cannot tell which expressions match
    at such a location, because, if you are k characters into the string,
    there could be k different places at which a match began.
5. Starting at each location identified in step 4, run R backwards
    through the string. Every time it enters a final state, it
    identifies the starting point of a substring that ends
    where this backwards match started and, thanks to its special
    annotations, tells you which re(s) you have matched.
This is, of course, intractable in perverse cases because step 5 can still start and finish everywhere. But it is close to linear in all but very worst cases. Notice, for what it is worth that, for each string position at which you run step 5, at least one match is guaranteed. The problem is that you cannot stop when you find the first one because there could be more.

--Martin

···

On Aug 12, 2004, at 6:11 PM, Kirk Haines wrote:

Here is the puzzle. You have a string. You have a list of strings and/or
regular expressions. You want to find the match, if any, between your
string and an element in the list.

If the list is also just strings, you can create a hash with the strings as
keys then do a hash lookup to do the match. That's fast, even if the list
is very long.

However, what is most of the list is string literals, but you want to be
able to use regexps or some sort of wildcard matching for some of the
matches, too. Does anyone have any bright ideas about how to do this as
quickly as possible?

The only idea that I have come up with is to put the literal matches in a
hash, and then have the regular expressions in an array. If there isn't a
literal match, then one has to accept the time consuming process of
iterating through each regexp and checking it. Can anyone think of any
other approaches that might be faster?

Thanks,

Kirk Haines

Kirk_Haines · 12 August 2004 17:37

> The only idea that I have come up with is to put the literal matches in

a

> hash, and then have the regular expressions in an array. If there isn't

a

> literal match, then one has to accept the time consuming process of
> iterating through each regexp and checking it. Can anyone think of any
> other approaches that might be faster?

If you only need to know that there is a match, rather than what
matched, how about combining all the regexps into one big regexp using
((re1)|(re2)|...)?

Hmmm. Interesting idea. I'll have to give it a try and see if there is a
performance difference when doing that.

Thanks,

Kirk Haines

···

On Fri, 13 Aug 2004 01:41:10 +0900, Martin DeMello wrote

Kirk Haines <khaines@enigo.com> wrote:

Robert · 12 August 2004 19:41

"Kirk Haines" <khaines@enigo.com> schrieb im Newsbeitrag
news:20040812173551.M70191@enigo.com...

>
> > The only idea that I have come up with is to put the literal matches

in

a
> > hash, and then have the regular expressions in an array. If there

isn't

a
> > literal match, then one has to accept the time consuming process of
> > iterating through each regexp and checking it. Can anyone think of

any

> > other approaches that might be faster?
>
> If you only need to know that there is a match, rather than what
> matched, how about combining all the regexps into one big regexp using
> ((re1)|(re2)|...)?

Hmmm. Interesting idea. I'll have to give it a try and see if there is a
performance difference when doing that.

I was going to suggest the same: hash and a combined regexp. You could even
create a single regexp that coevers all - if that doesn't blow up the regexp
engine. I once made a script that takes a list of strings and creates a
single regexp from them that is supposedly matched fast (i.e. doesn't need
backtracking). If you're interested I can check at work tomorrow - I'm
quite sure I still have it somewhere (please email as reminder).

Kind regards

robert

···

On Fri, 13 Aug 2004 01:41:10 +0900, Martin DeMello wrote
> Kirk Haines <khaines@enigo.com> wrote:

Florian_Gross · 12 August 2004 20:46

Robert Klemme wrote:

I once made a script that takes a list of strings and creates a
single regexp from them that is supposedly matched fast (i.e. doesn't need
backtracking). If you're interested I can check at work tomorrow - I'm
quite sure I still have it somewhere (please email as reminder).

Is this a trie optimization? I've written something similar for Regexp::English but the optimization itself is quite slow and takes up lots of resources. (Because it needs to build the trie.)

It can do this:

It was able to create an optimized Regexp that matches the first few hundred English words from a words file -- I've attached it.

Kind regards
robert

More regards,
Florian Gross

re.txt (10.5 KB)

Robert · 13 August 2004 08:36

"Florian Gross" <flgr@ccan.de> schrieb im Newsbeitrag
news:2o235cF4pgrgU1@uni-berlin.de...

Robert Klemme wrote:

> I once made a script that takes a list of strings and creates a
> single regexp from them that is supposedly matched fast (i.e. doesn't

need

> backtracking). If you're interested I can check at work tomorrow -

I'm

> quite sure I still have it somewhere (please email as reminder).

Is this a trie optimization?

Yep.

I've written something similar for
Regexp::English but the optimization itself is quite slow and takes up
lots of resources. (Because it needs to build the trie.)

Well, it was ok in my case because the word list didn't change and I put
the generated regexp into code. Although I'm not sure that it was really
that slow. Lemmesee... IMHO Theoretically it should be around O(n*m)
with n the number of words and m the average word length.

It can do this:

irb(main):006:0> re
=> /foobar|foobark|foobard|foobarqux|food|fool|foolish/
irb(main):007:0> re.optimize
=> /foo(?:l(?:ish)?|bar(?:qux|[kd])?|d)/

Exactly. Only that my solution took a set of words as input.

It was able to create an optimized Regexp that matches the first few
hundred English words from a words file -- I've attached it.

Beautiful!

Kind regards

robert

Kirk_Haines · 13 August 2004 14:35

I would love to see your script, Robert. I was flirting in my head with the
thought that if one took all of the regexes and broken them into shared
pieces and made a tree out of them, one could walk down the tree of pieces
until one found the complete regex that made the match. Neat to see my
brain wasn't totally off base with the idea. So I want to take the regexp
your script builds and benchmark it against the current state of affairs
with simple hash based fixed string matching as well as the basic match the
fixed strings then iterate through the regexes approach and see how overall
performance shakes out.

Thanks!

Kirk Haines

···

On Fri, 13 Aug 2004 17:36:11 +0900, Robert Klemme wrote

Yep.

> I've written something similar for
> Regexp::English but the optimization itself is quite slow and takes up
> lots of resources. (Because it needs to build the trie.)

Well, it was ok in my case because the word list didn't change and I
put the generated regexp into code. Although I'm not sure that it
was really that slow. Lemmesee... IMHO Theoretically it should be
around O(n*m) with n the number of words and m the average word length.

Robert · 13 August 2004 15:21

"Kirk Haines" <khaines@enigo.com> schrieb im Newsbeitrag
news:20040813142908.M67018@enigo.com...

> Yep.
>
> > I've written something similar for
> > Regexp::English but the optimization itself is quite slow and takes

up

> > lots of resources. (Because it needs to build the trie.)
>
> Well, it was ok in my case because the word list didn't change and I
> put the generated regexp into code. Although I'm not sure that it
> was really that slow. Lemmesee... IMHO Theoretically it should be
> around O(n*m) with n the number of words and m the average word

length.

I would love to see your script, Robert. I was flirting in my head with

the

thought that if one took all of the regexes and broken them into shared
pieces and made a tree out of them, one could walk down the tree of

pieces

until one found the complete regex that made the match. Neat to see my
brain wasn't totally off base with the idea. So I want to take the

regexp

your script builds and benchmark it against the current state of affairs
with simple hash based fixed string matching as well as the basic match

the

fixed strings then iterate through the regexes approach and see how

overall

performance shakes out.

Ha,

found it! Nothing gets lost on a good sorted and well sized hard disk.
:-))
(see attachment, one is the script and it needs the tree implementation in
the other file)

Have fun!

Kind regards

robert

create-rx.rb (1.1 KB)

Tree.rb (2.35 KB)

···

On Fri, 13 Aug 2004 17:36:11 +0900, Robert Klemme wrote

Topic		Replies	Views
Matches -> regexp? ruby-talk	5	77	13 December 2005
Optimization anyone ruby-talk	5	67	22 November 2005
[SUMMARY] DictionaryMatcher (#103) ruby-talk	0	72	30 November 2006
[QUIZ] DictionaryMatcher (#103) ruby-talk	17	72	29 November 2006
Matching regular expression ruby-talk	3	86	7 March 2005

Implementation ideas? Flexible string matching

Related topics