Regular expression too big

Brabuhr · 17 November 2006 22:33

I built a local version of 1.8.5 with the oniguruma engine:
http://raa.ruby-lang.org/project/oniguruma/

And re-ran (a slight variation of) my test program:

[~]$ ruby foo
Using the <undefined> regex engine.
Converted a list of 1 words into a regex 8 bytes long.
Converted a list of 2 words into a regex 36 bytes long.
Converted a list of 4 words into a regex 48 bytes long.
Converted a list of 8 words into a regex 73 bytes long.
Converted a list of 16 words into a regex 173 bytes long.
Converted a list of 32 words into a regex 352 bytes long.
Converted a list of 64 words into a regex 718 bytes long.
Converted a list of 128 words into a regex 1415 bytes long.
Converted a list of 256 words into a regex 2656 bytes long.
Converted a list of 512 words into a regex 5210 bytes long.
Converted a list of 1024 words into a regex 10105 bytes long.
Converted a list of 2048 words into a regex 19432 bytes long.
Converted a list of 4096 words into a regex 37509 bytes long.
@_@

[~]$ /usr/local/bin/ruby foo
Using the Oniguruma regex engine.
Converted a list of 1 words into a regex 11 bytes long.
Converted a list of 2 words into a regex 16 bytes long.
Converted a list of 4 words into a regex 38 bytes long.
Converted a list of 8 words into a regex 97 bytes long.
Converted a list of 16 words into a regex 185 bytes long.
Converted a list of 32 words into a regex 359 bytes long.
Converted a list of 64 words into a regex 686 bytes long.
Converted a list of 128 words into a regex 1387 bytes long.
Converted a list of 256 words into a regex 2715 bytes long.
Converted a list of 512 words into a regex 5264 bytes long.
Converted a list of 1024 words into a regex 10074 bytes long.
Converted a list of 2048 words into a regex 19439 bytes long.
Converted a list of 4096 words into a regex 37452 bytes long.
Converted a list of 8192 words into a regex 71931 bytes long.
Converted a list of 16384 words into a regex 135572 bytes long.
Converted a list of 32768 words into a regex 253027 bytes long.
Converted a list of 65536 words into a regex 461607 bytes long.
Converted a list of 131072 words into a regex 808171 bytes long.
Converted a list of 262144 words into a regex 1326345 bytes long.
Converted a list of 479625 words into a regex 1873539 bytes long.

···

On 11/12/06, Ross Bamford <rosco@roscopeco.remove.co.uk> wrote:

On Sun, 12 Nov 2006 15:01:56 -0000, Peter Schrammel > <peter.schrammel@gmx.de> wrote:
> Why is there a limitation at all? I implemented the same thing in perl
> and it no complains ...
> Is the regexp engine of perl that much better?
>

Irrespective of whether regex the best solution for your needs, it seems
Oniguruma will improve the situation somewhat with respect to large
regular expressions.

Paul_Lutus · 13 November 2006 17:20

Ken Bloom wrote:

/ ...

I think this (e.g speed improvement) would have been true in Perl also.

His regex isn't anchored to the beginning and end of the string.

Yes, I see that now. From the appearance of the data, it looks as though he
is matching whole words, but that's only an assumption on my part.

This makes hashes useless for the kind of comparison he seems to want to
perform.

Yes, unless he is matching whole words, he is stuck with regexes. There is
very likely to be a refactoring for this problem, and it would have to
start with a clear statement of the problem to be solved.

···

--
Paul Lutus
http://www.arachnoid.com

Brabuhr · 17 November 2006 22:02

In a little bit of testing here, it goes to long after about 8,000
words.

> require 'makeregex'
>
> 20.times do |n|
> words = IO.readlines("/usr/share/dict/words")[0..(2 ** n)]

Of course, that isn't the best sample of words to test with.

Here is a revised version (fails with a smaller word count):

require 'makeregex'

class Array
  def sample n
    a = ; s = self.size
    n.times do
      a << self[rand(s)].chomp
    end
    a
  end
end

words = IO.readlines("/usr/share/dict/words")

20.times do |n|
  w = words.sample(2 ** n)

  start = Time.now

  r = Regexp.make(w)

  finish = Time.now

  puts "Took #{finish - start} seconds to convert #{w.size} words into a regex #{r.size} bytes long."

  "FOO".match(r)
end

Took 9.9e-05 seconds to convert 1 words into a regex 9 bytes long.
Took 0.000163 seconds to convert 2 words into a regex 18 bytes long.
Took 0.000169 seconds to convert 4 words into a regex 42 bytes long.
Took 0.000299 seconds to convert 8 words into a regex 80 bytes long.
Took 0.000618 seconds to convert 16 words into a regex 176 bytes long.
Took 0.001366 seconds to convert 32 words into a regex 324 bytes long.
Took 0.003021 seconds to convert 64 words into a regex 689 bytes long.
Took 0.007129 seconds to convert 128 words into a regex 1391 bytes long.
Took 0.019143 seconds to convert 256 words into a regex 2740 bytes long.
Took 0.149488 seconds to convert 512 words into a regex 5295 bytes long.
Took 0.304324 seconds to convert 1024 words into a regex 10236 bytes long.
Took 0.307832 seconds to convert 2048 words into a regex 19755 bytes long.
Took 0.404071 seconds to convert 4096 words into a regex 37308 bytes long.
foo:26:in `match': regular expression too big:
/(?:A(?:ME|c(?:arapis|kley|mispon)|...

Brabuhr · 23 November 2006 03:02

I thought I'd try running under jruby too:

$ /opt/ruby/v1.8.5-oniguruma/bin/ruby long_regex_test.rb
Took 0.000211 seconds to convert 1 words into a regex 5 bytes long.
Took 0.000334 seconds to convert 2 words into a regex 24 bytes long.
Took 0.000215 seconds to convert 4 words into a regex 52 bytes long.
Took 0.000836 seconds to convert 8 words into a regex 92 bytes long.
Took 0.000885 seconds to convert 16 words into a regex 173 bytes long.
Took 0.002779 seconds to convert 32 words into a regex 345 bytes long.
Took 0.004934 seconds to convert 64 words into a regex 725 bytes long.
Took 0.009765 seconds to convert 128 words into a regex 1369 bytes long.
Took 0.020761 seconds to convert 256 words into a regex 2737 bytes long.
Took 0.088759 seconds to convert 512 words into a regex 5408 bytes long.
Took 0.144276 seconds to convert 1024 words into a regex 10131 bytes long.
Took 0.246762 seconds to convert 2048 words into a regex 19531 bytes long.
Took 0.667575 seconds to convert 4096 words into a regex 37498 bytes long.
Took 1.677037 seconds to convert 8192 words into a regex 71352 bytes long.
Took 2.971277 seconds to convert 16384 words into a regex 133499 bytes long.
Took 6.078681 seconds to convert 32768 words into a regex 245318 bytes long.
Took 13.001538 seconds to convert 65536 words into a regex 433611 bytes long.
Took 26.791838 seconds to convert 131072 words into a regex 713229 bytes long.
Took 47.691109 seconds to convert 262144 words into a regex 1061186 bytes long.
Took 71.050324 seconds to convert 524288 words into a regex 1354567 bytes long.

$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home
$ ~/Desktop/jruby-0.9.1/bin/jruby long_regex_test.rb
Took 0.032 seconds to convert 1 words into a regex 9 bytes long.
Took 0.012 seconds to convert 2 words into a regex 18 bytes long.
Took 0.624 seconds to convert 4 words into a regex 40 bytes long.
Took 0.033 seconds to convert 8 words into a regex 95 bytes long.
Took 0.095 seconds to convert 16 words into a regex 156 bytes long.
Took 0.057 seconds to convert 32 words into a regex 358 bytes long.
Took 0.171 seconds to convert 64 words into a regex 743 bytes long.
Took 0.309 seconds to convert 128 words into a regex 1402 bytes long.
Took 0.40900000000000003 seconds to convert 256 words into a regex
2692 bytes long.
Took 1.863 seconds to convert 512 words into a regex 5341 bytes long.
Took 0.838 seconds to convert 1024 words into a regex 10328 bytes long.
Took 1.504 seconds to convert 2048 words into a regex 19733 bytes long.
Took 2.814 seconds to convert 4096 words into a regex 37334 bytes long.
Took 8.177 seconds to convert 8192 words into a regex 71593 bytes long.
Took 15.181000000000001 seconds to convert 16384 words into a regex
133779 bytes long.
Took 30.695 seconds to convert 32768 words into a regex 244280 bytes long.
Took 61.555 seconds to convert 65536 words into a regex 432751 bytes long.
Took 155.94400000000002 seconds to convert 131072 words into a regex
713573 bytes long.
Took 224.93 seconds to convert 262144 words into a regex 1060079 bytes long.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

···

On 11/17/06, brabuhr@gmail.com <brabuhr@gmail.com> wrote:

On 11/12/06, Ross Bamford <rosco@roscopeco.remove.co.uk> wrote:
> On Sun, 12 Nov 2006 15:01:56 -0000, Peter Schrammel > > <peter.schrammel@gmx.de> wrote:
> > Why is there a limitation at all? I implemented the same thing in perl
> > and it no complains ...
> > Is the regexp engine of perl that much better?
> >
>
> Irrespective of whether regex the best solution for your needs, it seems
> Oniguruma will improve the situation somewhat with respect to large
> regular expressions.

I built a local version of 1.8.5 with the oniguruma engine:
http://raa.ruby-lang.org/project/oniguruma/

And re-ran (a slight variation of) my test program:

Peter_Schrammel · 16 November 2006 23:05

Paul Lutus wrote:

Yes, unless he is matching whole words, he is stuck with regexes. There is
very likely to be a refactoring for this problem, and it would have to
start with a clear statement of the problem to be solved.

Sorry, my fault.

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.
The actual implementation is to concat the word with | and create a big
regexp. word1|word2|word3|word4 ...
This went well until I tested with some 100 words. Now I have the 'big
regexp problem' problem. The solution has to work with regexps as words
as well like: word1(the|a|an)word11|regexp2|...
So the currently working solutions are:
-use ruby 1.9 (the performance is far below perl, but I have to
benchmark this.)
-loop through the words/regexp and match each of them on the string
(don't ask for performance here).
-perhaps I'll realy pipe the problem into a
perl-implemented-matching-server (not that bad idea)

peter

Christopher_Aldridge · 23 November 2006 05:29

#would_never_do.rb
matched = `perl -e 'if ("Hello" =~ m/(Hello)/) {print "true";}'`

···

On 11/22/06, brabuhr@gmail.com <brabuhr@gmail.com> wrote:

On 11/17/06, brabuhr@gmail.com <brabuhr@gmail.com> wrote:
> On 11/12/06, Ross Bamford <rosco@roscopeco.remove.co.uk> wrote:
> > On Sun, 12 Nov 2006 15:01:56 -0000, Peter Schrammel > > > <peter.schrammel@gmx.de> wrote:
> > > Why is there a limitation at all? I implemented the same thing in perl
> > > and it no complains ...
> > > Is the regexp engine of perl that much better?
> > >
> >
> > Irrespective of whether regex the best solution for your needs, it seems
> > Oniguruma will improve the situation somewhat with respect to large
> > regular expressions.
>
> I built a local version of 1.8.5 with the oniguruma engine:
> http://raa.ruby-lang.org/project/oniguruma/
>
> And re-ran (a slight variation of) my test program:

I thought I'd try running under jruby too:

$ ruby long_regex_test.rb
Took 0.000153 seconds to convert 1 words into a regex 17 bytes long.
Took 0.000381 seconds to convert 2 words into a regex 20 bytes long.
Took 0.000393 seconds to convert 4 words into a regex 36 bytes long.
Took 0.000629 seconds to convert 8 words into a regex 93 bytes long.
Took 0.001359 seconds to convert 16 words into a regex 180 bytes long.
Took 0.002261 seconds to convert 32 words into a regex 360 bytes long.
Took 0.007304 seconds to convert 64 words into a regex 741 bytes long.
Took 0.013601 seconds to convert 128 words into a regex 1348 bytes long.
Took 0.028273 seconds to convert 256 words into a regex 2746 bytes long.
Took 0.066228 seconds to convert 512 words into a regex 5345 bytes long.
Took 0.177105 seconds to convert 1024 words into a regex 10017 bytes long.
Took 0.330573 seconds to convert 2048 words into a regex 19597 bytes long.
Took 1.390542 seconds to convert 4096 words into a regex 37345 bytes long.
long_regex_test.rb:26:in `match': regular expression too big:
/(?:A(?:cr(?:edula|opora)|d(?:ar|elochorda|ventis[mt])|frogaean|hepatokla|ileen|l(?:adinist|l(?:a(?:manda|sch)|otheria)|ticamelus)|m(?:bystomidae|ericanly|ioidei|phioxidae)|n(?:chisaurus|d(?:aman|romache)|olympiad|t(?:echinomys|h(?:ophila|ropozoic)))|patornis|r(?:ab|chelenis|istarch)|s(?:caridia|elli|hantee|ilidae|terias)|tropa|u(?:riculidae|stroasiatic))|B(?:a(?:cchus|eria|haism|iera|k(?:shaish|wiri)|re|sili(?:ca|scus))|e(?:atrice|l(?:g(?:ae|ic)|shazzaresque)|mbex|rn(?:inesque|oullian))|i(?:elid|lati|smarck|tis)|lackfoot|o(?:hemia|llandist|rrovian)|ra(?:m|nchiopulmonata)|u(?:nga|phthalmum)|yroni(?:cs|te))|C(?:a(?:ctales|l(?:edonia|li(?:carpa|stephus)|ochortaceae|vados|ycophorae)|m(?:bodian|orra)|ntabri|p(?:ito(?:line)?|sidae)|r(?:olan|tist)|s(?:sandra|tanospermum)|thari)|e(?:ntrarchidae|strian)|h(?:arontas|e(?:lura|makuan)|rist(?:ianomastix|li(?:keness|ness)|mas))|lathrus|o(?:bleskill

(RegexpError)
        from long_regex_test.rb:26
        from long_regex_test.rb:15:in `times'
        from long_regex_test.rb:15

$ /opt/ruby/v1.8.5-oniguruma/bin/ruby long_regex_test.rb
Took 0.000211 seconds to convert 1 words into a regex 5 bytes long.
Took 0.000334 seconds to convert 2 words into a regex 24 bytes long.
Took 0.000215 seconds to convert 4 words into a regex 52 bytes long.
Took 0.000836 seconds to convert 8 words into a regex 92 bytes long.
Took 0.000885 seconds to convert 16 words into a regex 173 bytes long.
Took 0.002779 seconds to convert 32 words into a regex 345 bytes long.
Took 0.004934 seconds to convert 64 words into a regex 725 bytes long.
Took 0.009765 seconds to convert 128 words into a regex 1369 bytes long.
Took 0.020761 seconds to convert 256 words into a regex 2737 bytes long.
Took 0.088759 seconds to convert 512 words into a regex 5408 bytes long.
Took 0.144276 seconds to convert 1024 words into a regex 10131 bytes long.
Took 0.246762 seconds to convert 2048 words into a regex 19531 bytes long.
Took 0.667575 seconds to convert 4096 words into a regex 37498 bytes long.
Took 1.677037 seconds to convert 8192 words into a regex 71352 bytes long.
Took 2.971277 seconds to convert 16384 words into a regex 133499 bytes long.
Took 6.078681 seconds to convert 32768 words into a regex 245318 bytes long.
Took 13.001538 seconds to convert 65536 words into a regex 433611 bytes long.
Took 26.791838 seconds to convert 131072 words into a regex 713229 bytes long.
Took 47.691109 seconds to convert 262144 words into a regex 1061186 bytes long.
Took 71.050324 seconds to convert 524288 words into a regex 1354567 bytes long.

$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home
$ ~/Desktop/jruby-0.9.1/bin/jruby long_regex_test.rb
Took 0.032 seconds to convert 1 words into a regex 9 bytes long.
Took 0.012 seconds to convert 2 words into a regex 18 bytes long.
Took 0.624 seconds to convert 4 words into a regex 40 bytes long.
Took 0.033 seconds to convert 8 words into a regex 95 bytes long.
Took 0.095 seconds to convert 16 words into a regex 156 bytes long.
Took 0.057 seconds to convert 32 words into a regex 358 bytes long.
Took 0.171 seconds to convert 64 words into a regex 743 bytes long.
Took 0.309 seconds to convert 128 words into a regex 1402 bytes long.
Took 0.40900000000000003 seconds to convert 256 words into a regex
2692 bytes long.
Took 1.863 seconds to convert 512 words into a regex 5341 bytes long.
Took 0.838 seconds to convert 1024 words into a regex 10328 bytes long.
Took 1.504 seconds to convert 2048 words into a regex 19733 bytes long.
Took 2.814 seconds to convert 4096 words into a regex 37334 bytes long.
Took 8.177 seconds to convert 8192 words into a regex 71593 bytes long.
Took 15.181000000000001 seconds to convert 16384 words into a regex
133779 bytes long.
Took 30.695 seconds to convert 32768 words into a regex 244280 bytes long.
Took 61.555 seconds to convert 65536 words into a regex 432751 bytes long.
Took 155.94400000000002 seconds to convert 131072 words into a regex
713573 bytes long.
Took 224.93 seconds to convert 262144 words into a regex 1060079 bytes long.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Simon_Strandgaard2 · 17 November 2006 01:35

Is it exact word matching?

if so then you can split the input-string on blankspace,
so you get an array of words. Then you can sort the array of words

irb(main):001:0> a = %w(a b c e f g)
=> ["a", "b", "c", "e", "f", "g"]
irb(main):002:0> b = %w(b d f)
=> ["b", "d", "f"]
irb(main):003:0> a & b
=> ["b", "f"]

···

On 11/17/06, Peter Schrammel <peter.schrammel@gmx.de> wrote:

Paul Lutus wrote:
> Yes, unless he is matching whole words, he is stuck with regexes. There is
> very likely to be a refactoring for this problem, and it would have to
> start with a clear statement of the problem to be solved.
>
Sorry, my fault.

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.
The actual implementation is to concat the word with | and create a big
regexp. word1|word2|word3|word4 ...

--
Simon Strandgaard

Paul_Lutus · 17 November 2006 05:45

Peter Schrammel wrote:

Paul Lutus wrote:

Yes, unless he is matching whole words, he is stuck with regexes. There
is very likely to be a refactoring for this problem, and it would have to
start with a clear statement of the problem to be solved.

Sorry, my fault.

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.

You are not being very clear. Do you mean to match entire words, letter for
letter, beginning to end, at least some of the time? If so, for those
cases, you can use a hash table that is preloaded with the words as keys.
That will be fast.

For the regexps, which one hopes are in the minority, the speed will of
course go down.

But separate the two classes of tests -- match whole words using a hash for
speed.

The actual implementation is to concat the word with | and create a big
regexp. word1|word2|word3|word4 ...

Yes, but you are moving ahead to implementation before stating the problem
to be solved.

This went well until I tested with some 100 words. Now I have the 'big
regexp problem' problem. The solution has to work with regexps as words
as well like: word1(the|a|an)word11|regexp2|...

Also precompile the regexes before use. You probably already know this, but
I thought I would mention it anyway.

Another option is C++, which has a readily available regexp library. The way
I would go about this is to design the entire thing in Ruby, then, if the
speed was not acceptable, recreate it in C++. This gives you the advantage
of speedy development in Ruby, followed by speedy execution.

If this is a full-on language analysis problem, you really should be using
Prolog or Lisp anyway. If the problem really is as complex as you are
hinting at, you may not be using the right language, or even class of
language.

···

--
Paul Lutus
http://www.arachnoid.com

Peter_Schrammel · 17 November 2006 18:15

Paul Lutus wrote:

Peter Schrammel wrote:

Paul Lutus wrote:

Yes, unless he is matching whole words, he is stuck with regexes. There
is very likely to be a refactoring for this problem, and it would have to
start with a clear statement of the problem to be solved.

Sorry, my fault.

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.

You are not being very clear. Do you mean to match entire words, letter for
letter, beginning to end, at least some of the time? If so, for those
cases, you can use a hash table that is preloaded with the words as keys.
That will be fast.

no its a substring match kind of /hello/.match("dear customer id like to
say hello to ...."). I have to find 70000 words in a large textfile:
if lotofwords.match(bigtext) .....
I need a trigger to classify the bitext depending on wether any of the
word (later regexps) are in it or not. Tries work well for substring
matching in this case but fail with regexps.

Also precompile the regexes before use. You probably already know this, but
I thought I would mention it anyway.

yes, done that.

Another option is C++, which has a readily available regexp library. The way
I would go about this is to design the entire thing in Ruby, then, if the
speed was not acceptable, recreate it in C++. This gives you the advantage
of speedy development in Ruby, followed by speedy execution.

If this is a full-on language analysis problem, you really should be using
Prolog or Lisp anyway. If the problem really is as complex as you are
hinting at, you may not be using the right language, or even class of
language.

yes, but I wanted to try it with "my favorit" language first before
doing something wild. It's a rails application so I'd have to call
lisp/perl/.... programms, interfacing them (calling on each test would
be a mess because reading somethousand lines every time is a performance
killer.)

Anyway, thanks for the replies I'll give a perl interface a try (due to
some other reasons like robust html parsing ...).

Giles_Bowkett · 17 November 2006 22:13

> If this is a full-on language analysis problem, you really should be using
> Prolog or Lisp anyway. If the problem really is as complex as you are
> hinting at, you may not be using the right language, or even class of
> language.
>
yes, but I wanted to try it with "my favorit" language first before
doing something wild. It's a rails application so I'd have to call
lisp/perl/.... programms, interfacing them (calling on each test would
be a mess because reading somethousand lines every time is a performance
killer.)

Anyway, thanks for the replies I'll give a perl interface a try (due to
some other reasons like robust html parsing ...).

there's got to be a better way to do this. I recently attempted to
implement a particle simulation algorithm for arranging nodes in a
graph in Rails, before I came to my senses.

solving this problem with regexes sounds like a nightmare on the
maintenance front. I think it's possible that you're using the wrong
language but more likely that you're just using the wrong technique.

···

--
Giles Bowkett
http://www.gilesgoatboy.org

Paul_Lutus · 18 November 2006 19:35

Peter Schrammel wrote:

/ ...

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.

You are not being very clear. Do you mean to match entire words, letter
for letter, beginning to end, at least some of the time? If so, for those
cases, you can use a hash table that is preloaded with the words as keys.
That will be fast.

no its a substring match kind of /hello/.match("dear customer id like to
say hello to ....").

Translation: "Yes, I am searching for matches with individual words or
groups of words, character by character, without always requiring regular
expressions." In this case, a hash approach is indicated for one tier of
your solution.

You need to realize that this:

/hello/.match("dear customer id like to say hello to ....")

Is fantastically, unbelievably, slower than this:

hash = [ "hello" => true,"goodbye" => true,"stay a while" => true ]

if hash[word] ...

All you have to do is tokenize your word list using "split" or another
similar approach, then apply the hash to it. You may have a need for
regular expressions, but it is clear that you can also use a hash and save
a load of run time.

I have to find 70000 words in a large textfile:
if lotofwords.match(bigtext) .....
I need a trigger to classify the bitext depending on wether any of the
word (later regexps) are in it or not. Tries work well for substring
matching in this case but fail with regexps.

At this point, to offer any useful advice, it would be nice to know what the
goal is, more specifically than we've heard up to now.

/ ...

Anyway, thanks for the replies I'll give a perl interface a try (due to
some other reasons like robust html parsing ...).

Oh, Ruby can parse HTML just fine, and in only a few lines.

···

--
Paul Lutus
http://www.arachnoid.com

Topic		Replies	Views
Slow regular expressions :( ruby-talk	28	135	28 July 2006
[QUIZ] DictionaryMatcher (#103) ruby-talk	17	75	29 November 2006
Matching against a zillion patterns ruby-talk	17	105	18 October 2009
Alternate Regular Expressions? ruby-talk	26	152	24 December 2009
Slow regular expressions ruby-talk	8	84	29 July 2006

Regular expression too big

Related topics