Regular expression too big

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Peter

Peter Schrammel wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

You could optimize the regex a little for size, e.g. by factoring out common prefixes:

  (b(l(a|ub)|ar)|foo)...

Of course, that will only help if the | alternatives have a reasonable amount of redundancy. Alternatively, you could just break the whole thing into multiple expressions. Instead of

  if /first_part|second_part/ =~ text

You could try:

  if /first_part/ =~ text or /second_part/ =~ text

Maybe a trie would be useful?
http://rubyforge.org/projects/trie/
(or there's another trie at http://kzk9.net/software/miscprograms/ruby/\)

--Ken

···

On Sun, 12 Nov 2006 15:25:49 +0100, Peter Schrammel wrote:

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Peter

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

Peter Schrammel wrote:

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

It appears you are trying to match a word against a list of words. Yes? From
an efficiency standpoint, if the expression is really as shown (no real
regex syntax), why not create a hash out of the data and see if the desired
word is present that way? That would likely be much faster than running the
(non-regex) regex against each word in the input data.

When you post an question like this, it is always a good idea to reveal the
purpose as well as the method.

···

------------------------------------------

#!/usr/bin/ruby -w

data = File.read("/usr/share/dict/words")

words = data.split(%r{\s+}m)

hash = {}

words.each do |word|
   hash[word] = true
end

while true
   print "Enter a word:"
   word = STDIN.readline.chomp
   if hash[word]
      puts "Valid word."
   else
      puts "Not present."
   end
end

------------------------------------------

Now you have a hash that can be used for testing a list of words.

--
Paul Lutus
http://www.arachnoid.com

if you have many words to check for then consider using a

···

On 11/12/06, Peter Schrammel <peter.schrammel@gmx.de> wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

--
Simon Strandgaard

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

I friend of mine told me to suggest you to system() a perl script which does the job :stuck_out_tongue:

However, since he was just kidding, I think I'll suggest you something like this:

   class String
     def include_at_least_one_of?(words)
       words.find { |w| include? w }
     end
   end

   my_words = %w(bla blub foo bar)

   "zump blub asd".include_at_least_one_of? my_words # => "blub"
   "nah none included".include_at_least_one_of? my_words # => nil

You probably don't want to hack core classes in order to do this, but you get the idea.

···

On 12/nov/06, at 15:30, Peter Schrammel wrote:

Thanks for any hint

Peter

--
Gabriele Marrone

Jeffrey Schwab wrote:

Peter Schrammel wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

You could optimize the regex a little for size, e.g. by factoring out
common prefixes:

    (b(l(a|ub)|ar)|foo)...

Thought of that.

Of course, that will only help if the | alternatives have a reasonable
amount of redundancy. Alternatively, you could just break the whole
thing into multiple expressions. Instead of

    if /first_part|second_part/ =~ text

You could try:

    if /first_part/ =~ text or /second_part/ =~ text

Yes, that was my next thought but where to split? Just count the bytes
and splitt near 1 <<16?

Why is there a limitation at all? I implemented the same thing in perl
and it no complains ...
Is the regexp engine of perl that much better?

Thanks for the reply

This doesn't really help with the actual question of getting past
regex size limits, but are you sure that regexen are correct solution
to this problem? Unless I'm mistaken, the above match is going to be
horribly, painfully slow; the issue you're running into is probably an
indication that you might want to look elsewhere.

I don't want to make any silly claims without doing any benchmarking
on your data, but I would imagine that even doing an efficient search
on a sorted array of your words would give you better performance than
the regex search. You can move up from there into hashes or other
data structures for this sort of thing.

···

--
Lou

On one more thing: to implement substring search using a trie, when
adding words to the trie, you should generate appropriate back links so
that you can implement =~ use the Knuth-Morris-Pratt algorithm for matching.

--Ken

···

On Sun, 12 Nov 2006 19:15:12 +0000, Ken Bloom wrote:

On Sun, 12 Nov 2006 15:25:49 +0100, Peter Schrammel wrote:

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Peter

Maybe a trie would be useful?
http://rubyforge.org/projects/trie/
(or there's another trie at http://kzk9.net/software/miscprograms/ruby/\)

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

When you post an question like this, it is always a good idea to

reveal the

purpose as well as the method.

First of all: Thanks for the replies, I think I have enough input to
chew on.

And to reveal the purpose:
I'd like to match a LOT of words/strings (words with spaces) and
sometimes regexes against a lot of small and medium size( < 10000 byte )
strings.

The comparison with Perl was just a comparison of the regexp engines and
not the language itself. It was just that until now I had never to think
much about the underlying hardware .... because ruby did that for me. So
I was surprised by the "too big exception".

Peter

The only reason I didn't suggest that is becuase it can have false
positives.

--Ken Bloom
      ^^^^^ :wink:

···

On Mon, 13 Nov 2006 05:46:30 +0900, Simon Strandgaard wrote:

On 11/12/06, Peter Schrammel <peter.schrammel@gmx.de> wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

if you have many words to check for then consider using a
Bloom filter - Wikipedia

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

There's even a library for doing Bloom filters:
http://raa.ruby-lang.org/project/rbloomfilter/

(Bloom filters were invented by Burton Bloom, who is not even remotely
related to me)

···

Simon Strandgaard <neoneye@gmail.com> wrote:

On 11/12/06, Peter Schrammel <peter.schrammel@gmx.de> wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

if you have many words to check for then consider using a
Bloom filter - Wikipedia

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

Just a tiny style related suggestion. I would use any?() instead of find() there.

James Edward Gray II

···

On Nov 13, 2006, at 11:45 AM, Gabriele Marrone wrote:

  class String
    def include_at_least_one_of?(words)
      words.find { |w| include? w }
    end
  end

Irrespective of whether regex the best solution for your needs, it seems Oniguruma will improve the situation somewhat with respect to large regular expressions.

$ wc -w acd-holmes-return.txt
112110 acd-holmes-return.txt

$ cat bigregexp.rb
txt = File.read('acd-holmes-return.txt').split(' ')
re = /^(#{txt.map { |e| "#{Regexp.escape(e)}" }.join('|')})$/

p re =~ "beautiful"
p $&
__END__

$ ruby -v bigregexp.rb
ruby 1.8.5 (2006-08-25) [i686-linux]
bigregexp.rb:2: regular expression too big: /...[snipped].../ (RegexpError)

$ ruby9 -v bigregexp.rb
ruby 1.9.0 (2006-09-07) [i686-linux]
0
"beautiful"

···

On Sun, 12 Nov 2006 15:01:56 -0000, Peter Schrammel <peter.schrammel@gmx.de> wrote:

Why is there a limitation at all? I implemented the same thing in perl
and it no complains ...
Is the regexp engine of perl that much better?

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Peter Schrammel wrote:

Jeffrey Schwab wrote:

Peter Schrammel wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

You could optimize the regex a little for size, e.g. by factoring out
common prefixes:

    (b(l(a|ub)|ar)|foo)...

Thought of that.

Good for you.

Of course, that will only help if the | alternatives have a reasonable
amount of redundancy. Alternatively, you could just break the whole
thing into multiple expressions. Instead of

    if /first_part|second_part/ =~ text

You could try:

    if /first_part/ =~ text or /second_part/ =~ text

Yes, that was my next thought but where to split? Just count the bytes
and splitt near 1 <<16?

Probably better not to construct the mega-regex in the first place. For the record, finding yourself on the edges of the language's capacity like this might be a sign that refactoring is in order.

Even if you're going to stick with your current technique, but work around the size limitation, it's probably better not to build a megex (TM) you'll have to split up. As you put the pattern together, only add alternations as long as the cumulative size will be < 0x10000 (or a well-commented static constant with that value).

Why is there a limitation at all? I implemented the same thing in perl
and it no complains ...
Is the regexp engine of perl that much better?

As Friedl notes, Perl is darned close to being the ideal regex language. :slight_smile:

Ruby regexes aren't necessarily meant to be the one hammer that can drive every nail. If you want to be able to view every problem through a regex lens, you'll probably have to dig a little deeper than categorizing one language's engine as simply "better" than another. Big, static sizes like 2**16 are often used to avoid dynamic allocation, or otherwise improve runtime efficiency.

Also for the record: I'm a big fan of regexes. Though a lot of people complain about complexity or efficiency issues, I've never had a problem with either. I would be interested to see a comparison of the relative merits and limitations of various engines, e.g. regex lengths, benchmarks, big-O complexity, and ability to handle null bytes.

Peter Schrammel wrote:

When you post an question like this, it is always a good idea to

reveal the

purpose as well as the method.

First of all: Thanks for the replies, I think I have enough input to
chew on.

And to reveal the purpose:
I'd like to match a LOT of words/strings (words with spaces) and
sometimes regexes against a lot of small and medium size( < 10000 byte )
strings.

If you are going to compare strings to strings as well as strings to
regexes, maybe a two-tiered scheme would be better. First tier, simple
textual comparison using a hash of strings. Second tier, regexes.

This removes the ambiguity that a particular data string might pass a string
comparison but fail any of the provided regexes, or the opposite. It will
also be much faster than a gigantic list of strings and regexes, all
presented to the regex engine as though they were all regular expressions,
even though some are simple string comparisons.

I think this (e.g speed improvement) would have been true in Perl also.

···

--
Paul Lutus
http://www.arachnoid.com

Peter Schrammel <peter.schrammel@gmx.de> writes:

Why is there a limitation at all? I implemented the same thing in perl
and it no complains ...
Is the regexp engine of perl that much better?

I'd be very interested if you could post some benchmarks comparing
that huge regexp both in Perl's and Ruby's regexp engines. I think a
pretty new Perl actually optimizes the regexp into a trie, so it
should be loads faster than using an naive implementation.

···

Thanks for the reply

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Peter Schrammel wrote:

>> got problem with big regexes:
>> I have a regex of about 70000+ words concated with '|' that I'd like to
>> match as a regex. /bla|blub|foo|bar|.....(70000)/
>>
>> But unfortunately ruby gives me a 'regular expression too big' if I'm
>> trying to build such a thing.
>> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
>> regexes. Is there a way around this (without going down to 2000 words) ?
>>
>> Thanks for any hint

Jeffrey Schwab wrote:

> You could optimize the regex a little for size, e.g. by factoring out
> common prefixes:
>
> (b(l(a|ub)|ar)|foo)...

Peter Schrammel wrote:

Thought of that.

Have you seen:
"Converts a list of words to a regular expression with minimum
backtracking by joining words with common prefixes. It is a port
of the Perl module MakeRegex.pm by Hakan Kjellerstrand with
some improvements."
    http://raa.ruby-lang.org/project/makeregex/

YMMV; I have never used it on anything like the scale you are.

His regex isn't anchored to the beginning and end of the string. This makes
hashes useless for the kind of comparison he seems to want to perform.

--Ken

···

On Mon, 13 Nov 2006 00:07:52 -0800, Paul Lutus wrote:

Peter Schrammel wrote:

When you post an question like this, it is always a good idea to

reveal the

purpose as well as the method.

First of all: Thanks for the replies, I think I have enough input to
chew on.

And to reveal the purpose:
I'd like to match a LOT of words/strings (words with spaces) and
sometimes regexes against a lot of small and medium size( < 10000 byte )
strings.

If you are going to compare strings to strings as well as strings to
regexes, maybe a two-tiered scheme would be better. First tier, simple
textual comparison using a hash of strings. Second tier, regexes.

This removes the ambiguity that a particular data string might pass a string
comparison but fail any of the provided regexes, or the opposite. It will
also be much faster than a gigantic list of strings and regexes, all
presented to the regex engine as though they were all regular expressions,
even though some are simple string comparisons.

I think this (e.g speed improvement) would have been true in Perl also.

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

Jeffrey Schwab wrote:
> > You could optimize the regex a little for size, e.g. by factoring out
> > common prefixes:
> >
> > (b(l(a|ub)|ar)|foo)...

Peter Schrammel wrote:
> Thought of that.

Have you seen:
"Converts a list of words to a regular expression with minimum
backtracking by joining words with common prefixes. It is a port
of the Perl module MakeRegex.pm by Hakan Kjellerstrand with
some improvements."
    http://raa.ruby-lang.org/project/makeregex/

YMMV; I have never used it on anything like the scale you are.

In a little bit of testing here, it goes to long after about 8,000
words.

require 'makeregex'

20.times do |n|
  words = IO.readlines("/usr/share/dict/words")[0..(2 ** n)]

  start = Time.now

  r = Regexp.make(words)

  finish = Time.now

  puts "Took #{finish - start} seconds to convert #{words.size} words into a regex #{r.size} bytes long."

  "FOO".match(r)
end

Took 0.000372 seconds to convert 2 words into a regex 20 bytes long.
Took 0.000285 seconds to convert 3 words into a regex 25 bytes long.
Took 0.000359 seconds to convert 5 words into a regex 51 bytes long.
Took 0.000493 seconds to convert 9 words into a regex 86 bytes long.
Took 0.000973 seconds to convert 17 words into a regex 157 bytes long.
Took 0.001773 seconds to convert 33 words into a regex 285 bytes long.
Took 0.005386 seconds to convert 65 words into a regex 491 bytes long.
Took 0.00823 seconds to convert 129 words into a regex 933 bytes long.
Took 0.019234 seconds to convert 257 words into a regex 1876 bytes long.
Took 0.042557 seconds to convert 513 words into a regex 3856 bytes long.
Took 0.09146 seconds to convert 1025 words into a regex 7807 bytes long.
Took 0.196851 seconds to convert 2049 words into a regex 15669 bytes long.
Took 0.399155 seconds to convert 4097 words into a regex 32325 bytes long.
Took 0.968776 seconds to convert 8193 words into a regex 64671 bytes long.
foo:14:in `match': regular expression too big:
/(?:1(?:0(?:80\n|\-point\n|th\n)|...