Merging regular expressions

Hello all,

I've got a tricky puzzle.

Imagine I have a bunch of n RegExps r[] - they could be any valid ruby Regexps.
Say I want to match each of them in turn to a certain something s to
make sure that no two Regexps in the list both match s.
Now say that I want a meta regular expression m that is all the
Regexps merged together so that the following pseudo code wroks,

m = MetaRegExp.new
for each r { [i] m.add(r[i]) } # add makes sure that no two RegExps
would match the same input

m =~ s # returns an int representing which of the n original RegExps
would have matched

Any ideas? Is this possible? If not I will have to code it in C and I
would have to code the RegExp stuff from scratch, which fills me with
fear...

Thanks in advance,
   Anthony

(I hope I have explained myself clearly)

Anthony Durity asked:

Imagine I have a bunch of n RegExps r - they could be any valid ruby
Regexps.
Say I want to match each of them in turn to a certain something s to
make sure that no two Regexps in the list both match s.
Now say that I want a meta regular expression m that is all the
Regexps merged together so that the following pseudo code wroks,

m = MetaRegExp.new
for each r { [i] m.add(r[i]) } # add makes sure that no two RegExps
would match the same input

m =~ s # returns an int representing which of the n original RegExps
would have matched

Why not just use the array of regexps like this?

matches = r.map{|i| r =~ s }
matches.index(matches.select{|i| i }[0]) # returns the int you're after

Cheers,
Dave

Dave Burt wrote:

Anthony Durity asked:

Imagine I have a bunch of n RegExps r - they could be any valid
ruby Regexps.
Say I want to match each of them in turn to a certain something s to
make sure that no two Regexps in the list both match s.
Now say that I want a meta regular expression m that is all the
Regexps merged together so that the following pseudo code wroks,

m = MetaRegExp.new
for each r { [i] m.add(r[i]) } # add makes sure that no two RegExps
would match the same input

m =~ s # returns an int representing which of the n original RegExps
would have matched

Why not just use the array of regexps like this?

matches = r.map{|i| r =~ s }
matches.index(matches.select{|i| i }[0]) # returns the int you're
after

I guess you rather meant

matches = r.map {|i| i =~ s}
error = matches.compact.size != 1

Kind regards

    robert

Thanks Robert, thanks Dave,

I see what ye both mean. I was more thinking along the lines of a
situation like this. To avoid iterating over the array, I would like
to compress the regexps into a large regexp while still preserving the
identity of each regexp within the meta-regexp, something akin to what
happens when the scanner of a lexer is created.
So, imagine this trivial situation...

r[0] = /a/
r[1] = /b/
m = Regexp.union(r0.source, r1.source)
# _but_ when I go to check s with
m =~ s
# if it matches, i want it to tell me which of the original r[i] would
have matched!

See what I mean? Will I hack the Ruby source? Can this be done in
Ruby? Will I have to do it in C?

Thanks a million in advance again...
    Anthony

···

On 2/6/06, Robert Klemme <bob.news@gmx.net> wrote:

Dave Burt wrote:
> Anthony Durity asked:
>> Imagine I have a bunch of n RegExps r - they could be any valid
>> ruby Regexps.
>> Say I want to match each of them in turn to a certain something s to
>> make sure that no two Regexps in the list both match s.
>> Now say that I want a meta regular expression m that is all the
>> Regexps merged together so that the following pseudo code wroks,
>>
>> m = MetaRegExp.new
>> for each r { [i] m.add(r[i]) } # add makes sure that no two RegExps
>> would match the same input
>>
>> m =~ s # returns an int representing which of the n original RegExps
>> would have matched
>
> Why not just use the array of regexps like this?
>
> matches = r.map{|i| r =~ s }
> matches.index(matches.select{|i| i }[0]) # returns the int you're
> after

I guess you rather meant

matches = r.map {|i| i =~ s}
error = matches.compact.size != 1

Kind regards

   robert

Anthony Durity wrote:

Thanks Robert, thanks Dave,

I see what ye both mean. I was more thinking along the lines of a
situation like this. To avoid iterating over the array, I would like
to compress the regexps into a large regexp while still preserving the
identity of each regexp within the meta-regexp, something akin to what
happens when the scanner of a lexer is created.
So, imagine this trivial situation...

r[0] = /a/
r[1] = /b/
m = Regexp.union(r0.source, r1.source)
# _but_ when I go to check s with
m =~ s
# if it matches, i want it to tell me which of the original r[i] would
have matched!

Didn't you originally state that you wanted to ensure that not two regexps
match at the same time? I mean this is something different. What you now
state can be done with an alternation - but you'll only see one of your
sub regexps match, i.e. /(foo)|(bar)/.

See what I mean? Will I hack the Ruby source? Can this be done in
Ruby? Will I have to do it in C?

The first question is: can this be done with regexps? First of all I'd
like to hear the exact requirements before we go on with this. I have the
feeling that they are not clear yet - or at least that they have not
become clear to me.

Cheers

    robert

How about:

  class Regexp
    class Union
      def initialize( *regexen )
        @regexen = regexen
      end

      def <<( regex )
        @regexen << regex
      end

      def =~( other )
        regexen = @regexen.select{ |regex| regex =~ other }
        raise "ambiguous input" if regexen.size > 1
        return nil if @regexen.empty?
        return @regexen.index(regexen[0])
      end

      def ( index )
        @regexen[index]
      end
    end

    def self.union( *regexen )
      Regexp::Union.new( *regexen )
    end
  end

Usage:

  r =
  r << /a/
  r << /b/
  m = Regexp.union( *r )
  m =~ "a test" # => 0
  m =~ "test b" # => 1
  m =~ "none" # => nil
  m =~ "ambiguous" # raises "ambiguous input"

Jacob Fugal

···

On 2/6/06, Anthony Durity <anthony.durity@gmail.com> wrote:

r[0] = /a/
r[1] = /b/
m = Regexp.union(r0.source, r1.source)
# _but_ when I go to check s with
m =~ s
# if it matches, i want it to tell me which of the original r[i] would have matched!

Yes!

This can be done with alternation (which is what RegExp.union does)
but I really really really want to know which sub-regexp matched - I
can (maybe) sort out the derivative problem of conflicting sub-regexps
when the time comes.

So, think of the first requirement as wanting to know which sub-regexp
in an array of regexps that have been unioned or alternated together
triggers a match - the object is to save time iterating over an array
of regexps.

hmmm,
    Anthony

(I think racc must implent this algorithm somewhere, I know not where)

···

On 2/6/06, Robert Klemme <bob.news@gmx.net> wrote:

Anthony Durity wrote:
> Thanks Robert, thanks Dave,
>
> I see what ye both mean. I was more thinking along the lines of a
> situation like this. To avoid iterating over the array, I would like
> to compress the regexps into a large regexp while still preserving the
> identity of each regexp within the meta-regexp, something akin to what
> happens when the scanner of a lexer is created.
> So, imagine this trivial situation...
>
> r[0] = /a/
> r[1] = /b/
> m = Regexp.union(r0.source, r1.source)
> # _but_ when I go to check s with
> m =~ s
> # if it matches, i want it to tell me which of the original r[i] would
> have matched!

Didn't you originally state that you wanted to ensure that not two regexps
match at the same time? I mean this is something different. What you now
state can be done with an alternation - but you'll only see one of your
sub regexps match, i.e. /(foo)|(bar)/.

> See what I mean? Will I hack the Ruby source? Can this be done in
> Ruby? Will I have to do it in C?

The first question is: can this be done with regexps? First of all I'd
like to hear the exact requirements before we go on with this. I have the
feeling that they are not clear yet - or at least that they have not
become clear to me.

Cheers

   robert

Jacob, Brilliant!

But... see my last post, i want to avoid iteration by merging all
regexps into one large regexp. However, this is absolutely beautiful
and correct. Taken with the replies I got earlier this is a great
start.

Question... What would the following code output?

r =
r << /aaa/
r << /[ab][ab][ab]/
m = Regexp.union( *r )
m =~ "aaa" # => ?

(I think I basically want to be able to twiddle with the finite
automata the regexps make when they are compiled/created, I dunno if
this is possible)

Later,
    Anthony

···

On 2/6/06, Jacob Fugal <lukfugl@gmail.com> wrote:

On 2/6/06, Anthony Durity <anthony.durity@gmail.com> wrote:
> r[0] = /a/
> r[1] = /b/
> m = Regexp.union(r0.source, r1.source)
> # _but_ when I go to check s with
> m =~ s
> # if it matches, i want it to tell me which of the original r[i] would have matched!

How about:

class Regexp
   class Union
     def initialize( *regexen )
       @regexen = regexen
     end

     def <<( regex )
       @regexen << regex
     end

     def =~( other )
       regexen = @regexen.select{ |regex| regex =~ other }
       raise "ambiguous input" if regexen.size > 1
       return nil if @regexen.empty?
       return @regexen.index(regexen[0])
     end

     def ( index )
       @regexen[index]
     end
   end

   def self.union( *regexen )
     Regexp::Union.new( *regexen )
   end
end

Usage:

r =
r << /a/
r << /b/
m = Regexp.union( *r )
m =~ "a test" # => 0
m =~ "test b" # => 1
m =~ "none" # => nil
m =~ "ambiguous" # raises "ambiguous input"

Jacob Fugal

But... see my last post, i want to avoid iteration by merging all
regexps into one large regexp. However, this is absolutely beautiful
and correct. Taken with the replies I got earlier this is a great
start.

<snip>

(I think I basically want to be able to twiddle with the finite
automata the regexps make when they are compiled/created, I dunno if
this is possible)

Ah. Well in that case, I don't think you're gonna be able to play in
pure-ruby-land, since you'll need access to the regex engine itself.
You might be able to do it with a combination of one regex for
alternation ("do any match") and another for exclusion ("does just one
match"), but at that point you'll probably be negating any speed
savings over the iteration (I assume that's what you're after).

Question... What would the following code output?

r =
r << /aaa/
r << /[ab][ab][ab]/
m = Regexp.union( *r )
m =~ "aaa" # => ?

That should raise the "ambiguous input" exception, since both /aaa/ =~
"aaa" and /[ab][ab][ab]/ =~ "aaa" are true.

Jacob Fugal

···

On 2/6/06, Anthony Durity <anthony.durity@gmail.com> wrote:

Jacob, Brilliant!

But... see my last post, i want to avoid iteration by merging all
regexps into one large regexp. However, this is absolutely beautiful
and correct. Taken with the replies I got earlier this is a great
start.

If at all possible, I would use Strings:

  re = ['foo', 'bar', 'baz', 'quux']
  regexp = re.map {|r| "(#{r})"}.join '|'
  result = 'baz baa foo guggeli quux'.scan regexp

  p result

You may need to tweak the regexp/builder a bit to
for example ensure there is whitespace on either
side or something.

Question... What would the following code output?

r =
r << /aaa/
r << /[ab][ab][ab]/
m = Regexp.union( *r )
m =~ "aaa" # => ?

(I think I basically want to be able to twiddle with the finite
automata the regexps make when they are compiled/created, I dunno if
this is possible)

Later,
    Anthony

> > r[0] = /a/
> > r[1] = /b/
> > m = Regexp.union(r0.source, r1.source)
> > # _but_ when I go to check s with
> > m =~ s
> > # if it matches, i want it to tell me which of the original r[i] would have matched!
>
> <skip union code />
>
> Jacob Fugal

E

···

On 2006.02.07 02:24, Anthony Durity wrote:

On 2/6/06, Jacob Fugal <lukfugl@gmail.com> wrote:
> On 2/6/06, Anthony Durity <anthony.durity@gmail.com> wrote:

> Jacob, Brilliant!
>
> But... see my last post, i want to avoid iteration by merging all
> regexps into one large regexp. However, this is absolutely beautiful
> and correct. Taken with the replies I got earlier this is a great
> start.

If at all possible, I would use Strings:

  re = ['foo', 'bar', 'baz', 'quux']
  regexp = re.map {|r| "(#{r})"}.join '|'
  result = 'baz baa foo guggeli quux'.scan regexp

Bah.
   
   result = 'baz baa foo guggeli quux'.scan /#{regexp}/
   

  p result

You may need to tweak the regexp/builder a bit to
for example ensure there is whitespace on either
side or something.

>
> Question... What would the following code output?
>
> r =
> r << /aaa/
> r << /[ab][ab][ab]/
> m = Regexp.union( *r )
> m =~ "aaa" # => ?
>
> (I think I basically want to be able to twiddle with the finite
> automata the regexps make when they are compiled/created, I dunno if
> this is possible)
>
> Later,
> Anthony
>
> > > r[0] = /a/
> > > r[1] = /b/
> > > m = Regexp.union(r0.source, r1.source)
> > > # _but_ when I go to check s with
> > > m =~ s
> > > # if it matches, i want it to tell me which of the original r[i] would have matched!
> >
> > <skip union code />
> >
> > Jacob Fugal

E

···

On 2006.02.07 04:26, Eero Saynatkari wrote:

On 2006.02.07 02:24, Anthony Durity wrote:
> On 2/6/06, Jacob Fugal <lukfugl@gmail.com> wrote:
> > On 2/6/06, Anthony Durity <anthony.durity@gmail.com> wrote: