Regexp equality

Here's an oddity I recently came across in a unit test. Identical regexps declared with /.../ will be equivilent to each other, but not to the same regexps created with %r{...}:

   $ ruby -ve 'p( /\/blah/ == /\/blah/ )'
   ruby 1.8.2 (2004-07-29) [i686-linux]
   true

   $ ruby -ve 'p( %r{/blah} == %r{/blah} )'
   ruby 1.8.2 (2004-07-29) [i686-linux]
   true

   $ ruby -ve 'p( /\/blah/ == %r{/blah} )'
   ruby 1.8.2 (2004-07-29) [i686-linux]
   false

This appears to only be the case when there is a forward slash in the regexp. In other words:

   $ ruby -ve 'p( /blah/ == %r{blah} )'
   ruby 1.8.2 (2004-07-29) [i686-linux]
   true

Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

- Jamis

···

--
Jamis Buck
jgb3@email.byu.edu
http://www.jamisbuck.org/jamis

Hi,

···

In message "Re: Regexp equality" on Wed, 20 Oct 2004 03:07:28 +0900, Jamis Buck <jgb3@email.byu.edu> writes:

Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

Regexp#== compares literal appearance of regex, e.g.

  \/blah vs /blah

I consider this as a feature. I'm open for the discussion (as always).

              matz.

Yukihiro Matsumoto wrote:

Hi,

>Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

Regexp#== compares literal appearance of regex, e.g.

  \/blah vs /blah

I consider this as a feature. I'm open for the discussion (as always).

Is there a way to test regexen for equality based on the strings they would match? Perhaps that's too general, since it depends on the strings they would be given, but I guess I expected /\/blah/ to be equal to %r{/blah}.

I suppose /\/blah/.to_s would equal %r{/blah}.to_s? Would that be a better way to compare regexen?

- Jamis

···

In message "Re: Regexp equality" > on Wed, 20 Oct 2004 03:07:28 +0900, Jamis Buck <jgb3@email.byu.edu> writes:

--
Jamis Buck
jgb3@email.byu.edu
http://www.jamisbuck.org/jamis

Wow. I bet that's a pretty tall order. Pulling from the current Ruby Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that way. Maybe some internal representation, but I would but super impressed to see that.

Not at all saying I don't like the idea, just that it's a tall order.

James Edward Gray II

···

On Oct 19, 2004, at 1:53 PM, Jamis Buck wrote:

Is there a way to test regexen for equality based on the strings they would match?

James Edward Gray II <james@grayproductions.net> writes:

Is there a way to test regexen for equality based on the strings they
would match?

Wow. I bet that's a pretty tall order. Pulling from the current Ruby
Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that way.
Maybe some internal representation, but I would but super impressed to
see that.

Not at all saying I don't like the idea, just that it's a tall order.

Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

If I'm not mistaken, I believe that this can be shown to be a reduction
of the Halting Problem.

In other words, short of feeding every possible character string to both
regex's and comparing the two sets of matches, I don't think that there
is an algorithmic way to measure equivalence of two regex's which don't
have the same external representation.

···

On Oct 19, 2004, at 1:53 PM, Jamis Buck wrote:

--
Lloyd Zusman
ljz@asfast.com
God bless you.

Lloyd Zusman wrote:

James Edward Gray II <james@grayproductions.net> writes:

Is there a way to test regexen for equality based on the strings they
would match?

Wow. I bet that's a pretty tall order. Pulling from the current Ruby
Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that way.
Maybe some internal representation, but I would but super impressed to
see that.

Not at all saying I don't like the idea, just that it's a tall order.

Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

If I'm not mistaken, I believe that this can be shown to be a reduction
of the Halting Problem.

In other words, short of feeding every possible character string to both
regex's and comparing the two sets of matches, I don't think that there
is an algorithmic way to measure equivalence of two regex's which don't
have the same external representation.

I realize that particular question is most likely impossible to answer for the general case. What I was wanting, though, was to be able to know whether /\/blah/ is equal to %r{/blah}. Surely that's not too hard?

As I said, though, converting them both to strings gives the same result, so the strings, at least, are comparable with expected results.

- Jamis

···

On Oct 19, 2004, at 1:53 PM, Jamis Buck wrote:

--
Jamis Buck
jgb3@email.byu.edu
http://www.jamisbuck.org/jamis

As far as I can tell: If you run Antimirov's algorithm for NFA
construction, you can compare each extracted partial derivative and
see if it is equal. If all extracted partial derivatives are equal,
your regular expressions are identical. If any differ - they're
different.

Eivind.

···

On Wed, 20 Oct 2004 09:06:10 +0900, Lloyd Zusman <ljz@asfast.com> wrote:

Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

--
Hazzle free packages for Ruby?
RPA is available from http://www.rubyarchive.org/

Jamis Buck <jgb3@email.byu.edu> writes:

[ ... ]

I realize that particular question is most likely impossible to answer
for the general case. What I was wanting, though, was to be able to know
whether /\/blah/ is equal to %r{/blah}. Surely that's not too hard?

As I said, though, converting them both to strings gives the same
result, so the strings, at least, are comparable with expected results.

OK. Now I better understand your question. I agree with you that
something seems kind of counter-intuitive here:

  irb(main):001:0> x = /\/blah/
  => /\/blah/
  irb(main):002:0> y = %r{/blah}
  => /\/blah/
  irb(main):003:0> z = %r{\/blah}
  => /\/blah/
  irb(main):004:0> x == y
  => false
  irb(main):005:0> x == z
  => true
  irb(main):006:0> x.source
  => "\\/blah"
  irb(main):007:0> y.source
  => "/blah"
  irb(main):008:0> z.source
  => "\\/blah"

In line 006, why is the source of /\/blah/ returned as "\\/blah"?

It seems to me that it's incorrect to return the leading backslash
doubled. Isn't the backslash in the initial definition of x just an
escape character for enabling the second forward slash to be treated as
part of the regexp? Why is that initial escape character treated as a
bona fide part of the regexp source?

Also, consider this:

  irb(main):001:0> x = /\/blah/
  => /\/blah/
  irb(main):002:0> y = %r{/blah}
  => /\/blah/
  irb(main):003:0> z = %r{\/blah}
  => /\/blah/
  irb(main):004:0> x.to_s
  => "(?-mix:\\/blah)"
  irb(main):005:0> y.to_s
  => "(?-mix:\\/blah)"
  irb(main):006:0> z.to_s
  => "(?-mix:\\/blah)"
  irb(main):007:0> x.to_s == y.to_s
  => true
  irb(main):008:0> x.to_s == z.to_s
  => true

In other words, the string representation of the 3 regexp's (as opposed
to their _source_ representations) are identical.

So this brings up two questions:

1. Why is the backslash in a statement like this /\/blah/ _not_ just
    treated as an escape character when determining the source?

2. Why does the `==' operator compare the source and not the internal
    representations? It seems to me that the internal representation is
    a much more meaningful piece of information for use in a comparison.

···

--
Lloyd Zusman
ljz@asfast.com
God bless you.

I should probably include some suitable references so somebody can run
off an implement this:

Subtyping with Antimirov's algorithm:
http://www.inf.uni-konstanz.de/dbis/teaching/current/pathfinder/download/hohenade-ausarbeitung.pdf

Partial derivatives of regular expressions and finite automata
http://citeseer.ist.psu.edu/antimirov95partial.html

Eivind.

···

On Thu, 21 Oct 2004 15:37:57 +0000, Eivind Eklund <eeklund@gmail.com> wrote:

On Wed, 20 Oct 2004 09:06:10 +0900, Lloyd Zusman <ljz@asfast.com> wrote:
> Not only is it a tall order to determine whether any two regex's will
> match exactly the same set of strings, but in the general case, I'm
> pretty sure that this question is undecidable.

As far as I can tell: If you run Antimirov's algorithm for NFA
construction, you can compare each extracted partial derivative and
see if it is equal. If all extracted partial derivatives are equal,
your regular expressions are identical. If any differ - they're
different.

--
Hazzle free packages for Ruby?
RPA is available from http://www.rubyarchive.org/