Regexp equality

Jamis_Buck4 · 19 October 2004 18:07

Here's an oddity I recently came across in a unit test. Identical regexps declared with /.../ will be equivilent to each other, but not to the same regexps created with %r{...}:

   $ ruby -ve 'p( /\/blah/ == /\/blah/ )'
   ruby 1.8.2 (2004-07-29) [i686-linux]
   true

   $ ruby -ve 'p( %r{/blah} == %r{/blah} )'
   ruby 1.8.2 (2004-07-29) [i686-linux]
   true

   $ ruby -ve 'p( /\/blah/ == %r{/blah} )'
   ruby 1.8.2 (2004-07-29) [i686-linux]
   false

This appears to only be the case when there is a forward slash in the regexp. In other words:

   $ ruby -ve 'p( /blah/ == %r{blah} )'
   ruby 1.8.2 (2004-07-29) [i686-linux]
   true

Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

- Jamis

···

--
Jamis Buck
jgb3@email.byu.edu
http://www.jamisbuck.org/jamis

Yukihiro_Matsumoto2 · 19 October 2004 18:44

Hi,

···

In message "Re: Regexp equality" on Wed, 20 Oct 2004 03:07:28 +0900, Jamis Buck <jgb3@email.byu.edu> writes:

Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

Regexp#== compares literal appearance of regex, e.g.

\/blah vs /blah

I consider this as a feature. I'm open for the discussion (as always).

matz.

Jamis_Buck4 · 19 October 2004 18:53

Yukihiro Matsumoto wrote:

Hi,

>Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

Regexp#== compares literal appearance of regex, e.g.

\/blah vs /blah

I consider this as a feature. I'm open for the discussion (as always).

Is there a way to test regexen for equality based on the strings they would match? Perhaps that's too general, since it depends on the strings they would be given, but I guess I expected /\/blah/ to be equal to %r{/blah}.

I suppose /\/blah/.to_s would equal %r{/blah}.to_s? Would that be a better way to compare regexen?

- Jamis

···

In message "Re: Regexp equality" > on Wed, 20 Oct 2004 03:07:28 +0900, Jamis Buck <jgb3@email.byu.edu> writes:

--
Jamis Buck
jgb3@email.byu.edu
http://www.jamisbuck.org/jamis

James_Edward_Gray_II · 19 October 2004 19:18

Wow. I bet that's a pretty tall order. Pulling from the current Ruby Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that way. Maybe some internal representation, but I would but super impressed to see that.

Not at all saying I don't like the idea, just that it's a tall order.

James Edward Gray II

···

On Oct 19, 2004, at 1:53 PM, Jamis Buck wrote:

Is there a way to test regexen for equality based on the strings they would match?

Lloyd_Zusman · 20 October 2004 00:06

James Edward Gray II <james@grayproductions.net> writes:

Is there a way to test regexen for equality based on the strings they
would match?

Wow. I bet that's a pretty tall order. Pulling from the current Ruby
Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that way.
Maybe some internal representation, but I would but super impressed to
see that.

Not at all saying I don't like the idea, just that it's a tall order.

Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

If I'm not mistaken, I believe that this can be shown to be a reduction
of the Halting Problem.

In other words, short of feeding every possible character string to both
regex's and comparing the two sets of matches, I don't think that there
is an algorithmic way to measure equivalence of two regex's which don't
have the same external representation.

···

On Oct 19, 2004, at 1:53 PM, Jamis Buck wrote:

--
Lloyd Zusman
ljz@asfast.com
God bless you.

Jamis_Buck4 · 20 October 2004 14:58

Lloyd Zusman wrote:

James Edward Gray II <james@grayproductions.net> writes:

Is there a way to test regexen for equality based on the strings they
would match?

Wow. I bet that's a pretty tall order. Pulling from the current Ruby
Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that way.
Maybe some internal representation, but I would but super impressed to
see that.

Not at all saying I don't like the idea, just that it's a tall order.

Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

If I'm not mistaken, I believe that this can be shown to be a reduction
of the Halting Problem.

In other words, short of feeding every possible character string to both
regex's and comparing the two sets of matches, I don't think that there
is an algorithmic way to measure equivalence of two regex's which don't
have the same external representation.

I realize that particular question is most likely impossible to answer for the general case. What I was wanting, though, was to be able to know whether /\/blah/ is equal to %r{/blah}. Surely that's not too hard?

As I said, though, converting them both to strings gives the same result, so the strings, at least, are comparable with expected results.

- Jamis

···

On Oct 19, 2004, at 1:53 PM, Jamis Buck wrote:

--
Jamis Buck
jgb3@email.byu.edu
http://www.jamisbuck.org/jamis

Eivind_Eklund · 21 October 2004 15:38

As far as I can tell: If you run Antimirov's algorithm for NFA
construction, you can compare each extracted partial derivative and
see if it is equal. If all extracted partial derivatives are equal,
your regular expressions are identical. If any differ - they're
different.

Eivind.

···

On Wed, 20 Oct 2004 09:06:10 +0900, Lloyd Zusman <ljz@asfast.com> wrote:

Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

--
Hazzle free packages for Ruby?
RPA is available from http://www.rubyarchive.org/

Lloyd_Zusman · 20 October 2004 22:46

Jamis Buck <jgb3@email.byu.edu> writes:

[ ... ]

I realize that particular question is most likely impossible to answer
for the general case. What I was wanting, though, was to be able to know
whether /\/blah/ is equal to %r{/blah}. Surely that's not too hard?

As I said, though, converting them both to strings gives the same
result, so the strings, at least, are comparable with expected results.

OK. Now I better understand your question. I agree with you that
something seems kind of counter-intuitive here:

  irb(main):001:0> x = /\/blah/
  => /\/blah/
  irb(main):002:0> y = %r{/blah}
  => /\/blah/
  irb(main):003:0> z = %r{\/blah}
  => /\/blah/
  irb(main):004:0> x == y
  => false
  irb(main):005:0> x == z
  => true
  irb(main):006:0> x.source
  => "\\/blah"
  irb(main):007:0> y.source
  => "/blah"
  irb(main):008:0> z.source
  => "\\/blah"

In line 006, why is the source of /\/blah/ returned as "\\/blah"?

It seems to me that it's incorrect to return the leading backslash
doubled. Isn't the backslash in the initial definition of x just an
escape character for enabling the second forward slash to be treated as
part of the regexp? Why is that initial escape character treated as a
bona fide part of the regexp source?

Also, consider this:

  irb(main):001:0> x = /\/blah/
  => /\/blah/
  irb(main):002:0> y = %r{/blah}
  => /\/blah/
  irb(main):003:0> z = %r{\/blah}
  => /\/blah/
  irb(main):004:0> x.to_s
  => "(?-mix:\\/blah)"
  irb(main):005:0> y.to_s
  => "(?-mix:\\/blah)"
  irb(main):006:0> z.to_s
  => "(?-mix:\\/blah)"
  irb(main):007:0> x.to_s == y.to_s
  => true
  irb(main):008:0> x.to_s == z.to_s
  => true

In other words, the string representation of the 3 regexp's (as opposed
to their _source_ representations) are identical.

So this brings up two questions:

1. Why is the backslash in a statement like this /\/blah/ _not_ just
treated as an escape character when determining the source?

2. Why does the `==' operator compare the source and not the internal
representations? It seems to me that the internal representation is
a much more meaningful piece of information for use in a comparison.

···

--
Lloyd Zusman
ljz@asfast.com
God bless you.

Eivind_Eklund · 21 October 2004 15:59

I should probably include some suitable references so somebody can run
off an implement this:

Subtyping with Antimirov's algorithm:
http://www.inf.uni-konstanz.de/dbis/teaching/current/pathfinder/download/hohenade-ausarbeitung.pdf

Partial derivatives of regular expressions and finite automata
http://citeseer.ist.psu.edu/antimirov95partial.html

Eivind.

···

On Thu, 21 Oct 2004 15:37:57 +0000, Eivind Eklund <eeklund@gmail.com> wrote:

On Wed, 20 Oct 2004 09:06:10 +0900, Lloyd Zusman <ljz@asfast.com> wrote:
> Not only is it a tall order to determine whether any two regex's will
> match exactly the same set of strings, but in the general case, I'm
> pretty sure that this question is undecidable.

As far as I can tell: If you run Antimirov's algorithm for NFA
construction, you can compare each extracted partial derivative and
see if it is equal. If all extracted partial derivatives are equal,
your regular expressions are identical. If any differ - they're
different.

--
Hazzle free packages for Ruby?
RPA is available from http://www.rubyarchive.org/

Topic		Replies	Views
Re.c/ri's Regexp.=== documentation has an error/typo ruby-talk	0	122	30 March 2004
Why is regexp "\Amax_repeats=(\d+)" not equal to "^max_repeats=([0-9]+)"? ruby-talk	1	113	1 May 2004
Regexes appear same, test != ruby-talk	2	106	24 September 2013
RCR: regex + regex ruby-talk	3	82	27 October 2009
Ruby 1.8.1. RegExp documentation? ruby-talk	2	119	12 April 2004

Regexp equality

Related topics