Regexp operators

Hi folks,

In doing some work with parsers the other day, I ended up with a
situation where I wanted to embed and combine regexps easily,
perserving any flags that they may have on them (such as
case-sensitivity, etc).

So I whipped this out really quick. It lets you combine regexp by adding
them (+), ORing them (|) or by including them /blah#{some_regexp}blah/
inside each other, while preserving all flags. (At least, all flags
that can be preserved–I can’t do anything about the encodings if they
mismatch… but I use all UTF-8 regexps, so I always use ‘u’). This
could potentially be extended to support other operations–but I
haven’t needed them yet on the project I’m tinkering with.

Maybe this is useful or interesting to somebody besides me. :wink:

class Regexp
@@inspect_regex = %r{/(.)/((?:m|i|x))}

def options
return @@inspect_regex.match(self.inspect)[2]
end

def to_s
self.inspect =~ @@inspect_regex
if $2.length > 0
"(?#{$2}:#{$1})“
else
”(?:#{$1})"
end
end

def |(other)
/#{self}|#{other}/u
end

def +(other)
/#{self}#{other}/u
end
end

Super simple, but has been endlessly handy for me:

$ irb
irb(main):001:0> require ‘RegexpOps’
=> true
irb(main):002:0> /bar/ + /foo/
=> /(?:bar)(?:foo)/u
irb(main):003:0> /bar/i + /foo/
=> /(?i:bar)(?:foo)/u
irb(main):004:0> /bar/i + /foo/m
=> /(?i:bar)(?m:foo)/u
irb(main):005:0> /bar/ix + /foo/m
=> /(?ix:bar)(?m:foo)/u
irb(main):006:0> baz = /bar/i + /foo/
=> /(?i:bar)(?:foo)/u
irb(main):007:0> %r!foo#{/bar/i}#{baz}?!
=> /foo(?i:bar)
(?:(?i:bar)(?:foo))?/
irb(main):008:0> /test/mi.options
=> “mi”

And stuff like that. =)

···


Wesley J. Landaker - wjl@icecavern.net
OpenPGP FP: C99E DF40 54F6 B625 FD48 B509 A3DE 8D79 541F F830

Hi,

···

At Sat, 31 May 2003 03:26:53 +0900, Wesley J Landaker wrote:

So I whipped this out really quick. It lets you combine regexp by adding
them (+), ORing them (|) or by including them /blah#{some_regexp}blah/
inside each other, while preserving all flags. (At least, all flags
that can be preserved–I can’t do anything about the encodings if they
mismatch… but I use all UTF-8 regexps, so I always use ‘u’). This
could potentially be extended to support other operations–but I
haven’t needed them yet on the project I’m tinkering with.

Flags are preserved in 1.8.

$ ruby -v -e ‘p(/#{/foo/m}/)’
ruby 1.8.0 (2003-05-31) [i686-linux]
/(?m-ix:foo)/


Nobu Nakada

I hadn’t tried this in 1.8 yet, so that’s cool to see that it has built
in a to_s similar to the one I posted. Nice! =)

Mine also does the + and | operators, as well, though; I’m not sure if
that’s universally useful.

Looks like 1.8 still doesn’t catch encoding flag in this case; there
doesn’t appear to be any ‘(?’ prefix that changes encodings, though,
which would be a prerequisite. (Personally, I’m happy with UTF-8. :wink:

$ ruby -v -e ‘p(/#{/foo/mu}/)’
ruby 1.8.0 (2003-05-31) [i686-linux]
/(?m-ix:foo)/

···

On Friday 30 May 2003 5:41 pm, nobu.nokada@softhome.net wrote:

Hi,

At Sat, 31 May 2003 03:26:53 +0900, > > Wesley J Landaker wrote:

So I whipped this out really quick. It lets you combine regexp by
adding them (+), ORing them (|) or by including them
/blah#{some_regexp}blah/ inside each other, while preserving all
flags. (At least, all flags that can be preserved–I can’t do
anything about the encodings if they mismatch… but I use all
UTF-8 regexps, so I always use ‘u’). This could potentially be
extended to support other operations–but I haven’t needed them yet
on the project I’m tinkering with.

Flags are preserved in 1.8.

$ ruby -v -e ‘p(/#{/foo/m}/)’
ruby 1.8.0 (2003-05-31) [i686-linux]
/(?m-ix:foo)/


Wesley J. Landaker - wjl@icecavern.net
OpenPGP FP: C99E DF40 54F6 B625 FD48 B509 A3DE 8D79 541F F830

Hi,

Mine also does the + and | operators, as well, though; I’m not sure if
that’s universally useful.

As for +, is it right to just concatinate them? Regexp#| is
provided in lib/eregex.rb. And you can see also
http://member.nifty.ne.jp/nokada/archive/reop.rb.

Looks like 1.8 still doesn’t catch encoding flag in this case; there
doesn’t appear to be any ‘(?’ prefix that changes encodings, though,
which would be a prerequisite. (Personally, I’m happy with UTF-8. :wink:

Current regexp engine (and perhaps Oniguruma too) can not mix
encodings. Well, would it be better to preserve it and raise
an exception when it doesn’t match?

···

At Sat, 31 May 2003 08:59:45 +0900, Wesley J Landaker wrote:


Nobu Nakada

Hi,

Mine also does the + and | operators, as well, though; I’m not sure
if that’s universally useful.

As for +, is it right to just concatinate them? Regexp#| is
provided in lib/eregex.rb. And you can see also
http://member.nifty.ne.jp/nokada/archive/reop.rb.

Well, + meaning concatination makes sense to me. What else would it
mean? Notice that I do put regexps in (?:slight_smile: groups so that you don’t
have any ambiguity if you do something like:

/foo|bar/ + /./ # => /(?:foo|bar)(?:.)/u

(vs. getting /foo|bar.*/ which would be, I think, not what you expected,
especially if the regexps were extremely complex)

I wasn’t aware that there were so several other regexp-operators
packages. Must be a good idea if so several different people have also
thought of it. :wink:

One thing that’s missing from the packages you point at is that the
object you get back isn’t completely usable as a regexp. They could be
extended to have the missing methods, of course, but they don’t
currently support them. And if you’ve added or modified any methods in
regexp, these objects are of a different type (and aren’t class
descendants) so won’t have the changes applied to them (say, if I
redefine to_s or source or something like that)

i.e.:
irb(main):001:0> require ‘eregex’
=> true
irb(main):002:0> x = /foo/ | /bar/
=> #<RegOr:0x401c0ba4 @re2=/bar/, @re1=/foo/>
irb(main):003:0> /test/.methods - x.methods
=> [“casefold?”, “|”, “source”, “&”, “~”, “match”, “kcode”]

Anyway, looks like eregex & is pretty handy; and your reop.rb looks even
better, but for me, I think mine is a lot more useful in that it is
totally transparent: when you do an operation on regexps, you get a
regexp back. It doesn’t create an object hierarchy as the other two you
cited do; I toyed with that idea, but I didn’t like it because I got
objects back that behaved differently than regexps and couldn’t be
easily redefined without having some intimate knowledge of the operator
package.

BTW, I never wrote ‘&’ because I didn’t really need it, but it could be
done with something like this:

In RegexpOps.rb:

the other code I posted goes here

class Regexp
def &(other)
/(?=#{self})#{other}/u
end
end

Then:
irb(main):001:0> require ‘RegexpOps’
=> true
irb(main):002:0> /foo/ & /bar/
=> /(?=(?:foo))(?:bar)/u

Of course, that regexp will never match anything, but you get the idea.
:wink:

Looks like 1.8 still doesn’t catch encoding flag in this case;
there doesn’t appear to be any ‘(?’ prefix that changes encodings,
though, which would be a prerequisite. (Personally, I’m happy with
UTF-8. :wink:

Current regexp engine (and perhaps Oniguruma too) can not mix
encodings. Well, would it be better to preserve it and raise
an exception when it doesn’t match?

For me, the encodings are not a problem, as I only use UTF-8; I do a lot
of multilingual stuff, and UTF-8 is the only way I can support English,
French, Spanish, German, and Japanese (strange mix, but those are the
languages I work with!) simultaneously in Ruby.

In general, though, it seems like it would be a good idea to catch
attempts at mixing encodings and throw an exception if they are
incompatible. I might add that to mine.

···

On Saturday 31 May 2003 1:28 am, nobu.nokada@softhome.net wrote:

At Sat, 31 May 2003 08:59:45 +0900, > > Wesley J Landaker wrote:


Wesley J. Landaker - wjl@icecavern.net
OpenPGP FP: C99E DF40 54F6 B625 FD48 B509 A3DE 8D79 541F F830

Hi,

Well, + meaning concatination makes sense to me. What else would it
mean? Notice that I do put regexps in (?:slight_smile: groups so that you don’t
have any ambiguity if you do something like:

/foo|bar/ + /./ # => /(?:foo|bar)(?:.)/u

One possibility:
=> /(?:foo|bar).(?:.)/u

One thing that’s missing from the packages you point at is that the
object you get back isn’t completely usable as a regexp. They could be
extended to have the missing methods, of course, but they don’t
currently support them. And if you’ve added or modified any methods in
regexp, these objects are of a different type (and aren’t class
descendants) so won’t have the changes applied to them (say, if I
redefine to_s or source or something like that)

Yes, I know it’s a problem. Not only you mentioned, some
methods of String expect Regexp instance.

BTW, I never wrote ‘&’ because I didn’t really need it, but it could be
done with something like this:

In RegexpOps.rb:

the other code I posted goes here

class Regexp
def &(other)
/(?=#{self})#{other}/u
end
end

Seems nice.

Of course, that regexp will never match anything, but you get the idea.
:wink:

Maybe, /(?=.(?:foo)).(?:bar)/?

···

At Sun, 1 Jun 2003 00:10:15 +0900, Wesley J Landaker wrote:


Nobu Nakada

I tend to think of + for regexp to be like + for strings–just a simple
concatenation. However, I can see how the above would make sense, if
you think of “a + b” as meaning more like “match a, and then match b;
what’s in between doesn’t really matter”.

If you did that, you’d need some way to do the “a directly followed by
b” semantics. I suppose << would work for that, though. =)

/foo|bar/ << /./
=> /(?:foo|bar)(?:.
)/

···

On Saturday 31 May 2003 10:35 am, nobu.nokada@softhome.net wrote:

Hi,

At Sun, 1 Jun 2003 00:10:15 +0900, > > Wesley J Landaker wrote:

Well, + meaning concatination makes sense to me. What else would it
mean? Notice that I do put regexps in (?:slight_smile: groups so that you don’t
have any ambiguity if you do something like:

/foo|bar/ + /./ # => /(?:foo|bar)(?:.)/u

One possibility:
=> /(?:foo|bar).(?:.)/u


Wesley J. Landaker - wjl@icecavern.net
OpenPGP FP: C99E DF40 54F6 B625 FD48 B509 A3DE 8D79 541F F830