Recursive brace matching with Ruby regexp

I wanted to learn Ruby, so I picked a small task of trying to write a
command line script to parse PHP classes and shell out some unit test
cases. I have it working for the most part, but I ran across a
problem trying to use Ruby regexp to find a set of matching curly
braces.

Please forgive the intrusion of this PHP code onto the list, but I
wanted to give you the flavor of what I am attempting to do, that can
be easily done with recursive regular expression available in the Perl
compatiable regexp engine.

<php>
$test = <<<EOS
/* some stuff */
class foo {
        public \$var;
        public function __construct() {}
        public function bar() {
                if (false) {
                }
        }

}
// some other stuff
EOS;

$re = <<<EOS
~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
EOS;
preg_match($re, $test, $match);
echo "your class matched:\n", $match[1];

</php>

Now it appears the regexp engine in Ruby does not support recursion
(at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
working on, and with what I know how to test), thus the only
workaround I found was very ugly, model the nesting of braces to a
fixed depth, i.e.

open = '\{'
close = '\}'
other = '[^\{\}]*'
l1 = other+open+other+close+other
l2 = other+open+'('+l1 +')+'+other+close+other
l3 = other+open+'('+l2 +')+'+other+close+other
l4 = other+open+'('+l3 +')+'+other+close+other
l5 = other+open+'('+l4 +')+'+other+close+other
re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+')|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:'+other+
'))+'+close, 'ixm')

This code did work, but sometimes timed out on valid real classes.

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?

Thanks,

Regards,
Jason

···

--
http://blog.casey-sweat.us/

Ruby's regex engine doesn't yet support this, but will in future versions.

Hope that helps.

James Edward Gray II

···

On Nov 4, 2004, at 6:04 PM, Jason Sweat wrote:

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?

Hi,

I wanted to learn Ruby, so I picked a small task of trying to write a
command line script to parse PHP classes and shell out some unit test
cases. I have it working for the most part, but I ran across a
problem trying to use Ruby regexp to find a set of matching curly
braces.

Please forgive the intrusion of this PHP code onto the list, but I
wanted to give you the flavor of what I am attempting to do, that can
be easily done with recursive regular expression available in the Perl
compatiable regexp engine.

<php>
$test = <<<EOS
/* some stuff */
class foo {
        public \$var;
        public function __construct() {}
        public function bar() {
                if (false) {
                }
        }

}
// some other stuff
EOS;

$re = <<<EOS
~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
EOS;
preg_match($re, $test, $match);
echo "your class matched:\n", $match[1];

</php>

Now it appears the regexp engine in Ruby does not support recursion
(at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
working on, and with what I know how to test), thus the only
workaround I found was very ugly, model the nesting of braces to a
fixed depth, i.e.

open = '\{'
close = '\}'
other = '[^\{\}]*'
l1 = other+open+other+close+other
l2 = other+open+'('+l1 +')+'+other+close+other
l3 = other+open+'('+l2 +')+'+other+close+other
l4 = other+open+'('+l3 +')+'+other+close+other
l5 = other+open+'('+l4 +')+'+other+close+other
re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+')|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:'+other+
'))+'+close, 'ixm')

This code did work, but sometimes timed out on valid real classes.

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?

Regular expressions, by all standard definitions, aren't recursive.
Perl's regexen have been extended to allow it, but it really isn't
considered a standard regex feature. You might try using a simple
tokenizer... here's a quick attempt:

def parse(code)
  chunks =
  loop do
    chunks << text.slice!(/\A.*?(?=[{}])/m) # match start of string to
before next bracket
    bracket = text.slice! 0
    chunks << parse(text) if bracket == ?{
    return chunks if bracket == ?}
    return chunks if text.size == 0
  end
end

This returns a recursive array that holds all the text chunks around
the brackets. here's some sample code (can't remember exactly what php
looks like ATM) and sample output:

PHP code:

class Foo {
  def bar(){
    if true{
      do_stuff()
    }else{
      do_nothing()
    }
    clean_up()
  }
}

Then, after recursively stripping whitespace from strings, the
pretty-printed array:

["class Foo",
["def bar()",
  ["if true", ["do_stuff()"], "else", ["do_nothing()"], "clean_up()"],
  ""],
nil]

uh, yeah, I know it drops off the last piece of text (reads it as nil)
but I don't want to figure that out just yet. Dinner calls :slight_smile:

HTH,
Mark

···

On Fri, 5 Nov 2004 09:04:30 +0900, Jason Sweat <jason.sweat@gmail.com> wrote:

Thanks,

Regards,
Jason
--
http://blog.casey-sweat.us/

> I expect I am probably missing some facet of Ruby that eaily allows me
> to next regexp inside of the Ruby code in some fasion to achieve the
> result I am looking for, but how to do so eludes me. Can anyone
> provide some insight for me on this situation?

Ruby's regex engine doesn't yet support this, but will in future
versions.

Just for fun,

dat = <<-"END_TEXT"
/* some stuff */
class foo {
        public \$var;
        public function __construct() {}
        public function bar() {
                if (false) {
                }
        }

}
// some other stuff
END_TEXT

repl =
true while dat.gsub!(/\([^(){}]*\)|\{[^(){}]*\}/) do |str|
  repl.push(str)
  "\0#{repl.length - 1}\0"
end
dat.scan(/class\s+\w+\s+\0\d+\0/) do
  match = $&
  true while match.gsub!(/\0(\d+)\0/) { repl[$1.to_i] }
  puts "Your class matched:", match
end

The above matches arbitrarily deeply nested ( ) and { }
blocks... but all the "classes" it matches have to be
at toplevel, can't be inside a { } or ( ) themselves...

I don't think it would be hard to modify to handle nested
classes, but I haven't thought it through...

In case it's not obvious how it works... It replaces,
from innermost to outermost, () and {} spans with a
token, and saves the original span in an array.

When it finds a class match followed by one of these
tokens, it expands it back out to its original
representation.

Regards,

Bill

···

From: "James Edward Gray II" <james@grayproductions.net>

On Nov 4, 2004, at 6:04 PM, Jason Sweat wrote:

Of course, you are correct. However, recursive "Regular Expressions" are becoming fairly common place now. Ruby's next regex engine will support them as well.

Regular Expression has evolved considerably from the original definition, with recursive capabilities being just another change in a long line of added usefulness. Does that really means they cease to be Regular Expressions? Longhorn will still be Windows, right?

I think of it more in terms of supersets or, for a programming slant, subclassing. The spirit of Regular Expressions, patterns to locate/breakdown text, is intact, I believe. They're just even more handy now.

Just my two cents.

James Edward Gray II

P.S. Proving whether or not Perl 6 changes will still be "Regular Expression" is left as an exercise for the reader. :wink:

···

On Nov 4, 2004, at 8:04 PM, Mark Hubbart wrote:

Regular expressions, by all standard definitions, aren't recursive.
Perl's regexen have been extended to allow it, but it really isn't
considered a standard regex feature.

Mark Hubbart ha scritto:

Hi,

I wanted to learn Ruby, so I picked a small task of trying to write a
command line script to parse PHP classes and shell out some unit test
cases. I have it working for the most part, but I ran across a
problem trying to use Ruby regexp to find a set of matching curly
braces.

Please forgive the intrusion of this PHP code onto the list, but I
wanted to give you the flavor of what I am attempting to do, that can
be easily done with recursive regular expression available in the Perl
compatiable regexp engine.

<php>
$test = <<<EOS
/* some stuff */
class foo {
       public \$var;
       public function __construct() {}
       public function bar() {
               if (false) {
               }
       }

}
// some other stuff
EOS;

$re = <<<EOS
~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
EOS;
preg_match($re, $test, $match);
echo "your class matched:\n", $match[1];

</php>

Now it appears the regexp engine in Ruby does not support recursion
(at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
working on, and with what I know how to test), thus the only
workaround I found was very ugly, model the nesting of braces to a
fixed depth, i.e.

open = '\{'
close = '\}'
other = '[^\{\}]*'
l1 = other+open+other+close+other
l2 = other+open+'('+l1 +')+'+other+close+other
l3 = other+open+'('+l2 +')+'+other+close+other
l4 = other+open+'('+l3 +')+'+other+close+other
l5 = other+open+'('+l4 +')+'+other+close+other
re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+')|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:'+other+
'))+'+close, 'ixm')

This code did work, but sometimes timed out on valid real classes.

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?

Regular expressions, by all standard definitions, aren't recursive.
Perl's regexen have been extended to allow it, but it really isn't
considered a standard regex feature. You might try using a simple
tokenizer... here's a quick attempt:

I wonder if someone ever looked at the ABNF module on raa. It is a tool to generate a parser for a grammar in augmented backus naur form... the parser is a Regexp object, so it seem thgere is some general way to handle recursive parsing needs..

···

On Fri, 5 Nov 2004 09:04:30 +0900, Jason Sweat <jason.sweat@gmail.com> wrote:

Thanks Bill,

I think I see how that works. I will play with it and see if it
solves my problem. I knew there had to be a much cleaner way than
what I was doing :slight_smile:

Regards,
Jason

···

On Fri, 5 Nov 2004 10:39:11 +0900, Bill Kelly <billk@cts.com> wrote:

repl =
true while dat.gsub!(/\([^(){}]*\)|\{[^(){}]*\}/) do |str|
  repl.push(str)
  "\0#{repl.length - 1}\0"
end
dat.scan(/class\s+\w+\s+\0\d+\0/) do
  match = $&
  true while match.gsub!(/\0(\d+)\0/) { repl[$1.to_i] }
  puts "Your class matched:", match
end

The above matches arbitrarily deeply nested ( ) and { }
blocks... but all the "classes" it matches have to be
at toplevel, can't be inside a { } or ( ) themselves...

In case it's not obvious how it works... It replaces,
from innermost to outermost, () and {} spans with a
token, and saves the original span in an array.

--
http://blog.casey-sweat.us/

Hi,

> Regular expressions, by all standard definitions, aren't recursive.
> Perl's regexen have been extended to allow it, but it really isn't
> considered a standard regex feature.

Of course, you are correct. However, recursive "Regular Expressions"
are becoming fairly common place now. Ruby's next regex engine will
support them as well.

Are you saying Oniguruma *will* support it, or *does* support it? It
would be nice if it already does :smiley:

Regular Expression has evolved considerably from the original
definition, with recursive capabilities being just another change in a
long line of added usefulness. Does that really means they cease to be
Regular Expressions? Longhorn will still be Windows, right?

I think of it more in terms of supersets or, for a programming slant,
subclassing. The spirit of Regular Expressions, patterns to
locate/breakdown text, is intact, I believe. They're just even more
handy now.

My statement wasn't supposed to insinuate that adding recursion is
bad, or turns regular expressions into something else; I was just
pointing out that (last time I checked) you can't *expect* to be able
to use recursion in regexen in a particular language. Perl's regexen
are *more* than regexen; you can even embed perl code in them. Also,
I'm pretty sure the recursive matching wasn't in Perl when I last used
it, a couple years back.

Just my two cents.

James Edward Gray II

P.S. Proving whether or not Perl 6 changes will still be "Regular
Expression" is left as an exercise for the reader. :wink:

:slight_smile:

cheers,
Mark

···

On Fri, 5 Nov 2004 12:17:01 +0900, James Edward Gray II <james@grayproductions.net> wrote:

On Nov 4, 2004, at 8:04 PM, Mark Hubbart wrote:

Yes.

It very specifically mean that they stop being regular expressions,
because the "regular" actually has a specific meaning (coming from
regular sets/context free language theory), and has the nice property
of compiling to a Deterministic Finite Automation with only linear
increase in size of the DFA compared to the regular expression.

The true regular expressions consists of ^, $, characters, *, and (|)
(alternation). Most of the "original" extensions compile cleanly to
this, and can still be considered regular. When you start using
backrefs in matching or recursion, the expression is no longer regular
(which, among other things, will almost certainly force your poor
regexp engine into Non-deterministic Finite Automation mode - unless
you've got exponential amounts of RAM, of course :wink:

Regular expression - Wikipedia gives a bit more background.

Eivind.

···

On Fri, 5 Nov 2004 12:17:01 +0900, James Edward Gray II <james@grayproductions.net> wrote:

On Nov 4, 2004, at 8:04 PM, Mark Hubbart wrote:

> Regular expressions, by all standard definitions, aren't recursive.
> Perl's regexen have been extended to allow it, but it really isn't
> considered a standard regex feature.

Of course, you are correct. However, recursive "Regular Expressions"
are becoming fairly common place now. Ruby's next regex engine will
support them as well.

Regular Expression has evolved considerably from the original
definition, with recursive capabilities being just another change in a
long line of added usefulness. Does that really means they cease to be
Regular Expressions?

--
Hazzle free packages for Ruby?
RPA is available from http://www.rubyarchive.org/

Sure does.

James Edward Gray II

···

On Nov 4, 2004, at 9:59 PM, Mark Hubbart wrote:

Are you saying Oniguruma *will* support it, or *does* support it? It
would be nice if it already does :smiley:

Okay Eivind, what would you prefer we call Regular Expression from now on?

James Edward Gray II

P.S. You better copy Jeffrey E. F. Friedl in you answer since his "Mastering Regular Expressions" covers considerably more NFA than DFA.

···

On Nov 5, 2004, at 2:19 AM, Eivind Eklund wrote:

Yes.

It very specifically mean that they stop being regular expressions,

* Eivind Eklund <eeklund@gmail.com> [Nov 05, 2004 16:11]:

> > Regular expressions, by all standard definitions, aren't
> > recursive. Perl's regexen have been extended to allow it, but it
> > really isn't considered a standard regex feature.

> Of course, you are correct. However, recursive "Regular
> Expressions" are becoming fairly common place now. Ruby's next
> regex engine will support them as well.

> Regular Expression has evolved considerably from the original
> definition, with recursive capabilities being just another change in
> a long line of added usefulness. Does that really means they cease
> to be Regular Expressions?

It very specifically mean that they stop being regular expressions,
because the "regular" actually has a specific meaning (coming from
regular sets/context free language theory), and has the nice property
of compiling to a Deterministic Finite Automation with only linear
increase in size of the DFA compared to the regular expression.

Well, that's not quite true. They compile to NFAs that are linear in
size of the input regular expression. The DFA constructed from the
resulting NFA can be exponential in the size of that input. This is, of
course, only the case for pathological cases, but DFAs are considerably
larger than NFAs, even after minimizing them.

The true regular expressions consists of ^, $, characters, *, and (|)
(alternation).

Well, not really true either. True regular expressions don't include
zero-width assertions, such as ^ and $. They include characters from
some alphabet, * - Kleene closure, . - concatenation (which is implicit
in most syntaxes), and | - alternation.

Most of the "original" extensions compile cleanly to this, and can
still be considered regular.

Yes.

When you start using backrefs in matching or recursion, the expression
is no longer regular (which, among other things, will almost certainly
force your poor regexp engine into Non-deterministic Finite Automation
mode -

And backtracking mode as well. Regular Expressions With Backreferences
are an NP-complete problem and have none of the nice properties of
regular expressions.

unless you've got exponential amounts of RAM, of course :wink:

?

All in all, backreferencing is an addition that doesn't really add very
much yet ruins a lot of the whole deal with regular expressions. When
people now start to try to add recursion to regular expressions (which
would have been nice if they could have easily been included - but they
can't) one has to shiver a bit. If you wan't recursion, use something
suitable - like context-free languages; don't expect to solve everything
with only one tool.
  nikolai

···

--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

Indeed it does! I just tried it out. And named subexpressions work too
:slight_smile: many new features since I last checked it out:

http://www.geocities.jp/kosako1/oniguruma/doc/RE.txt

Thanks for pointing it out.

cheers,
Mark

···

On Fri, 5 Nov 2004 13:06:40 +0900, James Edward Gray II <james@grayproductions.net> wrote:

On Nov 4, 2004, at 9:59 PM, Mark Hubbart wrote:

> Are you saying Oniguruma *will* support it, or *does* support it? It
> would be nice if it already does :smiley:

Sure does.

Okay Eivind, what would you prefer we call Regular Expression from now on?

Ruby Automata? Irregular Expressions? Irregardless Expressions?
Syntax-Restricted-State-Machines (SRSM)? Susan/Jacob/Drew (depending
on whether your language makes them female, male, or neuter)?

I vote for calling it something like "Police Academy" -- Steve
Guttenberg really doesn't get enough credit for his impact on the
programming community.

~Me!

···

--
There's no word in the English language for what you do to a dead
thing to make it stop chasing you.

but both of those recognize only regular languages (those one can describe
with a regular expression). it seems this new type must be called

   'context-free expressions'

which would be the next class i think - that is unless this new type does
considerably more.

-a

···

On Fri, 5 Nov 2004, James Edward Gray II wrote:

On Nov 5, 2004, at 2:19 AM, Eivind Eklund wrote:

Yes.

It very specifically mean that they stop being regular expressions,

Okay Eivind, what would you prefer we call Regular Expression from now
on?

James Edward Gray II

P.S. You better copy Jeffrey E. F. Friedl in you answer since his
"Mastering Regular Expressions" covers considerably more NFA than DFA.

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
When you do something, you should burn yourself completely, like a good
bonfire, leaving no trace of yourself. --Shunryu Suzuki

===============================================================================

* James Edward Gray II <james@grayproductions.net> [Nov 05, 2004 16:11]:

> It very specifically mean that they stop being regular expressions,

Okay Eivind, what would you prefer we call Regular Expression from now
on?

Call them regular expressions and stop using them for context-free
language matching.
  nikolai

···

--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

A-1. Syntax depend options

   + ONIG_SYNTAX_RUBY
     (?m): dot(.) match newline

   + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
     (?s): dot(.) match newline
     (?m): ^ match after newline, $ match before newline

Any idea why Ruby was chosen to behave differently from Perl here?

···

On Fri, Nov 05, 2004 at 02:05:00PM +0900, Mark Hubbart wrote:

http://www.geocities.jp/kosako1/oniguruma/doc/RE.txt

--
Jos Backus _/ _/_/_/ Sunnyvale, CA
                                _/ _/ _/
                               _/ _/_/_/
                          _/ _/ _/ _/
jos at catnook.com _/_/ _/_/_/ require 'std/disclaimer'

Great!! Just comiled ruby from cvs (ruby 1.9.0 (2004-11-04)
[i686-linux]) and was able to match using:

dat = <<-"END_TEXT"
/* some stuff */
class foo {
       public $var;
       public function __construct() {}
       public function bar() {
               if (false) {
               }
       }

}
// some other stuff
END_TEXT

dat.match(/(class\s+\w+\s*(\{(?:[^\{\}]*|\g<2>)*\}))/)
print $1

Thank you very much for your help!!

Regards,
Jason

···

On Fri, 5 Nov 2004 14:05:00 +0900, Mark Hubbart <discordantus@gmail.com> wrote:

> > Are you saying Oniguruma *will* support it, or *does* support it? It
> > would be nice if it already does :smiley:
>
> Sure does.

Indeed it does! I just tried it out. And named subexpressions work too
:slight_smile: many new features since I last checked it out:

http://www.geocities.jp/kosako1/oniguruma/doc/RE.txt

Thanks for pointing it out.

--
http://blog.casey-sweat.us/

Matt Maycock ha scritto:

Okay Eivind, what would you prefer we call Regular Expression from now on?

Ruby Automata? Irregular Expressions? Irregardless Expressions? Syntax-Restricted-State-Machines (SRSM)? Susan/Jacob/Drew (depending
on whether your language makes them female, male, or neuter)?

pattern matching :slight_smile: