Named groups in regexp matches?

Hi,

Does Ruby support regexps that assign names to specific matched groups? In Python, for instance, if you write a regexp like this,

TEMP_RE = re.compile(r"""^(?P<temp>(M|-)?\d+|//|XX|MM)/
                           (?P<dewpt>(M|-)?\d+|//|XX|MM)?\s+""",
                           re.VERBOSE)

the match object will provide a hash with keys 'temp' and 'dewpt', containing the values matched by the corresponding '(?P<temp>...). and '(?P<dewpt>...)' groups. This is another solution to the problem of creating regexps where you want to parentheses both for grouping and substring capture. I know Perl and Ruby support the '(?:...)' syntax to let you use parens for specifying alternatives without capturing that group, but the Python scheme for labeling capture groups produces more readable code, and I've used it heavily in some code I was hoping to port to Ruby. I was hoping that perhaps I'd just overlooked something in the Ruby docs.

Thanks,

Tom

Hi,

Does Ruby support regexps that assign names to specific matched
groups? In Python, for instance, if you write a regexp like this,

TEMP_RE = re.compile(r"""^(?P<temp>(M|-)?\d+|//|XX|MM)/
                          (?P<dewpt>(M|-)?\d+|//|XX|MM)?\s+""",
                          re.VERBOSE)

the match object will provide a hash with keys 'temp' and 'dewpt',

[...]

Take a look at this thread:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/80270

Thanks,
Tom

--Greg

···

On Tue, Jan 30, 2007 at 03:24:40AM +0900, Tom Pollard wrote:

Tom Pollard schrieb:

Does Ruby support regexps that assign names to specific matched groups?

In Ruby 1.9 it works. I wrote some artikels with many examples on http://www.ruby-mine.de (the site may be down for maintenance the next two days, especially the following one: http://www.ruby-mine.de?p=130 - unfortunately it is only available in german in the moment, but the examples are Ruby code and irb usage, so it should be understandable without understanding the german texts.

But - Ruby 1.9 is still under development. May be that there will be changes in details in future.

Some examples:

irb(main):001:0> md="abba".match(/(?<a1>.)(?<a2>.)\k<a2>\k<a1>/)
=> #<MatchData:0x2bf0488>
irb(main):002:0> md[0]
=> "abba"
irb(main):003:0> md[1]
=> "a"
irb(main):004:0> md[2]
=> "b"
irb(main):005:0> md[:a1]
=> "a"
irb(main):006:0> md[:a2]
=> "b"
irb(main):007:0> md['a1']
=> "a"
irb(main):008:0> md['a2']
=> "b"

Here it is visible, that the contents of a matched groups are accessible by number, name as symbol, and name as string, but it is not allowed to mix named groups and normal capturing groups in the same regular expression:

irb(main):001:0> "abba".match(/(?<a1>.)(.)\2\k<a1>/)
SyntaxError: compile error
(irb):1: numbered backref/call is not allowed. (use name): /(?<a1>.)(.)\2\k<a1>/
         from (irb):1:in `Kernel#binding'

···

-----

When using "sub", "gsub", "sub!", or "gsub!" witout a block, it is only possible to access the groups by name, the positional access return the empty string

irb(main):001:0> puts 'axbx'.sub(/(?<r>.)x(?<s>.)x/, '\k<s>\k<r>')
ba
=> nil
irb(main):002:0> puts 'axbx'.sub(/(?<r>.)x(?<s>.)x/, '\2\1')

=> nil

-----

Inside block a direct access to the group names is not possible - I must say, I don't find a way to do it directly. The use of positional variables "$1" etc. is possible. There is another possibility by using the MatchDate object "$~" inside the block. In doing this, the same possibilities are available as described for "match":

irb(main):001:0> 'axbx'.sub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
"axbx"
"a"
"b"
=> "ub"
irb(main):002:0> 'axbxcxdx'.gsub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
"axbx"
"a"
"b"
"cxdx"
"c"
"d"
=> "ubud"

and using MatchData object:

irb(main):001:0> 'axbx'.sub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
"axbx"
"a"
"b"
=> "ub"
irb(main):002:0> 'axbxcxdx'.gsub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
"axbx"
"a"
"b"
"cxdx"
"c"
"d"
=> "ubud"

------

There are special situations, where the possibilities of Oniguruma in Ruby 1.9 allow solutions, which are not as simple to describe in Ruby 1.9.

Ruby 1.8:

irb(main):001:0> "rasbuavb".scan(/(.)a|(.)b/){|i|p i}
["r", nil]
[nil, "s"]
["u", nil]
[nil, "v"]
=> "rasbuavb"

Ruby 1.9:

irb(main):002:0> "rasbuavb".scan(/(.){0}\g<1>a|\g<1>b/){|i|p i}
["r"]
["s"]
["u"]
["v"]
=> "rasbuavb"

Here isn't a named group the player, it is the possibility to call a subexpression. It is a very powerfull feature, which allows recursive constructs. I made in the article a pocket calculator as example, but it may useful for checking complex input fields in a GUI, or even later on in Rails:

pattern = / (?<e>\g<t>\+\g<e>|\g<t>-\g<e>|\g<t>){0}
             (?<t>|\g<f>\*\g<t>|\g<f>\/\g<t>|\g<f>){0}
             (?<f>[-+]?\g<id>|\(\g<e>\)){0}
             (?<id>\g<n>|\g<v>){0}
             (?<n>[a-zA-Z_]\w*){0}
             (?<v>\d+(\.\d+)?){0}
             ^((?<var>\g<n>)=)?(?<expr>\g<e>)$
           /x

vars = Hash.new(0)
basbind = binding

# print ‘input> ‘ # for interactive usage
while (!(inp = DATA.gets).chomp.match(/^quit$/i))
   if (md = inp.chomp.gsub(/\s+/,‘‘).match(pattern))
     expr = md[:expr].gsub(/([a-zA-Z_]\w*)/, ‘vars["\1"]‘)
     erg = eval(expr, basbind)
     vars[md[:var]] = erg if md[:var]
     puts "#{inp.chomp}, result> #{(md[:var])?(md[:var]+‘=‘):‘‘}#{erg}"
   else
     puts "+++++ incorrect input: ‘#{inp.chomp}‘"
   end
# print ‘input> ‘ # for interactive usage
end
puts ‘***** variables *****‘
vars.keys.sort.each{|v|puts "#{v}=#{vars[v]}"}
puts ‘******* End ********‘
__END__
30+12
a = 30 + 12
b = 2*a
c = -(a*a+5)
d = (6+5*a)*c
quit

results in:

30+12, result> 42
a = 30 + 12, result> a=42
b = 2*a, result> b=84
c = -(a*a+5), result> c=-1769
d = (6+5*a)*c, result> d=-382104
***** variables *****
a=42
b=84
c=-1769
d=-382104
******* End ********

-----

Summary - in the near future you will habe a lot of powerful new features in Ruby's pattern matching facilities.

Wolfgang Nádasi-Donner

Tom Pollard schrieb:

Does Ruby support regexps that assign names to specific matched groups?

In Ruby 1.9 it works. I wrote some artikels with many examples on
http://www.ruby-mine.de (the site may be down for maintenance the next two days,
especially the following one: http://www.ruby-mine.de?p=130 - unfortunately it
is only available in german in the moment, but the examples are Ruby code and
irb usage, so it should be understandable without understanding the german texts.

But - Ruby 1.9 is still under development. May be that there will be changes in
details in future.

Some examples:

irb(main):001:0> md="abba".match(/(?<a1>.)(?<a2>.)\k<a2>\k<a1>/)
=> #<MatchData:0x2bf0488>
irb(main):002:0> md[0]
=> "abba"
irb(main):003:0> md[1]
=> "a"
irb(main):004:0> md[2]
=> "b"
irb(main):005:0> md[:a1]
=> "a"
irb(main):006:0> md[:a2]
=> "b"
irb(main):007:0> md['a1']
=> "a"
irb(main):008:0> md['a2']
=> "b"

Here it is visible, that the contents of a matched groups are accessible by
number, name as symbol, and name as string, but it is not allowed to mix named
groups and normal capturing groups in the same regular expression:

irb(main):001:0> "abba".match(/(?<a1>.)(.)\2\k<a1>/)
SyntaxError: compile error
(irb):1: numbered backref/call is not allowed. (use name): /(?<a1>.)(.)\2\k<a1>/
         from (irb):1:in `Kernel#binding'

···

-----

When using "sub", "gsub", "sub!", or "gsub!" witout a block, it is only possible
to access the groups by name, the positional access return the empty string

irb(main):001:0> puts 'axbx'.sub(/(?<r>.)x(?<s>.)x/, '\k<s>\k<r>')
ba
=> nil
irb(main):002:0> puts 'axbx'.sub(/(?<r>.)x(?<s>.)x/, '\2\1')

=> nil

-----

Inside block a direct access to the group names is not possible - I must say, I
don't find a way to do it directly. The use of positional variables "$1" etc. is
possible. There is another possibility by using the MatchData object "$~" inside
the block. In doing this, the same possibilities are available as described for
"match":

irb(main):001:0> 'axbx'.sub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
"axbx"
"a"
"b"
=> "ub"
irb(main):002:0> 'axbxcxdx'.gsub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
"axbx"
"a"
"b"
"cxdx"
"c"
"d"
=> "ubud"

and using MatchData object:

irb(main):003:0> 'axbxcxdx'.gsub(/(?<i>.)x(?<j>.)x/){|k|p $~[:i]}
"a"
"c"
=> ""

------

There are special situations, where the possibilities of Oniguruma in Ruby 1.9
allow solutions, which are not as simple to describe in Ruby 1.9.

Ruby 1.8:

irb(main):001:0> "rasbuavb".scan(/(.)a|(.)b/){|i|p i}
["r", nil]
[nil, "s"]
["u", nil]
[nil, "v"]
=> "rasbuavb"

Ruby 1.9:

irb(main):002:0> "rasbuavb".scan(/(.){0}\g<1>a|\g<1>b/){|i|p i}
["r"]
["s"]
["u"]
["v"]
=> "rasbuavb"

Here isn't a named group the player, it is the possibility to call a
subexpression. It is a very powerfull feature, which allows recursive
constructs. I made in the article a pocket calculator as example, but it may
useful for checking complex input fields in a GUI, or even later on in Rails:

pattern = / (?<e>\g<t>\+\g<e>|\g<t>-\g<e>|\g<t>){0}
             (?<t>|\g<f>\*\g<t>|\g<f>\/\g<t>|\g<f>){0}
             (?<f>[-+]?\g<id>|\(\g<e>\)){0}
             (?<id>\g<n>|\g<v>){0}
             (?<n>[a-zA-Z_]\w*){0}
             (?<v>\d+(\.\d+)?){0}
             ^((?<var>\g<n>)=)?(?<expr>\g<e>)$
           /x

vars = Hash.new(0)
basbind = binding

# print ‘input> ‘ # for interactive usage
while (!(inp = DATA.gets).chomp.match(/^quit$/i))
   if (md = inp.chomp.gsub(/\s+/,‘‘).match(pattern))
     expr = md[:expr].gsub(/([a-zA-Z_]\w*)/, ‘vars["\1"]‘)
     erg = eval(expr, basbind)
     vars[md[:var]] = erg if md[:var]
     puts "#{inp.chomp}, result> #{(md[:var])?(md[:var]+‘=‘):‘‘}#{erg}"
   else
     puts "+++++ incorrect input: ‘#{inp.chomp}‘"
   end
# print ‘input> ‘ # for interactive usage
end
puts ‘***** variables *****‘
vars.keys.sort.each{|v|puts "#{v}=#{vars[v]}"}
puts ‘******* End ********‘
__END__
30+12
a = 30 + 12
b = 2*a
c = -(a*a+5)
d = (6+5*a)*c
quit

results in:

30+12, result> 42
a = 30 + 12, result> a=42
b = 2*a, result> b=84
c = -(a*a+5), result> c=-1769
d = (6+5*a)*c, result> d=-382104
***** variables *****
a=42
b=84
c=-1769
d=-382104
******* End ********

-----

Summary - in the near future you will habe a lot of powerful new features in
Ruby's pattern matching facilities.

Wolfgang Nádasi-Donner

Thanks very much for the quick response. It sounds like the answer is that Ruby does not support named captures, but that the Oniguruma library supplies this feature. I think it would be a nice feature to add in 1.9. My experience is this is very useful (if not necessary) for composing non-trivial regexps. Without them, it's just too easy to mess up the capture-group numbers when adding or removing parenthesized subexpressions in your regexp.

Tom

···

On Jan 29, 2007, at 1:28 PM, Gregory Seidman wrote:

On Tue, Jan 30, 2007 at 03:24:40AM +0900, Tom Pollard wrote:

Does Ruby support regexps that assign names to specific matched
groups? In Python, for instance, if you write a regexp like this,

TEMP_RE = re.compile(r"""^(?P<temp>(M|-)?\d+|//|XX|MM)/
                          (?P<dewpt>(M|-)?\d+|//|XX|MM)?\s+""",
                          re.VERBOSE)

the match object will provide a hash with keys 'temp' and 'dewpt',

[...]

Take a look at this thread:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/80270

Thanks very much that report! Now I'll just have to decide whether to wait until 1.9 rolls out, or find some other way to port my code in the meantime.

Tom

···

On Jan 29, 2007, at 5:25 PM, Wolfgang Nádasi-Donner wrote:

Summary - in the near future you will habe a lot of powerful new features in
Ruby's pattern matching facilities.

Tom Pollard schrieb:> Does Ruby support regexps that assign names to specific matched groups?

In Ruby 1.9 it works. I wrote some artikels with many examples onhttp://www.ruby-mine.de(the site may be down for maintenance the next two days,
especially the following one:http://www.ruby-mine.de?p=130- unfortunately it
is only available in german in the moment, but the examples are Ruby code and
irb usage, so it should be understandable without understanding the german texts.

But - Ruby 1.9 is still under development. May be that there will be changes in
details in future.

Some examples:

irb(main):001:0> md="abba".match(/(?<a1>.)(?<a2>.)\k<a2>\k<a1>/)
=> #<MatchData:0x2bf0488>
irb(main):002:0> md[0]
=> "abba"
irb(main):003:0> md[1]
=> "a"
irb(main):004:0> md[2]
=> "b"
irb(main):005:0> md[:a1]
=> "a"
irb(main):006:0> md[:a2]
=> "b"
irb(main):007:0> md['a1']
=> "a"
irb(main):008:0> md['a2']
=> "b"

Here it is visible, that the contents of a matched groups are accessible by
number, name as symbol, and name as string, but it is not allowed to mix named
groups and normal capturing groups in the same regular expression:

Hi Wolfgang,
I was going to ask why you did not use the syntax of (?P<name>...) as
used in Python, but found that, according to http://www.amk.ca/python/
howto/regex/regex.html#SECTION000530000000000000000, the P is for
Python extensions.

But then if the Ruby extension is quite like the Python one, but not
seen as being the canonical implementation of grouped expressions then
maybe it should be (?R<name>...) showing that this is a ruby-specific
etension?

Thanks,
- Paddy.

···

On Jan 29, 10:14 pm, Wolfgang Nádasi-Donner <won...@donnerweb.de> wrote:

Summary - in the near future you will habe a lot of powerful new features in
Ruby's pattern matching facilities.

Wolfgang Nádasi-Donner

Paddy3118 schrieb:

On Jan 29, 10:14 pm, Wolfgang Nádasi-Donner <won...@donnerweb.de>

irb(main):001:0> md="abba".match(/(?<a1>.)(?<a2>.)\k<a2>\k<a1>/)

Hi Wolfgang,
I was going to ask why you did not use the syntax of (?P<name>...) as
used in Python, but found that, according to http://www.amk.ca/python/
howto/regex/regex.html#SECTION000530000000000000000, the P is for
Python extensions.

It's not that easy. The regular expression engine used in Ruby 1.9 is not integral part or ruby. It is a stand alone regular expression engine called "Oniguruma" (サービス終了のお知らせ).

Oniguruma is actually existent in tree variants, "2.x.y" can be used in Ruby 1.6 and 1.8, but it is not the standard engine of Ruby 1.6/1.8, "4.x.y" will be used in Ruby 1.9ff, and "5.x.y" is not related to Ruby.

The syntax of the regular expressions are not defined by Ruby, they are defined by Oniguruma.

Wolfgang Nádasi-Donner

What I do might be enough for your purpose

S=Struct.new(:key, :value)
=> S
irb(main):002:0> r=%r{(\w+)\s*=\s*(.*)}
=> /(\w+)\s*=\s*(.*)/
irb(main):003:0> m= r.match("name = Tom Pollard")
=> #<MatchData:0xb7dfbb5c>
irb(main):004:0> s=S.new(*m.captures)
=> #<struct S key="name", value="Tom Pollard">
irb(main):005:0> s.key
=> "name"
irb(main):006:0> s.value
=> "Tom Pollard"

this could easily be wrapped into a class BTW.

HTH
Robert

···

On 1/30/07, Tom Pollard <tomp@earthlink.net> wrote:

On Jan 29, 2007, at 5:25 PM, Wolfgang Nádasi-Donner wrote:
> Summary - in the near future you will habe a lot of powerful new
> features in
> Ruby's pattern matching facilities.

Thanks very much that report! Now I'll just have to decide whether
to wait until 1.9 rolls out, or find some other way to port my code
in the meantime.

Tom

--
We have not succeeded in answering all of our questions.
In fact, in some ways, we are more confused than ever.
But we feel we are confused on a higher level and about more important
things.
-Anonymous

Thanks. That's not a bad idea, but it only addresses half of my problem, because I still need to be careful to use non-capturing groups for the things I don't want to capture. In Python, I can ignore that - labeling the groups I /do/ want to capture is enough. Here are a few examples from my Python code:

WIND_RE = re.compile(r"""^(?P<dir>[\dO]{3}|[0O]|///|MMM|VRB)
                           (?P<speed>P?[\dO]{2,3}|[0O]+|[/M]{2,3})
                         (G(?P<gust>P?(\d{1,3}|[/M]{1,3})))?
                           (?P<units>KTS?|LT|K|T|KMH|MPS)?
                       (\s+(?P<varfrom>\d\d\d)V
                           (?P<varto>\d\d\d))?\s+""",
                           re.VERBOSE)
VISIBILITY_RE = re.compile(r"""^(?P<vis>(?P<dist>M?(\d\s+)?\d/\d\d?|M?\d+)
                                      ( \s*(?P<units>SM|KM|M|U) |
                                           (?P<dir>[NSEW][EW]?) )? |
                                    CAVOK )\s+""",
                                    re.VERBOSE)
RUNWAY_RE = re.compile(r"""^(RVRNO |
                              R(?P<name>\d\d(RR?|LL?|C)?)/
                               (?P<low>(M|P)?\d\d\d\d)
                             (V(?P<high>(M|P)?\d\d\d\d))?
                               (?P<unit>FT)?[/NDU]*)\s+""",
                               re.VERBOSE)
TEMP_RE = re.compile(r"""^(?P<temp>(M|-)?\d+|//|XX|MM)/
                           (?P<dewpt>(M|-)?\d+|//|XX|MM)?\s+""",
                           re.VERBOSE)

To port these to pre-1.9 Ruby, I'll need to remove the '?P<name>' labels and change the other groups from '(...)' to '(?:...)'. Once I've done that I can worry about assigning labels to the captured groups, after they're matched, or just using the index captures. There's nothing about that that's not straightforward; I'm mostly struggling with my motivation for going through this effort at all, simply to port a working, well-debugged and fairly fast Python module to Ruby, especially since I'm fairly sure the resulting Ruby module will be much slower and harder to maintain.

Tom

···

On Feb 3, 2007, at 11:23 AM, Robert Dober wrote:

What I do might be enough for your purpose

S=Struct.new(:key, :value)
=> S
irb(main):002:0> r=%r{(\w+)\s*=\s*(.*)}
=> /(\w+)\s*=\s*(.*)/
irb(main):003:0> m= r.match("name = Tom Pollard")
=> #<MatchData:0xb7dfbb5c>
irb(main):004:0> s=S.new(*m.captures)
=> #<struct S key="name", value="Tom Pollard">
irb(main):005:0> s.key
=> "name"
irb(main):006:0> s.value
=> "Tom Pollard"

this could easily be wrapped into a class BTW.