Regular expressions, capture repeated groups

I'm trying to emulate something I've done in .Net many moons ago, which is capture a named group, but not just once, get all it's repetitions and then be able to see all those repetitions. I think they call them GroupCollections in C#. This is the kind of code I'm trying to emulate with Ruby(1.9.1):

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main ()
    {

        // Define a regular expression for repeated words.
        Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
          RegexOptions.Compiled | RegexOptions.IgnoreCase);

        // Define a test string.
        string text = "The the quick brown fox fox jumped over the lazy dog dog.";

        // Find matches.
        MatchCollection matches = rx.Matches(text);

        // Report the number of matches found.
        Console.WriteLine("{0} matches found in:\n {1}",
                          matches.Count,
                          text);

        // Report on each match.
        foreach (Match match in matches)
        {
            GroupCollection groups = match.Groups;
            Console.WriteLine("'{0}' repeated at positions {1} and {2}",
                              groups["word"].Value,
                              groups[0].Index,
                              groups[1].Index);
        }

    }
  
}
// The example produces the following output to the console:
// 3 matches found in:
// The the quick brown fox fox jumped over the lazy dog dog.
// 'The' repeated at positions 0 and 4
// 'fox' repeated at positions 20 and 25
// 'dog' repeated at positions 50 and 54

For example, if I had the string "11 12" I could have a regex like
/
(?<first> \d+ ) \s \g<first>
/x
that captured "11" and then the repetition "12" and put them in an array (or some kind of collection) referenced by the name.

I think my attempts to get this to work are better explanations. What I want is the result
#<MatchData "11 12" first:["11", "12"]> or something like it. At the moment all my attempts end with the named capture only keeping the last match it made i.e. 12 with no mention of 11.

I know I could do this a different way, perhaps with split or something, but I'd like to know if it's possible with just regex. I understand the Oniguruma engine is used now but I can't find any good docs for it.

These are my attempts, $ is my prompt.

$ md1 = /
    (?<first> \d+ )
    \s \g<first>
      /x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 = /
    (?<first> \d+ )
    (?: \s \g<first> )?
  /x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 = /
    (?<first> \d+ )
    (?: \s
      (?<second> \g<first> )
    )?
  /x.match( "11 12" )
#<MatchData "11 12" first:"12" second:"12">

$ md1[:first]
"12"

$ md1[:second]
"12"

$ md1 = /
        (?: (?<first> \d+ )\s* )+
      /x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

Iain

"The the quick brown fox fox jumped over the lazy dog dog.".
scan(/((\w+) +\2)/i){|x| puts "#{ x[0] } #{ $~.offset(0)[0]}"}
The the 0
fox fox 20
dog dog 50

···

On Jul 8, 6:20 am, Iain Barnett <iainsp...@gmail.com> wrote:

I'm trying to emulate something I've done in .Net many moons ago, which is capture a named group, but not just once, get all it's repetitions and then be able to see all those repetitions. I think they call them GroupCollections in C#. This is the kind of code I'm trying to emulate with Ruby(1.9.1):

using System;
using System.Text.RegularExpressions;

public class Test
{

public static void Main \(\)
\{

    // Define a regular expression for repeated words\.
    Regex rx = new Regex\(@&quot;\\b\(?&lt;word&gt;\\w\+\)\\s\+\(\\k&lt;word&gt;\)\\b&quot;,
      RegexOptions\.Compiled | RegexOptions\.IgnoreCase\);

    // Define a test string\.        
    string text = &quot;The the quick brown fox  fox jumped over the lazy dog dog\.&quot;;

    // Find matches\.
    MatchCollection matches = rx\.Matches\(text\);

    // Report the number of matches found\.
    Console\.WriteLine\(&quot;\{0\} matches found in:\\n   \{1\}&quot;,
                      matches\.Count,
                      text\);

    // Report on each match\.
    foreach \(Match match in matches\)
    \{
        GroupCollection groups = match\.Groups;
        Console\.WriteLine\(&quot;&#39;\{0\}&#39; repeated at positions \{1\} and \{2\}&quot;,  
                          groups\[&quot;word&quot;\]\.Value,
                          groups\[0\]\.Index,
                          groups\[1\]\.Index\);
    \}

\}

}

// The example produces the following output to the console:
// 3 matches found in:
// The the quick brown fox fox jumped over the lazy dog dog.
// 'The' repeated at positions 0 and 4
// 'fox' repeated at positions 20 and 25
// 'dog' repeated at positions 50 and 54

For example, if I had the string "11 12" I could have a regex like
/
(?<first> \d+ ) \s \g<first>
/x
that captured "11" and then the repetition "12" and put them in an array (or some kind of collection) referenced by the name.

I think my attempts to get this to work are better explanations. What I want is the result
#<MatchData "11 12" first:["11", "12"]> or something like it. At the moment all my attempts end with the named capture only keeping the last match it made i.e. 12 with no mention of 11.

I know I could do this a different way, perhaps with split or something, but I'd like to know if it's possible with just regex. I understand the Oniguruma engine is used now but I can't find any good docs for it.

These are my attempts, $ is my prompt.

$ md1 = /
(?<first> \d+ )
\s \g<first>
/x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 = /
(?<first> \d+ )
(?: \s \g<first> )?
/x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 = /
(?<first> \d+ )
(?: \s
(?<second> \g<first> )
)?
/x.match( "11 12" )
#<MatchData "11 12" first:"12" second:"12">

$ md1[:first]
"12"

$ md1[:second]
"12"

$ md1 = /
(?: (?<first> \d+ )\s* )+
/x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

Iain

Thanks for that. That would certainly work to a degree, much better than my current alternative, but it nullifies the usefulness of named captures. For example, I can't call

$ md1[:first]

and get back all the matches for the (?<first> ) grouping, which would be phenomenally useful, because scan returns arrays of strings and not matchdata.

Iain

···

On 8 Jul 2010, at 16:15, w_a_x_man wrote:

"The the quick brown fox fox jumped over the lazy dog dog.".
scan(/((\w+) +\2)/i){|x| puts "#{ x[0] } #{ $~.offset(0)[0]}"}
The the 0
fox fox 20
dog dog 50

Thanks for that. That would certainly work to a degree, much better than my current alternative, but it nullifies the usefulness of named captures. For example, I can't call

$ md1[:first]

wait till you call the 21st :wink:

and get back all the matches for the (?<first> ) grouping, which would be phenomenally useful, because scan returns arrays of strings and not matchdata.

waxman hinted the $~

try eg,

s
#=> "The the quick brown fox fox jumped over the lazy dog dog."
m=
#=>
s.scan(/((\w+) +\2)/i){|x| m << $~}
#=> "The the quick brown fox fox jumped over the lazy dog dog."
m.size
#=> 3
m[0]
#=> #<MatchData "The the" 1:"The the" 2:"The">
m[0].offset 0
#=> [0, 7]
m[0].offset

.... and so fort..

best regards -botp

···

On Fri, Jul 9, 2010 at 12:38 AM, Iain Barnett <iainspeed@gmail.com> wrote:

Ok, I get it now. Thanks for the extra nudge (bang on the head:)

Iain

···

On 8 Jul 2010, at 18:01, botp wrote:

and get back all the matches for the (?<first> ) grouping, which would be phenomenally useful, because scan returns arrays of strings and not matchdata.

waxman hinted the $~
...

best regards -botp