clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.
But the regular expression you're passing is anchored, so the entire
regexp is only matched once, and it only contains one capturing group.
Perhaps this is clearer:
"abcd".scan /^a(b)(c)d$/
=> [["b", "c"]]
"abcd".scan /^a(?:(b|c)+)d$/
=> [["c"]]
In both cases the result is an array containing a single element,
because the regexp was matched exactly once.
The first gives [$1,$2] because there are two capture groups in its
regexp.
The second gives only [$1] because there is a single capture group. It
happens to have matched multiple times, but you get only the last value
for $1.
If multiple values were inserted into the result, then you wouldn't know
if ["foo","bar","baz"] came from [$1,$2,$3] or [$1,$1,$2] or [$1,$1,$1]
or [$1,$2,$2]
Actually they are allowed, otherwise I would not get a match at all.
Note also that I have manually unnested them in the example. The
problem is that repeated matches of the group are not returned.
Thanks
Michal
···
On 6 February 2010 19:57, Ralf Mueller <ralf.mueller@zmaw.de> wrote:
Michal Suchanek wrote:
Hello
I tried scanning for multiple occurences of a group in a string and
match/scan would return only one.
As you can see from this 1.9.1 test, it is the *last* match. I cannot provide an official rationale for this, but one likely reason: The memory overhead for storing arbitrary amount of matches per group can be significant. Also, the number of groups is known at compile time of a regular expression while the number of matches of each group is only known at match time. This makes it easier to allocate the memory needed for storing a single capture per group because it can be done when the regular expression is compiled. Please also note that all regular expression engines I know handle it that way, i.e. you get at most one capture per group.
In those cases I usually employ a two level approach:
irb(main):015:0> s = "ajabcaabck"
=> "ajabcaabck"
irb(main):016:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):017:1> $1.scan(/b*(a+)b+c*/){|m| p m, $1}
irb(main):018:1> end
["a"]
"a"
["aa"]
"aa"
=> "abcaabc"
irb(main):019:0>
Because of the way how #scan works we can do:
irb(main):022:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):023:1> $1.scan(/b*(a+)b+c*/){|m| p m}
irb(main):024:1> end
["a"]
["aa"]
=> "abcaabc"
irb(main):025:0>
Even so, I still think that there is a bug in your regex. I can't
find it, but I tried the same regular expression in perl and in Reggy,
a regex tool for osx (http://reggyapp.com/\). Both cases only matched
the one a.
Ben
···
On Sat, Feb 6, 2010 at 11:47 AM, Michal Suchanek <hramrach@centrum.cz> wrote:
Actually they are allowed, otherwise I would not get a match at all.
Note also that I have manually unnested them in the example. The
problem is that repeated matches of the group are not returned.
Thanks for the explanations. As mentioned on the page and also
explained in Brian's reply this is a design limitation of the return
value of the match method. It could return the additional matches but
then the return value would have to be structured differently than it
is now for the result to make sense. As scan most likely uses match
internally or at least returns results consistent with match it shares
the limitation.
So something like split has to be used to slice the string into pieces
where either a shorter non-anchored regex can match repeatedly or only
one match can be found.
The case which causes problems and is not actually well captured by
the example is something like
ab=cd,ef, ...
where the regexes for 'ab', 'cd' and the rest are slightly different,
and so is the interpretation.
Thanks
Michal
···
On 6 February 2010 21:47, Rick DeNatale <rick.denatale@gmail.com> wrote:
On Sat, Feb 6, 2010 at 3:23 PM, Brian Candler <b.candler@pobox.com> wrote:
As you can see from this 1.9.1 test, it is the *last* match. I cannot provide an official rationale for this, but one likely reason: The memory overhead for storing arbitrary amount of matches per group can be significant. Also, the number of groups is known at compile time of a regular expression while the number of matches of each group is only known at match time. This makes it easier to allocate the memory needed for storing a single capture per group because it can be done when the regular expression is compiled. Please also note that all regular expression engines I know handle it that way, i.e. you get at most one capture per group.
In those cases I usually employ a two level approach:
irb(main):015:0> s = "ajabcaabck"
=> "ajabcaabck"
irb(main):016:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):017:1> $1.scan(/b*(a+)b+c*/){|m| p m, $1}
irb(main):018:1> end
["a"]
"a"
["aa"]
"aa"
=> "abcaabc"
irb(main):019:0>
Because of the way how #scan works we can do:
irb(main):022:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):023:1> $1.scan(/b*(a+)b+c*/){|m| p m}
irb(main):024:1> end
["a"]
["aa"]
=> "abcaabc"
irb(main):025:0>
Sorry, I mixed grouping and capturing. Concerning grouping, regexp acts like a language, but not concerning the capturing and for this reason you have to make that two level trick. Nested caputring would lead to a tree of results with bad performance, I guess.
Actually, nested capturing is also supported as you can see from the
examples here. What is not supported is returning multiple matches for
a group that matches multiple times.
Thanks
Michal
···
On 9 February 2010 11:49, Ralf Mueller <ralf.mueller@zmaw.de> wrote:
Robert Klemme wrote:
On 02/06/2010 07:57 PM, Ralf Mueller wrote:
Michal Suchanek wrote:
Hello
I tried scanning for multiple occurences of a group in a string and
match/scan would return only one.
As you can see from this 1.9.1 test, it is the *last* match. I cannot
provide an official rationale for this, but one likely reason: The memory
overhead for storing arbitrary amount of matches per group can be
significant. Also, the number of groups is known at compile time of a
regular expression while the number of matches of each group is only known
at match time. This makes it easier to allocate the memory needed for
storing a single capture per group because it can be done when the regular
expression is compiled. Please also note that all regular expression
engines I know handle it that way, i.e. you get at most one capture per
group.
In those cases I usually employ a two level approach:
irb(main):015:0> s = "ajabcaabck"
=> "ajabcaabck"
irb(main):016:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):017:1> $1.scan(/b*(a+)b+c*/){|m| p m, $1}
irb(main):018:1> end
["a"]
"a"
["aa"]
"aa"
=> "abcaabc"
irb(main):019:0>
Because of the way how #scan works we can do:
irb(main):022:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):023:1> $1.scan(/b*(a+)b+c*/){|m| p m}
irb(main):024:1> end
["a"]
["aa"]
=> "abcaabc"
irb(main):025:0>
Sorry, I mixed grouping and capturing. Concerning grouping, regexp acts like
a language, but not concerning the capturing and for this reason you have to
make that two level trick. Nested caputring would lead to a tree of results
with bad performance, I guess.
Are you sure it matches multiple times? As I mentioned earlier in the
thread, I can't get it to do so.
Ben
···
On Tue, Feb 9, 2010 at 9:36 AM, Michal Suchanek <hramrach@centrum.cz> wrote:
Actually, nested capturing is also supported as you can see from the
examples here. What is not supported is returning multiple matches for
a group that matches multiple times.
On 9 February 2010 19:27, Ben Bleything <ben@bleything.net> wrote:
On Tue, Feb 9, 2010 at 9:36 AM, Michal Suchanek <hramrach@centrum.cz> wrote:
Actually, nested capturing is also supported as you can see from the
examples here. What is not supported is returning multiple matches for
a group that matches multiple times.
Are you sure it matches multiple times? As I mentioned earlier in the
thread, I can't get it to do so.