String#scan strangeness

Hi there,

[linux.gfbs:281]gfb> ruby -v
ruby 1.6.8 (2003-10-15) [i686-linux]
[linux.gfbs:282]gfb> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
=> [["a b ", "b "], ["c d ", "d "]]
irb(main):003:0>

I am just wondering why String#scan "looses" a group in every match. I would expect the following result:

=> [["a b ", "a ", "b "], ["c d ", "c ", "d "]]

or even

=> [["a b ", ["a ", "b "]], ["c d ", ["c ", "d "]]]

Where am I wrong in my expectations?

Thank you,
Gennady.

P.S.
   It works the same way in Ruby 1.8.0 as well.

when using sub-captures, then #scan returns an array of sub-captures.
This does not include capture[0].. which is the full-match.

"abcd".scan(/(.)(.)/)
#=> [["a", "b"], ["c", "d"]]

when not using sub-captures at all, then #scan returns only full-matches.

"abcd".scan(/../)
#=> ["ab", "cd"]

···

On Thursday 10 June 2004 18:49, Gennady wrote:

I am just wondering why String#scan "looses" a group in every match. I
would expect the following result:

--
Simon Strandgaard

Simon Strandgaard wrote:

I am just wondering why String#scan "looses" a group in every match. I
would expect the following result:

when using sub-captures, then #scan returns an array of sub-captures.
This does not include capture[0].. which is the full-match.

"abcd".scan(/(.)(.)/)
#=> [["a", "b"], ["c", "d"]]

when not using sub-captures at all, then #scan returns only full-matches.

"abcd".scan(/../)
#=> ["ab", "cd"]

--
Simon Strandgaard

In my original irb session capture I have sub-captures, moreover they are nested:

[linux.gfbs:281]gfb> ruby -v
ruby 1.6.8 (2003-10-15) [i686-linux]
[linux.gfbs:282]gfb> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]

irb(main):003:0>

···

On Thursday 10 June 2004 18:49, Gennady wrote:

^^^^ ^^^^
                        these are not subcaptures
                        and are thus not being captured.
                        you need parentesis in order to capture them

···

On Friday 11 June 2004 00:43, Gennady wrote:

irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]

--
Simon Strandgaard

Hi --

Simon Strandgaard wrote:

>
>>I am just wondering why String#scan "looses" a group in every match. I
>>would expect the following result:
>
>
> when using sub-captures, then #scan returns an array of sub-captures.
> This does not include capture[0].. which is the full-match.
>
> "abcd".scan(/(.)(.)/)
> #=> [["a", "b"], ["c", "d"]]
>
> when not using sub-captures at all, then #scan returns only full-matches.
>
> "abcd".scan(/../)
> #=> ["ab", "cd"]
>
> --
> Simon Strandgaard
>

In my original irb session capture I have sub-captures, moreover they
are nested:

[linux.gfbs:281]gfb> ruby -v
ruby 1.6.8 (2003-10-15) [i686-linux]
[linux.gfbs:282]gfb> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]

My understanding is: you've only got two sets of parentheses, so you
can have at most two captures; in other words, (){2} != ()() :slight_smile: It's
purely positional: whatever is in the nth set of parentheses from the
left when the matching stops is the nth capture.

It's as if each () is a window which can move through the string but
can only hold one substring. So the second set of () sort of moves
from left to right:

   (("a ")....)
   ("a "("b ")) # match completed

   Result: $1 == "a b "
            $2 == "b "

David

···

On Fri, 11 Jun 2004, Gennady wrote:

> On Thursday 10 June 2004 18:49, Gennady wrote:

--
David A. Black
dblack@wobblini.net

Simon Strandgaard wrote:

irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}

                                ^^^^^^^
                               ^^^^^^^^^^^^^^
                               These are sub-captures

ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]

               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
               And this is scan's result presented by irb

···

On Friday 11 June 2004 00:43, Gennady wrote:

                        ^^^^ ^^^^
                        these are not subcaptures
                        and are thus not being captured.
                        you need parentesis in order to capture them

--
Simon Strandgaard

David A. Black wrote:

Hi --

Simon Strandgaard wrote:

I am just wondering why String#scan "looses" a group in every match. I
would expect the following result:

when using sub-captures, then #scan returns an array of sub-captures.
This does not include capture[0].. which is the full-match.

"abcd".scan(/(.)(.)/)
#=> [["a", "b"], ["c", "d"]]

when not using sub-captures at all, then #scan returns only full-matches.

"abcd".scan(/../)
#=> ["ab", "cd"]

--
Simon Strandgaard

In my original irb session capture I have sub-captures, moreover they are nested:

[linux.gfbs:281]gfb> ruby -v
ruby 1.6.8 (2003-10-15) [i686-linux]
[linux.gfbs:282]gfb> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]

My understanding is: you've only got two sets of parentheses, so you
can have at most two captures; in other words, (){2} != ()() :slight_smile: It's
purely positional: whatever is in the nth set of parentheses from the
left when the matching stops is the nth capture.

It's as if each () is a window which can move through the string but
can only hold one substring. So the second set of () sort of moves
from left to right:

   (("a ")....)
   ("a "("b ")) # match completed

   Result: $1 == "a b "
            $2 == "b "

David

Thanks, David. It looks like this is the case. Actually, I solved my problem by using the following regexp instead:

[linux.gfbs:71]gfb-ems-session_1> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{#{'(\S+\s+)' * 2}}
=> [["a ", "b "], ["c ", "d "]]
irb(main):003:0>

(My actual regexp is much bigger, I just used a simplified form for an example)

Gennady.

···

On Fri, 11 Jun 2004, Gennady wrote:

On Thursday 10 June 2004 18:49, Gennady wrote: