String#split and groups in the field separator RE

Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):057:0> s = 'a::b:::c::::d'
=> "a::b:::c::::d"
irb(main):058:0> s.split(/:/)
=> ["a", "", "b", "", "", "c", "", "", "", "d"] => OK
irb(main):059:0> s.split(/:+/)
=> ["a", "b", "c", "d"] => OK
irb(main):060:0> s.split(/(:)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?
irb(main):061:0> s.split(/((:)+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???
irb(main):062:0> s.split(/(:+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

mortee

mortee wrote:

Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):057:0> s = 'a::b:::c::::d'
=> "a::b:::c::::d"
irb(main):058:0> s.split(/:/)
=> ["a", "", "b", "", "", "c", "", "", "", "d"] => OK
irb(main):059:0> s.split(/:+/)
=> ["a", "b", "c", "d"] => OK
irb(main):060:0> s.split(/(:)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?
irb(main):061:0> s.split(/((:)+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???
irb(main):062:0> s.split(/(:+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

It was unexpected behavior for me when I ran into it using python's
regex split() function a few months ago. Since it works the same way in
both languages, I would guess it might be a universal regex trait.

···

--
Posted via http://www.ruby-forum.com/\.

Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):060:0> s.split(/(:)+/)
=> ["a", ":", "b", ":", "c", ":", "d"]

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

It is curious that it's not in the api doc... I must have learnt it from
somewhere...

Dan.

mortee wrote:

Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):057:0> s = 'a::b:::c::::d'
=> "a::b:::c::::d"
irb(main):058:0> s.split(/:/)
=> ["a", "", "b", "", "", "c", "", "", "", "d"] => OK
irb(main):059:0> s.split(/:+/)
=> ["a", "b", "c", "d"] => OK
irb(main):060:0> s.split(/(:)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?
irb(main):061:0> s.split(/((:)+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???
irb(main):062:0> s.split(/(:+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

I guess I should mention that the rule I jotted down in the margin of my
book is: if the split() pattern has parenthesized sub groupings, the
result array will include the match for each subgroup--but not the whole
match.

Applying that rule to your examples:

irb(main):060:0> s.split(/(:)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?

The subgroup (:slight_smile: matches a single colon, so those matches are included
in the results,

irb(main):061:0> s.split(/((:)+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???

The subgroup (:slight_smile: matches one colon and those results are included. The
subgroup ((:)+) matches two, three, and four colons as it traverses the
strings and those results are included. Because groups are numbered by
their left most parentheses, the outer grouping comes first in the list.

irb(main):062:0> s.split(/(:+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

The subgroup (:+) matches two, three, and four colons as it traverses
the list, and those matches are included in the results.

And, here is an example of my own that shows that the whole match is not
included in the results--only the parenthesized sub groupings are
included:

str = 'a_::_b_:::_c_::::_d'
pattern = /_(:+)_/

results = str.split(pattern)
p results

--output:--
["a", "::", "b", ":::", "c", "::::", "d"]

···

--
Posted via http://www.ruby-forum.com/\.

mortee wrote:

Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

$ri String#split
...
...
     if _pattern_ is a +Regexp+, _str_ is divided where the pattern
     matches. Whenever the pattern matches a zero-length string, _str_
     is split into individual characters.
...

pickaxe2, p. 619 adds a line to the end of that description:

...
If pattern includes groups, these groups will be included in the
returned values.

···

--
Posted via http://www.ruby-forum.com/\.

Daniel Sheppard wrote:

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

It is curious that it's not in the api doc... I must have learnt it from
somewhere...

7stud -- wrote:

I guess I should mention that the rule I jotted down in the margin of my
book is: if the split() pattern has parenthesized sub groupings, the
result array will include the match for each subgroup--but not the whole
match.

Applying that rule to your examples:

irb(main):060:0> s.split(/(:)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?

The subgroup (:slight_smile: matches a single colon, so those matches are included
in the results,

[...]

Thanks, that clarifies it, and the results make sense based on the rule.
However, I find it quite confusing to have parts of what I intend to be
part of the "separator" among the list of results. To say the least.

mortee

Daniel Sheppard wrote:

Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):060:0> s.split(/(:)+/)
=> ["a", ":", "b", ":", "c", ":", "d"]

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

/(?=,)/ is a lookahead match. I'm sure you really meant /(?:,)/