String.split

Tom_Danielsen · 13 July 2004 23:19

While it works as documented in Pickaxe ( If pattern is omitted, the
value of $; is used. If $; is nil (which is the default), str is split
on whitespace as if ` ' were specified. ) I do find this behaviour
somewhat surprising:

    irb(main):004:0> "a b".split(" ")
    => ["a", "b"]
    irb(main):005:0> "a\tb".split(" ")
    => ["a", "b"]
    irb(main):006:0> "a b".split(/ /)
    => ["a", "b"]
    irb(main):007:0> "a\tb".split(/ /)
    => ["a\tb"]
    irb(main):008:0>

I think "a\tb".split(" ") => ["a", "b"] is quite counterintuitive...

% ruby -v
ruby 1.8.1 (2003-12-25) [i386-freebsd5.1]
%

regards,
Tom

Lloyd_Zusman · 13 July 2004 23:41

Tom Danielsen <tom@mnemonic.no> writes:

While it works as documented in Pickaxe ( If pattern is omitted, the
value of $; is used. If $; is nil (which is the default), str is split
on whitespace as if ` ' were specified. ) I do find this behaviour
somewhat surprising:

    irb(main):004:0> "a b".split(" ")
    => ["a", "b"]
    irb(main):005:0> "a\tb".split(" ")
    => ["a", "b"]
    irb(main):006:0> "a b".split(/ /)
    => ["a", "b"]
    irb(main):007:0> "a\tb".split(/ /)
    => ["a\tb"]
    irb(main):008:0>

I think "a\tb".split(" ") => ["a", "b"] is quite counterintuitive...

This case follows the convention in Perl, where a split pattern of " "
(one explicit space, not in the form of a regexp) is a special case
which means to split on any occurrence of one or more whitespace
characters, ignoring any leading whitespace.

We also have this:

  irb(main):001:0> " a b".split(" ")
  => ["a", "b"]
  irb(main):002:0> " a b".split(/ /)
  => ["", "a", "b"]
  irb(main):003:0> "\ta\tb".split(" ")
  => ["a", "b"]
  irb(main):004:0> "\ta\tb".split(/\t/)
  => ["", "a", "b"]

It's a common occurrence to want to split lines that have fields
separated by arbitrary whitespace characters, and to ignore any leading
whitespace. This usage of split() does that quite nicely.

This convention was almost certainly adopted deliberately, in order to
be consistent with some of the semantics of Perl's split() function.
Although it may seem counter-intuitive to people without prior Perl
experience, it's a very familiar construct for those who have been
working in Perl for a long time.

Just think of split(" ") as a special case which performs a very useful
function.

···

% ruby -v
ruby 1.8.1 (2003-12-25) [i386-freebsd5.1]
%

regards,
Tom

--
Lloyd Zusman
ljz@asfast.com
God bless you.

Gavin_Kistner2 · 14 July 2004 03:16

Ick.

Not at your summary, Lloyd, but at this situation. This is...stupid.
I don't know what else to call it.

It's a non-sensical idiom, sure to bite more than a few people. It's like Ruby implemented the behavior of a bug that Perl people have gotten used to relying on.

What possible benefit is there to typing split(" ") vs. split(/\s/)? One saved character (but two shift key presses!)?

It is counter-intuitive to people without prior Perl experience. Now that Ruby is taking off in its own right, does Ruby need to continue supporting gross global $ vars, this, and other ugly Perl-isms just to try and make Ruby feel more like Perl?

···

On Jul 13, 2004, at 5:41 PM, Lloyd Zusman wrote:

This convention was almost certainly adopted deliberately, in order to
be consistent with some of the semantics of Perl's split() function.
Although it may seem counter-intuitive to people without prior Perl
experience, it's a very familiar construct for those who have been
working in Perl for a long time.

Just think of split(" ") as a special case which performs a very useful
function.

--
(-, /\ \/ / /\/

Cameron_McBride · 14 July 2004 03:57

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

It is counter-intuitive to people without prior Perl experience. Now
that Ruby is taking off in its own right, does Ruby need to continue
supporting gross global $ vars, this, and other ugly Perl-isms just to
try and make Ruby feel more like Perl?

Well, things are the way they are. Ruby has over 10 yrs behind it.
I, for one, would like to see less sweeping changes that causes
breakage, not more.

Cameron

Bill_Kelly · 14 July 2004 04:03

Hi,

>
> Just think of split(" ") as a special case which performs a very useful
> function.

Ick.

Not at your summary, Lloyd, but at this situation. This is...stupid.
I don't know what else to call it.

It's a non-sensical idiom, sure to bite more than a few people. It's
like Ruby implemented the behavior of a bug that Perl people have
gotten used to relying on.

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

They aren't the same. I agree that having a special case
feels funky... But split(" ") embodies functionality that's
not as easy to duplicate as /\s/ . For instance:

" a b c ".split(" ")

=> ["a", "b", "c"]

" a b c ".split(/\s/)

=> ["", "", "", "a", "", "", "", "b", "", "", "", "c"]

" a b c ".split(/\s+/)

=> ["", "a", "b", "c"]

Even with /\s+/ we're getting a leading empty field that
the " " special case eliminates for us.

I've never been sure how to write a regexp for split that
does what " " does. I keep thinking it'd need a variable-
width negative lookbehind assertion... which I don't think
even Perl's regex engine supports... Something like:

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

It is counter-intuitive to people without prior Perl experience. Now
that Ruby is taking off in its own right, does Ruby need to continue
supporting gross global $ vars, this, and other ugly Perl-isms just to
try and make Ruby feel more like Perl?

Some are Perl-isms, some are Shell-isms. They're fantastic
for one-liners... If Ruby was neutered to be lousy for one-
liners, I'd be thoroughly bummed . . .

Regards,

Bill

···

From: "Gavin Kistner" <gavin@refinery.com>

On Jul 13, 2004, at 5:41 PM, Lloyd Zusman wrote:

Cameron_McBride · 14 July 2004 03:59

Stupid webinterface. paste got mangled. apologizes.

irb(main):001:0> s = "this is\tfun \tno?"
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]
irb(main):003:0> s.split(" ")
=> ["this", "is", "fun", "no?"]

Cameron

Chris_Dutton1 · 14 July 2004 05:57

Bill Kelly wrote:

I've never been sure how to write a regexp for split that
does what " " does. I keep thinking it'd need a variable-
width negative lookbehind assertion... which I don't think
even Perl's regex engine supports... Something like:

/(?<!^\s+)\s+/ ...uh....

..Maybe there's another way to do it... If anybody knows
I'd like to learn...

Not that I dislike the behavior of split(" "), but it shouldn't be much harder than:

" a b c d ".strip.split(/\s+/)

Mark_Hubbart · 14 July 2004 06:26

as of now, this works:

irb(main):001:0> " spaces of doom ".split(nil)
=> ["spaces", "of", "doom"]

Why shouldn't nil be the only special case? If the $variable is set to nil, it uses this kind of split anyway.

And it's no more characters to type than " "

Mark

···

On Jul 13, 2004, at 9:03 PM, Bill Kelly wrote:

Even with /\s+/ we're getting a leading empty field that
the " " special case eliminates for us.

I've never been sure how to write a regexp for split that
does what " " does. I keep thinking it'd need a variable-
width negative lookbehind assertion... which I don't think
even Perl's regex engine supports... Something like:

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

Robert · 14 July 2004 09:17

"Cameron McBride" <cameron.mcbride@gmail.com> schrieb im Newsbeitrag
news:dcedf5e204071320562e0ad096@mail.gmail.com...

> What possible benefit is there to typing split(" ") vs. split(/\s/)?
> One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. IMHO regular expressions are better suited to
this task anyway. And they are faster:

def test1(s) s.split ' ' end

def test2(s) s.split /\s+/ end

foo = (1..100).to_a.join " "

1000.times { test1 foo }
1000.times { test2 foo }

Yields

11:12:47 [source]: /c/temp/split-perf.rb
  % cumulative self self total
time seconds seconds calls ms/call ms/call name
61.58 0.62 0.62 2000 0.31 0.31 String#split
15.37 0.78 0.16 1000 0.16 0.39 Object#test1
13.79 0.92 0.14 2 70.00 507.50 Integer#times
  9.26 1.02 0.09 1000 0.09 0.48 Object#test2
  3.05 1.05 0.03 1 31.00 31.00
Profiler__.start_profile
  0.00 1.05 0.00 2 0.00 0.00 Module#method_added
  0.00 1.05 0.00 100 0.00 0.00 Fixnum#to_s
  0.00 1.05 0.00 1 0.00 0.00 Enumerable.to_a
  0.00 1.05 0.00 1 0.00 1015.00 #toplevel
  0.00 1.05 0.00 1 0.00 0.00 Array#join
  0.00 1.05 0.00 1 0.00 0.00 Range#each

Which shows that the regexp version is faster. I assume, the string is
converted into a regexp internally and that this is done on each
invocation, while there are definitely optimizations for recurring regexp
usage.

Regards

robert

Lloyd_Zusman · 14 July 2004 09:51

Mark Hubbart <discord@mac.com> writes:

···

On Jul 13, 2004, at 9:03 PM, Bill Kelly wrote:

[ ... ]

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

as of now, this works:

irb(main):001:0> " spaces of doom ".split(nil)
=> ["spaces", "of", "doom"]

Why shouldn't nil be the only special case? If the $variable is set to
nil, it uses this kind of split anyway.

And it's no more characters to type than " "

Mark

... and the following has even fewer characters to type:

irb(main):001:0> " spaces of doom ".split()
=> ["spaces", "of", "doom"]

--
Lloyd Zusman
ljz@asfast.com
God bless you.

Lloyd_Zusman · 14 July 2004 10:00

"Robert Klemme" <bob.news@gmx.net> writes:

"Cameron McBride" <cameron.mcbride@gmail.com> schrieb im Newsbeitrag
news:dcedf5e204071320562e0ad096@mail.gmail.com...

> What possible benefit is there to typing split(" ") vs. split(/\s/)?
> One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. [ ... ]

However, the two cases are not equivalent:

  irb(main):001:0> " spaces of doom ".split(/\s+/)
  => ["", "spaces", "of", "doom"]
  irb(main):002:0> " spaces of doom ".split(" ")
  => ["spaces", "of", "doom"]

You'd have to compare split(" ") with strip.split(/\s+/). I'll do that
later this morning, when I have more time, and I'll then post my
results.

···

--
Lloyd Zusman
ljz@asfast.com
God bless you.

Robert · 14 July 2004 10:32

"Lloyd Zusman" <ljz@asfast.com> schrieb im Newsbeitrag
news:m3vfgq6hg6.fsf@asfast.com...

"Robert Klemme" <bob.news@gmx.net> writes:

> "Cameron McBride" <cameron.mcbride@gmail.com> schrieb im Newsbeitrag
> news:dcedf5e204071320562e0ad096@mail.gmail.com...
>> > What possible benefit is there to typing split(" ") vs.

split(/\s/)?

>> > One saved character (but two shift key presses!)?
>>
>> they are not the same:
>>
>> irb(main):001:0> s = "this is\tfun \tno?"
>> => ["this", "is", "fun", "no?"]
>> irb(main):003:0> s.split(" ")
>> => "this is\tfun \tno?"
>> irb(main):002:0> s.split(/\s/)
>> => ["this", "is", "fun", "", "no?"]
>
> I'd rather compare split(" ") to split(/\s+/), which is what I use

when I

> need this functionality. [ ... ]

However, the two cases are not equivalent:

  irb(main):001:0> " spaces of doom ".split(/\s+/)
  => ["", "spaces", "of", "doom"]
  irb(main):002:0> " spaces of doom ".split(" ")
  => ["spaces", "of", "doom"]

You'd have to compare split(" ") with strip.split(/\s+/). I'll do that
later this morning, when I have more time, and I'll then post my
results.

You're right, the strip makes

def test1(s) s.split ' ' end
def test2(s) s.split /\s+/ end
def test3(s) s.strip.split /\s+/ end
def test4(s) s.sub(/^\s+/, '').split /\s+/ end

foo = (1..100).to_a.join " "

1000.times { test1 foo }
1000.times { test2 foo }
1000.times { test3 foo }
1000.times { test4 foo }

12:26:30 [ruby]: ./split-perf.rb
  % cumulative self self total
time seconds seconds calls ms/call ms/call name
56.61 1.36 1.36 4000 0.34 0.34 String#split
10.56 1.62 0.25 4 63.50 597.75 Integer#times
  8.89 1.83 0.21 1000 0.21 0.21 String#sub
  8.40 2.03 0.20 1000 0.20 0.55 Object#test2
  7.15 2.20 0.17 1000 0.17 0.65 Object#test4
  5.15 2.33 0.12 1000 0.12 0.39 Object#test1
  2.62 2.39 0.06 1000 0.06 0.55 Object#test3
  1.29 2.42 0.03 1 31.00 31.00
Profiler__.start_profile
  0.00 2.42 0.00 1 0.00 2406.00 #toplevel
  0.00 2.42 0.00 100 0.00 0.00 Fixnum#to_s
  0.00 2.42 0.00 1000 0.00 0.00 String#strip
  0.00 2.42 0.00 4 0.00 0.00 Module#method_added
  0.00 2.42 0.00 1 0.00 0.00 Range#each
  0.00 2.42 0.00 1 0.00 0.00 Array#join
  0.00 2.42 0.00 1 0.00 0.00 Enumerable.to_a

def test1(s) s.split ' ' end
def test2(s) s.split /\s+/ end
def test3(s) s.strip.split /\s+/ end
def test4(s) s.sub(/^\s+/, '').split /\s+/ end

foo = " " + (1..100).to_a.join( " " )

1000.times { test1 foo }
1000.times { test2 foo }
1000.times { test3 foo }
1000.times { test4 foo }

12:27:36 [ruby]: ./split-perf.rb
  % cumulative self self total
time seconds seconds calls ms/call ms/call name
51.03 1.26 1.26 4000 0.32 0.32 String#split
12.84 1.58 0.32 1000 0.32 0.65 Object#test3
10.13 1.83 0.25 4 62.50 613.50 Integer#times
  8.30 2.03 0.20 1000 0.20 0.39 Object#test1
  7.09 2.21 0.17 1000 0.17 0.64 Object#test4
  6.97 2.38 0.17 1000 0.17 0.52 Object#test2
  1.82 2.42 0.05 1000 0.05 0.05 String#sub
  1.26 2.46 0.03 1 31.00 31.00
Profiler__.start_profile
  1.22 2.49 0.03 1000 0.03 0.03 String#strip
  0.61 2.50 0.01 1 15.00 15.00 Enumerable.to_a
  0.00 2.50 0.00 1 0.00 0.00 String#+
  0.00 2.50 0.00 100 0.00 0.00 Fixnum#to_s
  0.00 2.50 0.00 1 0.00 2469.00 #toplevel
  0.00 2.50 0.00 1 0.00 0.00 Range#each
  0.00 2.50 0.00 1 0.00 0.00 Array#join
  0.00 2.50 0.00 4 0.00 0.00 Module#method_added

Performance ranking depends on whether there are leading spaces or not.

robert

Mark_Hubbart · 14 July 2004 17:21

... which is the same as:

irb(main):001:0> " spaces of doom ".split
=> ["spaces", "of", "doom"]

However, I was mistakenly thinking that #split(nil) would be exactly the same as #split(" ")... but it isn't. I tried setting $; to "." and it no longer worked. It seems that it should, though: when you want the default behavior of #split, you set $; to nil. it seems rather logical that #split(nil) should split using that default behavior. Oh well

cheers,
Mark

···

On Jul 14, 2004, at 2:51 AM, Lloyd Zusman wrote:

Mark Hubbart <discord@mac.com> writes:

On Jul 13, 2004, at 9:03 PM, Bill Kelly wrote:

[ ... ]

  /(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

as of now, this works:

irb(main):001:0> " spaces of doom ".split(nil)
=> ["spaces", "of", "doom"]

Why shouldn't nil be the only special case? If the $variable is set to
nil, it uses this kind of split anyway.

And it's no more characters to type than " "

Mark

... and the following has even fewer characters to type:

  irb(main):001:0> " spaces of doom ".split()
  => ["spaces", "of", "doom"]

Lloyd_Zusman · 14 July 2004 12:16

"Robert Klemme" <bob.news@gmx.net> writes:

"Lloyd Zusman" <ljz@asfast.com> schrieb im Newsbeitrag
news:m3vfgq6hg6.fsf@asfast.com...

"Robert Klemme" <bob.news@gmx.net> writes:

> "Cameron McBride" <cameron.mcbride@gmail.com> schrieb im Newsbeitrag
> news:dcedf5e204071320562e0ad096@mail.gmail.com...
>> > What possible benefit is there to typing split(" ") vs.

split(/\s/)?

>> > One saved character (but two shift key presses!)?
>>
>> they are not the same:
>>
>> irb(main):001:0> s = "this is\tfun \tno?"
>> => ["this", "is", "fun", "no?"]
>> irb(main):003:0> s.split(" ")
>> => "this is\tfun \tno?"
>> irb(main):002:0> s.split(/\s/)
>> => ["this", "is", "fun", "", "no?"]
>
> I'd rather compare split(" ") to split(/\s+/), which is what I use

when I

> need this functionality. [ ... ]

However, the two cases are not equivalent:

  irb(main):001:0> " spaces of doom ".split(/\s+/)
  => ["", "spaces", "of", "doom"]
  irb(main):002:0> " spaces of doom ".split(" ")
  => ["spaces", "of", "doom"]

You'd have to compare split(" ") with strip.split(/\s+/). I'll do that
later this morning, when I have more time, and I'll then post my
results.

You're right, the strip makes

[ ... etc. ... ]

Well, you saved me some time by running these yourself. Thanks.

Hmm ... if you know for sure ahead of time whether or not there's
leading whitespace, split(' ') is not the best.

However, without this knowledge about the existence of leading
whitespace or lack thereof, I believe that the best bet is still
split(' ') and its cousins split(nil) and split().

Using a random number of spaces between the items and a random amount of
leading whitespace (including none), I got the following results. Note
that the split(' ')/split(nil)/split() cases are the fastest ones when
you leave out the split(/\s+/) case. That one should really be left out
of these random whitespace tests, because it doesn't give the same
results as the others.

testArray =

  1000.times {
    string = ''
    (1..100).each { |x| string += ((" " * rand(3)) + x.to_s) }
    testArray << string;
  }

require 'profile'

  def test1(s) s.split(' ') end
  def test2(s) s.split(nil) end
  def test3(s) s.split() end
  def test4(s) s.split(/\s+/) end
  def test5(s) s.strip.split(/\s+/) end
  def test6(s) s.sub(/^\s+/, '').split(/\s+/) end

  testArray.each { |x| test1(x) }
  testArray.each { |x| test2(x) }
  testArray.each { |x| test3(x) }
  testArray.each { |x| test4(x) }
  testArray.each { |x| test5(x) }
  testArray.each { |x| test6(x) }

  % cumulative self self total
   time seconds seconds calls ms/call ms/call name
   33.17 3.80 3.80 6000 0.63 0.63 String#split
   31.54 7.42 3.62 1 3617.19 3617.19
   Profiler__.start_profile
   24.59 10.24 2.82 6 470.05 1911.46 Array#each
    8.17 11.18 0.94 1000 0.94 1.66 Object#test5
    6.68 11.95 0.77 1000 0.77 1.93 Object#test6
    6.34 12.67 0.73 1000 0.73 1.20 Object#test1
    6.27 13.39 0.72 1000 0.72 1.60 Object#test4
    6.27 14.11 0.72 1000 0.72 1.13 Object#test3
    5.18 14.70 0.59 1000 0.59 1.12 Object#test2
    2.18 14.95 0.25 1000 0.25 0.25 String#sub
    1.16 15.09 0.13 1000 0.13 0.13 String#strip
    0.00 15.09 0.00 6 0.00 0.00
   Module#method_added
    0.00 15.09 0.00 1 0.00 11468.75 #toplevel

···

--
Lloyd Zusman
ljz@asfast.com
God bless you.

Topic		Replies	Views
Split ruby-talk	1	102	22 July 2010
Can't understand String#split's behavior ruby-talk	2	144	17 October 2010
Splitting ruby-talk	3	91	4 March 2010
String#split: unxepected result ruby-talk	3	76	31 October 2006
Ask for some experience ruby-talk	4	75	16 November 2005

String.split

Related topics