String#split(' ') and whitespace (perl user's surprise)

I have to confess that I use a lot of Perl, and some of its idioms are
deeply embedded in my mind.

Im the course of parsing some data in Ruby I used a fragment of code
like

rules.each_line do |rule|
  sku, price, special = rule.chomp.split(' ', 3)
  # [...]
end

where rules was composed of lines of the form

  A     50       3 for 130

coming from a perl background I would have expected ‘A’, ‘50’ and ‘3 for 130’
to have been assigned to sku, price and special. As it turned out
special had leading whitespace.

Maybe this is the difference Hal alludes to in From Perl to Ruby in his
book The Ruby Way: “Also, note that split also behaves slightly
differently.”

Is this intentional behaviour? I think that the Perl behaviour is more
useful (that is trimming the leading whitespace off the limit-th element
returned.)

In the perl debugger:

DB<1> $s = ’ A 50 3 for 130’

DB<2> @l = split(’ ', $s, 3)

DB<3> x @l
0 ‘A’
1 50
2 ‘3 for 130’

In irb:

’ A 50 3 for 130’.split(’ ', 3)
=> [“A”, “50”, " 3 for 130"]

What I’d like Ruby to do

’ A 50 3 for 130’.split(’ ', 3)
=> [“A”, “50”, “3 for 130”]

Mike

···


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA

In irb:

’ A 50 3 for 130’.split(’ ', 3)
=> [“A”, “50”, " 3 for 130"]

What I’d like Ruby to do

’ A 50 3 for 130’.split(’ ', 3)
=> [“A”, “50”, “3 for 130”]

I’d agree, FWIW. Ruby even appears internally inconsistent; why does
it trim the leading \s off of ‘A’ and ‘50’, but not the last element.
Just because it’s the last one? Bug? Oversight? By design?

···

Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!

I don’t know if it’s intentional or not, but I do see the following at
http://www.rubycentral.com/book/ref_c_string.html#String.split

split
str.split( pattern=$;, [ limit ] ) → anArray

Divides str into substrings based on a delimiter, returning an array
of these substrings.

If pattern is a String, then its contents are used as the delimiter
when splitting str. If pattern is a single space, str is split on
whitespace, with leading whitespace and runs of contiguous whitespace
characters ignored.

And they have an example:

" now’s the time".split(’ ') » [“now’s”, “the”, “time”]

This seems to imply that split(’ ', 3) ought to work like you were
expecting. A little testing with irb seems to show that this behaves
differently any time the “limit” argument is given. In fact, when a
regex matching whitespace is given, you get an extra empty entry at the
front; this is another possible problem, but maybe is correct since
it’s not the “magic” single space parameter which is said to trim
leading whitespace. See below:

$ cat test.rb
str = " A 50 3 for 130"

p str.split
p str.split(’ ‘)
p str.split(’ ',3)
p str.split(/\s+/)
p str.split(/\s+/,3)

$ ./ruby -v test.rb
ruby 1.8.0 (2003-05-31) [i686-linux]
[“A”, “50”, “3”, “for”, “130”]
[“A”, “50”, “3”, “for”, “130”]
[“A”, “50”, " 3 for 130"] ← this differs from what is documented
[“”, “A”, “50”, “3”, “for”, “130”]
[“”, “A”, “50 3 for 130”]

This seems like a ruby bug: limit seems to be counting whitespace
characters and not whitespace “runs” when ’ ’ is given as the pattern.

But maybe it’s a documentation bug. Who knows?

···

On Thursday 26 June 2003 6:14 am, Mike Stok wrote:

I have to confess that I use a lot of Perl, and some of its idioms
are deeply embedded in my mind.

Im the course of parsing some data in Ruby I used a fragment of code
like

rules.each_line do |rule|
  sku, price, special = rule.chomp.split(' ', 3)
  # [...]
end

where rules was composed of lines of the form

  A     50       3 for 130

coming from a perl background I would have expected ‘A’, ‘50’ and ‘3
for 130’ to have been assigned to sku, price and special. As it
turned out special had leading whitespace.

Maybe this is the difference Hal alludes to in From Perl to Ruby in
his book The Ruby Way: “Also, note that split also behaves slightly
differently.”

Is this intentional behaviour? I think that the Perl behaviour is
more useful (that is trimming the leading whitespace off the limit-th
element returned.)


Wesley J. Landaker - wjl@icecavern.net
OpenPGP FP: C99E DF40 54F6 B625 FD48 B509 A3DE 8D79 541F F830

I should have mentioned that as a workaround, you can do this:

’ A 50 3 for 130’.split(’ ', 3).map { |x| x.strip }

I agree you shouldn’t have to do that, but it will get you going with
minimal changes.

···

On Thursday 26 June 2003 6:14 am, Mike Stok wrote:

In irb:

’ A 50 3 for 130’.split(’ ', 3)

=> [“A”, “50”, " 3 for 130"]

What I’d like Ruby to do

’ A 50 3 for 130’.split(’ ', 3)

=> [“A”, “50”, “3 for 130”]


Wesley J. Landaker - wjl@icecavern.net
OpenPGP FP: C99E DF40 54F6 B625 FD48 B509 A3DE 8D79 541F F830

This seems like a ruby bug: limit seems to be counting whitespace
characters and not whitespace "runs" when ' ' is given as the pattern.

But maybe it's a documentation bug. Who knows?

Well, to see what it do

svg% ruby -e 'p "a b".split(" ", 2); p "a b".split(" ", 2);'
["a", "b"]
["a", " b"]
svg%

it remove just the first space when it has a limit parameter

Guy Decoux

“Michael Campbell” michael_s_campbell@yahoo.com schrieb im Newsbeitrag
news:20030626132342.22980.qmail@web12408.mail.yahoo.com

In irb:

’ A 50 3 for 130’.split(’ ', 3)
=> [“A”, “50”, " 3 for 130"]

What I’d like Ruby to do

’ A 50 3 for 130’.split(’ ', 3)
=> [“A”, “50”, “3 for 130”]

I’d agree, FWIW. Ruby even appears internally inconsistent; why does
it trim the leading \s off of ‘A’ and ‘50’, but not the last element.
Just because it’s the last one? Bug? Oversight? By design?

I guess, it’s because the last one receives the rest of whatever is there.
Personally I prefer to use regexps for splitting like this:

s.strip.split( /\s+/o, 3 )

Personally I think the behavior of split is quite complex and one has to
carefully read the documentation. If you read that you’ll find out that
splitting with a single space is a special case which leads exactly to
what you have observed. At least doc and impl match. :slight_smile:

robert

In article 200306260741.40418.wjl@icecavern.net,

In irb:

’ A 50 3 for 130’.split(’ ', 3)

=> [“A”, “50”, " 3 for 130"]

What I’d like Ruby to do

’ A 50 3 for 130’.split(’ ', 3)

=> [“A”, “50”, “3 for 130”]

I should have mentioned that as a workaround, you can do this:

’ A 50 3 for 130’.split(’ ', 3).map { |x| x.strip }

I agree you shouldn’t have to do that, but it will get you going with
minimal changes.

Thanks. A minor nit is that this will destroy trailing spaces on the
third field (not relevant in this case, and maybe useful in many cases.
:slight_smile:

I think that my real point was that if there’s a special case for
splitting a string on ’ ’ then it should behave the same way as Perl’s
special case.

Mike

···

Wesley J Landaker wjl@icecavern.net wrote:

On Thursday 26 June 2003 6:14 am, Mike Stok wrote:


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA

Does that sound like a bug to you?

···

-----Original Message-----
From: ts [mailto:decoux@moulon.inra.fr]
Sent: Thursday, June 26, 2003 8:38 AM
To: ruby-talk ML
Cc: ruby-talk@ruby-lang.org
Subject: Re: String#split(’ ') and whitespace (perl user’s surprise)

This seems like a ruby bug: limit seems to be counting whitespace
characters and not whitespace “runs” when ’ ’ is given as the pattern.

But maybe it’s a documentation bug. Who knows?

Well, to see what it do

svg% ruby -e ‘p “a b”.split(" “, 2); p “a b”.split(” ", 2);’
[“a”, “b”]
[“a”, " b"]
svg%

it remove just the first space when it has a limit parameter

Guy Decoux

Bug ? May-be / Maybe not …

Anyway, just let you know:
(1) the split( /\s+/o, 3) does not work as we want.
(2) you could use squeeze, but still have tailling problem …
(3) Ruby behaviour the way like that now… ( you may need do more work on
the last item. )

irb(main):001:0> ’ A 50 3 for 130’.split(/\s+/o, 3)
["", “A”, “50 3 for 130”]
irb(main):002:0> ’ A 50 3 for 130’.squeeze.split(’ ', 3)
[“A”, “50”, “3 for 130”]
irb(main):003:0> ’ A 50 3 for 130 ‘.squeeze.split(’ ', 3)
[“A”, “50”, "3 for 130 "]

Dave

···

Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!

I think you want strip instead of squeeze:

irb(main):002:0> " I like broccolli ".squeeze
=> " I like brocoli "
irb(main):003:0> " I like broccolli ".strip
=> “I like broccolli”

I got those mixed up in a script once before and it took me forever to
debug. :wink:

···

On Thursday 26 June 2003 8:57 am, D T wrote:

Bug ? May-be / Maybe not …

Anyway, just let you know:
(1) the split( /\s+/o, 3) does not work as we want.
(2) you could use squeeze, but still have tailling problem …
(3) Ruby behaviour the way like that now… ( you may need do more
work on the last item. )

irb(main):001:0> ’ A 50 3 for 130’.split(/\s+/o, 3)
[“”, “A”, “50 3 for 130”]
irb(main):002:0> ’ A 50 3 for 130’.squeeze.split(’ ', 3)
[“A”, “50”, “3 for 130”]
irb(main):003:0> ’ A 50 3 for 130 ‘.squeeze.split(’
', 3) [“A”, “50”, "3 for 130 "]


Wesley J. Landaker - wjl@icecavern.net
OpenPGP FP: C99E DF40 54F6 B625 FD48 B509 A3DE 8D79 541F F830

Hi,

···

At Thu, 26 Jun 2003 22:45:37 +0900, Michael Campbell wrote:

Does that sound like a bug to you?

Yes, to me.

Index: string.c

RCS file: /cvs/ruby/src/ruby/string.c,v
retrieving revision 1.161
diff -u -2 -p -r1.161 string.c
— string.c 26 Jun 2003 18:24:58 -0000 1.161
+++ string.c 26 Jun 2003 18:26:27 -0000
@@ -2582,4 +2582,5 @@ rb_str_split_m(argc, argv, str)
end = beg+1;
skip = 0;

  •       if (!NIL_P(limit) && lim <= i) break;
      }
      }
    

@@ -2589,5 +2590,5 @@ rb_str_split_m(argc, argv, str)
skip = 1;
beg = end + 1;

  •       if (!NIL_P(limit) && lim <= ++i) break;
    
  •       if (!NIL_P(limit)) ++i;
      }
      else {
    


Nobu Nakada

I think you want strip instead of squeeze:

irb(main):002:0> " I like broccolli ".squeeze
=> " I like brocoli "
irb(main):003:0> " I like broccolli ".strip
=> “I like broccolli”

Well, in this case, squeeze and strip is almost the same,
because later, you will split on space char.
( so squeeze space between first and second item is not a problem at all )

The only problem on squeez is ONLY on the last item
’ A B C D E’ after squeeze.split(’ ', 3) ==> [‘A’, ‘B’, 'C D E ']
last item ( middle and tail will squeez! )

An avantage squeeze then strip is like, if you want :
irb(main):002:0> ‘A,B,C,D,E,’.squeeze(‘,’).split(‘,’,3)
[“A”, “B”, “C,D,E,”]

can strip do that ?! :slight_smile:

Anyway, you need more control, still need to work on last item after split.

Dave

···

Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!

To me, as well. Commit the fix, please.

						matz.
···

In message “Re: String#split(’ ') and whitespace (perl user’s surprise)” on 03/06/27, nobu.nokada@softhome.net nobu.nokada@softhome.net writes:

At Thu, 26 Jun 2003 22:45:37 +0900, >Michael Campbell wrote:

Does that sound like a bug to you?

Yes, to me.

I think you want strip instead of squeeze:

irb(main):002:0> " I like broccolli ".squeeze
=> " I like brocoli "
irb(main):003:0> " I like broccolli ".strip
=> “I like broccolli”

Well, in this case, squeeze and strip is almost the same,
because later, you will split on space char.
( so squeeze space between first and second item is not a problem at
all )

The only problem on squeez is ONLY on the last item
’ A B C D E’ after squeeze.split(’ ', 3) ==> [‘A’, ‘B’,
'C D E '] last item ( middle and tail will squeez! )

I was referring to the [sometimes unintended] effect of squeeze by
default squeezing all characters, not just whitespace.

’ AAA BB CCCCCC DDDDDDDDDDDDDD EEE’.squeeze.split(’ ', 3)
=> [‘A’, ‘B’, ‘C D E’]

Maybe that’s what you want, but I bet that could take you by surprise if
you were expecting it to just squeeze whitespace. =)

An avantage squeeze then strip is like, if you want :
irb(main):002:0> ‘A,B,C,D,E,’.squeeze(‘,’).split(‘,’,3)
[“A”, “B”, “C,D,E,”]

can strip do that ?! :slight_smile:

Heh heh, I didn’t say squeeze wasn’t useful. :wink: In my above example you
can squeeze on ’ ’ and it works like you said:

’ AAA BB CCCCCC DDDDDDDDDDDDDD EEE’.squeeze(’ ‘).split(’ ',
3) => [“AAA”, “BB”, “CCCCCC DDDDDDDDDDDDDD EEE”]

···

On Thursday 26 June 2003 9:14 am, D T wrote:


Wesley J. Landaker - wjl@icecavern.net
OpenPGP FP: C99E DF40 54F6 B625 FD48 B509 A3DE 8D79 541F F830

Maybe that’s what you want, but I bet that could take you by surprise
if you were expecting it to just squeeze whitespace. =)

Thanks, learn one more thing.
Next time I shuold be careful with squeeze.

···

Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!