[rcr] String#split behaves odd

//should be a 1:1 correlation between join and split. Namely, anything
//split (with a constant string/pattern) should be able to be
//joined back
//to the original using that constant string.

second that.
I hope it is possible.

kind regards -botp

···

Ryan Davis [mailto:ryand-ruby@zenspider.com] wrote:

Hi,

···

In message "Re: [rcr] String#split behaves odd" on Wed, 8 Dec 2004 09:33:39 +0900, "Peña, Botp" <botp@delmonte-phil.com> writes:

Ryan Davis [mailto:ryand-ruby@zenspider.com] wrote:

//should be a 1:1 correlation between join and split. Namely, anything
//split (with a constant string/pattern) should be able to be
//joined back
//to the original using that constant string.

second that.
I hope it is possible.

I feel it might be difficult. Define the following:

  "abc\n".split("\n").join("\n")

              matz.

"abc\n".split("\n",-1).join("\n")
=> "abc\n"

Make default behavior -1.

T.

···

On Wednesday 08 December 2004 12:00 am, Yukihiro Matsumoto wrote:

I feel it might be difficult. Define the following:

  "abc\n".split("\n").join("\n")

No. Most of the time, I don't actually *want* all the extra crap when
I split. This change would break a lot of code silently.

-austin

···

On Wed, 8 Dec 2004 15:56:01 +0900, trans. (T. Onoma) <transami@runbox.com> wrote:

On Wednesday 08 December 2004 12:00 am, Yukihiro Matsumoto wrote:
> I feel it might be difficult. Define the following:
> "abc\n".split("\n").join("\n")
"abc\n".split("\n",-1).join("\n")
=> "abc\n"

Make default behavior -1.

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

And now:

"abc\n\ndef\nghi".split(/\n+/)

···

On 2004-12-08 15:56:01 +0900, trans. (T. Onoma) wrote:

Make default behavior -1.

--
Florian Frank

Sorry, what are you pointing out here?

T.

···

On Wednesday 08 December 2004 10:00 am, Florian Frank wrote:

On 2004-12-08 15:56:01 +0900, trans. (T. Onoma) wrote:
> Make default behavior -1.

And now:

"abc\n\ndef\nghi".split(/\n+/)

Really? In all my code I either had to put the -1 in b/c I was getting
unexpected results (wasting many hours, btw!); or it didn't really matter
either way --empty strings usually end up in no-effect results.

Might it break code? Sure. But a lot? I doubt it. In fact I suspect some of
those same programs might have edge cases that would break them for the lack
of the -1.

T.

···

On Wednesday 08 December 2004 07:42 am, Austin Ziegler wrote:

On Wed, 8 Dec 2004 15:56:01 +0900, trans. (T. Onoma) > > <transami@runbox.com> wrote:
> On Wednesday 08 December 2004 12:00 am, Yukihiro Matsumoto wrote:
> > I feel it might be difficult. Define the following:
> > "abc\n".split("\n").join("\n")
>
> "abc\n".split("\n",-1).join("\n")
> => "abc\n"
>
> Make default behavior -1.

No. Most of the time, I don't actually *want* all the extra crap when
I split. This change would break a lot of code silently.

trans. (T. Onoma) wrote:

···

On Wednesday 08 December 2004 10:00 am, Florian Frank wrote:
>
> "abc\n\ndef\nghi".split(/\n+/)

Sorry, what are you pointing out here?

It is impossible to have perfect forwards-backwards symmetry with split and join. When the split pattern matches more than one possible string of characters, there is no way to join the results of the split and recover the original input.

--
Glenn Parker | glenn.parker-AT-comcast.net | <http://www.tetrafoil.com/&gt;

I feel it might be difficult. Define the following:
  "abc\n".split("\n").join("\n")

"abc\n".split("\n",-1).join("\n")
=> "abc\n"

Make default behavior -1.

No. Most of the time, I don't actually *want* all the extra crap when I
split. This change would break a lot of code silently.

Really? In all my code I either had to put the -1 in b/c I was getting
unexpected results (wasting many hours, btw!); or it didn't really matter
either way --empty strings usually end up in no-effect results.

Really. I just did a quick audit of my publicly released code -- no checks to
see what would be broken if I add -1 -- but NONE of my code uses split with -1,
and I have ~45 split calls in said code. Are YOU going to volunteer to test all
of my code (and everyone else's?) because you don't like adding the -1?

Might it break code? Sure. But a lot? I doubt it. In fact I suspect some of
those same programs might have edge cases that would break them for the lack
of the -1.

*shrug*

My code is written without the -1. If you want this default behaviour changed
-- that's been in Ruby for quite a while -- then you take on the responsibility
for testing all of the code out there. It is not clear that the existing
behaviour is broken.

-austin

···

On Fri, 10 Dec 2004 03:10:20 +0900, trans. (T. Onoma) <transami@runbox.com> wrote:

On Wednesday 08 December 2004 07:42 am, Austin Ziegler wrote:

On Wed, 8 Dec 2004 15:56:01 +0900, trans. (T. Onoma) >> <transami@runbox.com> wrote:

On Wednesday 08 December 2004 12:00 am, Yukihiro Matsumoto wrote:

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

Hi,

> On Wed, 8 Dec 2004 15:56:01 +0900, trans. (T. Onoma)
>
> > Make default behavior -1.
>
> No. Most of the time, I don't actually *want* all the extra crap when
> I split. This change would break a lot of code silently.

Really? In all my code I either had to put the -1 in b/c I was getting
unexpected results (wasting many hours, btw!); or it didn't really matter
either way --empty strings usually end up in no-effect results.

Might it break code? Sure. But a lot? I doubt it. In fact I suspect some of
those same programs might have edge cases that would break them for the lack
of the -1.

I'm pretty sure it would break a lot of my code. I'm used to #split working
as it does since I came from Perl, but my point in saying that is not to bless
the way Perl did it, but that I am very careful about whether I want trailing
fields or not.

I haven't surveyed my code, but I'm pretty sure I rely on the precise behavior
of split. . .

FWIW,

Regards,

Bill

···

From: "trans. (T. Onoma)" <transami@runbox.com>

On Wednesday 08 December 2004 07:42 am, Austin Ziegler wrote:

That is obvious, but no argument against wanting split and join to be symmetric
for fixed split patterns. (On the other hand one could even argue, that we
"need" variable join patterns to make it symmetric again. :wink:

What is important is, that the split('x', -1) trick does not seem to be as well
known as it should be. I for one forgot about it, even though I now remember
there was a similar thread some time ago.

Regards,

Brian

···

On Thu, 9 Dec 2004 22:19:55 +0900 Glenn Parker <glenn.parker@comcast.net> wrote:

trans. (T. Onoma) wrote:
> On Wednesday 08 December 2004 10:00 am, Florian Frank wrote:
> >
> > "abc\n\ndef\nghi".split(/\n+/)
>
> Sorry, what are you pointing out here?

It is impossible to have perfect forwards-backwards symmetry with split
and join. When the split pattern matches more than one possible string
of characters, there is no way to join the results of the split and
recover the original input.

--
Brian Schröder
http://www.brian-schroeder.de/

I see. Well, The -1 parameter was only have the story. The other half of this
argument was to not let empty strings drop, so the above could produce.

  ["abc","","def","ghi"]

And in that case it certainly is reversible. So I would think that would make
sense as default, then add method "modes" for the desired variations.

That reminds me, do we have simple way to delete all empty strings from an
array, like compact is for nil?

T.

···

On Thursday 09 December 2004 08:19 am, Glenn Parker wrote:

trans. (T. Onoma) wrote:
> On Wednesday 08 December 2004 10:00 am, Florian Frank wrote:
> > "abc\n\ndef\nghi".split(/\n+/)
>
> Sorry, what are you pointing out here?

It is impossible to have perfect forwards-backwards symmetry with split
and join. When the split pattern matches more than one possible string
of characters, there is no way to join the results of the split and
recover the original input.

>> split. This change would break a lot of code silently.
>
> Really? In all my code I either had to put the -1 in b/c I was getting
> unexpected results (wasting many hours, btw!); or it didn't really matter
> either way --empty strings usually end up in no-effect results.

Really. I just did a quick audit of my publicly released code -- no checks
to see what would be broken if I add -1 -- but NONE of my code uses split
with -1, and I have ~45 split calls in said code. Are YOU going to
volunteer to test all of my code (and everyone else's?) because you don't
like adding the -1?

Could you send me said code? I would like to see how it effects things.

Also understand that I never paid _any_ attention until one day my program
wouldn't work and I could not figure out why. Well, nearly a day of debugging
later I traced it to a split call and finally learned about the -1. I was not
amused.

> Might it break code? Sure. But a lot? I doubt it. In fact I suspect some
> of those same programs might have edge cases that would break them for
> the lack of the -1.

*shrug*

My code is written without the -1. If you want this default behaviour
changed -- that's been in Ruby for quite a while -- then you take on the
responsibility for testing all of the code out there.

In the same amount of time I spent working out that first bug, I probably
could have adjusted the vast majority of programs that would be effected.

It is not clear that the existing behaviour is broken.

Broken is not the argument. Obviously it is usable. The question is, is it
well designed?

     If the _limit_ parameter is omitted, trailing null fields are
     suppressed. If _limit_ is a positive number, at most that number of
     fields will be returned (if _limit_ is +1+, the entire string is
     returned as the only entry in an array). If negative, there is no
     limit to the number of fields returned, and trailing null fields
     are not suppressed.

Just reading that is enough to know, but also given that the issue has come up
a number of independent times, it is quite obvious that it is not well
designed.

T.

···

On Thursday 09 December 2004 01:49 pm, Austin Ziegler wrote:

"Brian Schröder" <ruby@brian-schroeder.de> schrieb im Newsbeitrag
news:20041209143808.412a5f7e@black.wg...

> trans. (T. Onoma) wrote:
> > >
> > > "abc\n\ndef\nghi".split(/\n+/)
> >
> > Sorry, what are you pointing out here?
>
> It is impossible to have perfect forwards-backwards symmetry with

split

> and join. When the split pattern matches more than one possible

string

> of characters, there is no way to join the results of the split and
> recover the original input.
>

That is obvious, but no argument against wanting split and join to be

symmetric

for fixed split patterns. (On the other hand one could even argue, that

we

"need" variable join patterns to make it symmetric again. :wink:

Just one note: you can use regexp grouping to get the delimiters as well
as the content. That way you can join with "" and get the original back:

"abababab".split(/b/)

=> ["a", "a", "a", "a"]

"abababab".split(/(b)/)

=> ["a", "b", "a", "b", "a", "b", "a", "b"]

There might be certain border cases though.

Kind regards

    robert

···

On Thu, 9 Dec 2004 22:19:55 +0900 > Glenn Parker <glenn.parker@comcast.net> wrote:
> > On Wednesday 08 December 2004 10:00 am, Florian Frank wrote:

++ trans. (T. Onoma) [ruby-talk] [10/12/04 00:43 +0900]:

That reminds me, do we have simple way to delete all empty strings from an
array, like compact is for nil?

what's wrong with using Array#delete ?

      --ibz.

···

--

split. This change would break a lot of code silently.

Really? In all my code I either had to put the -1 in b/c I was
getting unexpected results (wasting many hours, btw!); or it
didn't really matter either way --empty strings usually end up
in no-effect results.

Really. I just did a quick audit of my publicly released code --
no checks to see what would be broken if I add -1 -- but NONE of
my code uses split with -1, and I have ~45 split calls in said
code. Are YOU going to volunteer to test all of my code (and
everyone else's?) because you don't like adding the -1?

Could you send me said code? I would like to see how it effects
things.

No; it's all publicly available, as I said. TeX::Hyphen,
Text::Format, Ruwiki, Diff::LCS, Archive::Tar::Minitar, etc. Of all
of these, I think that I saw one of them that would probably benefit
from using -1 (and that's a command-line program in Diff::LCS).

Also understand that I never paid _any_ attention until one day my
program wouldn't work and I could not figure out why. Well, nearly
a day of debugging later I traced it to a split call and finally
learned about the -1. I was not amused.

Again, *shrug*. I think that I've had to use the -1 form of #split
exactly once in any code that I've written for Ruby.

Might it break code? Sure. But a lot? I doubt it. In fact I
suspect some of those same programs might have edge cases that
would break them for the lack of the -1.

*shrug*

My code is written without the -1. If you want this default
behaviour changed -- that's been in Ruby for quite a while --
then you take on the responsibility for testing all of the code
out there.

In the same amount of time I spent working out that first bug, I
probably could have adjusted the vast majority of programs that
would be effected.

Probably not. Remember that when you're talking about changing a
default, you're talking about a lot of programs -- and a lot of
legacy programs. If you instead suggest:

  class String
    def splitall(re)
      self.split(re, -1)
    end
  end

I'll support you. I will NOT support you changing default behaviour
on something that is as widely used as split.

It is not clear that the existing behaviour is broken.

Broken is not the argument. Obviously it is usable. The question
is, is it well designed?

IMO, it's a little late to be asking that.

     If the _limit_ parameter is omitted, trailing null fields are
     suppressed. If _limit_ is a positive number, at most that
     number of fields will be returned (if _limit_ is +1+, the
     entire string is returned as the only entry in an array). If
     negative, there is no limit to the number of fields returned,
     and trailing null fields are not suppressed.

Just reading that is enough to know, but also given that the issue
has come up a number of independent times, it is quite obvious
that it is not well designed.

i don't think it's come up all that often, and I don't think it's an
issue all that often.

-austin

···

On Fri, 10 Dec 2004 05:20:17 +0900, trans. (T. Onoma) <transami@runbox.com> wrote:

On Thursday 09 December 2004 01:49 pm, Austin Ziegler wrote:

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

Robert Klemme wrote:

Just one note: you can use regexp grouping to get the delimiters as well
as the content. That way you can join with "" and get the original back:

I had the same thought, based on the similar Perl feature.

"abababab".split(/b/)

=> ["a", "a", "a", "a"]

"abababab".split(/(b)/)

=> ["a", "b", "a", "b", "a", "b", "a", "b"]

There might be certain border cases though.

a = "aba".split(/(a)/)
=> ["", "a", "b", "a"]

"abcdabcd".split(/(a|b)/)
=> ["", "a", "", "b", "cd", "a", "", "b", "cd"]

The leading "" looks like a bug at first, but it seems the intention is to always put delimited text at even array indices, leaving delimiters at odd indices. This is good to know, but not well documented for Ruby.

The Perl doc goes into this in some depth. If Perl was the source of inspiration, then here is a crack.

"abcdabcd".split(/(a)|(b)/)
=> ["", "a", "", "b", "cd", "a", "", "b", "cd"]

The (admittedly obscure) example above is handled differently in Perl.

@a = split(/(a)|(b)/, "abcdabcd");
map { print "\"$_\"", ' ' } @a;

produces:

"" "a" "" "" "" "b" "cd" "a" "" "" "" "b" "cd"

Sadly, it can be difficult to distinguish between undef and "" in Perl, but this output can be better understood as:

""
"a" undef
""
undef "b"
"cd"
"a" undef
""
undef "b"
"cd"

I won't even try to explain, much less justify, this behavior in Perl.

···

--
Glenn Parker | glenn.parker-AT-comcast.net | <http://www.tetrafoil.com/&gt;

Right. I was thinking #delete only removed the first match it came to. But no,
it does remove them all. So simple enough. Thanks.

BTW, I mis-tested Florian's example so disregard my last post on the subject
-- I see the point now.

To bring us back to the focal points, can we summarize the pros and cons for
the two ideas being presented:

  1. "" return value
  2. -1 default behavior

Thanks,
T.

···

On Thursday 09 December 2004 12:29 pm, Ibraheem Umaru-Mohammed wrote:

++ trans. (T. Onoma) [ruby-talk] [10/12/04 00:43 +0900]:
> That reminds me, do we have simple way to delete all empty strings from
> an array, like compact is for nil?

what's wrong with using Array#delete ?

     --ibz.

No; it's all publicly available, as I said. TeX::Hyphen,
Text::Format, Ruwiki, Diff::LCS, Archive::Tar::Minitar, etc. Of all
of these, I think that I saw one of them that would probably benefit
from using -1 (and that's a command-line program in Diff::LCS).

Okay, well that's good enough. I'll grab a couple and check it out.

> Also understand that I never paid _any_ attention until one day my
> program wouldn't work and I could not figure out why. Well, nearly
> a day of debugging later I traced it to a split call and finally
> learned about the -1. I was not amused.

Again, *shrug*. I think that I've had to use the -1 form of #split
exactly once in any code that I've written for Ruby.

Once is all it took for me to pull my hair out for a day :wink:

>>> Might it break code? Sure. But a lot? I doubt it. In fact I
>>> suspect some of those same programs might have edge cases that
>>> would break them for the lack of the -1.
>>
>> *shrug*
>>
>> My code is written without the -1. If you want this default
>> behaviour changed -- that's been in Ruby for quite a while --
>> then you take on the responsibility for testing all of the code
>> out there.
>
> In the same amount of time I spent working out that first bug, I
> probably could have adjusted the vast majority of programs that
> would be effected.

Probably not. Remember that when you're talking about changing a
default, you're talking about a lot of programs -- and a lot of
legacy programs. If you instead suggest:

Well, not be hand. I would use a supporting script. Nontheless it not like I
will actually be doing this.

  class String
    def splitall(re)
      self.split(re, -1)
    end
  end

Well, that is a fair compromise I think.

I'll support you. I will NOT support you changing default behaviour
on something that is as widely used as split.

What if it turns out, for example, that expirementation shows approx. 95%+ of
code would continue working unaffected? Obviously we don't know either way.
But I wouldn't be so quick to rule out the possibility.

>> It is not clear that the existing behaviour is broken.
>
> Broken is not the argument. Obviously it is usable. The question
> is, is it well designed?

IMO, it's a little late to be asking that.

> If the _limit_ parameter is omitted, trailing null fields are
> suppressed. If _limit_ is a positive number, at most that
> number of fields will be returned (if _limit_ is +1+, the
> entire string is returned as the only entry in an array). If
> negative, there is no limit to the number of fields returned,
> and trailing null fields are not suppressed.
>
> Just reading that is enough to know, but also given that the issue
> has come up a number of independent times, it is quite obvious
> that it is not well designed.

i don't think it's come up all that often, and I don't think it's an
issue all that often.

On the whole, perhaps, not that often. But presently it has been twice in that
last month. And if we don't work on improving things at times such as these,
when might we ever?

T.

···

On Thursday 09 December 2004 04:11 pm, Austin Ziegler wrote:

Hi --

···

On Fri, 10 Dec 2004, trans. (T. Onoma) wrote:

On Thursday 09 December 2004 12:29 pm, Ibraheem Umaru-Mohammed wrote:
> ++ trans. (T. Onoma) [ruby-talk] [10/12/04 00:43 +0900]:
> > That reminds me, do we have simple way to delete all empty strings from
> > an array, like compact is for nil?
>
> what's wrong with using Array#delete ?
>
> --ibz.

Right. I was thinking #delete only removed the first match it came to. But no,
it does remove them all. So simple enough. Thanks.

BTW, I mis-tested Florian's example so disregard my last post on the subject
-- I see the point now.

Small caveat: remember that Array#delete returns the thing deleted,
not the array, so you can't chain it as you would #compact.

David

--
David A. Black
dblack@wobblini.net