Strings vs arrays

Hi --

Daniel Brockman wrote:

Whatever String# and all the other String methods index, of course.

Depending on the parameter you pass, # can return a String or an Integer.

There is no clear notion of an "Element" in a String.

If this is true, then we have a serious problem. Before doing much
anything about its API, we need to decide whether String is a byte
array or a character array. (Presumably, matz & co. already have.)

My understanding (sneaking in a reply to Daniel's post in this reply
:slight_smile: was that the conceptual and design decision was that Strings are
not arrays, and are therefore not obliged or constrained to have an
Array-like API (any more than arrays are obliged to have a String-like
API).

There's no such thing as a character in Ruby. (See any discussion on Unicode, etc. in Ruby.) Strings are Objects (stored in C as char*s, I'd guess). Call the right methods on them, and you can get an Integer representing the byte value at a given position ("Hello"[0]), or another String object representing some manipulation of the String ("Hello"[0..0]). Those are your only means of inspection. That was just a long-winded way of saying "this is true."

Anything I didn't reply to, I probably agree with. Since the String methods don't have a consistent notion of an "element," it doesn't seem it would hurt to choose whichever notion we want for a potential #shift method.

That's true only if there's an imperative to have a String instance
method called "shift". I don't think there is. Maybe a left
chop/chomp operation would be a good idea, but I think it should be
called lchop, which would be consistent with other string method
naming (rather than with array method naming).

David

···

On Sun, 10 Jul 2005, Devin Mullins wrote:

--
David A. Black
dblack@wobblini.net

Wow, some mixup --- Daniel, David, Devin and Levin. :slight_smile:

  > There is no clear notion of an "Element" in a String.

  > If this is true, then we have a serious problem. Before
  > doing much anything about its API, we need to decide whether
  > String is a byte array or a character array. (Presumably,
  > matz & co. already have.)

  > My understanding was that the conceptual and design decision
  > was that Strings are not arrays, and are therefore not
  > obliged or constrained to have an Array-like API (any more
  > than arrays are obliged to have a String-like API).

To me, the term _array_ means ``sequence of elements stored in a
contiguous chunk of memory,'' just like the term _list_ means
``sequence of elements stored in a linked list of cells.''

As long as strings are implemented as arrays, I reserve the right to
refer to them as such, regardless of whether String < Array.
(Of course, I won't do it needlessly since it is confusing.)

In Haskell, strings are not arrays, but lists. In many languages,
strings are immutable, which deemphasises their nature as arrays.
In Ruby, however, strings are mutable arrays (of bytes?). So far I
have not been able to understand the desire to obfuscate this fact.

What's the harm of admitting that strings are arrays? What's the harm
of making them at least _quack_ alike?

  > Since the String methods don't have a consistent notion of an
  > "element," it doesn't seem it would hurt to choose whichever
  > notion we want for a potential #shift method.

  > That's true only if there's an imperative to have a String
  > instance method called "shift". I don't think there is.

Why not? This thread started with an example of the need for one.
Of course you can use string.slice!(0), but the same goes for arrays.

The terms `shift' and `unshift' are general and well-understood.
I can't see any reason why they shouldn't be applied to strings.

  > Maybe a left chop/chomp operation would be a good idea, but I
  > think it should be called lchop, which would be consistent
  > with other string method naming (rather than with array
  > method naming).

I fail to see the point in that. String#chop is meant to chop off
end-of-line characters; String#lchop wouldn't be. So using that name
would be _inconsistent_ with other string method naming.

···

--
Daniel Brockman <daniel@brockman.se>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.

Hi --

> Maybe a left chop/chomp operation would be a good idea, but I
> think it should be called lchop, which would be consistent
> with other string method naming (rather than with array
> method naming).

I fail to see the point in that. String#chop is meant to chop off
end-of-line characters; String#lchop wouldn't be. So using that name
would be _inconsistent_ with other string method naming.

String#chop chops off the rightmost character:

   irb(main):001:0> "abc".chop
   => "ab"

You may be thinking of "chomp", which is a specialized "chop"
operating only on newline characters.

So the idea of lchop would be to serve as a left-hand equivalent of
chop.

David

···

On Sun, 10 Jul 2005, Daniel Brockman wrote:

--
David A. Black
dblack@wobblini.net

"David A. Black" <dblack@wobblini.net> writes:

String#chop chops off the rightmost character:

   irb(main):001:0> "abc".chop
   => "ab"

Except if the string ends with a CRLF pair:

   "abc\r\n".chop #=> "abc"

You may be thinking of "chomp", which is a specialized "chop"
operating only on newline characters.

If you read the docstrings, you get the impression that String#chop
is more-or-less deprecated in favor of the ``safer'' String#chomp:

   +String#chomp+ is ofter a safer alternative, as it leaves
   the string unchanged if it doesn't end in a record separator.

So the idea of lchop would be to serve as a left-hand equivalent
of chop.

So I suppose if the string starts with a CRLF pair, String#lchop would
chop off two characters from the left?

Why not go all the way and let all string methods treat CRLF pairs as
single characters?

I think it's a problem that strings are the only way to go for raw
byte arrays in Ruby, yet

  * strings lack a few random useful array methods

  * the string methods are not binary safe.

···

--
Daniel Brockman <daniel@brockman.se>

Hi --

"David A. Black" <dblack@wobblini.net> writes:

String#chop chops off the rightmost character:

   irb(main):001:0> "abc".chop
   => "ab"

Except if the string ends with a CRLF pair:

  "abc\r\n".chop #=> "abc"

You may be thinking of "chomp", which is a specialized "chop"
operating only on newline characters.

If you read the docstrings, you get the impression that String#chop
is more-or-less deprecated in favor of the ``safer'' String#chomp:

  +String#chomp+ is ofter a safer alternative, as it leaves
  the string unchanged if it doesn't end in a record separator.

I believe that means safer in the sense that if you're going through,
say, lines in a file, and for some reason there's no \n at the end of
the last line, you won't accidentally cut off a non-\n character.

In the general case, #chop can't be deprecated in favor of #chomp,
because #chomp doesn't offer the same functionality (chopping off the
last character).

So the idea of lchop would be to serve as a left-hand equivalent
of chop.

So I suppose if the string starts with a CRLF pair, String#lchop would
chop off two characters from the left?

That's a good question. One could argue that the only reason they are
treated together in the first place is that they represent the more
abstract concept "newline" -- and that where they aren't representing
that concept, they should be treated separately. Or one could go for
the complete symmetry approach. I guess I'd tend to favor the former
notion, since the idea of left-end/right-end is already irreducibly
asymmetrical in a left-to-right writing system. (Though then there's
the matter of what would happen given a right-to-left writing system,
etc.)

Why not go all the way and let all string methods treat CRLF pairs as
single characters?

See above -- there's no magic association between those two
characters, just the historical fact of their serving the newline
role, and the practical need to acknowledge that role. I don't think
there would be any advantage to, say, having String#count combine
them, etc. (though of course there would have been an advantage to
global agreement several decades ago on how to represent newline on
various platforms :slight_smile:

I think it's a problem that strings are the only way to go for raw
byte arrays in Ruby, yet

* strings lack a few random useful array methods

* the string methods are not binary safe.

String, like Hash, raises interesting questions about the relation
between itself, Array, and Enumerable. It's interesting that
String#to_a breaks the string into lines as opposed to characters or
bytes. That's certainly a behavior one would not expect if the
"arrayness" of strings resided strictly in their status as ordered
collections of bytes. On the other hand, they are ordered collections
of characters of bytes :slight_smile: I still find myself expecting String#each
to go bytewise. But the fact that these different objects don't map
exactly on to each other is, I think, one of the points of having a
higher and separate abstraction like Enumerable. It decouples them,
while still not making it impossible to assimilate them to each other
when necessary.

David

···

On Mon, 11 Jul 2005, Daniel Brockman wrote:

--
David A. Black
dblack@wobblini.net