Austin Ziegler wrote in post #1061436:
This is *not* a Ruby problem, this is a *data* problem.
Leaving aside the point that not all data is text, you still need a
clear conceptual model to be able to reason about your program.
In Python 3, there is a clear distinction between "characters" and "a
sequence of bytes which encode those characters". They are two
completely different classes and cannot be combined (e.g. a+b will
always fail if a is str and b is bytes). It's also symmetrical: you
convert from bytes to characters as text enters your program, and from
characters to bytes as text leaves it.
(Aside: I know that Python only supports Unicode characters, but this is
just an implementation limitation. There could be a third class
"gb2312str" if desired, and additional classes for other character sets
which are not subsets of Unicode)
Ruby muddles these concepts by having all strings be a sequence of bytes
plus the encoding, which in turn muddles the concepts of "character set"
and "a method of encoding that character set".
Now, you could argue that Ruby is actually implementing the Python 3
approach but in a "lazy" way: by not explicitly converting bytes to
characters until required, it avoids potentially unnecessary work. But
if so, it's half-baked. For example, you cannot combine a UTF16-LE
string with a UTF16-BE string, even though they are the same character
set (Unicode). What's worse is that a UTF16-LE string will sort
differently to a UTF16-BE string (because ruby 1.9 sorts by byte
ordering, which happens to work for UTF8 but not all other encodings of
Unicode). So it kind-of behaves like a string of characters, except that
it doesn't.
Furthermore, ruby sometimes lets you combine objects representing
"characters" and "bytes", or "characters with encoding A" and
"characters with encoding B". Whether it is allowed or not depends on
the run-time contents of those objects.
If a = b + c *always* crashed when b and c had different encodings, I
would really not have a problem with any of this. Your test case would
immediately catch it, you fix it, problem solved.
However ruby 1.9's insidious behaviour means that b + c may *or may not*
crash depending not only the encodings but the actual content of the
strings at that instant. One perfectly reasonable set of tests may pass;
actual application data may fail.
Finally, ruby is asymmetrical. On input, encodings are tagged; on
output, they are ignored (by default). From files, the environment
encoding is used; from sockets, the ASCII_8BIT encoding is used. WIth
regexps, invalid strings cause an exception; with String# they do not.
It is just an utter dog's breakfast of arbitrary rules which you just
have no choice but to learn.
Some people see ruby 1.9's highly complex encoding implementation as a
triumph of engineering; I see it as design smell.
Matz and others have worked very hard to make sure that Ruby 1.9 works
well if you follow certain rules regarding your inputs and outputs.
... which one has to absorb by osmosis. Certainly the core API docs
don't give these rules; in fact they give precious little about the
encoding semantics of String. And you can't get much more of a core part
of the language than String.
Want to find out what String# does when given a string which contains
invalid characters in its declared encoding? The docs won't help you.
Try it and see. Or go to the C source code.
Of course, because every String is now two-dimensional (x = sequence of
bytes, y = Encoding) there is a much higher requirement to document
every method which acts on a string or returns on a string, because
there is a much larger variety of inputs and outputs to consider.
Take strings with invalid characters, for example, or the fact that
every returned string also has an encoding and you need to document how
it is chosen. (For example Net::HTTP: does it return strings with
encoding from the Content-Type header? You tell me)
Incidentally, strings with invalid characters are not an edge case or
only for erroneous input. Ruby encourages you to do things like:
txt = sock.read(4096) # txt likely to contain a split character
at the end
This could be dealt with if explicitly converting bytes to characters at
some point (you'd buffer the extra bit). By not having this explicit
conversion, you are quite likely to have byte patterns which don't
represent *any* character. Yes you can do the buffering yourself; I'm
just saying that all methods need to *document* whether they do accept
strings with invalid bytes, and how they handle them.
If you don't respect your encodings, they will bite you. They may not
bite you up front (as they do with Ruby, because it exposes these
things which are painful), but they *will* bite you.
Certainly you need to know about character sets and how they are
encoded. This does not imply that ruby does it in a sane way. And as I
said before, if Ruby were to bite you consistently, it would be much
better.
Ruby got it right, because it acknowledges that (a) this is hard and
(b) gives you the tools you need in order to make this less painful.
It also doesn't (c) incorrectly assume that everything is or can be
expressed safely in Unicode. (Shift-JIS will not roundtrip to Unicode
and back for some characters.)
That's kind of irrelevant, since ruby 1.9 doesn't really handle
Shift-JIS either, except to transcode it.
···
--
Posted via http://www.ruby-forum.com/\.