Curt Sampson cjs@cynic.net wrote in message news:Pine.NEB.4.44.0208121234260.2317-100000@angelic.cynic.net…
This is one of the big advantages of UTF-16 over UTF-8; you can do
simple operations the simple way and still produce valid UTF-16 output.
(There’s no explicit rule, as far as I know at least, that states that
UTF-8 parsers must ignore broken characters, as there is with UTF-16.)
UTF-8 parsers must ignore “broken” characters because, as I pointed
out in a previous message, “broken” characters are never valid UTF-8,
due to the UTF-8 design. The standard now only allows parsing of
valid characters (the loopholes that existed in unicode version 3.0
were eliminated by updates in versions 3.1 and 3.2). The unicode
standard expressly forbids the interpretation of illegal UTF-8
sequences.
There are also advantages to a fixed-width encoding, such as the
recently introduced UTF-32, which can often outweigh the endianness
issues. (Encodings which are not byte-grained, such as UTF-32 and
UTF-16, need two variants, big-endian and little-endian.)
UTF-16 was not thought through very well. It is an encoding
following the mental line of least resistence – encode the
character points by their numbers. There was no reason the encoding
should have included 0-bytes, thus sabotaging byte-grained string
processing by C programs. And of course it was thought that all
characters of interest to any significant community could fit in
the two-byte “Basic Multilingual Plane”. This is not attainable
even with current unicode unless you consider Chinese, Japanese,
mathematicians, and musicians to be insignificant communities.
Also, important further expansion outside the BMP is inevitable.
But UTF-16 in both big-endian and little-endian variants is sure to
be one of those technical blunders which far outlives its
excusability, due to inertia and corporate politics, so Ruby should
probably provide direct support. Failing that, Ruby could provide
indirect support via invisible translation to some other unicode
encoding.
Some people use UTF-16 as a disk storage format and expand to
UTF-32 in memory. This allows one to directly access characters
by index for unicode strings in memory, while avoiding crass
inefficiency in disk usage. But for general multilingual processing,
UTF-8 seems more efficient and handier as a disk storage format.
The unicode consortium has recently promulgated yet another
encoding form, CESU-8, intended only for internal use by programs,
and not for data transfer between applications. CESU-8 is byte-
grained and similar to UTF-8, but CESU-8 has been designed so CESU-8
text will have the same binary collation as equivalent UTF-16 text.
I don’t know if there is a reason for RUBY to support this.
Though notoriously unwise myself, I'd like to make a plea for
some wisdom. Many people here have a great deal of experience with
internationalization, and rightly consider themselves experts. But
expertise comes in many flavors, and one should think twice before
making assertions about what other people need. The need for
internationalization, M17n, and so forth by a maker of corporate web
sites is different from the need of a mathematician, musician, or
someone trying to computerize Akkadian tablets. We should avoid the
parochial thought that our interests are the only important or
“practical” ones.
Regards, Bret
http://www.rexx.com/~oinkoink/
oinkoink
at
rexx
dot
com