but instead to ending it this way, can anyone sum up the overall
conclusions of this lengthy discussion? that’s what i’d like to hear.
Unicode covers a lot of what people want to do, but not everything.
Therefore there will be specialized situations in which you cannot use
Unicode.
Unicode has some variable length issues with surrogates, combining
pairs, and various encodings. In all encodings, combining pairs are
multiple Unicode code points representing a single glyph on screen.
UTF-32: 4-byte chars, except for combining pairs.
UTF-16: 2-byte chars, except for combining pairs and surrogates.
UTF-8: 1-byte chars, except for non-ASCII, combining pairs and
surrogates.
Note well that all Unicode encodings have variable length issues, due
to the combining chars. Surrogate pairs are dealt with very, very simply
if you are not actually interpreting them (you can easily tell from
looking at a single character if it’s a high byte or low-byte surrogate,
and the only rule is not to split them). Combining chars are rather more
difficult.
The processing you do on strings with surrogate pairs can be divided
into three categories.
1. No special processing required, because your processing
cannot break anything. (E.g., web page form information to and
from a database).
2. Little special processing required to avoid any breakage (i.e.,
don't split pairs); minimal damage if you do split pairs.
3. Extensive processing required because you're interpreting
the pairs.
3 is necessary only for things interacting with a user that have
to display proper glyphs and accept input. Most such things are
language-specific, because it’s impossibly complex to have one way
of doing things that covers everything. (For a start, just try to
think up an input method that works for Chinese, Japanese and Korean
simultaneously. And then also handles Hebrew.) And most people don’t
need that anyway.
2 Is pretty trivial to implement above the string layer, where
necessary, and often needs to be combined with other stuff anyway.
(E.g., line wrapping algorithms, which are language-specific.)
1 Is suprisingly common situation.
There is no equivalant for combining pairs; you have to do some
real processing there. Pretty much everybody has just ignored the
problem from the beginning, and it’s not been that big a deal.
And one more thing a lot of people have missed: UTF-8 is the most
efficient storage format only for text the majority of which is ASCII;
if it’s not UTF-16 is about as efficient or, in the case of Asian
languages, more efficient.
A rather more disputed fact is that surrogate pairs and combining
characters are pretty darn rare. But one thing to remember is that a)
people have been mostly ignoring combining pairs all along, with very
little fuss from anyone, and b) only within the last year or two have
there even been any characters assigned to the surrogate pair area.
i.e. does any encoding scheme out there do the job, the whole job, and
nothing but the job? or are they all flawed and somebody someday needs
to sit down and figure the problem out and fix it for good?
They are all flawed in one way or another. Figuring out something
that covers everything would probably not be much more difficult
than Regan’s Star Wars project, however.
One should also keep in mind that the general public has been dealing
extensively with systems using flawed encodings for the past thirty
years, in the case of ASCII and the English speaking world, and ten
years or so in the case of Asian standards and the Asian world, and
there have not been major complaints regarding single-language support.
(Unicode basically just combines all the popular character sets of the
world into one big one to solve the multiple-language support problem.
It does not attempt–much, at any rate–to solve other problems that
have been present all along that people have brought up here. I opine
that that’s because had those problems really needed a solution, the
solution would have been created and standardized.)
cjs
···
On Tue, 6 Aug 2002, Tom Sawyer wrote:
Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC