Unicode in Ruby now?

Curt Sampson wrote:

These statements are both very wrong.

I was deliberately being “reinterpretive”, but what I said is the effective
truth.

What part of “UTF-32 is restricted in values to the range 0…10FFFF16,
which precisely matches the range of characters defined in the
Unicode Standard (and other standards such as XML), and those
representable by UTF-8 and UTF-16.” (Unicode Standard Annex #19)
don’t you understand?

Also note that UTF-32 is still “variable length” in some senses,
in that it can still have combining characters that you need to
interpret.

If you want to do Unicode-3 correctly and simply, then a 32 bit
internal representation is the right one…

You have addressed none of the points I made in my previous when
I said that one can do Unicode 3 correctly and simply in UTF-16.
Please address them.

  • surrogates simply recreate
    exactly the same problems of variable-length encoding that plagued
    earlier efforts at providing a simple way (for the programmer) to code
    correctly.

If you consider variable length a real problem; UTF-32 doesn’t fix
it since it still has combining characters.

Externally, a more compact encoding is needed (utf-8 or utf-16
are valid choices), but internally, UTF-16 is bogus in the extreme.

This is completely wrong.

You seem to be so wedded to the Java/Unicode model that you can’t see
out of the hole into which you’ve dug yourself.

No. I’m going by stuff out of the Unicode 3 standard here, not just
the java model.

If you’d actually work though some typical cases of string use and
see what happens when they encounter surrogate pairs, you’d see
that your analysis of the problem is not at all correct.

cjs

···

On Mon, 5 Aug 2002, Clifford Heath wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Curt Sampson wrote:

What part of “UTF-32 is restricted in values to the range 0…10FFFF16,
which precisely matches the range of characters defined in the
Unicode Standard (and other standards such as XML), and those
representable by UTF-8 and UTF-16.” (Unicode Standard Annex #19)
don’t you understand?

I see you have built a new 21-bit computer which is going to conquer
the world, have you? If not, what size memory word do you store that
21-bit value in? So you see that 32 bits is effectively the minimum
that gets used to store a single UTF-32 codepoint value.

The limit of 0…10FFFF16 is to protect the broken UTF-16 encoding. If
you must deal with a variable-length encoding, at least UTF-8 can be
extended to encompass all the world’s character sets.

Also note that UTF-32 is still “variable length” in some senses,
in that it can still have combining characters that you need to
interpret.

Yes, and aren’t they a bugger! But unavoidable, since rooted in real
languages. The only saving grace is that many applications don’t have
to worry about them.

You have addressed none of the points I made in my previous when
I said that one can do Unicode 3 correctly and simply in UTF-16.

You’re quite right, and I never had an issue with that. I do have an
issue with the way the standard shuts out a large number of characters
in order to protect a broken encoding. Unicode 3 doesn’t go far enough
not because it panders to the majority but because it shuts out the
minority who need those characters.

If you’d actually work though some typical cases of string use

I have. Optimising for typical string use shouldn’t come at the expense
of representational completeness. How would you feel if some European
body decided that the letters J and Y were to be dropped from English
because they didn’t fit in some six-bit encoding? (I once worked on a
CDC computer which used six-bit characters - no lower case!)

···


Clifford Heath

I see you have built a new 21-bit computer which is going to conquer
the world, have you? If not, what size memory word do you store that
21-bit value in?

If I want to store 21-bit values, I store them in a 32-bit word, wasting
11 bits. That does not change the value to be 32 bits in range.

But as far as Unicode goes, I store that in 16-bit words, for reasons I
have gone into in detail before.

Also note that UTF-32 is still “variable length” in some senses,
in that it can still have combining characters that you need to
interpret.

Yes, and aren’t they a bugger! But unavoidable, since rooted in real
languages. The only saving grace is that many applications don’t have
to worry about them.

Many applications need not worry about surrogate pairs, either.
Also, surrogate pairs are much, much easier to deal with than
combining characters. So removing surrogate pairs does little to
solve this problem.

I do have an
issue with the way the standard shuts out a large number of characters
in order to protect a broken encoding.

It doesn’t shut them out in the slightest; surrogate pairs work
just fine. Please re-read my previous posts.

Unicode 3 doesn’t go far enough
not because it panders to the majority but because it shuts out the
minority who need those characters.

Hello! It does not shut them out! What part of “surrogate pairs
work fine without special processing” don’t you understand? Why
don’t you point out how they break, if you think that they do,
rather than making unsubstantiated claims?

Also, I would like to know just who is using these characters, and
for what.

In Japan, I have never met anyone outside of someone testing Unicode
or Unicode-based applications that has ever used a surrogate character.
Aside from a few kanji specialists, Japanese people would be hard
pressed even to identify the readings and meanings of more than a
handful of these surrogate characters if you showed them the whole lot.
The Japanese have demonstrated that they have no pressing need for these
characters in their computer character sets by sticking for years to
standards (EUC-JP, Shift-JIS, ISO-2022-JP) that do not contain any of
these characters. (Unicode handles the characters from all of these
character sets without using surrogate pairs.)

I have. Optimising for typical string use shouldn’t come at the expense
of representational completeness.

Of course not. But UTF-16 is representationally complete. There is
no Unicode character you can represent in UTF-32 that you cannot
represent in UTF-16.

How would you feel if some European
body decided that the letters J and Y were to be dropped from English
because they didn’t fit in some six-bit encoding? (I once worked on a
CDC computer which used six-bit characters - no lower case!)

NOTHING IS BEING DROPPED! ALL CHARACTERS ARE REPRESENTABLE IN UTF-16! HELLO!

The proper parallel here is, “how would you feel if some European
body decided that the thorn character was, only under certain
circumstances, going to be marginally harder to deal with.”

BTW, when was the last time you used thorn? That’s the equivalant of the
kanji that are in the surogate range.

cjs

···

On Tue, 6 Aug 2002, Clifford Heath wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Curt Sampson wrote:

NOTHING IS BEING DROPPED! ALL CHARACTERS ARE REPRESENTABLE IN UTF-16! HELLO!

Hey, no need to shout. Why is there a Mojikyo standard if no-one’s being
shut out? I can only assume that they want to represent characters that
are excluded from Unicode but could have been easily included via UCS-4
and UTF-8. Am I wrong, and are the Mojikyo people just pig-headed?

Aside from a few kanji specialists,…

Those few also need to use computers - linguistic study is a valid use for
computers - and there’s no need to adopt a standard that forces them to
use an encoding other than that used by the rest of the world.

···


Clifford Heath

In article Pine.NEB.4.44.0208060937500.429-100000@angelic.cynic.net,
Curt Sampson cjs@cynic.net writes:

In Japan, I have never met anyone outside of someone testing Unicode
or Unicode-based applications that has ever used a surrogate character.

Because Unicode has all JIS X 0208 characters in BMP. However new
standard JIS X 0213:2000 requires surrogate area to represent it.

Since Unicode assigns characters in surrogate area since Unicode-3.1
which is released in 2001, they are not widely used yet.
I agree you, now. But I think it will be deployed near future.

···


Tanaka Akira

Curt Sampson wrote:

NOTHING IS BEING DROPPED! ALL CHARACTERS ARE REPRESENTABLE IN UTF-16! HELLO!

Hey, no need to shout. Why is there a Mojikyo standard if no-one’s being
shut out?

Because certain very specialized applications, mainly to do with
academic research into Kanji, need to represent more kanji than Unicode
cares to represent.

If you need it, you need it. If you don’t, you don’t want to get
near it because it’s extremely unwiedly and expensive to use. Also,
note that it’s only useful for kanji, not for other writing systems.

I can only assume that they want to represent characters that
are excluded from Unicode but could have been easily included via UCS-4
and UTF-8.

One doesn’t “include” things in Unicode via UCS-4 and UTF-8".
Those are merely encoding schemes for Unicode; different ways of
representing it. All Unicode characters can be represented in UTF-8,
UTF-16 and UTF-32.

You can encode other character sets in these encoding as well, but then
(tautology here) you’re not encoding Unicode. From the spec, section 3.8:

The definition of UTF-8 in Amendment 2 to ISO/IEC 10646 also
allows for the use of five- and six-byte sequences to encode
characters that are outside the range of the Unicode character
set; those five- and six-=byte sequences are illegal for the
use of UTF-8 as a transformation of Unicode characters.

Aside from a few kanji specialists,…

Those few also need to use computers - linguistic study is a valid use for
computers - and there’s no need to adopt a standard that forces them to
use an encoding other than that used by the rest of the world.

Yes there is. They have completely different requirements that are
basically impossible to met with something also intended for general
purpose use by the rest of the world. (One of these is the ability
to change the standard very quickly.) And they’re not interested
in supporting many of the things required by other languages
(right-to-left text, combining characters, etc.)

Would you make the entire world stop using regular filesystems and
start using Lotus Notes just because a few people need to have
full-text search and additional indexed fields in some applications?

Anyway, it appears to me that you do not have a very good understanding
of Unicode, so I am going to drop this argument until you go read
the specification carefully and can point out which parts of it
you want to argue with. Nothing I’ve been saying here is my own
original idea; it’s all in the Unicode specification itself.

cjs

···

On Tue, 6 Aug 2002, Clifford Heath wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

In article Pine.NEB.4.44.0208060937500.429-100000@angelic.cynic.net,
Curt Sampson cjs@cynic.net writes:

In Japan, I have never met anyone outside of someone testing Unicode
or Unicode-based applications that has ever used a surrogate character.

Because Unicode has all JIS X 0208 characters in BMP.

Yes. And also all JIS X 212 characters.

However new
standard JIS X 0213:2000 requires surrogate area to represent it.

Right.

Since Unicode assigns characters in surrogate area since Unicode-3.1
which is released in 2001, they are not widely used yet.
I agree you, now. But I think it will be deployed near future.

Do you really think so? I think it says a lot that the vast majority
of Japanese text processing and web pages are done in Shift_JIS, which
encodes only JIS X 208. It doesn’t encode JIS X 212, much less JIS X
213. If most Japanese people during the past decade have not even felt
the need to add JIX X 212 to their character repetoire, why would they
need to add JIX X 213?

JIS X 213 is two years now; what products use it?

cjs

···

On Tue, 6 Aug 2002, Tanaka Akira wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

I intend to drop the discussion too, since you seem to want to
discuss what is or isn’t allowed in the Unicode spec (with which
I’m quite familiar thankyou), and I don’t. I was hoping for some
intelligent discussion about what it perhaps should and could
allow, not what it does allow. It seems I’m doomed to
disappointment, since you only seem to want to punch noses.

···


Clifford Heath

In article Pine.NEB.4.44.0208071641320.1214-100000@angelic.cynic.net,
Curt Sampson cjs@cynic.net writes:

Do you really think so? I think it says a lot that the vast majority
of Japanese text processing and web pages are done in Shift_JIS, which
encodes only JIS X 208. It doesn’t encode JIS X 212, much less JIS X
213. If most Japanese people during the past decade have not even felt
the need to add JIX X 212 to their character repetoire, why would they
need to add JIX X 213?

For Shift_JIS people, JIS X 0213 provides a Shift_JIS compatible encoding.

JIS X 213 is two years now; what products use it?

MacOS X supports JIS X 0213.

···


Tanaka Akira

its a fight! :slight_smile:

okay, guys the UNICODE thread has really gone on and on. i unfortunately
don’t know didley about any of these encodings and from the sounds of it
i don’t want to. :wink:

but instead to ending it this way, can anyone sum up the overall
conclusions of this lengthy discussion? that’s what i’d like to hear.
i.e. does any encoding scheme out there do the job, the whole job, and
nothing but the job? or are they all flawed and somebody someday needs
to sit down and figure the problem out and fix it for good?

~transami

···

On Mon, 2002-08-05 at 22:40, Clifford Heath wrote:

I intend to drop the discussion too, since you seem to want to
discuss what is or isn’t allowed in the Unicode spec (with which
I’m quite familiar thankyou), and I don’t. I was hoping for some
intelligent discussion about what it perhaps should and could
allow, not what it does allow. It seems I’m doomed to
disappointment, since you only seem to want to punch noses.


Clifford Heath


~transami

I wish you’d said that at the beginning. I would never have started
the argument, since I’m interested only in how to deal with multiple
languages and charsets in a way compatable with the rest of the world.

This of course involves using Unicode, rather than dreaming up new
schemes for character representation.

cjs

···

On Tue, 6 Aug 2002, Clifford Heath wrote:

I was hoping for some intelligent discussion about what it perhaps
should and could allow, not what it does allow.


Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

but instead to ending it this way, can anyone sum up the overall
conclusions of this lengthy discussion? that’s what i’d like to hear.

Unicode covers a lot of what people want to do, but not everything.
Therefore there will be specialized situations in which you cannot use
Unicode.

Unicode has some variable length issues with surrogates, combining
pairs, and various encodings. In all encodings, combining pairs are
multiple Unicode code points representing a single glyph on screen.

UTF-32: 4-byte chars, except for combining pairs.

UTF-16: 2-byte chars, except for combining pairs and surrogates.

UTF-8: 1-byte chars, except for non-ASCII, combining pairs and
surrogates.

Note well that all Unicode encodings have variable length issues, due
to the combining chars. Surrogate pairs are dealt with very, very simply
if you are not actually interpreting them (you can easily tell from
looking at a single character if it’s a high byte or low-byte surrogate,
and the only rule is not to split them). Combining chars are rather more
difficult.

The processing you do on strings with surrogate pairs can be divided
into three categories.

1. No special processing required, because your processing
cannot break anything. (E.g., web page form information to and
from a database).

2. Little special processing required to avoid any breakage (i.e.,
don't split pairs); minimal damage if you do split pairs.

3. Extensive processing required because you're interpreting
the pairs.

3 is necessary only for things interacting with a user that have
to display proper glyphs and accept input. Most such things are
language-specific, because it’s impossibly complex to have one way
of doing things that covers everything. (For a start, just try to
think up an input method that works for Chinese, Japanese and Korean
simultaneously. And then also handles Hebrew.) And most people don’t
need that anyway.

2 Is pretty trivial to implement above the string layer, where
necessary, and often needs to be combined with other stuff anyway.
(E.g., line wrapping algorithms, which are language-specific.)

1 Is suprisingly common situation.

There is no equivalant for combining pairs; you have to do some
real processing there. Pretty much everybody has just ignored the
problem from the beginning, and it’s not been that big a deal.

And one more thing a lot of people have missed: UTF-8 is the most
efficient storage format only for text the majority of which is ASCII;
if it’s not UTF-16 is about as efficient or, in the case of Asian
languages, more efficient.

A rather more disputed fact is that surrogate pairs and combining
characters are pretty darn rare. But one thing to remember is that a)
people have been mostly ignoring combining pairs all along, with very
little fuss from anyone, and b) only within the last year or two have
there even been any characters assigned to the surrogate pair area.

i.e. does any encoding scheme out there do the job, the whole job, and
nothing but the job? or are they all flawed and somebody someday needs
to sit down and figure the problem out and fix it for good?

They are all flawed in one way or another. Figuring out something
that covers everything would probably not be much more difficult
than Regan’s Star Wars project, however.

One should also keep in mind that the general public has been dealing
extensively with systems using flawed encodings for the past thirty
years, in the case of ASCII and the English speaking world, and ten
years or so in the case of Asian standards and the Asian world, and
there have not been major complaints regarding single-language support.
(Unicode basically just combines all the popular character sets of the
world into one big one to solve the multiple-language support problem.
It does not attempt–much, at any rate–to solve other problems that
have been present all along that people have brought up here. I opine
that that’s because had those problems really needed a solution, the
solution would have been created and standardized.)

cjs

···

On Tue, 6 Aug 2002, Tom Sawyer wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC