Unicode in Ruby now?

Wed, 7 Aug 2002 16:41:18 +0900, Curt Sampson cjs@cynic.net pisze:

UTF-8 is much less compact than UTF-16 for Asian text.

Well, not that much: at most 3/2 times larger.

And in UTF-16, surrogate pairs are encoded with 4 bytes, whereas
they take 6 bytes in UTF-8.

No, there are no surrogates in UTF-8. Characters above U+FFFF are
encoded in 4 bytes each. Surrogates exist only in UTF-16.

Anyway, if variable width is not a problem (and you say it isn’t if
you defend UTF-16), I would almost always choose UTF-8 as the default.
Yes, up to 3/2 larger for Asian text, but twice more compact for ASCII,
free of endianness issues, and ASCII-compatible which is very important.

···


__("< Marcin Kowalczyk
__/ qrczak@knm.org.pl
^^ Blog człowieka poczciwego.

UTF-8 is 50% larger than UTF-16 only for text which consist only of
Asian characters. Usual Asian document contains both Asian and ASCII
characters. In case of markup, like HTML, ASCII strongly outnumbers
Asian characters.

Example:
http://www.ruby-lang.org/ja/whats.html (jis) 4074
iconved to euc 3756
iconved to utf-8 4304
iconved to utf-16 6418
iconved to utf-32 12836

If we assume that ASCII take 1 byte in both EUC and UTF8,
and Asian characters take 2 bytes in EUC and 3 in UTF-8,
numbers we get are:

ASCII 2660
Asian 548
Proportion: 4.85 : 1

···

On Thu, Aug 08, 2002 at 06:28:34AM +0900, Marcin ‘Qrczak’ Kowalczyk wrote:

Wed, 7 Aug 2002 16:41:18 +0900, Curt Sampson cjs@cynic.net pisze:

UTF-8 is much less compact than UTF-16 for Asian text.

Well, not that much: at most 3/2 times larger.

And in UTF-16, surrogate pairs are encoded with 4 bytes, whereas
they take 6 bytes in UTF-8.

No, there are no surrogates in UTF-8. Characters above U+FFFF are
encoded in 4 bytes each. Surrogates exist only in UTF-16.

Anyway, if variable width is not a problem (and you say it isn’t if
you defend UTF-16), I would almost always choose UTF-8 as the default.
Yes, up to 3/2 larger for Asian text, but twice more compact for ASCII,
free of endianness issues, and ASCII-compatible which is very important.

No, there are no surrogates in UTF-8. Characters above U+FFFF are
encoded in 4 bytes each. Surrogates exist only in UTF-16.

Sorry; you’re right.

Anyway, if variable width is not a problem (and you say it isn’t if
you defend UTF-16),

Well, actually the point with UTF-16 is that you can, in general, safely
ignore the variable width stuff. I don’t think you can do that so easily
in UTF-8. If I chop off a UTF-8 sequence in the middle, are applications
that read it required to ignore that, as they are with surrogates in
UTF-16? Or is it likely that they will break, instead?

Anyway, I’m open to various arguments on the use of UTF-8 vs. UTF-16. I
suspect that UTF-16 would be rather easier to use, but I’ve not actually
done a thorough analysis.

cjs

···

On Thu, 8 Aug 2002, Marcin ‘Qrczak’ Kowalczyk wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Curt Sampson cjs@cynic.net wrote in message news:Pine.NEB.4.44.0208081139480.17422-100000@angelic.cynic.net

Well, actually the point with UTF-16 is that you can, in general, safely
ignore the variable width stuff. I don’t think you can do that so easily
in UTF-8. If I chop off a UTF-8 sequence in the middle, are applications
that read it required to ignore that, as they are with surrogates in
UTF-16? Or is it likely that they will break, instead?

UTF-8 is designed so that you always know if you are in the
middle of a character (provided that you know you are reading UTF-8).
I.e., if you break a string of bytes in the middle of a character,
the resulting sequence of bytes will not be valid UTF-8. The mapping
from unicode code points goes like this (using hexadecimal):

Unicode code point UTF-8 byte sequence
00…7F (00…7F) (This row is ASCII)
80…7FF (C2…DF) (80…BF) (2 bytes)
800…FFF (E0) (A0…BF) (80…BF) (3 bytes)
1000…FFFF (E1…EF) (80…BF)(80…BF) (3 bytes)
10000…3FFFF (F0)(90…BF)(80…BF)(80…BF) (4 bytes)
40000…FFFFF (F1…F3)(80…BF)(80…BF)(80…BF)(4 bytes)
100000…10FFFF (F4)(80…8F)(80…BF)(80…BF) (4 bytes)

Suppose for example that you truncate the character F1 87 B0 B1,
losing the last byte and getting F1 87 B0. This is a putative
3-byte character, so it belongs in the third or fourth rows of
the table above. But such a character cannot start with the
byte F1…

Unicode is no longer something that can be squeezed into two
bytes, even for practical purposes. There are over 40 000 CJK
characters outside the “BMP”, that require surrogates in UTF-16.
Mathematical alphanumeric symbols and musical symbols also live
outside the BMP. A lot of growth is still necessary if unicode
is to fulfill its mission. For example, the scandalous situation
where many Chinese and Japanese cannot write their names in unicode
will have to be fixed eventually, and this will be done outside
the BMP. More technical notation (such as Fregean notation in
logic, for which I personally feel a need) will have to be introduced,
and it won’t be in the BMP. Certain mistakes in unicode, such as
the bungled treatment of IPA, will have to be fixed, and they will
be fixed outside the BMP. It is clear that some of the “unification”
that has occurred was driven mainly by an unrealistic desire to
cram all the world’s characters into two bytes. The misguided
unifications will certainly be rectified outside the BMP.

But UTF-16 was a mistake from the beginning. It is no longer fixed-
width, and it is sure to grow much less fixed-width in practice, so
it lacks that merit. Yet it is just long enough to introduce an
endianness nightmare. The UTF-16 folks try to fix this with a kluge,
the byte-order mark, but the kluge is an abomination. It is non-local,
and hence screws string processing. It breaks unix’s critical shebang
hack. No wonder Microsoft loves it! It disrupts life on unix and life
on big-endian machines.

All things considered, the unicode people have done a wonderful job.
But the job isn’t done yet, and maybe unicode will never be right for
everybody, so I think Ruby should support other character sets as well,
including some which are not compatible with unicode.

                             Regards, Bret

http://www.rexx.com/~oinkoink
oinkoink at rexx dot DON’T_SPAM_ME_PLEASE com

“Bret Jolly” oinkoink+unet@rexx.com wrote in message
news:7e7131a1.0208091559.7f59a71c@posting.google.com

Curt Sampson cjs@cynic.net wrote in message
news:Pine.NEB.4.44.0208081139480.17422-100000@angelic.cynic.net

Well, actually the point with UTF-16 is that you can, in general, safely
ignore the variable width stuff. I don’t think you can do that so easily
in UTF-8. If I chop off a UTF-8 sequence in the middle, are applications
that read it required to ignore that, as they are with surrogates in
UTF-16? Or is it likely that they will break, instead?

UTF-8 is designed so that you always know if you are in the
middle of a character (provided that you know you are reading UTF-8).

I have nothing against UTF-8, but it isn’t really a strong point. When would
you ever break a UTF-8 string in this way, except to recover from a disk
crash or broken transmission?

You can’t look up the 10’th character and say, oups I’m in the middle of a
character, so I must backtrack, because where did you get that index from in
the first place.

UTF-8 is only useful for forward or backward (I might just have answered my
own question there) scanning applications. But these applications are also
very common, as this is the case with regular expressions. An 8 bit regular
expression works well in many UTF-8 scenarios. And since input from network
or disk can be expected to be UTF-8, this is a strong point for UTF-8.

It is far less common to index strings by number of characters. But when you
do, UCS-4 is better. A typical application is text formatting where you
wan’t the nearest linebreak to the given width of say 80 characters. UCS-2
is probably not good enough because combining (or surrogates or whatever)
will only take up one display unit. But then you probably need to take
proportional spacing into account anyway.

One reason for character indexed lookup is to store the position in a string
as an integer, once you have scanned to that position. But you can represent
the position by other means (objects internally pointing the location).
Therefore, what is important is good iterators and position objects, It
doesn’t really matter how the internal representation is.

I would certainly like to be able to handle UTF-8 without spending time on
needless conversions. Yet I believe a fixed width format is also a
requirement.
I therefore believe a UTF-8 string and a fixed width string are both
relevant. The fixed width format can store whatever coding it likes (as matz
plans to), I see no reason to limit it to Unicode. For all I care, someone
could be storing Marsian genetic sequences or file block allocation bitmaps
in it.
I just need to be able to query its encoding at all times and be able to
ensure that I don’t accidentially mix my Korean homework assignment with my
list disk defragmentation data.

It does, however, appear that Unix have chosen UCS-4 where Microsoft started
out with UCS-2. As such it may be more future proof to choose the bloated
32bit version, which also would solve my 80 character width formatting
problem in the foreseable future.

So I propose UCS-4 with encoding stored (this is also important for
serialization), and UTF-8 for the more daily household string handling. If
we really want to beef up things, we should also have a UCS-2 in BSTR
compatible format as this is the format you use to communicate to the
Windows API and to COM objects. BSTR’s stores a 32bit length immediately
before the actual 16bit wide characters, terminated by a null character to
be compatible with C’s wchar_t strings. The format is not good for editing,
but it can instantly be made an argument for C and Windows API’s.

I hope this point view is from a more practical perspective than the views
of what format does most damage to some minorities. There are 3 widely
deployed real life encodings in API’s now.

Mikkel

Unicode is no longer something that can be squeezed into two
bytes, even for practical purposes. There are over 40 000 CJK
characters outside the “BMP”, that require surrogates in UTF-16.

I agree. What we may not agree on is that surrogates work very well
in UTF-16, and require minimal or no processing in many cases.

For example, the scandalous situation where many Chinese and
Japanese cannot write their names in unicode will have to be fixed
eventually…

How many people is this, really? I note that Japanese people have been
putting up with this “scandalous” situation for years now, and will
continue to do so far a long time, as Shift_JIS and EUC-JP encodings of
JIS X 208 and JIS X 212 are showing no signs of declining in use, and
they are both fully present in the Unicode BMP.

But UTF-16 was a mistake from the beginning. It is no longer fixed-
width, and it is sure to grow much less fixed-width in practice…

UTF-32 is not fixed width, either, and never has been. Nothing can
be fixed width in Unicode due to combining characters.

The only “extra” problem that UTF-16 presents over UTF-32 is dealing
with surrogates, and that is a very easy problem to deal with.

Yet it is just long enough to introduce an
endianness nightmare. The UTF-16 folks try to fix this with a kluge,
the byte-order mark, but the kluge is an abomination. It is non-local,
and hence screws string processing. It breaks unix’s critical shebang
hack.

Actually, that would not be hard to fix. It’s pretty trivial to
modify the kernel to deal with that.

…and maybe unicode will never be right for
everybody, so I think Ruby should support other character sets as well,
including some which are not compatible with unicode.

I certainly agree with that!

cjs

···

On Sat, 10 Aug 2002, Bret Jolly wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Well, in a lot of cases it’s no big deal, because you just want to
limit the length of a string. For example, I may want to trucate
a display field to twenty characters, so it doesn’t overflow. With
UTF-16, I can safely just truncate. If I break a surrogate, no
problem; it doesn’t display. If I break a combining character, it’s
a bit more of a problem (because only part of it displays), but
nothing most people can’t live with.

This is one of the big advantages of UTF-16 over UTF-8; you can do
simple operations the simple way and still produce valid UTF-16 output.
(There’s no explicit rule, as far as I know at least, that states that
UTF-8 parsers must ignore broken characters, as there is with UTF-16.)

cjs

···

On Sat, 10 Aug 2002, MikkelFJ wrote:

It is far less common to index strings by number of characters. But when you
do, UCS-4 is better. A typical application is text formatting where you
wan’t the nearest linebreak to the given width of say 80 characters. UCS-2
is probably not good enough because combining (or surrogates or whatever)
will only take up one display unit. But then you probably need to take
proportional spacing into account anyway.


Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Hi,

How many people is this, really? I note that Japanese people have been
putting up with this “scandalous” situation for years now, and will
continue to do so far a long time, as Shift_JIS and EUC-JP encodings of
JIS X 208 and JIS X 212 are showing no signs of declining in use, and
they are both fully present in the Unicode BMP.

They are both fully present in Unicode BMP, but mapping is not
completely defined, and cannot possibly defined for these encodings
without problems.

I’m not sure why you don’t mention “conversion hell” between legacy
encoding and Unicode. I’m sure you’ve noticed “yes sign problem” and
such.

						matz.
···

In message “Re: Unicode in Ruby now?” on 02/08/12, Curt Sampson cjs@cynic.net writes:

Curt Sampson cjs@cynic.net wrote in message news:Pine.NEB.4.44.0208121238230.2317-100000@angelic.cynic.net

For example, the scandalous situation where many Chinese and
Japanese cannot write their names in unicode will have to be fixed
eventually…

How many people is this, really? I note that Japanese people have been
putting up with this “scandalous” situation for years now, and will
continue to do so far a long time, as Shift_JIS and EUC-JP encodings of
JIS X 208 and JIS X 212 are showing no signs of declining in use, and
they are both fully present in the Unicode BMP.

What is in use is determined not only by what character sets and
encodings are available, but also on how much software is available
to use them, and how aware the users are of the availability of this
software. The software and the awareness are continually expanding.
If people know they can write their names correctly, they will.

This is not just an issue of vanity. (Nor is it personal: my Chinese
name can be written in the BMP. :-)) It is not just the person whose
name cannot be written who is affected, but anyone who wants to talk
about him. For example, I understand that one of the top 5 politicians
in China has a name which does not appear in unicode. This makes life
hard for journalists and political scholars writing in Chinese.

                          Regards, Bret

http://www.rexx.com/~oinkoink
oinkoink
at
rexx
dot
com

···

On Sat, 10 Aug 2002, Bret Jolly wrote:

Curt Sampson cjs@cynic.net wrote in message news:Pine.NEB.4.44.0208121234260.2317-100000@angelic.cynic.net

This is one of the big advantages of UTF-16 over UTF-8; you can do
simple operations the simple way and still produce valid UTF-16 output.
(There’s no explicit rule, as far as I know at least, that states that
UTF-8 parsers must ignore broken characters, as there is with UTF-16.)

UTF-8 parsers must ignore “broken” characters because, as I pointed
out in a previous message, “broken” characters are never valid UTF-8,
due to the UTF-8 design. The standard now only allows parsing of
valid characters (the loopholes that existed in unicode version 3.0
were eliminated by updates in versions 3.1 and 3.2). The unicode
standard expressly forbids the interpretation of illegal UTF-8
sequences.

There are also advantages to a fixed-width encoding, such as the
recently introduced UTF-32, which can often outweigh the endianness
issues. (Encodings which are not byte-grained, such as UTF-32 and
UTF-16, need two variants, big-endian and little-endian.)

UTF-16 was not thought through very well. It is an encoding
following the mental line of least resistence – encode the
character points by their numbers. There was no reason the encoding
should have included 0-bytes, thus sabotaging byte-grained string
processing by C programs. And of course it was thought that all
characters of interest to any significant community could fit in
the two-byte “Basic Multilingual Plane”. This is not attainable
even with current unicode unless you consider Chinese, Japanese,
mathematicians, and musicians to be insignificant communities.
Also, important further expansion outside the BMP is inevitable.

But UTF-16 in both big-endian and little-endian variants is sure to
be one of those technical blunders which far outlives its
excusability, due to inertia and corporate politics, so Ruby should
probably provide direct support. Failing that, Ruby could provide
indirect support via invisible translation to some other unicode
encoding.

Some people use UTF-16 as a disk storage format and expand to
UTF-32 in memory. This allows one to directly access characters
by index for unicode strings in memory, while avoiding crass
inefficiency in disk usage. But for general multilingual processing,
UTF-8 seems more efficient and handier as a disk storage format.

The unicode consortium has recently promulgated yet another
encoding form, CESU-8, intended only for internal use by programs,
and not for data transfer between applications. CESU-8 is byte-
grained and similar to UTF-8, but CESU-8 has been designed so CESU-8
text will have the same binary collation as equivalent UTF-16 text.
I don’t know if there is a reason for RUBY to support this.

Though notoriously unwise myself, I'd like to make a plea for

some wisdom. Many people here have a great deal of experience with
internationalization, and rightly consider themselves experts. But
expertise comes in many flavors, and one should think twice before
making assertions about what other people need. The need for
internationalization, M17n, and so forth by a maker of corporate web
sites is different from the need of a mathematician, musician, or
someone trying to computerize Akkadian tablets. We should avoid the
parochial thought that our interests are the only important or
“practical” ones.

                Regards, Bret

http://www.rexx.com/~oinkoink/
oinkoink
at
rexx
dot
com

oinkoink+unet@rexx.com (Bret Jolly) writes:

For example, I understand that one of the top 5 politicians in China
has a name which does not appear in unicode. This makes life hard
for journalists and political scholars writing in Chinese.

Which leads to some interesting Orwellian possibilities: not just
could people removed from history, but we arrange the character set so
that their name cannot even be expressed.

D*ve

UTF-8 parsers must ignore “broken” characters because, as I pointed
out in a previous message, “broken” characters are never valid UTF-8,
due to the UTF-8 design. The standard now only allows parsing of
valid characters (the loopholes that existed in unicode version 3.0
were eliminated by updates in versions 3.1 and 3.2). The unicode
standard expressly forbids the interpretation of illegal UTF-8
sequences.

Ah. So does this mean that if I break a String into two in the
middle of a UTF-8 sequence both broken sequence parts will be
preserved, so that the character reappears if I put the two strings
back together again? This, to my mind, is one of the big advantages of
the UTF-16 surrogate character specification.

There are also advantages to a fixed-width encoding, such as the
recently introduced UTF-32…

I think I’ve already said this about eight million times, but:

UTF-32 is not fixed width, due to combining characters.

But UTF-16 in both big-endian and little-endian variants is sure to
be one of those technical blunders which far outlives its
excusability…

Well, we’ll just have to agree to differ on this. I deal with a lot of
Japanese text in my various programs, and at the lowest level (String
objects and suchlike) I find UTF-16 to be by far the most convenient
way of dealing with with it. It’s small, efficient, lets me do basic
handling of stuff with ease, and lets me push up some of the harder
issues into just the classes that really need it, rather than having to
deal with them everywhere.

Though notoriously unwise myself, I'd like to make a plea for

some wisdom. Many people here have a great deal of experience with
internationalization, and rightly consider themselves experts. But
expertise comes in many flavors, and one should think twice before
making assertions about what other people need. The need for
internationalization, M17n, and so forth by a maker of corporate web
sites is different from the need of a mathematician, musician, or
someone trying to computerize Akkadian tablets. We should avoid the
parochial thought that our interests are the only important or
“practical” ones.

Well, I’ve said all along that Unicode just is not suitable for a
lot of very technical purposes. My argument is that it’s impossible
for a single character set to deal with everything, and even dealing
with most of it is completely impractical. Thus, use a simple
character set like Unicode and it’s relatively simple accompanying
algorithms for day to day work, and do something custom when you
have requirements beyond that.

cjs

···

On Tue, 13 Aug 2002, Bret Jolly wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC