Unicode in Ruby now?

Thu, 1 Aug 2002 19:53:07 +0900, Curt Sampson cjs@cynic.net pisze:

Not just political reasons, but practical reasons. Unicode is designed
to work if you restrict yourself to using only 16-bit chars, and I
expect most programs are going to limit themselves to that. So even if
it were folded in to the extension space, most people wouldn’t use it.

Unicode is not 16-bit. The code point range is 0…0x10FFFF.

We don’t need to do the stupid thing that Java and MS Windows do,
i.e. using UTF-16 internally which combines disdvantages of UTF-8
and UTF-32: variable length and incompatible with ASCII.

The most straightforward internal representation is UTF-32: each
character is stored in 4 bytes. All other encodings are either
variable-length or can’t represent all Unicode characters.

If you need compactness or ASCII compatibility, use UTF-8. Most
characters (i.e. below U+FFFF) are encoded with 1, 2 or 3 bytes.
Good for data transmission and already widely known.

Anything else should be necessary only at the border with the outside
world.

···


__("< Marcin Kowalczyk
__/ qrczak@knm.org.pl
^^ Blog człowieka poczciwego.

The most straightforward internal representation is UTF-32: each
character is stored in 4 bytes. All other encodings are either
variable-length or can’t represent all Unicode characters.

UTF-8, UTF-16 and UTF-32 are all able to represent all Unicode
characters, and are all variable length in one sense or another.
(UTF-32 still has combining characters.)

If you need compactness or ASCII compatibility, use UTF-8.

UTF-8 is much less compact than UTF-16 for Asian text.

Most characters (i.e. below U+FFFF) are encoded with 1, 2 or 3 bytes.

As opposed to UTF-16, where you can change that statement to “1 or 2 bytes”.

And in UTF-16, surrogate pairs are encoded with 4 bytes, whereas
they take 6 bytes in UTF-8.

cjs

···

On Wed, 7 Aug 2002, Marcin ‘Qrczak’ Kowalczyk wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC