>>> so, you guys are telling me a language developed since the year 2000
>>> doesn't support unicode strings natively? in my opinion, that's a
>>> pretty glaring problem.
>> Please note that Ruby itself is ten years old. Unicode has only
>> *recently* (the last three or four years, with the release of Windows
>> XP) become a major factor, especially in Japan. Unix support for
>> Unicode is still in the stone ages because of the nonsense that POSIX
>> put on Unix ages ago. (When Unix filesystems can write UTF-16 as
>> their native filename format, then we're going to be much better.
>> That will, however, break some assumptions by really stupid
>> programs.)
> Why the hell utf-16? It is no longer compatible with ascii, yet 16
> bits are far from sufficient to cover current unicode. So you still
> get multiword characters. It is not even dword aligned for fast
> processing by current cpus. I would like utf-8 for compatibility, and
> utf-32 for easy string processing. But I do not see much use for
> utf-16.
UTF-16 is actually pretty performant and the implementation of wchar_t
on MacOS X and Windows is (you guessed it!) UTF-16. The filesystems for
both of these operating systems (which have *far* superior Unicode
support than anything else) both use UTF-16 as the native filename
encoding (this is true for HFS+, NTFS4, and NTFS5). The only difference
between what MacOS X does and Windows does for this is that Apple chose
to use decomposed characters instead of composed characters (e.g.,
LOWERCASE E + COMBINING ACUTE ACCENT instead of LOWERCASE E ACUTE
ACCENT).
Look at the performance numbers for ICU4C: it's pretty damn good. UTF-32
isn't exactly space conservative (since with UTF-16 *most* of the BMP
can be represented with a single wchar_t, and only a few need surrogates
taking up exactly *two* wchar_ts, whereas *all* characters would take up
four uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.
I do not care what Windows, OS X, or ICU uses. I care what I want to
use. Even if most characters are encoded with single word you have to
cope with multiword characters. That means that a character is not a
simple type. You cannot have character arrays. And no library can
completely wrap this inconsistency and isolate you from dealing with
it.
Even if the library is performant with multiword characters it is
complex. That means more prone to errors. Both in itself and in the
software that interfaces it.
You say that utf-16 is more space-conserving for languages like
Japanese. Nice. But I do not care. I guess text consumes very small
portion of memory on my system. Both ram and hardrive. I do not care
if that doubles or quadruples. In the very few cases when I want to
save space (ie when sending email attachments) I can use gzip. It can
even compress repetitive text which no encoding can.
> Austin Ziegler wrote:
>> Unix support for Unicode is still in the stone ages because of the
>> nonsense that POSIX put on Unix ages ago. (When Unix filesystems can
>> write UTF-16 as their native filename format, then we're going to be
>> much better. That will, however, break some assumptions by really
>> stupid programs.)
> Ummm, no. UTF-16 filenames would break *every* correctly-implemented
> UNIX program: UTF-16 allows the octect 0x00, which has always been the
> end-of-string marker.
You're right. And I'm saying that I don't care. People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I'll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems. One could do
what I *think* that Apple has done and provided two filesystem
interfaces that are synchronized. The native interface -- and the more
efficient one -- will be using UTF-16 because that's what HFS+ speaks.
The secondary interface (that also works on UFS filesystems) would
translate to UTF-8 and/or follow the nonsensical POSIX rules for native
encodings.
> Personally, my file names have been in UTF-8 for quite some time now,
> and it works well: What exactly is this 'stone age' you refer to?
Change and environment variable and watch your programs break that had
worked so well with Unicode. *That* is the stone age that I refer to.
I'm also guessing that you don't do much with long Japanese filenames or
deep paths that involve *anything* except US-ASCII (a subset of UTF-8).
Hmm, so you call the possibility to choose your encoding living in
stone age. I would call it living in reality. There are various
encodings out there.
> UTF-8 can take multiple octets to represent a character. So can UTF-16,
> UTF-32, and every other variation of Unicode.
This last statement is true only because you use the term "octet." It's
a useless term here, because UTF-8 only has any level of efficiency for
US-ASCII. Even if you step to European content, UTF-8 is no longer
perfectly efficient, and when you step to Asian content, UTF-8 is so
bloody inefficient that most folks who have to deal with it would rather
work in a native encoding (EUC-JP or SJIS, anyone?) which is 1..2 bytes
or do everything in UTF-16.
No, I suspect the reason for using EUC-JP, SJIS, or ISO-8859-*, and
other weird encodings is historical.
What do you mean by efficiency? If you want space efficiency use
compression. If you want speed, use utf-32 or similar encoding that
does not have to deal with special cases.
> Depending on content, a string in UTF-8 can consume more octects than
> the same string in UTF-16, or vice versa.
>
> Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't
> get to have the fun of picking between big- and little-endian!
Are people always this stupid when it comes to things that they clearly
don't understand? Yes, UTF-16 may have the problem of not knowing if
you're dealing with UTF-16BE or UTF-16LE, but it's my understanding that
this is *only* an issue when you're dealing with both on the same
system. Additionally, most platforms specify a default. It's been a
while (almost a year), but I think that ICU4C defaults to UTF-16BE
internally, not just UTF-16.
iirc there are even byte-order marks. If you insert one in every
string you can get them identified at any time without doubt
But do not trust me on that. I do not know anything about unicode, and
I want to sidestep the issue by using an encoding that is easy to work
with, even for ignorants
Thanks
Michal
···
On 3/11/06, Austin Ziegler <halostatue@gmail.com> wrote:
On 3/10/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> On 3/10/06, Austin Ziegler <halostatue@gmail.com> wrote:
>> On 3/8/06, Richard Gyger <richard@bytethink.com> wrote:
On 3/10/06, Anthony DeRobertis <aderobertis@metrics.net> wrote: