Unicode in ruby

Richard_Gyger · 8 March 2006 11:46

i'm using IO.foreach to parse the lines in a file. now i'm trying to get it to work with unicode encoded files. does ruby support unicode? how do i compare a variable with a unicode constant string?

the script goes something like:

IO.foreach("myfile.txt") { |line|
if line.downcase[0,2] == "id"

Michal_hramrach_Such · 8 March 2006 12:44

To get unicode downcase you probably want icu4r. To handle the cases
you are interested in you could write your own. However, the
operator of ruby strings returns bytes, not characters.

hth

Michal

···

On 3/8/06, Richard Gyger <richard@bytethink.com> wrote:

i'm using IO.foreach to parse the lines in a file. now i'm trying to get
it to work with unicode encoded files. does ruby support unicode? how do
i compare a variable with a unicode constant string?

the script goes something like:

IO.foreach("myfile.txt") { |line|
if line.downcase[0,2] == "id"

--
Support the freedom of music!
Maybe it's a weird genre .. but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz

Pere_Noel1 · 8 March 2006 12:53

you don't make use of "\n" at uni-berlin.de when wrapping ?

could be more readable

···

Michal Suchanek <hramrach@centrum.cz> wrote:

On 3/8/06, Richard Gyger <richard@bytethink.com> wrote:

i'm using IO.foreach [.. no \n ]

--
une bévue

Richard_Gyger · 8 March 2006 18:13

so, you guys are telling me a language developed since the year 2000 doesn't support unicode strings natively? in my opinion, that's a pretty glaring problem.

Une bévue wrote:

···

Michal Suchanek <hramrach@centrum.cz> wrote:

On 3/8/06, Richard Gyger <richard@bytethink.com> wrote:

i'm using IO.foreach [.. no \n ]

you don't make use of "\n" at uni-berlin.de when wrapping ?

could be more readable

Logan_Capaldo · 8 March 2006 18:21

Ruby doesn't really support any strings natively. It just happens to have a bytevector class that acts a lot like a string Having said that, have you tried:
$KCODE="u" # Assumes the source file is encoded as UTF8, effects literal strings, regexps, etc.

If your source file is UTF16 or some other non-UTF8 encoding you'll have to use iconv to get into UTF8 to compare with the literals in your source.

···

On Mar 8, 2006, at 1:13 PM, Richard Gyger wrote:

so, you guys are telling me a language developed since the year 2000 doesn't support unicode strings natively? in my opinion, that's a pretty glaring problem.

Michal_hramrach_Such · 8 March 2006 18:24

For me it is a problem as well. But getting unicode right is hard.
Look at the size of the icu library and the size of ruby itself.
Anyway, unicode regexps are planned for ruby 2.0 iirc.

Thanks

Michal

···

On 3/8/06, Richard Gyger <richard@bytethink.com> wrote:

so, you guys are telling me a language developed since the year 2000
doesn't support unicode strings natively? in my opinion, that's a pretty
glaring problem.

--
Support the freedom of music!
Maybe it's a weird genre .. but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz

Austin_Ziegler5 · 10 March 2006 04:34

Please note that Ruby itself is ten years old. Unicode has only
*recently* (the last three or four years, with the release of Windows
XP) become a major factor, especially in Japan. Unix support for Unicode
is still in the stone ages because of the nonsense that POSIX put on
Unix ages ago. (When Unix filesystems can write UTF-16 as their native
filename format, then we're going to be much better. That will, however,
break some assumptions by really stupid programs.)

I've been following what Matz has had to say and have recently done
quite a bit of work with Unicode. The reality is that Unicode is hard,
and there's cultural and other reasons for Ruby *not* to have Unicode
(UTF-16 or UTF-8) strings by default. I think that Matz's plans for M17N
strings is far superior to assuming Unicode by default.

Basically, Ruby will have the capabilities to work with UTF-8, UTF-16,
and probably the ISO-8859-* encodings natively, as well as the existing
SJIS and EUC-JP support. I wouldn't be surprised if it also includes
other EUC-* encodings. Essentially, you'll be able to do:

  s = "école"
  s.encoding # -> :raw (or something like that)
  s.encoding = :iso8859_1 # "école"
  s.encoding = :utf8 # "Ã(c)cole"
  s.capitalize! # "Ã‰cole"
  s.encoding = :iso8859_1 # "École"

More than that, using the same string:

  s[0] # "É"
  s.encoding = :utf8
  s[0] # "Ã‰"

I've shown everything as a byte string here. The point is, though, that
going from the raw encoding -- which may be the default, or the default
may be able to be set -- shouldn't cause any byte conversions. I suspect
that Matz will have a different way to get at the underlying bytes, but
that's what will be happening for Ruby 2.0.

The last indication I had seen suggested that M17N strings were closer,
but not yet done. I'm looking forward to them.

-austin

···

On 3/8/06, Richard Gyger <richard@bytethink.com> wrote:

so, you guys are telling me a language developed since the year 2000
doesn't support unicode strings natively? in my opinion, that's a
pretty glaring problem.

--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Michal_hramrach_Such · 8 March 2006 18:31

err, no that is not what people want when they speak about downcase in unicode.
Sure, you can write a string encoded in utf-8 in your source, and
verify it is byte-identical to another string. That is about all you
get this way.
I suspect regexps won't work right with multibyte characters, for
downcase or case -insensitive regexps you would even need to know the
language.

Thanks

Michal

···

On 3/8/06, Logan Capaldo <logancapaldo@gmail.com> wrote:

On Mar 8, 2006, at 1:13 PM, Richard Gyger wrote:

> so, you guys are telling me a language developed since the year
> 2000 doesn't support unicode strings natively? in my opinion,
> that's a pretty glaring problem.
>

Ruby doesn't really support any strings natively. It just happens to
have a bytevector class that acts a lot like a string Having said
that, have you tried:
$KCODE="u" # Assumes the source file is encoded as UTF8, effects
literal strings, regexps, etc.

If your source file is UTF16 or some other non-UTF8 encoding you'll
have to use iconv to get into UTF8 to compare with the literals in
your source.

--
Support the freedom of music!
Maybe it's a weird genre .. but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz

Daniel_Harple · 8 March 2006 18:53

Unicode strings are also planned for Ruby 2 (possibly implemented already?).

-- Daniel

···

On Mar 8, 2006, at 7:24 PM, Michal Suchanek wrote:

Anyway, unicode regexps are planned for ruby 2.0 iirc.

Eric_Jacoboni2 · 8 March 2006 19:03

Logan Capaldo <logancapaldo@gmail.com> writes:

Ruby doesn't really support any strings natively. It just happens to
have a bytevector class that acts a lot like a string

.... that acts a lot like a string /of ASCII chars/, actually. Rather
anachronic, imho.

I can't consider that "il était une fois".length == 18 is the way it
should be with a string in a modern language.

Of course, tweaking with -K and jcode and/or other third parties
modules and/or various hacks allow some enhancements (we have a
jlength method that seems working), but that's not the Peru, either
(case methods support only ASCII chars, etc.)

Waiting for a plain support in Rite (much more important to me than
the "end" issues...).

···

--
Eric Jacoboni, ne il y a 1445284322 secondes

Michal_hramrach_Such · 10 March 2006 13:07

Why the hell utf-16? It is no longer compatible with ascii, yet 16
bits are far from sufficient to cover current unicode. So you still
get multiword characters. It is not even dword aligned for fast
processing by current cpus.
I would like utf-8 for compatibility, and utf-32 for easy string
processing. But I do not see much use for utf-16.

Thanks

Michal

···

On 3/10/06, Austin Ziegler <halostatue@gmail.com> wrote:

On 3/8/06, Richard Gyger <richard@bytethink.com> wrote:
> so, you guys are telling me a language developed since the year 2000
> doesn't support unicode strings natively? in my opinion, that's a
> pretty glaring problem.

Please note that Ruby itself is ten years old. Unicode has only
*recently* (the last three or four years, with the release of Windows
XP) become a major factor, especially in Japan. Unix support for Unicode
is still in the stone ages because of the nonsense that POSIX put on
Unix ages ago. (When Unix filesystems can write UTF-16 as their native
filename format, then we're going to be much better. That will, however,
break some assumptions by really stupid programs.)

--
Support the freedom of music!
Maybe it's a weird genre .. but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz

Anthony_DeRobertis · 10 March 2006 20:57

Austin Ziegler wrote:

Unix support for
Unicode is still in the stone ages because of the nonsense that POSIX
put on Unix ages ago. (When Unix filesystems can write UTF-16 as their
native filename format, then we're going to be much better. That will,
however, break some assumptions by really stupid programs.)

Ummm, no. UTF-16 filenames would break *every* correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been the
end-of-string marker.

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this 'stone age' you refer to?

UTF-8 can take multiple octets to represent a character. So can UTF-16,
UTF-32, and every other variation of Unicode.

Depending on content, a string in UTF-8 can consume more octects than
the same string in UTF-16, or vice versa.

Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't
get to have the fun of picking between big- and little-endian!

Brad5 · 8 March 2006 19:18

Eric Jacoboni wrote:

Waiting for a plain support in Rite (much more important to me than
the "end" issues...)

Speaking of Rite... is there a timeline on its release yet? One year? Two years? More?

Richard_Gyger · 10 March 2006 02:08

guess i'll wait till then. thanks for the info guys.

Daniel Harple wrote:

···

On Mar 8, 2006, at 7:24 PM, Michal Suchanek wrote:

Anyway, unicode regexps are planned for ruby 2.0 iirc.

Unicode strings are also planned for Ruby 2 (possibly implemented already?).

-- Daniel

Richard_Gyger · 10 March 2006 02:19

exactly. utf-8 doesn't mean one byte per char necessarily.

how have folks solved this problem when writing web sites in rails?

Michal Suchanek wrote:

···

On 3/8/06, Logan Capaldo <logancapaldo@gmail.com> wrote:

On Mar 8, 2006, at 1:13 PM, Richard Gyger wrote:

so, you guys are telling me a language developed since the year
2000 doesn't support unicode strings natively? in my opinion,
that's a pretty glaring problem.

Ruby doesn't really support any strings natively. It just happens to
have a bytevector class that acts a lot like a string Having said
that, have you tried:
$KCODE="u" # Assumes the source file is encoded as UTF8, effects
literal strings, regexps, etc.

If your source file is UTF16 or some other non-UTF8 encoding you'll
have to use iconv to get into UTF8 to compare with the literals in
your source.

err, no that is not what people want when they speak about downcase in unicode.
Sure, you can write a string encoded in utf-8 in your source, and
verify it is byte-identical to another string. That is about all you
get this way.
I suspect regexps won't work right with multibyte characters, for
downcase or case -insensitive regexps you would even need to know the
language.

Thanks

Michal

--
Support the freedom of music!
Maybe it's a weird genre .. but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz

Austin_Ziegler5 · 11 March 2006 04:42

so, you guys are telling me a language developed since the year 2000
doesn't support unicode strings natively? in my opinion, that's a
pretty glaring problem.

Please note that Ruby itself is ten years old. Unicode has only
*recently* (the last three or four years, with the release of Windows
XP) become a major factor, especially in Japan. Unix support for
Unicode is still in the stone ages because of the nonsense that POSIX
put on Unix ages ago. (When Unix filesystems can write UTF-16 as
their native filename format, then we're going to be much better.
That will, however, break some assumptions by really stupid
programs.)

Why the hell utf-16? It is no longer compatible with ascii, yet 16
bits are far from sufficient to cover current unicode. So you still
get multiword characters. It is not even dword aligned for fast
processing by current cpus. I would like utf-8 for compatibility, and
utf-32 for easy string processing. But I do not see much use for
utf-16.

UTF-16 is actually pretty performant and the implementation of wchar_t
on MacOS X and Windows is (you guessed it!) UTF-16. The filesystems for
both of these operating systems (which have *far* superior Unicode
support than anything else) both use UTF-16 as the native filename
encoding (this is true for HFS+, NTFS4, and NTFS5). The only difference
between what MacOS X does and Windows does for this is that Apple chose
to use decomposed characters instead of composed characters (e.g.,
LOWERCASE E + COMBINING ACUTE ACCENT instead of LOWERCASE E ACUTE
ACCENT).

Look at the performance numbers for ICU4C: it's pretty damn good. UTF-32
isn't exactly space conservative (since with UTF-16 *most* of the BMP
can be represented with a single wchar_t, and only a few need surrogates
taking up exactly *two* wchar_ts, whereas *all* characters would take up
four uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.

Austin Ziegler wrote:

Unix support for Unicode is still in the stone ages because of the
nonsense that POSIX put on Unix ages ago. (When Unix filesystems can
write UTF-16 as their native filename format, then we're going to be
much better. That will, however, break some assumptions by really
stupid programs.)

Ummm, no. UTF-16 filenames would break *every* correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been the
end-of-string marker.

You're right. And I'm saying that I don't care. People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I'll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems. One could do
what I *think* that Apple has done and provided two filesystem
interfaces that are synchronized. The native interface -- and the more
efficient one -- will be using UTF-16 because that's what HFS+ speaks.
The secondary interface (that also works on UFS filesystems) would
translate to UTF-8 and/or follow the nonsensical POSIX rules for native
encodings.

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this 'stone age' you refer to?

Change and environment variable and watch your programs break that had
worked so well with Unicode. *That* is the stone age that I refer to.
I'm also guessing that you don't do much with long Japanese filenames or
deep paths that involve *anything* except US-ASCII (a subset of UTF-8).

UTF-8 can take multiple octets to represent a character. So can UTF-16,
UTF-32, and every other variation of Unicode.

This last statement is true only because you use the term "octet." It's
a useless term here, because UTF-8 only has any level of efficiency for
US-ASCII. Even if you step to European content, UTF-8 is no longer
perfectly efficient, and when you step to Asian content, UTF-8 is so
bloody inefficient that most folks who have to deal with it would rather
work in a native encoding (EUC-JP or SJIS, anyone?) which is 1..2 bytes
or do everything in UTF-16.

Depending on content, a string in UTF-8 can consume more octects than
the same string in UTF-16, or vice versa.

Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't
get to have the fun of picking between big- and little-endian!

Are people always this stupid when it comes to things that they clearly
don't understand? Yes, UTF-16 may have the problem of not knowing if
you're dealing with UTF-16BE or UTF-16LE, but it's my understanding that
this is *only* an issue when you're dealing with both on the same
system. Additionally, most platforms specify a default. It's been a
while (almost a year), but I think that ICU4C defaults to UTF-16BE
internally, not just UTF-16.

There. Problem solved.

If you're going to babble on about Unicode, it'd be nice if you knew more than
the knee-jerk stuff you've posted so far. Either of you.

-austin

···

On 3/10/06, Michal Suchanek <hramrach@centrum.cz> wrote:

On 3/10/06, Austin Ziegler <halostatue@gmail.com> wrote:

On 3/8/06, Richard Gyger <richard@bytethink.com> wrote:

On 3/10/06, Anthony DeRobertis <aderobertis@metrics.net> wrote:
--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Daniel_Harple · 8 March 2006 19:25

http://www.atdot.net/yarv/
http://redhanded.hobix.com/cult/yarvMergedMatz.html

-- Daniel

···

On Mar 8, 2006, at 8:18 PM, rtilley wrote:

Speaking of Rite... is there a timeline on its release yet? One year? Two years? More?

PJ1 · 10 March 2006 02:39

It's a huge f*cking pain in the ass. We've been trying to convert
Wayfaring.com over to UTF8 off and on for about a month and it's
completely useless. Either you start the site using UTF8 (using crappy
hacks IMO) or forgetaboutit. We're about to break ground on a new site
and I almost don't want to do it until ruby 2.0 comes out with the
unicode support built in.

-PJ
http://pjhyett.com

···

On 3/9/06, Richard Gyger <richard@bytethink.com> wrote:

exactly. utf-8 doesn't mean one byte per char necessarily.

how have folks solved this problem when writing web sites in rails?

Michal_hramrach_Such · 11 March 2006 22:02

>>> so, you guys are telling me a language developed since the year 2000
>>> doesn't support unicode strings natively? in my opinion, that's a
>>> pretty glaring problem.
>> Please note that Ruby itself is ten years old. Unicode has only
>> *recently* (the last three or four years, with the release of Windows
>> XP) become a major factor, especially in Japan. Unix support for
>> Unicode is still in the stone ages because of the nonsense that POSIX
>> put on Unix ages ago. (When Unix filesystems can write UTF-16 as
>> their native filename format, then we're going to be much better.
>> That will, however, break some assumptions by really stupid
>> programs.)
> Why the hell utf-16? It is no longer compatible with ascii, yet 16
> bits are far from sufficient to cover current unicode. So you still
> get multiword characters. It is not even dword aligned for fast
> processing by current cpus. I would like utf-8 for compatibility, and
> utf-32 for easy string processing. But I do not see much use for
> utf-16.

UTF-16 is actually pretty performant and the implementation of wchar_t
on MacOS X and Windows is (you guessed it!) UTF-16. The filesystems for
both of these operating systems (which have *far* superior Unicode
support than anything else) both use UTF-16 as the native filename
encoding (this is true for HFS+, NTFS4, and NTFS5). The only difference
between what MacOS X does and Windows does for this is that Apple chose
to use decomposed characters instead of composed characters (e.g.,
LOWERCASE E + COMBINING ACUTE ACCENT instead of LOWERCASE E ACUTE
ACCENT).

Look at the performance numbers for ICU4C: it's pretty damn good. UTF-32
isn't exactly space conservative (since with UTF-16 *most* of the BMP
can be represented with a single wchar_t, and only a few need surrogates
taking up exactly *two* wchar_ts, whereas *all* characters would take up
four uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.

I do not care what Windows, OS X, or ICU uses. I care what I want to
use. Even if most characters are encoded with single word you have to
cope with multiword characters. That means that a character is not a
simple type. You cannot have character arrays. And no library can
completely wrap this inconsistency and isolate you from dealing with
it.

Even if the library is performant with multiword characters it is
complex. That means more prone to errors. Both in itself and in the
software that interfaces it.

You say that utf-16 is more space-conserving for languages like
Japanese. Nice. But I do not care. I guess text consumes very small
portion of memory on my system. Both ram and hardrive. I do not care
if that doubles or quadruples. In the very few cases when I want to
save space (ie when sending email attachments) I can use gzip. It can
even compress repetitive text which no encoding can.

> Austin Ziegler wrote:
>> Unix support for Unicode is still in the stone ages because of the
>> nonsense that POSIX put on Unix ages ago. (When Unix filesystems can
>> write UTF-16 as their native filename format, then we're going to be
>> much better. That will, however, break some assumptions by really
>> stupid programs.)
> Ummm, no. UTF-16 filenames would break *every* correctly-implemented
> UNIX program: UTF-16 allows the octect 0x00, which has always been the
> end-of-string marker.

You're right. And I'm saying that I don't care. People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I'll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems. One could do
what I *think* that Apple has done and provided two filesystem
interfaces that are synchronized. The native interface -- and the more
efficient one -- will be using UTF-16 because that's what HFS+ speaks.
The secondary interface (that also works on UFS filesystems) would
translate to UTF-8 and/or follow the nonsensical POSIX rules for native
encodings.

> Personally, my file names have been in UTF-8 for quite some time now,
> and it works well: What exactly is this 'stone age' you refer to?

Change and environment variable and watch your programs break that had
worked so well with Unicode. *That* is the stone age that I refer to.
I'm also guessing that you don't do much with long Japanese filenames or
deep paths that involve *anything* except US-ASCII (a subset of UTF-8).

Hmm, so you call the possibility to choose your encoding living in
stone age. I would call it living in reality. There are various
encodings out there.

> UTF-8 can take multiple octets to represent a character. So can UTF-16,
> UTF-32, and every other variation of Unicode.

This last statement is true only because you use the term "octet." It's
a useless term here, because UTF-8 only has any level of efficiency for
US-ASCII. Even if you step to European content, UTF-8 is no longer
perfectly efficient, and when you step to Asian content, UTF-8 is so
bloody inefficient that most folks who have to deal with it would rather
work in a native encoding (EUC-JP or SJIS, anyone?) which is 1..2 bytes
or do everything in UTF-16.

No, I suspect the reason for using EUC-JP, SJIS, or ISO-8859-*, and
other weird encodings is historical.
What do you mean by efficiency? If you want space efficiency use
compression. If you want speed, use utf-32 or similar encoding that
does not have to deal with special cases.

> Depending on content, a string in UTF-8 can consume more octects than
> the same string in UTF-16, or vice versa.
>
> Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't
> get to have the fun of picking between big- and little-endian!

Are people always this stupid when it comes to things that they clearly
don't understand? Yes, UTF-16 may have the problem of not knowing if
you're dealing with UTF-16BE or UTF-16LE, but it's my understanding that
this is *only* an issue when you're dealing with both on the same
system. Additionally, most platforms specify a default. It's been a
while (almost a year), but I think that ICU4C defaults to UTF-16BE
internally, not just UTF-16.

iirc there are even byte-order marks. If you insert one in every
string you can get them identified at any time without doubt

But do not trust me on that. I do not know anything about unicode, and
I want to sidestep the issue by using an encoding that is easy to work
with, even for ignorants

Thanks

Michal

···

On 3/11/06, Austin Ziegler <halostatue@gmail.com> wrote:

On 3/10/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> On 3/10/06, Austin Ziegler <halostatue@gmail.com> wrote:
>> On 3/8/06, Richard Gyger <richard@bytethink.com> wrote:
On 3/10/06, Anthony DeRobertis <aderobertis@metrics.net> wrote:

Anthony_DeRobertis · 13 March 2006 23:32

Austin Ziegler wrote:

Ummm, no. UTF-16 filenames would break *every* correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been
the end-of-string marker.

You're right. And I'm saying that I don't care.

Well, I suspect most other people want to maintain backwards
compatibility. Hence the existence of UTF-8.

People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I'll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems.

Why? POSIX gives nearly binary-transparent file names; the only
exception is the single octet 0x00. Considering the 1:1 mapping between
UTF-8 and other Unicode encodings, how can the choice of one or another
"badly limit" what can be done?

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this 'stone age' you refer to?

Change and environment variable and watch your programs break that had
worked so well with Unicode. *That* is the stone age that I refer to.

dd if=/dev/urandom of=/lib/ld-linux.so.2 and watch all my programs
break, too. What's you point?

It is always possible to break a computer system if you try hard enough
(or, all too often, not hard at all); but if the user actively attempts
to make his machine malfunction, that's not the OS's problem.

I'm also guessing that you don't do much with long Japanese filenames
or deep paths that involve *anything* except US-ASCII (a subset of
UTF-8).

Well, I have Japanese file names (though not that many in the grand
scheme of things), and have a lot of files and directories named in non
US-ASCII. Yeah, I know that file name length and path length limits
suck, but that's an implementation limitation of e.g. ext3, nothing
fundamental.

UTF-8 can take multiple octets to represent a character. So can
UTF-16, UTF-32, and every other variation of Unicode.

This last statement is true only because you use the term "octet."

You're correct; that isn't what I meant to say. Something along the
lines of the following is better worded:

        UTF-8 can take more than one octet to represent a
        character; UTF-16 can take more than two; UTF-32
        more than four; etc.

It's a useless term here, because UTF-8 only has any level of
efficiency for US-ASCII.

English, I've heard, is a rather common language.

Even if you step to European content, UTF-8
is no longer perfectly efficient,

Of course not --- but still generally better than UTF-16, I think.
Spanish, I've heard, is also a rather common language.

and when you step to Asian content,
UTF-8 is so bloody inefficient that most folks who have to deal with
it would rather work in a native encoding (EUC-JP or SJIS, anyone?)
which is 1..2 bytes or do everything in UTF-16.

Yes, for CJK, UTF-8 is fairly inefficient. A full 33% bigger than
UTF-16.

OTOH, it has some nice advantages over UTF-16, like being backwards
compatible with C strings, being resynchronizable (if a octet is lost),
not having byte-order issues, etc.

Now, honestly, what portion of your hard disk is taken up by file names?

···

On 3/10/06, Anthony DeRobertis <aderobertis@metrics.net> wrote:

Topic		Replies	Views
Unicode roadmap? ruby-talk	262	879	1 June 2007
Unicode ruby-talk	25	190	1 October 2007
Unicode roadmap? ruby-talk	17	113	18 June 2006
Unicode in Ruby now? ruby-talk	51	431	23 December 2004
Unicode roadmap? ruby-talk	36	138	29 June 2006

Unicode in ruby

Related topics