Unicode roadmap?

Almost all typical tasks on Unicode can be handled with UTF8 support in
Regexp, Iconv, jcode and $KCODE=u, and unicode[1] library (as in
unicode_hack[2]) :slight_smile:
(but case-insensitive regexp don't work for non ASCII chars in Ruby 1.8,
that can be probably solved using latest Oniguruma).

But if you're looking for deeper level of "Unicode support", e.g. as
described in Unicode FAQ[3], those problems aren't about handling Unicode
per se, but are rather L10N/I18N problems, such as locale dependent text
breaks,collation, formatting etc.
To deal with them from Ruby take look at somewhat broken wrappers to ICU
library icu4r[4], g11n[5] and Ruby/CLDR[6].

And if you want Unicode as default String encoding and want to use national
chars in names for your vars/functions/classes in Ruby code, I believe, it
will never happen. :slight_smile:

Links:
[1] http://www.yoshidam.net/Ruby.html
[2] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/
[3] http://www.unicode.org/faq/basic_q.html#13
[4] http://rubyforge.org/projects/icu4r
[5] http://rubyforge.org/projects/g11n
[6] http://www.yotabanana.com/hiki/ruby-cldr.html

To: ruby-talk ML
Subject: Re: Unicode roadmap?

Almost all typical tasks on Unicode can be handled with UTF8 support in
Regexp, Iconv, jcode and $KCODE=u, and unicode[1] library (as in
unicode_hack[2]) :slight_smile:
(but case-insensitive regexp don't work for non ASCII chars in Ruby 1.8,
that can be probably solved using latest Oniguruma).

But if you're looking for deeper level of "Unicode support", e.g. as
described in Unicode FAQ[3], those problems aren't about handling Unicode
per se, but are rather L10N/I18N problems, such as locale dependent text
breaks,collation, formatting etc.
To deal with them from Ruby take look at somewhat broken wrappers to ICU
library icu4r[4], g11n[5] and Ruby/CLDR[6].

Thanks Dmitry!

And if you want Unicode as default String encoding and want to use
national
chars in names for your vars/functions/classes in Ruby code, I believe, it
will never happen. :slight_smile:

Hmmm.. I've think Unicode IS defaul String encoding when $KCODE=u
Not?

V.

···

From: Dmitry Severin [mailto:dmitry.severin@gmail.com]
Sent: Wednesday, June 14, 2006 11:20 AM

No. Current String implementation has no notion of "encoding" (Ruby String
is just a sequence of bytes) and $KCODE is just a hint for methods to change
their behaviour (e.g. in Regexp) and treat those bytes as text represented
in some encoding.

···

On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:

Hmmm.. I've think Unicode IS defaul String encoding when $KCODE=u
Not?

Strictly speaking, Unicode is not an encoding, but UTF-8 is.

For my personal vision of "proper" Unicode support, I'd like to have
UTF-8 the standard internal string format, and Unicode Points the
standard character code, and *all* String functions to just work
intuitively "right" on a character base rather than byte base. Thus
the internal String encoding is a technical matter only, as long as it
is capable of supporting all Unicode characters, and these internal
details are not exposed via public methods.

I/O and String functions should be able to convert to and from
different external encodings, via plugin modules. Note I don't require
non Unicode String classes, just the possibility to do I/O with
foreign characters sets, or conversion to byte arrays. Strings should
consist of characters, not just be a sequence of bytes meaningless
without external information about their encoding.

No ruby apps or libraries should break because they are surprised by
(Unicode) Strings, or it should be obvious the fault is with them.

Optionally, additional String classes with different internal Unicode
encodings might be a boon for certain performance sensitive
applications, and they should all work together much like Numbers of
different kinds do.

While I want ruby source files to be UTF-8 encoded, in no way do I
want identifiers to consist of additional national characters. I like
names in APIs everyone can actually type, but literal Strings is a
different matter.

I know this is a bit vague on the one hand, and might demand intrusive
changes on the other one. Java history shows proper Unicode support
is no trivial matter, and I don't feel qualified to give advice how to
implement this. It's just my vision of how Strings ideally would be.

And of course for my personal vision to become perfect, everyone
outside Ruby should adopt Unicode too.

Jürgen

···

On Wed, Jun 14, 2006 at 05:26:58PM +0900, Victor Shepelev wrote:

From: Dmitry Severin [mailto:dmitry.severin@gmail.com]
Sent: Wednesday, June 14, 2006 11:20 AM
> To: ruby-talk ML
> Subject: Re: Unicode roadmap?
>
> Almost all typical tasks on Unicode can be handled with UTF8 support in
> Regexp, Iconv, jcode and $KCODE=u, and unicode[1] library (as in
> unicode_hack[2]) :slight_smile:
> (but case-insensitive regexp don't work for non ASCII chars in Ruby 1.8,
> that can be probably solved using latest Oniguruma).
>
> But if you're looking for deeper level of "Unicode support", e.g. as
> described in Unicode FAQ[3], those problems aren't about handling Unicode
> per se, but are rather L10N/I18N problems, such as locale dependent text
> breaks,collation, formatting etc.
> To deal with them from Ruby take look at somewhat broken wrappers to ICU
> library icu4r[4], g11n[5] and Ruby/CLDR[6].

Thanks Dmitry!

> And if you want Unicode as default String encoding and want to use
> national
> chars in names for your vars/functions/classes in Ruby code, I believe, it
> will never happen. :slight_smile:

Hmmm.. I've think Unicode IS defaul String encoding when $KCODE=u
Not?

V.

--
The box said it requires Windows 95 or better so I installed Linux

Matz,

thanks for taking part in that discussion. I would really appreciate an elegant unicode solution from the master himself :slight_smile: in ruby (and probably all other non-us citizens)

In most cases I would be happy if at least this functions
of class String had an unicode equivalent.

capitalize
upcase
downcase
reverse
slice
split
index

Maybe it's because I am no guru of regexp but I can't imagine a trivial solution.

Another issue is that ActiveRecord (and other additional libraries)
are not unicode aware because there is no _transparent_ unicode support.

Just as an example,

functions like:

ActiveRecord::Validations::ClassMethods::validates_length_of

using parameters like

# minimum - The minimum size of the attribute
# maximum - The maximum size of the attribute

will most probably use String.size which is giving the byte length,
not the string length.

The ruby 2.0 solution I read about (each string carries it's encoding inside) sounds fantastic (not to mention bytecode execution). Could you imagine an implementation of that before ruby 2.0 ?

Best regards
Peter

-------- Original-Nachricht --------

···

Datum: Wed, 14 Jun 2006 17:38:40 +0900
Von: Dmitry Severin <dmitry.severin@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: Unicode roadmap?

On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
>
> Hmmm.. I've think Unicode IS defaul String encoding when $KCODE=u
> Not?

No. Current String implementation has no notion of "encoding" (Ruby String
is just a sequence of bytes) and $KCODE is just a hint for methods to
change
their behaviour (e.g. in Regexp) and treat those bytes as text represented
in some encoding.

For my personal vision of "proper" Unicode support, I'd like to have
UTF-8 the standard internal string format, and Unicode Points the
standard character code, and *all* String functions to just work
intuitively "right" on a character base rather than byte base. Thus
the internal String encoding is a technical matter only, as long as it
is capable of supporting all Unicode characters, and these internal
details are not exposed via public methods.

Maybe Juergen is saying the same thing I'm going to say, but since I don't
understand / recall what UTF-8 encoding is exactly:

I'm beginning to think (with a newbie sort of perspective) that Unicode is too
complicated to deal with inside a program. My suggestion would be that
Unicode be an external format...

What I mean is, when you have a program that must handle international text,
convert the Unicode to a fixed width representation for use by the program.
Do the processing based on these fixed width characters. When it's complete,
convert it back to Unicode for output.

It seems to me that would make a lot of things easier.

Then I might have two basic "types" of programs--programs that can handle any
text (i.e., international), and other programs that can handle only English
(or maybe only European languages that can work with an 8 bit byte). (I
suggest these two types of programs because I suspect those that have to
handle the international character set will be slower than those that don't.)

Aside: What would that take to handle all the characters / ideographs (is that
what they call them, the Japanese, Chinese, ... characters) presently in use
in the world--iirc, 16 bits (2**16) didn't cut it for Unicode--would 32 bits?

Randy Kramer

···

On Wednesday 14 June 2006 06:01 am, Juergen Strobel wrote:

I/O and String functions should be able to convert to and from
different external encodings, via plugin modules. Note I don't require
non Unicode String classes, just the possibility to do I/O with
foreign characters sets, or conversion to byte arrays. Strings should
consist of characters, not just be a sequence of bytes meaningless
without external information about their encoding.

No ruby apps or libraries should break because they are surprised by
(Unicode) Strings, or it should be obvious the fault is with them.

Optionally, additional String classes with different internal Unicode
encodings might be a boon for certain performance sensitive
applications, and they should all work together much like Numbers of
different kinds do.

While I want ruby source files to be UTF-8 encoded, in no way do I
want identifiers to consist of additional national characters. I like
names in APIs everyone can actually type, but literal Strings is a
different matter.

I know this is a bit vague on the one hand, and might demand intrusive
changes on the other one. Java history shows proper Unicode support
is no trivial matter, and I don't feel qualified to give advice how to
implement this. It's just my vision of how Strings ideally would be.

And of course for my personal vision to become perfect, everyone
outside Ruby should adopt Unicode too.

Jürgen

> For my personal vision of "proper" Unicode support, I'd like to have
> UTF-8 the standard internal string format, and Unicode Points the
> standard character code, and *all* String functions to just work
> intuitively "right" on a character base rather than byte base. Thus
> the internal String encoding is a technical matter only, as long as it
> is capable of supporting all Unicode characters, and these internal
> details are not exposed via public methods.

Maybe Juergen is saying the same thing I'm going to say, but since I don't
understand / recall what UTF-8 encoding is exactly:

Wikipedia has decent articles on unicode at http://en.wikipedia.org/wiki/UniCode\.

Basically, Unicode gives every character worldwide a unique number,
called code point. Since this numbers can be quite large (currently
up to 21 bit), and especially western users usually only use a tiny
subset, different encoding were created to save space, or remain
backward compatible with 7 bit ASCII.

UTF-8 encodes every Unicode code point as a variable length sequence
of 1 to 4 (I think) bytes. Most western symbols only require 1 or 2
bytes. This encoding is space efficient, and ASCII compatible as long
as only 7 bit characters are used. Certain string operation
are quite hard or inefficient, since the position of characters, or
even the length of a string, given a byte stream, is uncertain without
counting actual characters (no pointer/index arithmetic!).

UTF-32 encodes every code point as a single 32 bit word. This enables
simple, efficient substring access, but wastes space.

Other encodings have yet different characteristics, but all deal with
encoding the same code points. A Unicode String class should expose
code points, or sequences of code points (characters), not the
internal encoding used to store them and that is the core of my
argument.

I'm beginning to think (with a newbie sort of perspective) that Unicode is too
complicated to deal with inside a program. My suggestion would be that
Unicode be an external format...

What I mean is, when you have a program that must handle international text,
convert the Unicode to a fixed width representation for use by the program.
Do the processing based on these fixed width characters. When it's complete,
convert it back to Unicode for output.

UTF-32 would be such an encoding. It uses quadruple space for simple 7
bit ASCII characters, but with such a dramatically larger total
character set, some tradeoffs are unavoidable.

It seems to me that would make a lot of things easier.

Then I might have two basic "types" of programs--programs that can handle any
text (i.e., international), and other programs that can handle only English
(or maybe only European languages that can work with an 8 bit byte). (I
suggest these two types of programs because I suspect those that have to
handle the international character set will be slower than those that don't.)

Aside: What would that take to handle all the characters / ideographs (is that
what they call them, the Japanese, Chinese, ... characters) presently in use
in the world--iirc, 16 bits (2**16) didn't cut it for Unicode--would 32 bits?

Randy Kramer

Currently Unicode requires 21 bit, but this has changed in the past.
Java got bitten by that by defining the character type to 16 bit and
hardcoding this in their VM, and now they need some kludges.

A split of simple and Unicode-aware will divide code into two camps,
which will remain slightly incompatible or require dirty hacks. I'd
rather prolonge the status quo, where Strings can be seen to contain
bytes in whatever encoding the user sees fit, but might break if used
with foreign code which has other notions of encoding.

Jürgen

···

On Thu, Jun 15, 2006 at 06:34:11AM +0900, Randy Kramer wrote:

On Wednesday 14 June 2006 06:01 am, Juergen Strobel wrote:

--
The box said it requires Windows 95 or better so I installed Linux

[ snip essentially accurate information ]

UTF-8 encodes every Unicode code point as a variable length sequence
of 1 to 4 (I think) bytes.

It could be up to six bytes at one point. However, I think that there
is still support for surrogate characters meaning that a single glyph
*might* take as many as eight bytes to represent in the 1-4 byte
representation. Even with that, though, those are rare and usually
user-defined (private) ranges IIRC. This also doesn't deal with
(de)composed glyphs/combining glyphs.

Currently Unicode requires 21 bit, but this has changed in the past.

Yes. Unicode went from 16-bit (I think) to 32-bit to 21-bit.

Java got bitten by that by defining the character type to 16 bit and
hardcoding this in their VM, and now they need some kludges.

Um. I think that the initial Java definition used UCS-2 (same as
Windows did for NTFS and VFS) but now uses UTF-16, which has surrogate
support (UCS-2 did not).

-austin

···

On 6/15/06, Juergen Strobel <strobel@secure.at> wrote:
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

Um, hi everyone. I'm a Rubie newby but very, very old hand at Unicode & text processing. I wrote all those articles Charles Nutter pointed to the other day. I spent years doing full-text search for a living, and adapted a popular engine to handle Japanese text, and co-edited the XML spec and helped work out its character-encoding issues. Lots more war stories on request.

Anyhow, I have some ideas about what good ways to do text processing in a language like Ruby might be, but I thought for the moment I'd just watch this interesting debate go by and serve as an information resource.

UTF-8 encodes every Unicode code point as a variable length sequence
of 1 to 4 (I think) bytes.

UTF-8 can do the 1,114,112 Unicode codepoints in 4 bytes. We probably don't need any more codepoints until we meet alien civilizations.

Most western symbols only require 1 or 2
bytes. This encoding is space efficient

UTF-8 is racist. The further East you go, the less efficient it is to store text. Having said that, it has a lot of other advantages. Also, when almost every storage device is increasingly being used for audio and video, at megabytes per minute, it may be the case that the efficiency of text storage is less likely to be a bottleneck.

Java got bitten by that by defining the character type to 16 bit and
hardcoding this in their VM, and now they need some kludges.

Java screwed up, with the result that a Java (and C#) "char" represent a UTF-16 codepoint. Blecch.

  -Tim

···

On Jun 15, 2006, at 11:17 AM, Juergen Strobel wrote:

[ snip essentially accurate information ]

>UTF-8 encodes every Unicode code point as a variable length sequence
>of 1 to 4 (I think) bytes.

It could be up to six bytes at one point. However, I think that there
is still support for surrogate characters meaning that a single glyph
*might* take as many as eight bytes to represent in the 1-4 byte
representation. Even with that, though, those are rare and usually
user-defined (private) ranges IIRC. This also doesn't deal with
(de)composed glyphs/combining glyphs.

No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
all characters. Only Java may need more than that because of their use
of UTF16 surrogates and special \0 handling in an intermediary step. See

>Currently Unicode requires 21 bit, but this has changed in the past.

Yes. Unicode went from 16-bit (I think) to 32-bit to 21-bit.

>Java got bitten by that by defining the character type to 16 bit and
>hardcoding this in their VM, and now they need some kludges.

Um. I think that the initial Java definition used UCS-2 (same as
Windows did for NTFS and VFS) but now uses UTF-16, which has surrogate
support (UCS-2 did not).

-austin

Java has its own *character* type apart from string. Like C's char,
only it is 16 bits wide, and is not directly related to internal
string encoding. Note that Java strings are more than a simple
sequence of objects of character type. And 16 bits is not enough for
some Unicode characters, which leads to the weird situation of needing
two character objects to represent a single character sometimes (via
surrogates).

Jürgen

···

On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Ziegler wrote:

On 6/15/06, Juergen Strobel <strobel@secure.at> wrote:

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
              * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
              * austin@zieglers.ca

--
The box said it requires Windows 95 or better so I installed Linux

...

> It could be up to six bytes at one point. However, I think that there
> is still support for surrogate characters meaning that a single glyph
> *might* take as many as eight bytes to represent in the 1-4 byte
> representation. Even with that, though, those are rare and usually
> user-defined (private) ranges IIRC. This also doesn't deal with
> (de)composed glyphs/combining glyphs.

No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
all characters. Only Java may need more than that because of their use
of UTF16 surrogates and special \0 handling in an intermediary step. See

Austin's correct about six bytes, actually. The original UTF-8
specification *was* for up to six bytes:
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

However, no codepoints were ever defined in the upper part of the
range, and once Unicode was officially restricted to the range
1-0x10FFFF, there was no longer any need for the five- and six-byte
sequences.

Compare RFC 2279 from 1998 (six bytes)

and RFC 3629 from 2003 (four bytes)

That Java encoding (UTF-8-encoded UTF-16) isn't really UTF-8, though,
so you'd never get eight bytes in valid UTF-8:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters. (RFC 3629)

Paul.

···

On 15/06/06, Juergen Strobel <strobel@secure.at> wrote:

On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Ziegler wrote:

Please, do not use Wikipedia as an argument. It can contain useful
information but it may as well contain utter nonsense. I may just go
there and change that 4 to 32. Maybe somebody will notice and correct
it, maybe not. You never know.
When reading anything on wikipedia you should verify from other
sources. It applies to other webs as well. But with wikipedia you have
no clue who wrote it.

If you want to get more idea about the quality of some wikipedia
articles search for wikipedia and Seigenthaler in your favorite search
engine (preferrably non-Google :).

One of the many results returned:
http://www.usatoday.com/news/opinion/editorials/2005-11-29-wikipedia-edit_x.htm

Thanks

Michal

···

On 6/16/06, Juergen Strobel <strobel@secure.at> wrote:

On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Ziegler wrote:
> On 6/15/06, Juergen Strobel <strobel@secure.at> wrote:
> [ snip essentially accurate information ]
>
> >UTF-8 encodes every Unicode code point as a variable length sequence
> >of 1 to 4 (I think) bytes.
>
> It could be up to six bytes at one point. However, I think that there
> is still support for surrogate characters meaning that a single glyph
> *might* take as many as eight bytes to represent in the 1-4 byte
> representation. Even with that, though, those are rare and usually
> user-defined (private) ranges IIRC. This also doesn't deal with
> (de)composed glyphs/combining glyphs.

No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
all characters. Only Java may need more than that because of their use
of UTF16 surrogates and special \0 handling in an intermediary step. See

Well, there is the official http://unicode.org/ site no one has
mentioned so far.

There's all sorts of technical information on Unicode.
http://www.unicode.org/reports/index.html

Including the latest version: Unicode 4.1.0

···

On 6/16/06, Juergen Strobel <strobel@secure.at> wrote:

On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Ziegler wrote:
> On 6/15/06, Juergen Strobel <strobel@secure.at> wrote:
> [ snip essentially accurate information ]
>
> >UTF-8 encodes every Unicode code point as a variable length sequence
> >of 1 to 4 (I think) bytes.
>
> It could be up to six bytes at one point. However, I think that there
> is still support for surrogate characters meaning that a single glyph
> *might* take as many as eight bytes to represent in the 1-4 byte
> representation. Even with that, though, those are rare and usually
> user-defined (private) ranges IIRC. This also doesn't deal with
> (de)composed glyphs/combining glyphs.

No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
all characters. Only Java may need more than that because of their use
of UTF16 surrogates and special \0 handling in an intermediary step. See

UTF-8 - Wikipedia

>
> >Currently Unicode requires 21 bit, but this has changed in the past.
>
> Yes. Unicode went from 16-bit (I think) to 32-bit to 21-bit.

...
>> It could be up to six bytes at one point. However, I think that there
>> is still support for surrogate characters meaning that a single glyph
>> *might* take as many as eight bytes to represent in the 1-4 byte
>> representation. Even with that, though, those are rare and usually
>> user-defined (private) ranges IIRC. This also doesn't deal with
>> (de)composed glyphs/combining glyphs.
>
>No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
>all characters. Only Java may need more than that because of their use
>of UTF16 surrogates and special \0 handling in an intermediary step. See

Austin's correct about six bytes, actually. The original UTF-8
specification *was* for up to six bytes:
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

However, no codepoints were ever defined in the upper part of the
range, and once Unicode was officially restricted to the range
1-0x10FFFF, there was no longer any need for the five- and six-byte
sequences.

Compare RFC 2279 from 1998 (six bytes)
RFC 2279 - UTF-8, a transformation format of ISO 10646
and RFC 3629 from 2003 (four bytes)
RFC 3629 - UTF-8, a transformation format of ISO 10646

I don't care who is technically correct here, that's not the point.

But when working on Unicode support for Ruby, I think it would be best
to focus on the new and current standard, before worrying if we should
support obsoleted RFCs. We might take care to be open to future
changes alongside old ones, but that's hard to predict and I wouldn't
waste time guessing. And Ruby is much more dynamic and less vulnerable
to such changes as for example Java.

Jürgen

···

On Fri, Jun 16, 2006 at 05:27:04PM +0900, Paul Battley wrote:

On 15/06/06, Juergen Strobel <strobel@secure.at> wrote:
>On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Ziegler wrote:

That Java encoding (UTF-8-encoded UTF-16) isn't really UTF-8, though,
so you'd never get eight bytes in valid UTF-8:

  The definition of UTF-8 prohibits encoding character numbers between
  U+D800 and U+DFFF, which are reserved for use with the UTF-16
  encoding form (as surrogate pairs) and do not directly represent
  characters. (RFC 3629)

Paul.

--
The box said it requires Windows 95 or better so I installed Linux

Good point. Unfortunately, a lot of it is only available as PDFs of
each chapter. The bookmarks help:
http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html

The technical reports are really useful:
http://www.unicode.org/reports/index.html

Paul.

···

On 17/06/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:

Well, there is the official http://unicode.org/ site no one has
mentioned so far.

There's all sorts of technical information on Unicode.
Technical Reports

Including the latest version: Unicode 4.1.0

I don't care who is technically correct here, that's not the point.

On the contrary: it's exactly the point in a technical discussion of
the number of bytes taken by various encodings.

But when working on Unicode support for Ruby, I think it would be best
to focus on the new and current standard, before worrying if we should
support obsoleted RFCs.

No one suggested supporting obsolete RFCs. I compared the obsolete and
current RFCs precisely so that everyone could get a clearer idea of
what constitutes the current state of UTF-8 - which is what we should
support. I hope you agree that they are more reliable sources for
technical information than Wikipedia.

Paul.

···

On 17/06/06, Juergen Strobel <strobel@secure.at> wrote:

>I don't care who is technically correct here, that's not the point.

On the contrary: it's exactly the point in a technical discussion of
the number of bytes taken by various encodings.

The discussion is about a Unicode Roadmap for Ruby. The number of
bytes per UTF-8 encoded character is tangential to this.

>But when working on Unicode support for Ruby, I think it would be best
>to focus on the new and current standard, before worrying if we should
>support obsoleted RFCs.

No one suggested supporting obsolete RFCs. I compared the obsolete and
current RFCs precisely so that everyone could get a clearer idea of
what constitutes the current state of UTF-8 - which is what we should
support. I hope you agree that they are more reliable sources for
technical information than Wikipedia.

Paul.

If you can point to an official and *current* standard which proves me
false an my statement of 1-4 bytes per plain UTF-8 encoded character,
I'll concede my point. Please don't bring in combining characters,
you know what I mean by now. If you like, s/character/code point/g.

Merely pointing to a concise and actually well written Wikipedia
article with a nice summary is way more informative than using
obsoleted RFCs to reinforce one's own argument. Besides, we all know
how relyable Wikipedia is. End of discussion from my side.

Jürgen

···

On Sat, Jun 17, 2006 at 06:05:20PM +0900, Paul Battley wrote:

On 17/06/06, Juergen Strobel <strobel@secure.at> wrote:

--
The box said it requires Windows 95 or better so I installed Linux

If you can point to an official and *current* standard which proves me
false an my statement of 1-4 bytes per plain UTF-8 encoded character,
I'll concede my point. Please don't bring in combining characters,
you know what I mean by now. If you like, s/character/code point/g.

You are correct. And why Wikipedia? www.unicode.org has it all:
UTR#17: Unicode Character Encoding Model and
following

···

Merely pointing to a concise and actually well written Wikipedia
article with a nice summary is way more informative than using
obsoleted RFCs to reinforce one's own argument. Besides, we all know
how relyable Wikipedia is. End of discussion from my side.

Jürgen

--
The box said it requires Windows 95 or better so I installed Linux

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iQEVAwUBRJPbH/y64gyiEfXtAQIi5AgAgUxBv/QN5OItmrWxODCe1HOaaNKKPJPT
ZF5zljWsaBOXfd7jptSe12y8S4jJzGQ8mApl14DypMr6c3A+K51oTqRo7pNnDq3y
Mi/64ycxledzN3V3ibSh+TthCUgk6JGI2WMv0SobiCBAHeNaffkFezfufMSLgCvh
S6l8EKtdX5Zitu/MH5akjR/Wcor1qivE9Sx4I7pyVfL4ITxZy9RL1A/NEIf7oJaq
8I2VCxk1D1qTiKXiInCNsN7r7ae5nsm4DwLgOiEV2UXg/gLn4E+TxbGHRCgGIMQz
C6MFmuC0SVGrjoUpyzis6GevZJXm5i/Ei06Izy42Xae4+OKKV6yEYg==
=wAgI
-----END PGP SIGNATURE-----