A few good articles on Unicode

To add a little fuel to the discussion (and to help dispel some rumors,
myths, and legends about Unicode) I present you with Tim Bray's 4-part
trilogy of articles on Unicode, why it's important, and why you should use
it. The first article provides a nice overview, even mentioning some of the
political and technical difficulties of CJK languages and Unicode (as well
as the previously-mentioned gaiji). The second article discusses character
strings in general. The third, perhaps most relevant to the Ruby Unicode
discussion is an exploration of characters versus bytes, and how the various
encodings work. The fourth article discusses Java's use of UTF-16
internally, and why that may be a good or bad thing.

At any rate, they're entertaining to read and cleared up a number of my own
questions about Unicode. Perhaps they will help the rest of us in the Ruby
community to understand Unicode as well.

Part 1: On the Goodness of Unicode -
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
Part 2: On Character Strings -
http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings
Part 3: Characters vs. Bytes -
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
Part 4: Programming Languages and Text -
http://www.tbray.org/ongoing/When/200x/2003/04/30/JavaStrings

And while not directly related, Tim also fiddled with a
fully-unicode-supporting UTF-8 string class in Java with many of the typical
C string operations (strcpy, strstr, ...). Some of the logic he uses for his
byte-vector-as-unicode-string might be applicable to Ruby as well:

Yooster (Ustr): http://www.tbray.org/ongoing/When/200x/2003/05/17/Yooster

···

--
Charles Oliver Nutter @ headius.blogspot.com
JRuby Developer @ jruby.sourceforge.net
Application Architect @ www.ventera.com

Excellent! I'm particularly interested to learn more about pros/cons between using UTF-16 internally for all strings (Java) vs. being able to specify different encoding for each string object (Ruby 2.0).

Thanks for sharing,

Daesan

Dae San Hwang
daesan@gmail.com

···

On Jun 16, 2006, at 1:15 AM, Charles O Nutter wrote:

The fourth article discusses Java's use of UTF-16
internally, and why that may be a good or bad thing.

"Charles O Nutter" <headius@headius.com> writes:

At any rate, they're entertaining to read and cleared up a number of my own
questions about Unicode. Perhaps they will help the rest of us in the Ruby
community to understand Unicode as well.

Part 1: On the Goodness of Unicode -
ongoing by Tim Bray · On the Goodness of Unicode
Part 2: On Character Strings -
ongoing by Tim Bray · On Character Strings
Part 3: Characters vs. Bytes -
ongoing by Tim Bray · Characters vs. Bytes
Part 4: Programming Languages and Text -
ongoing by Tim Bray · Programming Languages and Text

And while not directly related, Tim also fiddled with a
fully-unicode-supporting UTF-8 string class in Java with many of the typical
C string operations (strcpy, strstr, ...). Some of the logic he uses for his
byte-vector-as-unicode-string might be applicable to Ruby as well:

Yooster (Ustr): ongoing by Tim Bray · Yooster, v0.1

While were are at it, also see
"The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)"

···

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

And it's probably worth mentioning that O'Reilly has a 678 page book on
Unicode coming to bookstores by the end of the month:

http://www.oreilly.com/catalog/unicode/index.html

HTH,
Keith

···

On Thursday 15 June 2006 3:50 pm, Christian Neukirchen wrote:

While were are at it, also see