State of unicode support

I've heard rumors that "oniguruma fixes everything", and the like. I'm
sure that's a touch of hyperbole, but in any case:

What's the current state of Unicode support in Ruby? My recollection is
of Unicode support somewhat lacking.

···

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
Brian K. Reid: "In computer science, we stand on each other's feet."

Oh man, I really don't have the energy for this thread again :slight_smile: Chad: if you
get a straight answer about this, let me know. Others: Is there a simple,
straightforward FAQ entry somewhere that says "to use Unicode you have the
following choices"? This keeps coming up.

···

On 7/28/06, Chad Perrin <perrin@apotheon.com> wrote:

I've heard rumors that "oniguruma fixes everything", and the like. I'm
sure that's a touch of hyperbole, but in any case:

What's the current state of Unicode support in Ruby? My recollection is
of Unicode support somewhat lacking.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
Brian K. Reid: "In computer science, we stand on each other's feet."

--
Contribute to RubySpec! @ Welcome to headius.com
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

This isn't a complete answer, but it's the best I can do to help Chad out.
If you really want to solve the question now, Chad, I'd read Julian Tarkhanov's
UNICODE_PRIMER[1].

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Ruby's strings are not unicode-aware. There is a library called 'jcode', which
comes with Ruby which tries to help out, but it's very simple, only good for a
few things like counting characters and iterating through characters. Again,
UTF-8 only.

Ruby itself also understands UTF-8 regular expressions to a degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing a
multibyte character. (Also: str.unpack('U*').)

If you are using Unicode strings in Rails, check out Julian's unicode_hacks
plugin: <http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/&gt;
They have a channel on irc.freenode.net: #multibyte_rails.

The unicode_hacks plugin is interesting in that it tries to load one of several
Ruby unicode extensions before falling back to str.unpack('U*') mode.

Here are the extensions it prefers, in order:

* icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp, etc.
  classes for containing Unicode stuffs.
  (project page[3] and docs[4])
* utf8proc: a small library for iterating through characters and converting
  ints to code points. Adds String#utf8map and Integer#utf8, for example.
  (download[5])
* unicode: a little extension by Yoshida Masato which adds Unicode class
  methods for `strcmp`, `[de]compose`, normalization and case conversion for
  utf-8.
  (download[6] and readme[7])

So, many options, some massive, but most only partial and in their infancy.

The most recent entrant into this race, though, is Nikolai Weibull's
ruby-character-encoding library, which aims to get complete multibyte support
into Ruby 1.8's string class. If you use it, it will probably break a lot of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]

Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.

This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.

Signed in elaborate calligraphy with a picture of grapes at the end,

_why

[1] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNICODE_PRIMER
[2] http://www.geocities.jp/kosako3/oniguruma/
[3] http://rubyforge.org/projects/icu4r/
[4] http://icu4r.rubyforge.org/
[5] flexiguided.de
[6] http://www.yoshidam.net/Ruby.html
[7] http://www.yoshidam.net/unicode.txt
[8] http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html
[9] http://git.bitwi.se/?p=ruby-character-encodings.git;a=summary
[10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html

···

On Sat, Jul 29, 2006 at 01:08:06AM +0900, Charles O Nutter wrote:

Oh man, I really don't have the energy for this thread again :slight_smile: Chad: if you
get a straight answer about this, let me know. Others: Is there a simple,
straightforward FAQ entry somewhere that says "to use Unicode you have the
following choices"? This keeps coming up.

This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.

That was most excellent. Thank you for your kind assistance: it answers
my question quite well, and I appreciate your effort.

Signed in elaborate calligraphy with a picture of grapes at the end,

. . . and as always, you manage to entertain in the process.

···

On Sat, Jul 29, 2006 at 04:13:04AM +0900, why the lucky stiff wrote:

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"The first rule of magic is simple. Don't waste your time waving your
hands and hopping when a rock or a club will do." - McCloctnick the Lucid

So, the problem with Unicode support in Ruby is that the code
currently assumes that each letter is one byte, instead of multiple?
This includes presumably search algorithms (for Regexs, et al), then?

Or is my understanding warped and wrong?

_Why, et al, if you could break down the actual difficulties with
implementing Unicode support into Ruby 1.8, I think that might clear
up the questions we have as to whether a library eradicates all
problems (obviously, some problems can't be fixed, but merely hacked
or worked around).

Cheers, folks; remember to be nice. We're on the same team.

M.T.

Very nice; it should be on a wiki somewhere under the bold, flashing
headline "WHAT'S UP WITH UNICODE IN RUBY". Thank you!

···

On 7/28/06, why the lucky stiff <ruby-talk@whytheluckystiff.net> wrote:

On Sat, Jul 29, 2006 at 01:08:06AM +0900, Charles O Nutter wrote:
> Oh man, I really don't have the energy for this thread again :slight_smile: Chad: if
you
> get a straight answer about this, let me know. Others: Is there a
simple,
> straightforward FAQ entry somewhere that says "to use Unicode you have
the
> following choices"? This keeps coming up.

This isn't a complete answer, but it's the best I can do to help Chad out.
If you really want to solve the question now, Chad, I'd read Julian
Tarkhanov's
UNICODE_PRIMER[1].

First, Onigurama[2] is a regular expression engine. It supports Unicode
regular
expressions under many encodings, it's very handy. If all you want to do
is
search strings for Unicode text, then great, use it.

Ruby's strings are not unicode-aware. There is a library called 'jcode',
which
comes with Ruby which tries to help out, but it's very simple, only good
for a
few things like counting characters and iterating through
characters. Again,
UTF-8 only.

Ruby itself also understands UTF-8 regular expressions to a degree. Using
the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing
a
multibyte character. (Also: str.unpack('U*').)

If you are using Unicode strings in Rails, check out Julian's
unicode_hacks
plugin: <
http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/&gt;
They have a channel on irc.freenode.net: #multibyte_rails.

The unicode_hacks plugin is interesting in that it tries to load one of
several
Ruby unicode extensions before falling back to str.unpack('U*') mode.

Here are the extensions it prefers, in order:

* icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp,
etc.
  classes for containing Unicode stuffs.
  (project page[3] and docs[4])
* utf8proc: a small library for iterating through characters and
converting
  ints to code points. Adds String#utf8map and Integer#utf8, for example.
  (download[5])
* unicode: a little extension by Yoshida Masato which adds Unicode class
  methods for `strcmp`, `[de]compose`, normalization and case conversion
for
  utf-8.
  (download[6] and readme[7])

So, many options, some massive, but most only partial and in their
infancy.

The most recent entrant into this race, though, is Nikolai Weibull's
ruby-character-encoding library, which aims to get complete multibyte
support
into Ruby 1.8's string class. If you use it, it will probably break a lot
of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]

Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.

This might be a landslide of information, but it's better than spending
all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.

Signed in elaborate calligraphy with a picture of grapes at the end,

_why

[1]
http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNICODE_PRIMER
[2] サービス終了のお知らせ
[3] http://rubyforge.org/projects/icu4r/
[4] http://icu4r.rubyforge.org/
[5] flexiguided.de
[6] http://www.yoshidam.net/Ruby.html
[7] http://www.yoshidam.net/unicode.txt
[8] http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html
[9] http://git.bitwi.se/?p=ruby-character-encodings.git;a=summary
[10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html

--
Contribute to RubySpec! @ Welcome to headius.com
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

Spectacular summary. As a lurker on this thread,
I greatly appreciate it.

why the lucky stiff wrote:

···

On Sat, Jul 29, 2006 at 01:08:06AM +0900, Charles O Nutter wrote:

Oh man, I really don't have the energy for this thread again :slight_smile: Chad: if you
get a straight answer about this, let me know. Others: Is there a simple,
straightforward FAQ entry somewhere that says "to use Unicode you have the
following choices"? This keeps coming up.

This isn't a complete answer, but it's the best I can do to help Chad out.
If you really want to solve the question now, Chad, I'd read Julian Tarkhanov's
UNICODE_PRIMER[1].

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Ruby's strings are not unicode-aware. There is a library called 'jcode', which
comes with Ruby which tries to help out, but it's very simple, only good for a
few things like counting characters and iterating through characters. Again,
UTF-8 only.

Ruby itself also understands UTF-8 regular expressions to a degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing a
multibyte character. (Also: str.unpack('U*').)

If you are using Unicode strings in Rails, check out Julian's unicode_hacks
plugin: <http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/&gt;
They have a channel on irc.freenode.net: #multibyte_rails.

The unicode_hacks plugin is interesting in that it tries to load one of several
Ruby unicode extensions before falling back to str.unpack('U*') mode.

Here are the extensions it prefers, in order:

* icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp, etc.
  classes for containing Unicode stuffs.
  (project page[3] and docs[4])
* utf8proc: a small library for iterating through characters and converting
  ints to code points. Adds String#utf8map and Integer#utf8, for example.
  (download[5]) * unicode: a little extension by Yoshida Masato which adds Unicode class
  methods for `strcmp`, `[de]compose`, normalization and case conversion for
  utf-8.
  (download[6] and readme[7])

So, many options, some massive, but most only partial and in their infancy.

The most recent entrant into this race, though, is Nikolai Weibull's
ruby-character-encoding library, which aims to get complete multibyte support
into Ruby 1.8's string class. If you use it, it will probably break a lot of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]

Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.

This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.

Signed in elaborate calligraphy with a picture of grapes at the end,

_why

[1] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNICODE_PRIMER
[2] サービス終了のお知らせ
[3] http://rubyforge.org/projects/icu4r/
[4] http://icu4r.rubyforge.org/
[5] flexiguided.de
[6] http://www.yoshidam.net/Ruby.html
[7] http://www.yoshidam.net/unicode.txt
[8] http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html
[9] http://git.bitwi.se/?p=ruby-character-encodings.git;a=summary
[10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html

Er uh well it doesn't do unicode properties so you can't use things like \p{L} which, once you've found them, quickly come to feel essential. Anytime you write [a-zA-Z] in a regex, you've probably just uttered a bug So I would say that Oniguruma has holes.

Otherwise, a very useful landslide indeed. -Tim

···

On Jul 28, 2006, at 12:13 PM, why the lucky stiff wrote:

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Which is actually useless because this breaks your string between codepoints, not between characters. ICU4R currently resolves this, as well as a library posted
on ruby-talk a while ago (with proper text boudnary handling).

···

On 28-jul-2006, at 21:13, why the lucky stiff wrote:

Ruby itself also understands UTF-8 regular expressions to a degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing a
multibyte character. (Also: str.unpack('U*').)

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

So, the problem with Unicode support in Ruby is that the code
currently assumes that each letter is one byte, instead of multiple?
This includes presumably search algorithms (for Regexs, et al), then?

Or is my understanding warped and wrong?

Regexes in 1.8 can do utf-8.

_Why, et al, if you could break down the actual difficulties with
implementing Unicode support into Ruby 1.8, I think that might clear
up the questions we have as to whether a library eradicates all
problems (obviously, some problems can't be fixed, but merely hacked
or worked around).

The problem is with compatibility. In 1.8 it is expected that strings
are arrays of bytes. You can split them to characters with a regex or
convert into a sequence of codepoints. But no standard library or
function would understand that (except the single one that is there
for undoing the transformation).

So you have the choice to work with utf-8 strings and regexes, and
whenever you want characters convert the strings so that you get to
characters.

Or you can use a special unicode string class (such as from icu4r)
that no standard functions understand. Some may be able to do to_s but
you get a normal string then.

Or you can change the strings to handle utf-8 (or any other multibyte)
characters, and probably break most of the standard functions.

None of these is completely satisfactory because it is far from
_transparent_ unicode support in the standard string class. That is
planned for 2.0.

Thanks

Michal

···

On 7/28/06, Matt Todd <chiology@gmail.com> wrote:

Tim Bray wrote:

···

On Jul 28, 2006, at 12:13 PM, why the lucky stiff wrote:

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Er uh well it doesn't do unicode properties so you can't use things like \p{L}

Off topic, what does/would that do? Match a lower-case symbol?

--
Alex

> Ruby itself also understands UTF-8 regular expressions to a
> degree. Using the
> 'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
> str.scan(/./u), which returns an array of strings, each string
> containing a
> multibyte character. (Also: str.unpack('U*').)

Which is actually useless because this breaks your string between
codepoints, not between characters. ICU4R currently resolves this, as
well as a library posted
on ruby-talk a while ago (with proper text boudnary handling).

Whilst it's certainly useless for a lot of tasks, I'm not sure that
Ruby is any worse than other languages in this regard. As far as I'm
aware, most languages that 'support' Unicode don't handle grapheme
clusters without using additional libraries.

I, for one, am very saddened every time the topic comes up ecause i'm
sick of the brokenness (I actually start looking at these Other
Languages and Other Frameworks that take l10n and i18n seriously).

Actually, that's a really good idea. Which languages/frameworks have
you found that actually do it right? We could learn from their
example.

Paul.

···

On 31/07/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:

Unicode characters have named properties. "L" means it's a letter. There are sub-properties like Lu and Ll for upper and lower case. There are lots more properties for things like being numbers, being white-space, combining forms and particular properties of Asian characters and so on. Tremendously useful in regexes, particularly for those of us round-eye gringos who are prone to write [a-zA-Z] and think we're matching letters, which we're not. If you don't support properties, you don't support Unicode. -Tim

···

On Jul 31, 2006, at 7:52 AM, Alex Young wrote:

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Er uh well it doesn't do unicode properties so you can't use things like \p{L}

Off topic, what does/would that do? Match a lower-case symbol?

Whilst it's certainly useless for a lot of tasks, I'm not sure that
Ruby is any worse than other languages in this regard. As far as I'm
aware, most languages that 'support' Unicode don't handle grapheme
clusters without using additional libraries.

AFAIK Python regexps do that properly, and ICU does for sure (both as free iterators and regexps).

Actually, that's a really good idea. Which languages/frameworks have
you found that actually do it right? We could learn from their
example.

To my knowledge you are intimately familiar with the subject so I take it as sarcasm.

But if you really feel like being constructive you can update the Unicode gem (wich you promised about a month ago) :-))

···

On 31-jul-2006, at 17:48, Paul Battley wrote:

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Tim Bray wrote:

···

On Jul 31, 2006, at 7:52 AM, Alex Young wrote:

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Er uh well it doesn't do unicode properties so you can't use things like \p{L}

Off topic, what does/would that do? Match a lower-case symbol?

Unicode characters have named properties. "L" means it's a letter. There are sub-properties like Lu and Ll for upper and lower case. There are lots more properties for things like being numbers, being white-space, combining forms and particular properties of Asian characters and so on. Tremendously useful in regexes, particularly for those of us round-eye gringos who are prone to write [a-zA-Z] and think we're matching letters, which we're not. If you don't support properties, you don't support Unicode. -Tim

Gotcha. Thanks for that.

--
Alex

That's one of the reasons why you _need_ tables when working with Unicode, and you _will_ spend memory on them. What Ruby does now is nowhere near, and Matz wrote that he didn't unclude complete tables for Oniguruma in 1.9 yet.

With proper regex support other funky things become posslbe, for instance {all_cyrillic_letters} in a regex etc.

···

On 31-jul-2006, at 17:10, Tim Bray wrote:

Unicode characters have named properties. "L" means it's a letter. There are sub-properties like Lu and Ll for upper and lower case. There are lots more properties for things like being numbers, being white-space, combining forms and particular properties of Asian characters and so on. Tremendously useful in regexes, particularly for those of us round-eye gringos who are prone to write [a-zA-Z] and think we're matching letters, which we're not. If you don't support properties, you don't support Unicode.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

> Whilst it's certainly useless for a lot of tasks, I'm not sure that
> Ruby is any worse than other languages in this regard. As far as I'm
> aware, most languages that 'support' Unicode don't handle grapheme
> clusters without using additional libraries.

AFAIK Python regexps do that properly, and ICU does for sure (both as
free iterators and regexps).

That's what I mean: ICU is a separate library, not part of a language
core. We can use ICU in Ruby too - it's still pre-alpha and not
seamless, but the possibility exists. From what I've read, Python
doesn't do the heavyweight stuff natively, either. (Please tell me if
I'm wrong - I don't use Python.)

> Actually, that's a really good idea. Which languages/frameworks have
> you found that actually do it right? We could learn from their
> example.

To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.

I'm not being sarcastic at all, though perhaps I could have phrased it
better. It's just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we'd have some useful
yardsticks. You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?

But if you really feel like being constructive you can update the
Unicode gem (wich you promised about a month ago) :-))

I promised I'd try :slight_smile: Thanks for the reminder, though! I'll get on with it.

Paul.

···

On 31/07/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:

On 31-jul-2006, at 17:48, Paul Battley wrote:

> Whilst it's certainly useless for a lot of tasks, I'm not sure that
> Ruby is any worse than other languages in this regard. As far as I'm
> aware, most languages that 'support' Unicode don't handle grapheme
> clusters without using additional libraries.

AFAIK Python regexps do that properly, and ICU does for sure (both as
free iterators and regexps).

That's what I mean: ICU is a separate library, not part of a language
core.

PHP took the best of both - they are integrating ICU into the core. Although I always hated
their tendency to bloat the core, this is one of the cases of bloat that I would want to applaud as a gesture
of sanity and common sense.

We can use ICU in Ruby too - it's still pre-alpha and not
seamless, but the possibility exists.

Except from the fact that the maintainer has abandoned it and nobody stepped in. I don't do C.

From what I've read, Python
doesn't do the heavyweight stuff natively, either. (Please tell me if
I'm wrong - I don't use Python.)

It depends on what you call "heavyweight". For the purists out there, I gather, even including a complete Unicode table with
codepoint properties might be "heavyweight".

To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.

I'm not being sarcastic at all, though perhaps I could have phrased it
better. It's just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we'd have some useful
yardsticks.

The problem being, my "Right Examples" are nowhere near other's "Right Examples", which in turn supurs flamewars.
My "right example" is simple - Unicode on no terms, no encoding choice, characters only - but most already are dissatisfied with such
an attitude and the issue has been discussed in detail, with no solution satisfying all parties being devises. Too much compromise.

You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?

ICU in all it's incarnations (Java and C), compulsory character-oriented Strings without choice of encoding in Java and the upcoming Unicode support in Python (again - compulsory Unicode for all strings, byte arrays for everything else). Perl's regex support. I know everyone will disagree (how do I match a PNG header in a string???) but that's what I consider good.

As to localization - resource bundles are good, and of course I consider all languages that _did_ bother to print localized dates. Shame on Ruby.

But if you really feel like being constructive you can update the
Unicode gem (wich you promised about a month ago) :-))

I promised I'd try :slight_smile: Thanks for the reminder, though! I'll get on with it.

Gotcha :slight_smile:

···

On 31-jul-2006, at 18:51, Paul Battley wrote:

On 31/07/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:

On 31-jul-2006, at 17:48, Paul Battley wrote:

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Paul Battley wrote:

> Actually, that's a really good idea. Which languages/frameworks have
> you found that actually do it right? We could learn from their
> example.

To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.

I'm not being sarcastic at all, though perhaps I could have phrased it
better. It's just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we'd have some useful
yardsticks. You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?

I second that. I see a lot of people asking for "transparent" unicode support but I don't see how that is possible. To me it's like asking for a language that has transparent bug recovery. I know that ruby has weaknesses when it comes to multibyte encodings, but the main problem is human in nature; too many people assume that char==byte, which results in bugs when someone unexpectedly uses "weird" characters. IMHO no amount of "transparent support" will change that. But I would love to be shown otherwise with examples of languages that "do it right".

Daniel

By transparent I mean that I can iterate, compare, match, index, ...
not only bytes but also at least code points (and grapheme clusters if
somebody is so nice and implements that - but for me it is not very
important now). Using the standard string class that all standard
functions accept.

In ruby 1.8 working with anything but bytes is like scratching your
right ear with your left hand .. or leg.

Thanks

Michal

···

On 8/1/06, Daniel DeLorme <dan-ml@dan42.com> wrote:

Paul Battley wrote:
>> > Actually, that's a really good idea. Which languages/frameworks have
>> > you found that actually do it right? We could learn from their
>> > example.
>>
>> To my knowledge you are intimately familiar with the subject so I
>> take it as sarcasm.
>
> I'm not being sarcastic at all, though perhaps I could have phrased it
> better. It's just that all Unicode discussions in Ruby end up going
> round and round in circles; if we as a community could identify some
> first-class examples of Doing It Right, I think we'd have some useful
> yardsticks. You are someone with particularly high expectations
> (rightly so) of Unicode support in a language: have you found anything
> that really impressed you?

I second that. I see a lot of people asking for "transparent" unicode support
but I don't see how that is possible. To me it's like asking for a language that
has transparent bug recovery. I know that ruby has weaknesses when it comes to
multibyte encodings, but the main problem is human in nature; too many people
assume that char==byte, which results in bugs when someone unexpectedly uses
"weird" characters. IMHO no amount of "transparent support" will change that.
But I would love to be shown otherwise with examples of languages that "do it
right".