State of unicode support

>>
>> > Whilst it's certainly useless for a lot of tasks, I'm not sure that
>> > Ruby is any worse than other languages in this regard. As far as
>> I'm
>> > aware, most languages that 'support' Unicode don't handle grapheme
>> > clusters without using additional libraries.
>>
>> AFAIK Python regexps do that properly, and ICU does for sure (both as
>> free iterators and regexps).
>
> That's what I mean: ICU is a separate library, not part of a language
> core.

PHP took the best of both - they are integrating ICU into the core.
Although I always hated
their tendency to bloat the core, this is one of the cases of bloat
that I would want to applaud as a gesture
of sanity and common sense.

Last time I looked ICU was in C++. Requiring a C++ compilier and
runtime is quite a bit of bloat :slight_smile:

> We can use ICU in Ruby too - it's still pre-alpha and not
> seamless, but the possibility exists.

Except from the fact that the maintainer has abandoned it and nobody
stepped in. I don't do C.

> From what I've read, Python
> doesn't do the heavyweight stuff natively, either. (Please tell me if
> I'm wrong - I don't use Python.)

It depends on what you call "heavyweight". For the purists out there,
I gather, even including a complete Unicode table with
codepoint properties might be "heavyweight".

I am not sure how large that might be. But if it is about the size of
the interpreter including the rest of the standard libraries I would
consider it "heavyweight". It would be a reason to start "optional
standard libraries" I guess :slight_smile:

>>
>> To my knowledge you are intimately familiar with the subject so I
>> take it as sarcasm.
>
> I'm not being sarcastic at all, though perhaps I could have phrased it
> better. It's just that all Unicode discussions in Ruby end up going
> round and round in circles; if we as a community could identify some
> first-class examples of Doing It Right, I think we'd have some useful
> yardsticks.

The problem being, my "Right Examples" are nowhere near other's
"Right Examples", which in turn supurs flamewars.
My "right example" is simple - Unicode on no terms, no encoding
choice, characters only - but most already are dissatisfied with such
an attitude and the issue has been discussed in detail, with no
solution satisfying all parties being devises. Too much compromise.

It's been also said that giving more options does not stop you from
using only unicode. If your "right example" is only about restricting
choice then there is really not much to it.

The "right examples" people were interested in are probably more like
the libraries/languages that implement enough functionality to give
you full unicode support for your definition of "full".

Thanks

Michal

···

On 7/31/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:

On 31-jul-2006, at 18:51, Paul Battley wrote:
> On 31/07/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
>> On 31-jul-2006, at 17:48, Paul Battley wrote:

Daniel DeLorme wrote:

I second that. I see a lot of people asking for "transparent" unicode support but I don't see how that is possible. To me it's like asking for a language that has transparent bug recovery. I know that ruby has weaknesses when it comes to multibyte encodings, but the main problem is human in nature; too many people assume that char==byte, which results in bugs when someone unexpectedly uses "weird" characters. IMHO no amount of "transparent support" will change that. But I would love to be shown otherwise with examples of languages that "do it right".

It can be done. Java gets it almost right, and in such a way that most people will never stub their toes on the flaws. Python, it seems, is going to get it right next time around. It's clearly possible to do Unicode correctly. What Matz wants is much harder; a String type that can contain strings of characters from arbitrary character sets in arbitrary encodings, Unicode being just one special case, and also serve as a byte buffer.

  -Tim

Last time I looked ICU was in C++. Requiring a C++ compilier and
runtime is quite a bit of bloat :slight_smile:

It still is. And it's huge and takes ages to build. If only I knew something much lighter and better I would have dismissed it.

I am not sure how large that might be. But if it is about the size of
the interpreter including the rest of the standard libraries I would
consider it "heavyweight". It would be a reason to start "optional
standard libraries" I guess :slight_smile:

I'm stopping right here. Unicode is not an option.

It's been also said that giving more options does not stop you from
using only unicode.

In 90% of the cases giving more options means programmers ignore Unicode, for reasons ranging from speed
to ignorance. My user experience over the years has proven it.

But then again, I stop right here. And I urge you to do the same :slight_smile:

···

On 1-aug-2006, at 12:05, Michal Suchanek wrote:

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl