Unicode roadmap?

In my opinion, Ruby is practically useless for many applications without
proper Unicode support. How a modern language can ignore this issue is
really beyond me.

Is there a plan to get Unicode support into the language anytime soon?

···

--
Posted via http://www.ruby-forum.com/.

Hi,

In my opinion, Ruby is practically useless for many applications without
proper Unicode support. How a modern language can ignore this issue is
really beyond me.

Define "proper Unicode support" first.

Is there a plan to get Unicode support into the language anytime soon?

I'm planning enhancing Unicode support in 1.9 in a year or so
(finally). But I'm not sure that conforms your definition of "proper
Unicode support". Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

              matz.

···

In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.hausner@gmail.com> writes:

Roman Hausner wrote:

In my opinion, Ruby is practically useless for many applications without
proper Unicode support. How a modern language can ignore this issue is
really beyond me.

Is there a plan to get Unicode support into the language anytime soon?

I also think that this is very important.

···

--
Posted via http://www.ruby-forum.com/\.

Hello, everyone. I am sorry, I was a bit embarassed by the quantity of
text in this discussion and I may have read it not enough carefully to
firure out the answer, and it (discussion) itself seems to be a year
old, so I've decided to ask:

Finally, is there a convenient support for Unicode in Ruby? Or, if not,
when will it be?

I am going to develop an international website (with pages in some
european languages, including those using non-latin alphabets). I think
it should prove to be a good idea to make such a website totally in
Unicode (probably UTF-16), without using any legacy encodings at all.
The DBMS I am going to use is Oracle 10g (Express edition until it comes
to its limitations).

As well I would like to ask when the next Ruby release is planned to. If
it comes this year, I should probably try nightly builds as it seems to
be wise to start a new project targeting ea version of the next release.

Thanks in advance.

···

--
Posted via http://www.ruby-forum.com/.

Define "proper Unicode support" first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin... but, alas, who want's to write stuff like 'normalize_KC' etc. if you just want the frickin' substring of a string?!

you need to read books on unicode just to properly use the plugin...

aargg :-((

Best regards
Peter

Yukihiro Matsumoto schrieb:

···

Hi,

In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.hausner@gmail.com> writes:
>In my opinion, Ruby is practically useless for many applications without >proper Unicode support. How a modern language can ignore this issue is >really beyond me.

Define "proper Unicode support" first.

>Is there a plan to get Unicode support into the language anytime soon?

I'm planning enhancing Unicode support in 1.9 in a year or so
(finally). But I'm not sure that conforms your definition of "proper
Unicode support". Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

              matz.

Yukihiro Matsumoto skrev:

Hi,

>In my opinion, Ruby is practically useless for many applications without >proper Unicode support. How a modern language can ignore this issue is >really beyond me.

Define "proper Unicode support" first.
  

I won't define "proper Unicode support" here.

But there must be a problem somewhere since pure-ruby Ferret doesn't support UTF-8. You need to use the c-extension of Ferret to have it support UTF-8 (which doesn't work on Windows yet :frowning: ). I don't know if that is just a sucky impl of Ferret or if it's Ruby that make it so.

Maybe Dave Balmain can enlighten us why UTF-8 doesn't work in the pure Ruby version and what is needed of Ruby to make it work (if it's actually Ruby's fault that is)?

My personal belief is that it should just work in a case like this if data in is UTF-8 and search strings is UTF-8 without the lib author and/or user having to do anything very special to make it work (apart from specifying encoding). Am I wrong in this?

Regards,

Marcus

···

In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.hausner@gmail.com> writes:

There are a lot of answers to that question, and I strongly suggest
you search as this is a hotly debated discussion.

Google is more useful for searching this than ruby-forum.com. You will
find out when there will be a new release, and the current state of
Unicode.

-austin

···

On 5/31/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote:

Hello, everyone. I am sorry, I was a bit embarassed by the quantity of
text in this discussion and I may have read it not enough carefully to
firure out the answer, and it (discussion) itself seems to be a year
old, so I've decided to ask:

Finally, is there a convenient support for Unicode in Ruby? Or, if not,
when will it be?

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

Well, Ruby 1.9 (which is due in December) will have some Unicode
support. (So you'll have a `chars` method on strings, like with
Rails.) Matz is working on it right now even, as he posted that he
was tooling around with string.c earlier this week on his blog.

That is, nothing's been checked in yet. Because he wants it to be
good, you see?

_why

···

On Fri, Jun 01, 2007 at 05:29:31AM +0900, Ivan Mashchenko wrote:

Finally, is there a convenient support for Unicode in Ruby? Or, if not,
when will it be?

Finally, is there a convenient support for Unicode in Ruby? Or, if not,
when will it be?

It depends on your definition of 'convenient'.

The short answer is that unicode applications can be made in Ruby,
particularly Web Apps. It is not especially difficult, but it is not
'for free' or seamless. You generally have to use an encoding-aware
string type, or modify the existing string class to support multi-byte
characters.

A longer answer would contain references to the fact that there are
multiple options here, that web apps (Rails in particular) are ahead
of pure Ruby in terms of Unicode, and that there are actually a lot
of projects to investigate.

The hardest part of Ruby and Unicode is that not all of the libraries
support it, or that some of the meta-hackery to the string class
could break libraries that expect chars.length to equal bytes.length
(there are other examples). Some of the more popular libraries are
like this, or they inherit the encoding from your O/S settings and
cannot be driven from an API.

I am going to develop an international website (with pages in some
european languages, including those using non-latin alphabets). I think
it should prove to be a good idea to make such a website totally in
Unicode (probably UTF-16), without using any legacy encodings at all.

Well yes, but I would use UTF-8 instead. Its Unicode designed for the
web (and UTF-16 is a bit wierd in some ways - there are at least 3 kinds
of UTF-16 that I am aware of).

Rails 1.2 introduced some pretty impressive support for Unicode in the
last release, all of the major i18n plugins should be compatible with
these changes by now.

As well I would like to ask when the next Ruby release is planned to. If
it comes this year, I should probably try nightly builds as it seems to
be wise to start a new project targeting ea version of the next release.

AFAIK there is no release schedule. YARV is basically Ruby 1.9, and it
is scheduled for release around the end of the year. However there is no
firm commitment to make it the next Ruby version. Also Ruby 1.9 is going
to break/deprecate stuff - I wouldn't develop against it, it will be a
rough experience.
Ruby 1.9 is kind of a staging release; migrating from 1.8 -> 1.9 is going
to be tricky, but 1.9 -> 2.0 should be a drop in; that the intention - isolate
the biggest changes to the 1.9 release.

If you are moving to Ruby 1.9, do it with a complete working application.
Or better still, develop against Rails versions, not Ruby versions. Let the
Rails team figure out the best Ruby migration strategy for you.

···

On 5/31/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote:

Define "proper Unicode support" first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin... but, alas, who want's to write stuff like 'normalize_KC' etc. if you just want the frickin' substring of a string?!

def substring(str, start, len)
   md = str.match(/\A.{#{start}}(.{#{len}})/)
   md[1]
end

def strlength(str)
   n = 0
   str.gsub(/./m) { n += 1; $& }
   n
end

See! Regexps do everything!

Just you know, set $KCODE and use these methods and you are set!

(I am kidding... btw)

···

On Jun 13, 2006, at 6:34 PM, Pete wrote:

you need to read books on unicode just to properly use the plugin...

aargg :-((

Best regards
Peter

Yukihiro Matsumoto schrieb:

Hi,

In message "Re: Unicode roadmap?" >> on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner >> <roman.hausner@gmail.com> writes:
>In my opinion, Ruby is practically useless for many applications without |proper Unicode support. How a modern language can ignore this issue is |really beyond me.

Define "proper Unicode support" first.

>Is there a plan to get Unicode support into the language anytime soon?

I'm planning enhancing Unicode support in 1.9 in a year or so
(finally). But I'm not sure that conforms your definition of "proper
Unicode support". Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

              matz.

I suspect the Japanese posters on this list can answer better than I can,
but my impression is that Unicode is, shall we say, not highly thought of
outside Europe and North America. The way they dealt with "Chinese"
characters was apparently more than a bit of a hack, and just doesn't work
very well in the real world. Reading some of the explanations for glyphs
versus characters in Unicode just makes you shake your head. What were they
thinking? Sure doesn't pass the smell test, although I'll be the first to
admit I haven't exactly thought deeply about the subject.

There's another problem with Japanese - I've got a friend who's been dealing
with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters. (I believe this is called gaiji). The concept of a fixed
character set apparently just isn't a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)...]

- James Moore

If it helps any, I've moved ~2000 web pages in an internal work project that had mixed UTF-8/cp-1252 (in the content, not just between pages) and ruby handled it very gracefully. I was using 1.8.5-p12 and Hpricot (but not Hpricot's encoding features, which last I checked are broken) for the process.

While I'm certainly not an authority on the subject, I've thoroughly battle-tested this and it works with a high degree of confidence. Certainly better than perl and libxml2, which was our original implementation.

···

On 2007-05-31 15:30:50 -0700, "Austin Ziegler" <halostatue@gmail.com> said:

On 5/31/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote:

Hello, everyone. I am sorry, I was a bit embarassed by the quantity of
text in this discussion and I may have read it not enough carefully to
firure out the answer, and it (discussion) itself seems to be a year
old, so I've decided to ask:

Finally, is there a convenient support for Unicode in Ruby? Or, if not,
when will it be?

There are a lot of answers to that question, and I strongly suggest
you search as this is a hotly debated discussion.

Google is more useful for searching this than ruby-forum.com. You will
find out when there will be a new release, and the current state of
Unicode.

Richard Conroy wrote:

It depends on your definition of 'convenient'.

IMHO convinient is as in C#. There I don't have to bother how are
strings stored in memroy, they just do work and are international.

Well yes, but I would use UTF-8 instead.

Won't there be a problem if the data is stored in UTF-16 (as far as I
know Orace, NVARCHAR uses 16-bit per symbol)

Also Ruby 1.9 is going to break/deprecate stuff - I wouldn't develop against it
migrating from 1.8 -> 1.9 is going to be tricky

So why should anyone develop a new project against 1.8 if it is going to
be deprecated?

If you are moving to Ruby 1.9, do it with a complete working
application.

But isn't it going to be tricky, as you've said?

I dont have to be moving for now as I have no line of Ruby code (I have
only an idea in my head) for today. And no Ruby experience (I am C++,
C#, Java and T-SQL developer). I've chosen Ruby as it seems almost good
and free.

Have I understood you correctly - you think I should make it Ruby 1.8
and then do a tricky move when it comes?

Or better still, develop against Rails versions, not Ruby versions.

This advice can prove useful. I'll think about it.

···

--
Posted via http://www.ruby-forum.com/\.

From the theoretical point of view this is quite interesting. Also I understand the humor :slight_smile:

Performance and memory consumption should be breathtaking using regexp just everywhere...

Also there are a ____few____ methods left :slight_smile:

As I am German the 'missing' unicode support is one of the greatest obstacles for me (and probably all other Germans doing their stuff seriously)...

Logan Capaldo schrieb:

···

On Jun 13, 2006, at 6:34 PM, Pete wrote:

Define "proper Unicode support" first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin... but, alas, who want's to write stuff like 'normalize_KC' etc. if you just want the frickin' substring of a string?!

def substring(str, start, len)
  md = str.match(/\A.{#{start}}(.{#{len}})/)
  md[1]
end

def strlength(str)
  n = 0
  str.gsub(/./m) { n += 1; $& }
  n
end

See! Regexps do everything!

Just you know, set $KCODE and use these methods and you are set!

(I am kidding... btw)

you need to read books on unicode just to properly use the plugin...

aargg :-((

Best regards
Peter

Yukihiro Matsumoto schrieb:

Hi,

In message "Re: Unicode roadmap?" >>> on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner >>> <roman.hausner@gmail.com> writes:
>In my opinion, Ruby is practically useless for many applications without |proper Unicode support. How a modern language can ignore this issue is |really beyond me.

Define "proper Unicode support" first.

>Is there a plan to get Unicode support into the language anytime soon?

I'm planning enhancing Unicode support in 1.9 in a year or so
(finally). But I'm not sure that conforms your definition of "proper
Unicode support". Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

                            matz.

There is a good summary of the han unification controversy on wikipedia;

    Han unification - Wikipedia

···

On 6/14/06, James Moore <banshee@banshee.com> wrote:

I suspect the Japanese posters on this list can answer better than I can,
but my impression is that Unicode is, shall we say, not highly thought of
outside Europe and North America. The way they dealt with "Chinese"
characters was apparently more than a bit of a hack, and just doesn't work
very well in the real world. Reading some of the explanations for glyphs
versus characters in Unicode just makes you shake your head. What were they
thinking? Sure doesn't pass the smell test, although I'll be the first to
admit I haven't exactly thought deeply about the subject.

There's another problem with Japanese - I've got a friend who's been dealing
with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters. (I believe this is called gaiji). The concept of a fixed
character set apparently just isn't a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)...]

I have one Japanese person here who's never heard of this gaiji concept. But it could be new and behind a generation gap of some kind. They do sure like to add symbols where they can, though. Especially graphical star characters. I see that a lot.
-Mat

···

On Jun 13, 2006, at 7:56 PM, James Moore wrote:

There's another problem with Japanese - I've got a friend who's been dealing
with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters. (I believe this is called gaiji). The concept of a fixed
character set apparently just isn't a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)...]

Richard Conroy wrote:

> It depends on your definition of 'convenient'.

IMHO convinient is as in C#. There I don't have to bother how are
strings stored in memroy, they just do work and are international.

It's not *that* convenient. By default Ruby strings are 8-byte. You can make
them Unicode strings very easily through a library (kCODE IIRC), and they
will behave as unicode in a way that you don't have to think about. You don't
have to use a different string type.

The problem occurs when you use code that you didn't write that expects
strings to be single-byte. So every time you evaluate a Ruby library, Rails
plugin or gem, you have to do more homework than you would in the
unicode centric Java or C#.

> Well yes, but I would use UTF-8 instead.

Won't there be a problem if the data is stored in UTF-16 (as far as I
know Orace, NVARCHAR uses 16-bit per symbol)

Every database worth using lets you specify the encoding of your string
and character types. Check your manuals or the Oracle forums. Anything
that is any way associated with web development supports UTF-8.

> Also Ruby 1.9 is going to break/deprecate stuff - I wouldn't develop against it
> migrating from 1.8 -> 1.9 is going to be tricky

So why should anyone develop a new project against 1.8 if it is going to
be deprecated?

Okay, you misunderstood me. There is a feature roadmap towards Ruby 2.0,
where major changes are coming in; the two biggest that I recall are Unicode
support and native/pre-emptive threads. The only reasonable way to implement
them are by altering the behaviour of core classes and the standard library.

This will mean that Ruby code of any sophistication written for Ruby
1.8, including
many libraries is likely to break.

Ruby 1.8 is not going away. Ruby is an open language, with a public source
repository. Unlike with .Net say, where Microsoft distribute the runtime in
binary only-form and can make older versions difficult to get. You have no
obligation to migrate to the most recent version, and there is no technical
reason that multiple runtimes (application specific) cannot co-exist on the
same machine.

Chasing the latest release is really something that you only do with commercial
languages. It's not something that is generally done with open languages.

> If you are moving to Ruby 1.9, do it with a complete working
> application.

But isn't it going to be tricky, as you've said?

It would be one hell of a lot easier than developing against a moving
target, not knowing if the issues in your code are your issues or
due to the latest release candidate.

Bleeding edge software development is for people who can spare a
lot of blood loss;

I dont have to be moving for now as I have no line of Ruby code (I have
only an idea in my head) for today. And no Ruby experience (I am C++,
C#, Java and T-SQL developer). I've chosen Ruby as it seems almost good
and free.

Yeah, its a great language. Make a point of checking out the JRuby project.
Its an exceptionally well developed Ruby runtime; it is considerably more
than an interpreter or language bridge - the JRuby guys have basically
doubled the size of the Java platform (or Ruby platform depending on POV).
Ruby is strong where Java is weak, and vice versa.

Have I understood you correctly - you think I should make it Ruby 1.8
and then do a tricky move when it comes?

Use Rails, where the most compelling features in Ruby 1.9/2.0 are already
present: Unicode, native concurrency (via processes) and good performance
(via all those <foo>caching mechanisms). When the Rails guys go Ruby 1.9
you can.

> Or better still, develop against Rails versions, not Ruby versions.

This advice can prove useful. I'll think about it.

regards,
Richard.

···

On 6/1/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote:

As I am German the 'missing' unicode support is one of the greatest
obstacles for me (and probably all other Germans doing their stuff
seriously)...

The same is for Russians/Ukrainians. In our programming communities question
"does the programming language supports Unicode as 'native'?" has very high
priority.

/BTW, here is one of the things where Python beats Ruby completely

V.

···

From: Pete [mailto:pertl@gmx.org]
Sent: Wednesday, June 14, 2006 1:58 AM

Objective-C (through the Cocoa framework) also handles Unicode superbly. Problem is, it is not cross-platform and is in fact strictly OS X stuff, but you could indeed use those libraries (NSString, etc...) through RubyCocoa, but of course that is far from convenient or optimal for most purposes.

Ideally, if major OS vendors got behind Ruby full force and put their Unicode know-how into the codebase, things would be smoother. They're the ones who really have already figured out pretty good ways to handle that stuff, and all the major scripting languages could benefit from it.

···

On Jun 1, 2007, at 9:23 AM, Richard Conroy wrote:

On 6/1/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote:

Richard Conroy wrote:

> It depends on your definition of 'convenient'.

IMHO convinient is as in C#. There I don't have to bother how are
strings stored in memroy, they just do work and are international.

It's not *that* convenient. By default Ruby strings are 8-byte. You can make
them Unicode strings very easily through a library (kCODE IIRC), and they
will behave as unicode in a way that you don't have to think about. You don't
have to use a different string type.

The problem occurs when you use code that you didn't write that expects
strings to be single-byte. So every time you evaluate a Ruby library, Rails
plugin or gem, you have to do more homework than you would in the
unicode centric Java or C#.

> Well yes, but I would use UTF-8 instead.

Won't there be a problem if the data is stored in UTF-16 (as far as I
know Orace, NVARCHAR uses 16-bit per symbol)

Every database worth using lets you specify the encoding of your string
and character types. Check your manuals or the Oracle forums. Anything
that is any way associated with web development supports UTF-8.

> Also Ruby 1.9 is going to break/deprecate stuff - I wouldn't develop against it
> migrating from 1.8 -> 1.9 is going to be tricky

So why should anyone develop a new project against 1.8 if it is going to
be deprecated?

Okay, you misunderstood me. There is a feature roadmap towards Ruby 2.0,
where major changes are coming in; the two biggest that I recall are Unicode
support and native/pre-emptive threads. The only reasonable way to implement
them are by altering the behaviour of core classes and the standard library.

This will mean that Ruby code of any sophistication written for Ruby
1.8, including
many libraries is likely to break.

Ruby 1.8 is not going away. Ruby is an open language, with a public source
repository. Unlike with .Net say, where Microsoft distribute the runtime in
binary only-form and can make older versions difficult to get. You have no
obligation to migrate to the most recent version, and there is no technical
reason that multiple runtimes (application specific) cannot co-exist on the
same machine.

Chasing the latest release is really something that you only do with commercial
languages. It's not something that is generally done with open languages.

> If you are moving to Ruby 1.9, do it with a complete working
> application.

But isn't it going to be tricky, as you've said?

It would be one hell of a lot easier than developing against a moving
target, not knowing if the issues in your code are your issues or
due to the latest release candidate.

Bleeding edge software development is for people who can spare a
lot of blood loss;

I dont have to be moving for now as I have no line of Ruby code (I have
only an idea in my head) for today. And no Ruby experience (I am C++,
C#, Java and T-SQL developer). I've chosen Ruby as it seems almost good
and free.

Yeah, its a great language. Make a point of checking out the JRuby project.
Its an exceptionally well developed Ruby runtime; it is considerably more
than an interpreter or language bridge - the JRuby guys have basically
doubled the size of the Java platform (or Ruby platform depending on POV).
Ruby is strong where Java is weak, and vice versa.

Have I understood you correctly - you think I should make it Ruby 1.8
and then do a tricky move when it comes?

Use Rails, where the most compelling features in Ruby 1.9/2.0 are already
present: Unicode, native concurrency (via processes) and good performance
(via all those <foo>caching mechanisms). When the Rails guys go Ruby 1.9
you can.

> Or better still, develop against Rails versions, not Ruby versions.

This advice can prove useful. I'll think about it.

regards,
Richard.

Hi,

···

In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" <vshepelev@imho.com.ua> writes:

From: Pete [mailto:pertl@gmx.org]
Sent: Wednesday, June 14, 2006 1:58 AM

As I am German the 'missing' unicode support is one of the greatest
obstacles for me (and probably all other Germans doing their stuff
seriously)...

The same is for Russians/Ukrainians. In our programming communities question
"does the programming language supports Unicode as 'native'?" has very high
priority.

Alright, then what specific features are you (both) missing? I don't
think it is a method to get number of characters in a string. It
can't be THAT crucial. I do want to cover "your missing features" in
the future M17N support in Ruby.

              matz.