Unicode/multibyte string support in Ruby1.9/Ruby summary?

David_Garamond2 · 15 January 2005 14:13

If someone could summarize the recent Unicode/multibyte string discussion on a wiki, that would be nice (and _very_ useful). It will help programmers prepare their code for Unicode support and backward compatibility in the future. Topics should include:

- how will strings be stored in memory (which probably be different between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

- how to check a string's charset, encoding;

- how to do various operations in the new multibyte sring, especially those which will be done differently compared to the classic string;

- what will happen to the classic string (e.g. will it perhaps be renamed to ByteArray or something);

- comparison rules for cross-encoding and cross-charset strings;

- regexes;

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte string support (especially since Ruby is a pretty latecomer in the Unicode scene);

Regards,
dave

Florian_Gross · 15 January 2005 17:46

David Garamond wrote:

If someone could summarize the recent Unicode/multibyte string discussion on a wiki, that would be nice (and _very_ useful). It will help programmers prepare their code for Unicode support and backward compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll try to answer the questions as accurately as possible.

- how will strings be stored in memory (which probably be different between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple bytes for one character.) Note that the RString record of Ruby will get a new field for the encoding.

- how to check a string's charset, encoding;

String#encoding. It will return a String.

- how to do various operations in the new multibyte sring, especially those which will be done differently compared to the classic string;

Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.

- what will happen to the classic string (e.g. will it perhaps be renamed to ByteArray or something);

The String interface will remain the same. Strings will just get added the encoding facilities, but will remain largely backwards compatible AFAIK.

- comparison rules for cross-encoding and cross-charset strings;

Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another one, but I don't know the details.

- regexes;

Regexp#encoding is introduced, matching uses similar rules as String comparison.

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte string support (especially since Ruby is a pretty latecomer in the Unicode scene);

I can't really do an in-depth comparison here, because I don't know the other languages.

Note that str[0] will return a one-character String and that ?x will do the same. There will be a new method like String#code point for getting the underlying raw bytes. I think the one-character Strings can later still be optimized fairly easily so that they can be immediate Objects.

ts1 · 15 January 2005 17:58

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
bytes for one character.) Note that the RString record of Ruby will get
a new field for the encoding.

Are you sure ? or I've not understood what you are trying to say.

Guy Decoux

David_Garamond2 · 15 January 2005 20:01

Florian Gross wrote:

David Garamond wrote:

If someone could summarize the recent Unicode/multibyte string discussion on a wiki, that would be nice (and _very_ useful). It will help programmers prepare their code for Unicode support and backward compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll try to answer the questions as accurately as possible.

Thanks for the answers, Florian. Yes I was following the thread on ruby-core too, but forgot that this is ruby-talk.

I have created the first draft in RubyGarden:

http://www.rubygarden.org/ruby?UnicodeInRuby2

It's very raw and bare-bones (plus I'm an ASCII guy and totally clueless regarding multibyte/Unicode). I invite people to improve on it.

Thanks.

Regards,
dave

gabriele_renzi · 16 January 2005 11:41

Florian Gross ha scritto:

David Garamond wrote:

If someone could summarize the recent Unicode/multibyte string discussion on a wiki, that would be nice (and _very_ useful). It will help programmers prepare their code for Unicode support and backward compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll try to answer the questions as accurately as possible.

- how will strings be stored in memory (which probably be different between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple bytes for one character.) Note that the RString record of Ruby will get a new field for the encoding.

- how to check a string's charset, encoding;

String#encoding. It will return a String.

- how to do various operations in the new multibyte sring, especially those which will be done differently compared to the classic string;

Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.

- what will happen to the classic string (e.g. will it perhaps be renamed to ByteArray or something);

The String interface will remain the same. Strings will just get added the encoding facilities, but will remain largely backwards compatible AFAIK.

- comparison rules for cross-encoding and cross-charset strings;

Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another one, but I don't know the details.

- regexes;

Regexp#encoding is introduced, matching uses similar rules as String comparison.

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte string support (especially since Ruby is a pretty latecomer in the Unicode scene);

I can't really do an in-depth comparison here, because I don't know the other languages.

Note that str[0] will return a one-character String and that ?x will do the same. There will be a new method like String#code point for getting the underlying raw bytes. I think the one-character Strings can later still be optimized fairly easily so that they can be immediate Objects.

an addition and two questions: the encoding of the source file will be indicated with the same approach of python:
#!/usr/bin/ruby
# -*- coding: <encoding name> -*-

or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use that ugly magic string?

Also, not that I am an espert, but is localization supposed to work?
i.e. accented letters which are common in european languages are supposed to be able to be capitalized and such?
Is'nt this related to a charset property of the string different from encoding ?
IIRC in parrot-land a string is a <stream of

+<encoding>+<charset>+<language>, how happens that we just care

about one of this things?

Also, given that this seem a huge work.. will it spin off in a proper indipendent libm17n library ?

Yukihiro_Matsumoto2 · 15 January 2005 18:36

Hi,

···

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?" on Sun, 16 Jan 2005 02:58:20 +0900, ts <decoux@moulon.inra.fr> writes:

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
bytes for one character.) Note that the RString record of Ruby will get
a new field for the encoding.

Are you sure ? or I've not understood what you are trying to say.

He's right, except that the encoding will be stored using the FL_USER
flags or an instance variable of the string.

matz.

Nobuyoshi_Nakada · 16 January 2005 14:11

Hi,

At Sun, 16 Jan 2005 20:41:08 +0900,
gabriele renzi wrote in [ruby-talk:126677]:

an addition and two questions: the encoding of the source file will be
indicated with the same approach of python:
#!/usr/bin/ruby
# -*- coding: <encoding name> -*-

or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use
that ugly magic string?

Since encodings may vary per files, so -K would not enough.

···

--
Nobu Nakada

ts1 · 16 January 2005 10:59

He's right, except that the encoding will be stored using the FL_USER
flags or an instance variable of the string.

My question was precisely about

"RString record of Ruby will get a new field"

i.e. I've read ruby_m17n

Guy Decoux

Yukihiro_Matsumoto2 · 16 January 2005 14:00

Hi,

···

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?" on Sun, 16 Jan 2005 19:59:30 +0900, ts <decoux@moulon.inra.fr> writes:

He's right, except that the encoding will be stored using the FL_USER
flags or an instance variable of the string.

My question was precisely about

"RString record of Ruby will get a new field"

i.e. I've read ruby_m17n

I know that you know. It's just for rest of us.

matz.

Topic		Replies	Views
Unicode in Ruby and a Ruby Reference ruby-talk	9	125	15 December 2004
Support for Unicode strings ruby-talk	1	89	11 January 2006
A few good articles on Unicode ruby-talk	3	136	16 June 2006
Ruby Weekly News 13th - 19th December 2004 ruby-talk	0	101	22 December 2004
Ruby, Unicode - ever? ruby-talk	20	126	24 June 2006

Unicode/multibyte string support in Ruby1.9/Ruby summary?

Related topics