Unicode/multibyte string support in Ruby1.9/Ruby summary?

If someone could summarize the recent Unicode/multibyte string discussion on a wiki, that would be nice (and _very_ useful). It will help programmers prepare their code for Unicode support and backward compatibility in the future. Topics should include:

- how will strings be stored in memory (which probably be different between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

- how to check a string's charset, encoding;

- how to do various operations in the new multibyte sring, especially those which will be done differently compared to the classic string;

- what will happen to the classic string (e.g. will it perhaps be renamed to ByteArray or something);

- comparison rules for cross-encoding and cross-charset strings;

- regexes;

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte string support (especially since Ruby is a pretty latecomer in the Unicode scene);

Regards,
dave

David Garamond wrote:

If someone could summarize the recent Unicode/multibyte string discussion on a wiki, that would be nice (and _very_ useful). It will help programmers prepare their code for Unicode support and backward compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll try to answer the questions as accurately as possible.

- how will strings be stored in memory (which probably be different between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple bytes for one character.) Note that the RString record of Ruby will get a new field for the encoding.

- how to check a string's charset, encoding;

String#encoding. It will return a String.

- how to do various operations in the new multibyte sring, especially those which will be done differently compared to the classic string;

Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.

- what will happen to the classic string (e.g. will it perhaps be renamed to ByteArray or something);

The String interface will remain the same. Strings will just get added the encoding facilities, but will remain largely backwards compatible AFAIK.

- comparison rules for cross-encoding and cross-charset strings;

Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another one, but I don't know the details.

- regexes;

Regexp#encoding is introduced, matching uses similar rules as String comparison.

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte string support (especially since Ruby is a pretty latecomer in the Unicode scene);

I can't really do an in-depth comparison here, because I don't know the other languages.

Note that str[0] will return a one-character String and that ?x will do the same. There will be a new method like String#code point for getting the underlying raw bytes. I think the one-character Strings can later still be optimized fairly easily so that they can be immediate Objects.

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
bytes for one character.) Note that the RString record of Ruby will get
a new field for the encoding.

Are you sure ? or I've not understood what you are trying to say.

Guy Decoux

Florian Gross wrote:

David Garamond wrote:

If someone could summarize the recent Unicode/multibyte string discussion on a wiki, that would be nice (and _very_ useful). It will help programmers prepare their code for Unicode support and backward compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll try to answer the questions as accurately as possible.

Thanks for the answers, Florian. Yes I was following the thread on ruby-core too, but forgot that this is ruby-talk.

I have created the first draft in RubyGarden:

  http://www.rubygarden.org/ruby?UnicodeInRuby2

It's very raw and bare-bones (plus I'm an ASCII guy and totally clueless regarding multibyte/Unicode). I invite people to improve on it.

Thanks.

Regards,
dave

Florian Gross ha scritto:

David Garamond wrote:

If someone could summarize the recent Unicode/multibyte string discussion on a wiki, that would be nice (and _very_ useful). It will help programmers prepare their code for Unicode support and backward compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll try to answer the questions as accurately as possible.

- how will strings be stored in memory (which probably be different between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple bytes for one character.) Note that the RString record of Ruby will get a new field for the encoding.

- how to check a string's charset, encoding;

String#encoding. It will return a String.

- how to do various operations in the new multibyte sring, especially those which will be done differently compared to the classic string;

Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.

- what will happen to the classic string (e.g. will it perhaps be renamed to ByteArray or something);

The String interface will remain the same. Strings will just get added the encoding facilities, but will remain largely backwards compatible AFAIK.

- comparison rules for cross-encoding and cross-charset strings;

Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another one, but I don't know the details.

- regexes;

Regexp#encoding is introduced, matching uses similar rules as String comparison.

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte string support (especially since Ruby is a pretty latecomer in the Unicode scene);

I can't really do an in-depth comparison here, because I don't know the other languages.

Note that str[0] will return a one-character String and that ?x will do the same. There will be a new method like String#code point for getting the underlying raw bytes. I think the one-character Strings can later still be optimized fairly easily so that they can be immediate Objects.

an addition and two questions: the encoding of the source file will be indicated with the same approach of python:
  #!/usr/bin/ruby
  # -*- coding: <encoding name> -*-

or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use that ugly magic string?

Also, not that I am an espert, but is localization supposed to work?
i.e. accented letters which are common in european languages are supposed to be able to be capitalized and such?
  Is'nt this related to a charset property of the string different from encoding ?
IIRC in parrot-land a string is a <stream of

+<encoding>+<charset>+<language>, how happens that we just care

about one of this things?

Also, given that this seem a huge work.. will it spin off in a proper indipendent libm17n library ? :slight_smile:

Hi,

···

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?" on Sun, 16 Jan 2005 02:58:20 +0900, ts <decoux@moulon.inra.fr> writes:

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
bytes for one character.) Note that the RString record of Ruby will get
a new field for the encoding.

Are you sure ? or I've not understood what you are trying to say.

He's right, except that the encoding will be stored using the FL_USER
flags or an instance variable of the string.

              matz.

Hi,

At Sun, 16 Jan 2005 20:41:08 +0900,
gabriele renzi wrote in [ruby-talk:126677]:

an addition and two questions: the encoding of the source file will be
indicated with the same approach of python:
  #!/usr/bin/ruby
  # -*- coding: <encoding name> -*-

or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use
that ugly magic string?

Since encodings may vary per files, so -K would not enough.

···

--
Nobu Nakada

He's right, except that the encoding will be stored using the FL_USER
flags or an instance variable of the string.

My question was precisely about

  "RString record of Ruby will get a new field"

i.e. I've read ruby_m17n :slight_smile:

Guy Decoux

Hi,

···

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?" on Sun, 16 Jan 2005 19:59:30 +0900, ts <decoux@moulon.inra.fr> writes:

He's right, except that the encoding will be stored using the FL_USER
flags or an instance variable of the string.

My question was precisely about

"RString record of Ruby will get a new field"

i.e. I've read ruby_m17n :slight_smile:

I know that you know. It's just for rest of us.

              matz.