David Garamond wrote:
If someone could summarize the recent Unicode/multibyte string discussion on a wiki, that would be nice (and _very_ useful). It will help programmers prepare their code for Unicode support and backward compatibility in the future. Topics should include:
Note that lots of this was recently discussed in [ruby-core:04146]. I'll try to answer the questions as accurately as possible.
- how will strings be stored in memory (which probably be different between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);
AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple bytes for one character.) Note that the RString record of Ruby will get a new field for the encoding.
- how to check a string's charset, encoding;
String#encoding. It will return a String.
- how to do various operations in the new multibyte sring, especially those which will be done differently compared to the classic string;
Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.
- what will happen to the classic string (e.g. will it perhaps be renamed to ByteArray or something);
The String interface will remain the same. Strings will just get added the encoding facilities, but will remain largely backwards compatible AFAIK.
- comparison rules for cross-encoding and cross-charset strings;
Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only ASCII characters are equivalent.
Everything else is different.
I think there will be ways for converting from one encoding to another one, but I don't know the details.
- regexes;
Regexp#encoding is introduced, matching uses similar rules as String comparison.
- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte string support (especially since Ruby is a pretty latecomer in the Unicode scene);
I can't really do an in-depth comparison here, because I don't know the other languages.
Note that str[0] will return a one-character String and that ?x will do the same. There will be a new method like String#code point for getting the underlying raw bytes. I think the one-character Strings can later still be optimized fairly easily so that they can be immediate Objects.