Unicode roadmap?

Victor_Zverok_Shepel · 14 June 2006 05:26

>From: Pete [mailto:pertl@gmx.org]
>Sent: Wednesday, June 14, 2006 1:58 AM
>> As I am German the 'missing' unicode support is one of the greatest
>> obstacles for me (and probably all other Germans doing their stuff
>> seriously)...
>
>The same is for Russians/Ukrainians. In our programming communities
question
>"does the programming language supports Unicode as 'native'?" has very
high
>priority.

Alright, then what specific features are you (both) missing? I don't
think it is a method to get number of characters in a string. It
can't be THAT crucial. I do want to cover "your missing features" in
the future M17N support in Ruby.

matz.

I suppose, all we (non-English-writers) need is to have all string-related
methods working. Just for now, I think about plain testing each string
method; also, some other classes can be affected by Unicode (possibly
regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are
not: File.open with Russian letters in path don't finds the file.

More generally, it can make sense to have Unicode as the "base" mode; where
non-Unicode to stay "old, compatibility" mode.

Something like this.

V.

···

From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 5:37 AM

In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" > <vshepelev@imho.com.ua> writes:

Michal_hramrach_Such · 14 June 2006 09:22

What I want is all methods working seamlessly with unicode strings so
that I do not have to think about the encoding.

Regexps do work with utf-8 strings if KCODE is set to u (but it
defaults to n even when locale uses UTF-8).

String searches should probably work but they would retrurn wrong position.
Things like split should work for utf-8, the encoding is pretty well defined.

But one might want to use length and to work with strings.
It can be simulated with unicode_string=string.scan(/./). But it is no
longer a string. It is composed of characters only as long as I assign
only characters using =.
The string functions should do the right thing even for utf-8. But I
guess utf-32 is more useful for working with strings this way.

It might be a good idea to stick encoding information into strings (it
is probably the only way how internationalization can be done and the
sanity of all involved preserved at the same time). The functions for
comparison, etc could use it to do the right thing even if strings
come in several encodings. ie. cp1251 from the system, utf-8 from a
web page, ...

Functions like open could convert the string correctly according to
locale. One should be able to set the encoding information (ie for web
page title when the meta tag for content type is found in a web
page),and remove it to suppress string conversion. It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

Things like <=>, upcase, downcase, etc make sense only in context of
locale (language). Only the encoding does not define them.
I guess the default <=>is based on the binary representation of the
string. This would mean different sorting of the same strings in
different encodings. Sorting by the unicode code point would be at
least the same for any encoding.

Thanks

Michal

···

On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Hi,

In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" <vshepelev@imho.com.ua> writes:

>From: Pete [mailto:pertl@gmx.org]
>Sent: Wednesday, June 14, 2006 1:58 AM
>> As I am German the 'missing' unicode support is one of the greatest
>> obstacles for me (and probably all other Germans doing their stuff
>> seriously)...
>
>The same is for Russians/Ukrainians. In our programming communities question
>"does the programming language supports Unicode as 'native'?" has very high
>priority.

Alright, then what specific features are you (both) missing? I don't
think it is a method to get number of characters in a string. It
can't be THAT crucial. I do want to cover "your missing features" in
the future M17N support in Ruby.

Alexey_Borzenkov · 25 June 2006 14:41

Yukihiro Matsumoto wrote:

Alright, then what specific features are you (both) missing? I don't
think it is a method to get number of characters in a string. It
can't be THAT crucial. I do want to cover "your missing features" in
the future M17N support in Ruby.

Sorry for maybe getting into, but here are my 5 cents. When I first
found out about ruby, I practically almost fell in love with the
language. Unfortunately, after some studying and experimenting I
suddenly found that it lacks proper unicode support on win32, in
particular with file IO and ole automation, i.e. in two cases where I
had to interoperate with the rest of the world. Win32 really differs
from Linux and maybe other Unixes in API because in *nix you don't have
to worry about unicode/whatever, because all of the system depends on
your current locale. In win32 there are two sets of API, ansi and
unicode, maybe that was a bad microsoft's decision, but that's a
reality. Now I am a Russian, and when I write scripts I have to worry
that not only Russian characters don't get messed up, but characters of
other languages as well. So that if I receive, say, excel file with a
lot of languages in that, and I have to process that file somehow I have
to be sure that no letters will be lost, nor messed up, thus converting
it to current codepage (1251) is no option for me. The same is with
filenames, the fact that I'm running russian winxp doesn't mean that I
have only filenames that fall in 1251 codepage, I also have filenames
with european characters (umlauts and such), as well as japanese, and
when I want to write some script that processes these files, I have to
be able to work with them. At that time this caused me to move to Tcl
(it has utf-8 encoding everywhere, and it converts to required encoding
when interoperating with the world). Since then I'm still waiting for
proper unicode support in ruby (read: proper interoperability with
operating system and its components using unicode API versions: the ones
ending with W) and maybe a way to define in which locale (specific code
page, utf-8, etc) the current script is running.

Hope that clarifies what is currently missing for me (and maybe others,
I don't know).

···

--
Posted via http://www.ruby-forum.com/\.

Yukihiro_Matsumoto2 · 14 June 2006 06:34

Hi,

I suppose, all we (non-English-writers) need is to have all string-related
methods working. Just for now, I think about plain testing each string
method;

In that sense, _I_ am one of the non-English-writers, so that I can
suppose I know what we need. And I have no problem with the current
UTF-8 support. Maybe that's because Japanese don't have cases in our
characters. Or maybe I'm missing something. Can you show us your
concrete problems caused by Ruby's lack of "proper" Unicode support?

also, some other classes can be affected by Unicode (possibly
regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are
not: File.open with Russian letters in path don't finds the file.

Strange. Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect. If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.

matz.

···

In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 14:26:30 +0900, "Victor Shepelev" <vshepelev@imho.com.ua> writes:

Eric_Hodel1 · 14 June 2006 08:22

On OS X multibyte filenames work:

$ cat x.rb
$KCODE = 'u'

puts File.read('Cyrillic_Я.txt')
$ cat Cyrillic_\320\257.txt
test file with Я!
$ ruby x.rb
test file with Я!
$ uname -a
Darwin kaa.jijo.segment7.net 8.6.0 Darwin Kernel Version 8.6.0: Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power Macintosh powerpc
$ ruby -v
ruby 1.8.4 (2006-05-18) [powerpc-darwin8.6.0]
$

···

On Jun 13, 2006, at 10:26 PM, Victor Shepelev wrote:

Regexps seems to work fine (in my 1.9), but pathes are
not: File.open with Russian letters in path don't finds the file.

--
Eric Hodel - drbrain@segment7.net - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com

Paul_Battley · 14 June 2006 09:40

utf8_string.unpack('U*') is pretty close to this, giving an array of codepoints.

Paul.

···

On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:

It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

Austin_Ziegler5 · 14 June 2006 12:35

That will *never* happen. Even with Unicode, you have to think about
the encoding, because UTF-32 (the closest representation to the
Platonic ideal "Unicode" you'll ever find) is unlikely to be supported
in the general case. Matz's idea of m17n strings is the right one: you
have a "byte stream" and an attribute which indicates how the byte
stream is encoded. This will sort of be like $KCODE but on an
individual string level so that you could meaningfully have Unicode
(probably UTF-8) and ShiftJIS strings in the same data and still
meaningfully call #length on them.

You will *always* have to care about the encoding. As well as,
ultimately, your locale.

-austin

···

On 6/14/06, Michal Suchanek <hramrach@centrum.cz> wrote:

What I want is all methods working seamlessly with unicode strings so
that I do not have to think about the encoding.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Yukihiro_Matsumoto2 · 25 June 2006 15:07

Hi,

Hope that clarifies what is currently missing for me (and maybe others,
I don't know).

Unfortunately, not. I understand Russian people having problem with
multiple encoding, but I don't know how can we help you.

You said Tcl has Unicode support that works well with you. So that I
think treating all of them in UTF-8 is OK for you. Then how can it
determine which should be in the current code page, or in Unicode?
Or using Win32 API ending with W could allow you living in the
Unicode?

matz.

···

In message "Re: Unicode roadmap?" on Sun, 25 Jun 2006 23:41:48 +0900, Snaury Miyoto <snaury@gmail.com> writes:

Victor_Zverok_Shepel · 14 June 2006 06:56

Hi,

>I suppose, all we (non-English-writers) need is to have all string-
related
>methods working. Just for now, I think about plain testing each string
>method;

In that sense, _I_ am one of the non-English-writers,

Sorry, Matz, I know, of course. But I know too less about Japanese to see
how close our tasks are. Under "non-English-writers" I, maybe, had to say
"European languages" or so - which has common punctuations, LTR writing,
"words" and "whitespaces" and so on. I have almost no knowledge about
Japanese, Korean, Arabic, Hebrew people needs.

so that I can
suppose I know what we need. And I have no problem with the current
UTF-8 support. Maybe that's because Japanese don't have cases in our
characters. Or maybe I'm missing something.

Just what I've said above.

Can you show us your
concrete problems caused by Ruby's lack of "proper" Unicode support?

As mentioned in this topic, it's String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

Moreover, there seems to be some huge problems with pathes having Russian
letters; but I'm really not convinced, if Ruby really has to handle this.

>also, some other classes can be affected by Unicode (possibly
>regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
are
>not: File.open with Russian letters in path don't finds the file.

Strange. Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect. If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.

Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

If not take in account those problems, the only String problems remains, but
they are so base core methods!

V.

···

From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 9:35 AM

In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 14:26:30 +0900, "Victor Shepelev" > <vshepelev@imho.com.ua> writes:

Michal_hramrach_Such · 14 June 2006 10:52

But I want it to be string after the conversion, so that I can use
the standard string functions with sane results. I do not want to
think about varoius encodings myself if my application has to use
them. The runtime should do that.

Thanks

Michal

···

On 6/14/06, Paul Battley <pbattley@gmail.com> wrote:

On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> It should be also
> possible to convert the string (ie to UTF-32 to speed up character
> access).

utf8_string.unpack('U*') is pretty close to this, giving an array of codepoints.

Michal_hramrach_Such · 15 June 2006 10:59

No. Since I have locale stdin can be marked with the proper encoding
information so that all stings originating there have the proper
encoding information.

The string methods should not just blindly operate on bytes but use
the encoding information to operate on characters rather than bytes.
Sure something like byte_length is needed when the string is stored
somewhere outside Ruby but standard string methods should work with
character offsets and characters, not byte offsets nor bytes.

Since my stdout can be also marked with correct encoding the strings
that are output there can be converted to that encoding. Even if it
originates from a source file that happens to be in a different
encoding.
Hmm, prehaps it will be necessary to mark source files with encoding
tags as well. It could be quite tedious to assingn the tag manually to
every string in a source file.

When strings are compared, concatenated, .. the encoding is known so
the methods should do the right thing.

I do not have to care about encoding. You may make a string
implemenation that forces me to care (such a the current one). But I
do not have to. I can always turn to perl if I get really desperate.

Thanks

Michal

···

On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:

On 6/14/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> What I want is all methods working seamlessly with unicode strings so
> that I do not have to think about the encoding.

That will *never* happen. Even with Unicode, you have to think about
the encoding, because UTF-32 (the closest representation to the
Platonic ideal "Unicode" you'll ever find) is unlikely to be supported
in the general case. Matz's idea of m17n strings is the right one: you
have a "byte stream" and an attribute which indicates how the byte
stream is encoded. This will sort of be like $KCODE but on an
individual string level so that you could meaningfully have Unicode
(probably UTF-8) and ShiftJIS strings in the same data and still
meaningfully call #length on them.

You will *always* have to care about the encoding. As well as,
ultimately, your locale.

Austin_Ziegler5 · 25 June 2006 19:06

Matz,

I've mentioned it before, but I will be happy to make the Windows APIs
work with Unicode once the m17n Strings exist. Yes, I will be making
them use either UTF-8 (conversion required, most likely to be compatible
with existing code) or UTF-16 (no conversion required). It will work
well: I have done a similar implementation for code that I have written
at work.

-austin

···

On 6/25/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

In message "Re: Unicode roadmap?" on Sun, 25 Jun 2006 23:41:48 +0900, > Snaury Miyoto <snaury@gmail.com> writes:
>Hope that clarifies what is currently missing for me (and maybe
>others, I don't know).
Unfortunately, not. I understand Russian people having problem with
multiple encoding, but I don't know how can we help you.

You said Tcl has Unicode support that works well with you. So that I
think treating all of them in UTF-8 is OK for you. Then how can it
determine which should be in the current code page, or in Unicode?
Or using Win32 API ending with W could allow you living in the
Unicode?

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Alexey_Borzenkov · 26 June 2006 07:46

Yukihiro Matsumoto wrote:

You said Tcl has Unicode support that works well with you. So that I
think treating all of them in UTF-8 is OK for you.

It's actually not about treating everything in UTF-8, it just unifies
everything in Tcl in a way that you can have all variety of characters
in strings.

Then how can it
determine which should be in the current code page, or in Unicode?
Or using Win32 API ending with W could allow you living in the
Unicode?

Well, currently (just downloaded latest cvs sources) ruby uses ansi
versions of CreateFile and FindFirstFile/FindNextFile APIs, so even if I
say, for example, KCODE to UTF-8 (not sure how you can currently make
ruby work with UTF-8) ansi versions of APIs are still called, and that
means that

  1) if there are filenames with characters that don't fall in range of
current codepage, I will receive '?' in place of them when I enumerate
directory contents.
  2) I receive filenames in current code page, and not in UTF-8
  3) There is no way for me to open a file with these characters using
standard ruby classes

The same with win32ole extension, I can see a lot of ole_wc2mb/ole_mb2wc
there, which breaks things horribly when interoperating with, for
example, Excel and trying to work with russian/greek/japanese and all
other languages all on the same sheet (after I process the sheet,
modifying all of the cells, it will just strip all languages except
russian from it).

In *nixes you can just change your locale to *.UTF-8 and you're ok with
that, because everything you receive when enumerating directory is
UTF-8, and File.open will expect UTF-8. Unfortunately, for Windows that
is not possible: MS already provides 'wide' versions of APIs for those
who need them, and there is no UTF-8 ANSI codepage you can set as
default (because UTF-8 codepage in Windows is somewhat 'virtual', for
conversion purposes only).

In Tcl you have all of your strings in UTF-8, and when Tcl interoperates
with the rest of the world, it converts strings appropriately (for
example, on Win9x there are mostly no 'wide' APIs, so it converts
strings to current code page and uses ansi APIs, but on WinNT it
converts it to unicode and uses 'wide' APIs). What I was thinking is
maybe a way for setting "current codepage" for ruby on win32 (including
possibility to set it to UTF-8), and so that when ruby works with the
world it would use 'wide' APIs when possible, converting to and from
this codepage (so that instead the way it is Tcl when it is hard-coded
to be UTF-8, there would be a possibility to choose), because there are
no other way to do that on Windows by user (user can't set current
codepage to UTF-8).

···

--
Posted via http://www.ruby-forum.com/\.

Michael_Glaesemann · 14 June 2006 07:08

Just to chime in, aren't upcase, downcase, and capitalize a locale/localization issue rather than a Unicode-only issue per se? For example, different languages will have different rules for capitalization. Or am I wrong? Does Unicode in and of itself address these issues?

Granted, proper support for upcase, downcase, and capitalize is important, but I think it's a separate issue, part of m17n as a whole rather than support for Unicode in particular.

Michael Glaesemann
grzm seespotcode net

···

On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:

As mentioned in this topic, it's String#length, upcase, downcase,
capitalize.

Vincent_Isambart · 14 June 2006 07:14

Hi,

As mentioned in this topic, it's String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

To have the length of a Unicode string, just do str.split(//).length,
or "require 'jcode'" at the beginning of your code.
For the other functions, try looking at the unicode library
http://www.yoshidam.net/Ruby.html#unicode

> >also, some other classes can be affected by Unicode (possibly
> >regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
> are
> >not: File.open with Russian letters in path don't finds the file.
>
> Strange. Ruby does not convert encoding, so that there should be no
> problem opening files, if you are using strings in the encoding your OS
> expect. If they are differ, you have to specify (and convert) them
> properly, no matter how Unicode support is.

Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

Windows XP does support Unicode file names, but I'm not sure you can
use them with Ruby (I do not use Ruby much under Windows). Try
converting the file names to your current locale, it should work if
the file names can be converted to it. What I mean is that Russian
file names encoded in the Windows Russian encoding should work on a
Russian PC.

Hope this helps,

Cheers,
Vincent ISAMBART

Yukihiro_Matsumoto2 · 14 June 2006 07:20

Hi,

Can you show us your
concrete problems caused by Ruby's lack of "proper" Unicode support?

As mentioned in this topic, it's String#length, upcase, downcase,
capitalize.

OK. Case is the problem. I understand.

BTW, does String#length works good for you?

I don't remember the last time I needed length method to count
character numbers. Actually I don't count string length at all both
in bytes and characters in my string processing. Maybe this is a
special case. I am too optimized for Ruby string operations using
Regexp.

Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we _can_ handle Russian
path names, but we need help from Russian people to improve.

matz.

···

In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 15:56:02 +0900, "Victor Shepelev" <vshepelev@imho.com.ua> writes:

Austin_Ziegler5 · 14 June 2006 12:22

They are UTF-16 internally. I haven't been paying attention to Ruby
1.9 lately, but when I have time and have noticed that Matz has
checked in support for m17n strings, I will be enhancing support for
Windows files to use Unicode. Currently, Ruby is built using the
non-Unicode form *only*. And no, using -DUNICODE is the *wrong*
answer, thanks. We'd have to start using TCHAR instead of char, and it
would actually mean that we'd be using wchar_t instead of char in this
case.

I've already done a similar (but more complex) project at work.

-austin

···

On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:

Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Randy_Kramer · 14 June 2006 21:36

(RE my previous post): Oops, maybe UTF-32 is exactly what I was alluding to?

Randy Kramer

(Should have waited a little longer before posting.)

···

On Wednesday 14 June 2006 06:52 am, Michal Suchanek wrote:

On 6/14/06, Paul Battley <pbattley@gmail.com> wrote:
> On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> > It should be also
> > possible to convert the string (ie to UTF-32 to speed up character
> > access).

Juergen_Strobel · 17 June 2006 11:08

>> What I want is all methods working seamlessly with unicode strings so
>> that I do not have to think about the encoding.
>
>That will *never* happen. Even with Unicode, you have to think about
>the encoding, because UTF-32 (the closest representation to the
>Platonic ideal "Unicode" you'll ever find) is unlikely to be supported
>in the general case. Matz's idea of m17n strings is the right one: you
>have a "byte stream" and an attribute which indicates how the byte
>stream is encoded. This will sort of be like $KCODE but on an
>individual string level so that you could meaningfully have Unicode
>(probably UTF-8) and ShiftJIS strings in the same data and still
>meaningfully call #length on them.
>
>You will *always* have to care about the encoding. As well as,
>ultimately, your locale.

No. Since I have locale stdin can be marked with the proper encoding
information so that all stings originating there have the proper
encoding information.

The string methods should not just blindly operate on bytes but use
the encoding information to operate on characters rather than bytes.
Sure something like byte_length is needed when the string is stored
somewhere outside Ruby but standard string methods should work with
character offsets and characters, not byte offsets nor bytes.

I empathically agree. I'll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:

1. Strings should deal in characters (code points in Unicode) and not
in bytes, and the public interface should reflect this.

2. Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be encapsulated
by the string class completely, except for a few related classes which
may opt to work with the gory details for performance reasons.
The internal encoding has to be decided, probably between UTF-8,
UTF-16, and UTF-32 by the String class implementor.

3. Whenever Strings are read or written to/from an external source,
their data needs to be converted. The String class encapsulates the
encoding framework, likely with additional helper Modules or Classes
per external encoding. Some methods take an optional encoding
parameter, like #char(index, encoding=:utf8), or
#to_ary(encoding=:utf8), which can be used as helper Class or Module
selector.

4. IO instances are associated with a (modifyable) encoding. For
stdin, stdout this can be derived from the locale settings. String-IO
operations work as expected.

5. Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.

6. More exotic operations can easily be provided by additional
libraries because of Ruby's open classes. Those operations may be
coded depending on on String's public interface for simplicissity, or
work with the internal representation directly for performance.

7. This approach leaves open the possibility of String subclasses
implementing different internal encodings for performance/space
tradeoff reasons which work transparently together (a bit like FixInt
and BigInt).

8. Because Strings are tightly integrated into the language with the
source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby's canonical String class. This will break some
old uses of String, but now is the right time for that.

9. The String class does not worry over character representation
on-screen, the mapping to glyphs must be done by UI frameworks or the
terminal attached to stdout.

10. Be flexible. <placeholder for future idea>

This approach has several advantages and a few disadvantages, and I'll
try to bring in some new angles to this now too:

*Advantages*

-POL, Encapsulation-

All Strings behave exactly the same everywhere, are predictable,
and do the hard work for their users.

-Cross Library Transparency-

No String user needs to worry which Strings to pass to a library, or
worry which Strings he will get from a library. With Web-facing
libraries like rails returning encoding-tagged Strings, you would be
likely to get Strings of all possible encodings otherwise, and isthe
String user prepared to deal with this properly? This is a *big* deal
IMNSHO.

-Limited Conversions-

Encoding conversions are limited to the time Strings are created or
written or explicitly transformed to an external representation.

-Correct String Operations-

Even basic String operations are very hard in the world of Unicode. If
we leave the String users to look at the encoding tags and sort it out
themselves, they are bound to make mistakes because they don't care,
don't know, or have no time. And these mistakes may be _security_
_sensitive_, since most often credentials are represented as Strings
too. There already have been exploits related to Unicode.

*Disadvantages* (with mitigating reasoning of course)

- String users need to learn that #byte_length(encoding=:utf8) >=
#size, but that's not too hard, and applies everywhere. Users do not
need to learn about an encoding tag, which is surely worse to handle
for them.

- Strings cannot be used as simple byte buffers any more. Either use
an array of bytes, or an optimized ByteBuffer class. If you need
regular expresson support, RegExp can be extended for ByteBuffers or
even more.

- Some String operations may perform worse than might be expected from
a naive user, in both the time or space domain. But we do this so the
String user doesn't need to himself, and are problably better at it
than the user too.

- For very simple uses of String, there might be unneccessary
conversions. If a String is just to be passed through somewhere,
without inspecting or modifying it at all, in- and outwards conversion
will still take place. You could and should use a ByteBuffer to avoid
this.

- This ties Ruby's String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don't commit to a
particular encoding of Unicode strongly.

- More work and time to implement. Some could call it
over-engineered. But it will save a lot of time and troubles when shit
hits the fan and users really do get unexpected foreign characters in
their Strings. I could offer help implementing it, although I have
never looked at ruby's source, C-extensions, or even done a lot of
ruby programming yet.

Close to the start of this discussion Matz asked what the problem with
current strings really was for western users. Somewhere later he
concluded case folding. I think it is more than that: we are lazy and
expect character handling to be always as easy as with 7 bit ASCII, or
as close as possible. Fixed 8-bit codepages worked quite fine most of
the time in this regard, and breakage was limited to special
characters only.

Now let's ask the question in reverse: are eastern programmers so used
to doing elaborate byte-stream to character handling by hand they
don't recognize how hard this is any more? Surely it is a target for
DRY if I ever saw one. Or are there actual problems not solveable this
way? I looked up the mentioned Han-Unification issue, and as far as I
understood this could be handled by future Unicode revisions
allocating more characters, outside of Ruby, but I don't see how it
requires our Strings to stay dumb byte buffers.

Jürgen

···

On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:

On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:
>On 6/14/06, Michal Suchanek <hramrach@centrum.cz> wrote:

Since my stdout can be also marked with correct encoding the strings
that are output there can be converted to that encoding. Even if it
originates from a source file that happens to be in a different
encoding.
Hmm, prehaps it will be necessary to mark source files with encoding
tags as well. It could be quite tedious to assingn the tag manually to
every string in a source file.

When strings are compared, concatenated, .. the encoding is known so
the methods should do the right thing.

I do not have to care about encoding. You may make a string
implemenation that forces me to care (such a the current one). But I
do not have to. I can always turn to perl if I get really desperate.

Thanks

Michal

--
The box said it requires Windows 95 or better so I installed Linux

Alexey_Borzenkov · 26 June 2006 07:50

Snaury Miyoto wrote:

Yukihiro Matsumoto wrote:

Then how can it
determine which should be in the current code page, or in Unicode?
Or using Win32 API ending with W could allow you living in the
Unicode?

Well, currently (just downloaded latest cvs sources) ruby uses ansi
versions of CreateFile and FindFirstFile/FindNextFile APIs, so even if I
say, for example, KCODE to UTF-8 (not sure how you can currently make
ruby work with UTF-8) ansi versions of APIs are still called, and that
means that
The same with win32ole extension, I can see a lot of ole_wc2mb/ole_mb2wc
there, which breaks things horribly when interoperating with, for
example, Excel and trying to work with russian/greek/japanese and all
other languages all on the same sheet (after I process the sheet,
modifying all of the cells, it will just strip all languages except
russian from it).

Ah, well, for ole that's not true, only now I realized I can set
codepage there to UTF-8, but still similar thing for win32 file io (and
maybe for other things where win32 API or win32 cruntime used) would be
great.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Unicode in ruby ruby-talk	34	176	15 March 2006
Unicode roadmap? ruby-talk	36	138	29 June 2006
Unicode in Ruby now? ruby-talk	51	431	23 December 2004
Unicode roadmap? ruby-talk	17	113	18 June 2006
Ruby unicode./encoding support ruby-talk	9	92	4 June 2003

Unicode roadmap?

Related topics