Unicode roadmap?

Yukihiro_Matsumoto2 · 19 June 2006 13:01

Hi,

a), unless either of strings is "ascii" and the other is "ascii"
compatible. This point is arguable.

What is "ascii"? Specifically I would like string operations to suceed
in cases when both strings are encoded as different subset of Unicode
(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat. EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not. No other auto conversion shall be done,
since we don't particularly encourage mixed encoding model.

>- what to do with IO:
> a) IO will return strings in "raw/bytes"?
> b) IO can be tagged and will return Strings with given econding tag?
> c) IO can be tagged and is by default tagged with global encoding tag?
> d) IO can be tagged, but is not tagged by default, although methods
>returning strings (such as read, readlines) will use global encoding tag?
> e) if IO is tagged and one tries to write to it a String with different
>encoding, what will happen?

c), the global default shall be set from locale setting.

I am not sure this is good for network IO as well. For diagnostics it
might be useful to set the default to none, and have string raise an
exception when such strings are combined with other strings.

It is only obvious for STDIN and STDOUT that they should follow the
locale setting.

Restricting default encoding from locale to STDIO may be a good idea.
There's still open issues, since default encoding from locale is not
covered by the prototype, so we need more experience.

matz.

···

In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 21:39:33 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

Yukihiro_Matsumoto2 · 27 June 2006 08:04

Hi,

···

In message "Re: Unicode roadmap?" on Tue, 27 Jun 2006 00:05:22 +0900, "Dmitry Severin" <dmitry.severin@gmail.com> writes:

And what about minilanguages, incorporated in Ruby: regexp patterns,
sprintf, strftime patterns etc.?

Good point. Currently they don't support non ASCII compatible
encoding (including UTF-16 and UTF-32, but this is not fundamental
restriction).

matz.

Dmitrii_Mamut_Dimand · 19 June 2006 13:25

I wonder. Why cannot Strings throughout Ruby be _always_ represented
as Unicode and why no let ICU handle the conversion between various
encodings for incoming and outgoing data?
(Hybrid cloud software for a smarter business | IBM). I know, it is a
long-stanbding issue on Unicode's Han unification process, but without
proper Unicode support Ruby is destined to be a toy for
English-speaking and Japanese communities only. (And as I'm gearing up
to prepare a web-site in Russian, Turkish and English, I feel that
using Ruby could prove to be a major pain in the nether regions of my
body )

···

On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Hi,

In message "Re: Unicode roadmap?" > on Mon, 19 Jun 2006 21:39:33 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

>> a), unless either of strings is "ascii" and the other is "ascii"
>> compatible. This point is arguable.
>
>What is "ascii"? Specifically I would like string operations to suceed
>in cases when both strings are encoded as different subset of Unicode
>(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
>string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat. EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not. No other auto conversion shall be done,
since we don't particularly encourage mixed encoding model.

Michal_hramrach_Such · 19 June 2006 17:20

Reading what you said it appears it would be only possible to add
ascii strings to ascii-compatible sttings. That does not sound very
useful.
If the intended meanig was rather that operations on two
ascii-compatible strings
should always be possible, and that the result is again
ascii-compatible that would sound better.

But it makes these "ascii" encodings a special case. In particular, it
makes UTF-32 less convenient to use.
I guess that for calculation so complex that it would really benefit
form the fast random access of UTF-32 it is reasonable to create a
wrapper that converts the arguments and results. However, If one wants
to perform several such (different) consecutive calculations there are
going to be several useless conversions. It is certainly possible to
make the input interface clever enough to get it right for both UTF-32
and ascii strings but requiring the user to do the conversion on
results does not look nice.

The compatibility could also be just general value that specifies the
encoding family.

ie " ".compatibility => :ascii

ASCII="".encode(:utf8).compatibility

raise "Incompatible encoding #{str.encoding}" unless str.compatibility == ASCII

But different families could be possible. I am not sure if any other
encoding families of any significance exist, though.

Thanks

Michal

···

On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Hi,

In message "Re: Unicode roadmap?" > on Mon, 19 Jun 2006 21:39:33 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

>> a), unless either of strings is "ascii" and the other is "ascii"
>> compatible. This point is arguable.
>
>What is "ascii"? Specifically I would like string operations to suceed
>in cases when both strings are encoded as different subset of Unicode
>(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
>string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat. EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not. No other auto conversion shall be done,
since we don't particularly encourage mixed encoding model.

Austin_Ziegler5 · 19 June 2006 13:31

This entire discussion is centered around a proposal to do exactly
that. There are many *very good* reasons to avoid doing this. Unicode
Is Not Always The Answer.

It's *usually* the answer, but there are times when it's just easier
to work with data in an established code page.

-austin

···

On 6/19/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:

I wonder. Why cannot Strings throughout Ruby be _always_ represented
as Unicode and why no let ICU handle the conversion between various
encodings for incoming and outgoing data?
(Hybrid cloud software for a smarter business | IBM). I know, it is a
long-stanbding issue on Unicode's Han unification process, but without
proper Unicode support Ruby is destined to be a toy for
English-speaking and Japanese communities only. (And as I'm gearing up
to prepare a web-site in Russian, Turkish and English, I feel that
using Ruby could prove to be a major pain in the nether regions of my
body )

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Yukihiro_Matsumoto2 · 19 June 2006 23:38

Hi,

Reading what you said it appears it would be only possible to add
ascii strings to ascii-compatible sttings. That does not sound very
useful.

You will have all your strings in the encoding you choose as a
internal encoding in the usual case, so that you will have a few
compatibility problem. Only if you want to handle multiple encodings
at a time, you need explicit code conversion for mix encoding
operations.

I guess that for calculation so complex that it would really benefit
form the fast random access of UTF-32 it is reasonable to create a
wrapper that converts the arguments and results. However, If one wants
to perform several such (different) consecutive calculations there are
going to be several useless conversions.

I am not sure what you mean. I feel like that my plan does not have
anything against UTF-32 in this regard. Perhaps, I am missing
something. What is going to cause useless conversions?

matz.

···

In message "Re: Unicode roadmap?" on Tue, 20 Jun 2006 02:20:10 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

Dmitrii_Mamut_Dimand · 19 June 2006 13:46

I totally agree with that. IMO, the point lies exactly in this
"*usually* an answer". What was the last time 90% of developers had to
wonder what encoding their data was in And with the advent of
Unicode (and storage becoming cheaper and cheaper and developers
becoming more and more lazy and lazy) more and more of that data is
going to be Unicode.

So, since Unicode is *usually* the answer, make it as painless as
possible. Make all String methods and any other functions that work
with strings accept Unicode straight out of the box without any
worries on the developer's part. And provide alternatives (or optional
parameters?) that would allow the few more encoding-aware gurus do
whatever they want with encodings.

Because otherwise we are in a risk of ending up with incompatible
extensions to strings that "simplfy" a developer's life (and the
trend's already begun). I wouldn't want a C/C++ scenario with a string
class upon string class upon extension upon extension that aim to do
something String should do from the start.

All is IMHO, of course

···

On 6/19/06, Austin Ziegler <halostatue@gmail.com> wrote:

On 6/19/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:
> I wonder. Why cannot Strings throughout Ruby be _always_ represented
> as Unicode and why no let ICU handle the conversion between various
> encodings for incoming and outgoing data?
> (Hybrid cloud software for a smarter business | IBM). I know, it is a
> long-stanbding issue on Unicode's Han unification process, but without
> proper Unicode support Ruby is destined to be a toy for
> English-speaking and Japanese communities only. (And as I'm gearing up
> to prepare a web-site in Russian, Turkish and English, I feel that
> using Ruby could prove to be a major pain in the nether regions of my
> body )

This entire discussion is centered around a proposal to do exactly
that. There are many *very good* reasons to avoid doing this. Unicode
Is Not Always The Answer.

It's *usually* the answer, but there are times when it's just easier
to work with data in an established code page.

Tim_Bray · 19 June 2006 17:47

To enlighten the ignorant, could you describe one or two scenarios where a Unicode-based String class would get in the way? To use your words, make things less easy? I would probably not agree that there are "*many good*" reasons to avoid this, but probably that's just because I've been fortunate enough to not encounter the problem scenarios. This material would have application in a far larger domain than just Ruby, obviously. -Tim

···

On Jun 19, 2006, at 6:31 AM, Austin Ziegler wrote:

This entire discussion is centered around a proposal to do exactly
that. There are many *very good* reasons to avoid doing this. Unicode
Is Not Always The Answer.

It's *usually* the answer, but there are times when it's just easier
to work with data in an established code page.

Michal_hramrach_Such · 20 June 2006 12:09

Hi,

>Reading what you said it appears it would be only possible to add
>ascii strings to ascii-compatible sttings. That does not sound very
>useful.

You will have all your strings in the encoding you choose as a
internal encoding in the usual case, so that you will have a few
compatibility problem. Only if you want to handle multiple encodings
at a time, you need explicit code conversion for mix encoding
operations.

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

It was the complaint of one of the people here that in Python strings
with different encodings exist but the operations on tham fail. And it
makes the life of anybody working with such strings unneccessarily
hard. They have to be converted explicitly.

>I guess that for calculation so complex that it would really benefit
>form the fast random access of UTF-32 it is reasonable to create a
>wrapper that converts the arguments and results. However, If one wants
>to perform several such (different) consecutive calculations there are
>going to be several useless conversions.

I am not sure what you mean. I feel like that my plan does not have
anything against UTF-32 in this regard. Perhaps, I am missing
something. What is going to cause useless conversions?

If automatic conversions aren't implemented at all, utf-32 does not
really stand out in this regard.

Thanks

Michal

···

On 6/20/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

In message "Re: Unicode roadmap?" > on Tue, 20 Jun 2006 02:20:10 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

Austin_Ziegler5 · 19 June 2006 14:31

I think that's more likely with (a) what we have now and (b) a
Unicode-internal approach. (Indeed, a Unicode-internal approach
*requires* separating a byte vector from String, which doubles
interface complexity.) I would suggest that you look through the whole
discussion and particular attention to Matz's statements.

-austin

···

On 6/19/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:

Because otherwise we are in a risk of ending up with incompatible
extensions to strings that "simplfy" a developer's life (and the
trend's already begun). I wouldn't want a C/C++ scenario with a string
class upon string class upon extension upon extension that aim to do
something String should do from the start.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Austin_Ziegler5 · 19 June 2006 18:32

I've found that a Unicode-based string class gets in the way when it
forces you to work around it. For most text-processing purposes, it
*isn't* an issue. But when you've got text that you don't *know* the
origin encoding (and you're probably working in a different code page),
a Unicode-based string class usually guesses wrong.

Transparent Unicode conversion only works when it is guaranteed that the
starting code page and the ending code page are identical. It's
*definitely* a legacy data issue, and doesn't affect most people, but it
has affected me in dealing with (in a non-Ruby context) NetWare.
Additionally, the overhead of converting to Unicode if your entire data
set is in ISO-8859-1 is unnecessary; again, this is a specialized case.

More problematic, from the Ruby perspective, is the that a Unicode-based
string class would require that there be a wholly separate byte vector
class; I am not sure that is necessary or wise. The first time I read a
JPG into a String, I was delighted -- the interface presented was so
clean and nice as opposed to having to muck around in languages that
force multiple interfaces because of such a presentation.

Like I said, I'm not anti-Unicode, and I want Ruby's Unicode support to
be the best, bar none. I'm not willing to compromise on API or
flexibility to gain that, though.

-austin

···

On 6/19/06, Tim Bray <tbray@textuality.com> wrote:

On Jun 19, 2006, at 6:31 AM, Austin Ziegler wrote:

This entire discussion is centered around a proposal to do exactly
that. There are many *very good* reasons to avoid doing this. Unicode
Is Not Always The Answer.

It's *usually* the answer, but there are times when it's just easier
to work with data in an established code page.

To enlighten the ignorant, could you describe one or two scenarios
where a Unicode-based String class would get in the way? To use your
words, make things less easy? I would probably not agree that there
are "*many good*" reasons to avoid this, but probably that's just
because I've been fortunate enough to not encounter the problem
scenarios. This material would have application in a far larger
domain than just Ruby, obviously. -Tim

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Timothy_Bennett · 20 June 2006 13:54

Having different encodings on one web page is a good way to make sure that
the page won't display correctly, since all the browsers I know of display
all text on a page using just one encoding. Granted, if the encoding is a
subset of unicode, it may still manage to work out, but personally I keep
running in to pages that display some of the characters as garbage no matter
what encoding I instruct my browser to use. So, no, I don't think it should
be valid to concatenate strings with different encodings.

···

On 6/20/06, Michal Suchanek <hramrach@centrum.cz> wrote:

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

Gary_Wright · 20 June 2006 15:43

I'm not sure I understand what 'subset of unicode' means.

Do you mean two different encodings of Unicode code points?
As in 'UTF-8 and UTF-16 are subsets of Unicode'?

That usage seems unusual to me. Are you using 'subset' and 'encoding'
as synonyms or am I missing subtle difference?

Gary Wright

···

On Jun 20, 2006, at 8:09 AM, Michal Suchanek wrote:

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

Matthew_Smillie · 20 June 2006 14:29

So we shouldn't do it because it doesn't work in web browsers?

Hopefully we don't apply that criteria globally, or we'd never get anything done.

···

On Jun 20, 2006, at 14:54, Timothy Bennett wrote:

On 6/20/06, Michal Suchanek <hramrach@centrum.cz> wrote:

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

Having different encodings on one web page is a good way to make sure that
the page won't display correctly, since all the browsers I know of display
all text on a page using just one encoding. Granted, if the encoding is a
subset of unicode, it may still manage to work out, but personally I keep
running in to pages that display some of the characters as garbage no matter
what encoding I instruct my browser to use. So, no, I don't think it should
be valid to concatenate strings with different encodings.

Michal_hramrach_Such · 20 June 2006 14:33

No, I meant that the strings are, of course, converted to a common
encoding such as utf-8 before they are concatenated.
The point is that you do not have to care in which encoding you
obtained the pieces and convert them manually to a common encoding if
the string class can do it automatically for you.

Thanks

Michal

···

On 6/20/06, Timothy Bennett <timothy.s.bennett@gmail.com> wrote:

On 6/20/06, Michal Suchanek <hramrach@centrum.cz> wrote:
>
> If I read pieces of text from web pages they can be in different
> encodings. I do not see any reason why such pieces of text could not
> be automatically concatenated as long as they are all subset of
> unicode.

Having different encodings on one web page is a good way to make sure that
the page won't display correctly, since all the browsers I know of display
all text on a page using just one encoding. Granted, if the encoding is a
subset of unicode, it may still manage to work out, but personally I keep
running in to pages that display some of the characters as garbage no matter
what encoding I instruct my browser to use. So, no, I don't think it should
be valid to concatenate strings with different encodings.

Tim_Bray · 20 June 2006 16:04

Having different encodings on one web page is a good way to make sure that
the page won't display correctly

...

So, no, I don't think it should
be valid to concatenate strings with different encodings.

Well, unless you had a String class that took care of the encoding details and, when you were ready to output, allowed you to say "Give me that in ISO-8859 or UTF-8 or whatever". -Tim

···

On Jun 20, 2006, at 6:54 AM, Timothy Bennett wrote:

Michal_hramrach_Such · 20 June 2006 17:50

I mean that iso-8859-1 and iso-8859-2 encodings (as well as many
other) encode a subset of characters available in Unicode, and any of
its utf-* encodings. Thus any string that is encoded using such
encoding can be losslessly and automatically converted to an encoding
of full unicode such as utf-8, and operations on several such
converted strings make sense even if the strings were encoded using
different encodings before the conversion.

The automatic conversion would simplify things if you get strings in
different encodings from outside sources such as various web pages,
databases, etc.

Thanks

Michal

···

On 6/20/06, gwtmp01@mac.com <gwtmp01@mac.com> wrote:

On Jun 20, 2006, at 8:09 AM, Michal Suchanek wrote:
> If I read pieces of text from web pages they can be in different
> encodings. I do not see any reason why such pieces of text could not
> be automatically concatenated as long as they are all subset of
> unicode.

I'm not sure I understand what 'subset of unicode' means.

Do you mean two different encodings of Unicode code points?
As in 'UTF-8 and UTF-16 are subsets of Unicode'?

That usage seems unusual to me. Are you using 'subset' and 'encoding'
as synonyms or am I missing subtle difference?

Yukihiro_Matsumoto2 · 20 June 2006 16:18

Hi,

···

In message "Re: Unicode roadmap?" on Tue, 20 Jun 2006 23:33:43 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

No, I meant that the strings are, of course, converted to a common
encoding such as utf-8 before they are concatenated.
The point is that you do not have to care in which encoding you
obtained the pieces and convert them manually to a common encoding if
the string class can do it automatically for you.

If you choose to convert all input text data into Unicode (and convert
them back at output), there's no need for unreliable automatic
conversion.

matz.

Juergen_Strobel · 22 June 2006 15:32

That's what I suggested basically. The problem seems to be non-Unicode
demands mainly, and performance issues on the other hand. And it makes
Strings useless as byte buffers, since you have to specifiy the
encoding of the external representation you create the String from at
creation time. To recap:

Private extensions to Unicode are deemed too complex to implement
(Matz).

Transforming legacy or special (non Unicode) data to a ruby-private
internal storage format on I/O is too performance/space intensive
(Matz).

Strings as byte buffers are important to some people, and they don't
want to use another class or array for it, even if RegExp et al would
be extended to handle these too.

While it would be proper OO design, encapsulating the internal String
implementation hampers direct access to the "raw" data for C-hackers,
creating unwanted hurdles, and again performance issues.

I am still not convinced the arguments against this approach really
will hold in the long run, but since I am not the one implementing it
and can't really participate there due to language barriers, I can
only lean back and wait for the first realease of M17N. Learning
English was hard enough for me.

-Jürgen

···

On Wed, Jun 21, 2006 at 01:04:55AM +0900, Tim Bray wrote:

On Jun 20, 2006, at 6:54 AM, Timothy Bennett wrote:

>Having different encodings on one web page is a good way to make
>sure that
>the page won't display correctly
...
> So, no, I don't think it should
>be valid to concatenate strings with different encodings.

Well, unless you had a String class that took care of the encoding
details and, when you were ready to output, allowed you to say "Give
me that in ISO-8859 or UTF-8 or whatever". -Tim

--
The box said it requires Windows 95 or better so I installed Linux

Michal_hramrach_Such · 21 June 2006 11:45

Well, it's actually you who chose the conversion on input for me.
Since the strings aren't automatically converted I have to ensure that
I have always strings encoded using the same encoding. And the only
reasonable way I can think of is to convert any string that enters my
application (or class) to an arbitrary encoding I choose in advance.

This is no more reliable than automatic conversion. The reliability or
(un)reliability of the conversion is based on the (un)reliability with
which the actual encoding of the string is determined when it is
obtained. If the encoding tag is wrong the string will be converted
incorrectly. It is the only cause for incorrect conversion wether it
happens manually or automatically.

If conversion was done automatically by the string class it could be
performed lazily. The strings are kept in the encoding in which the
were obtained, and only converted when it is needed because they are
combined with a string in a different encoding. And users of the
srings still have the choice to convert them explicitly when they see
fit.

When such automatic conversion is not available it makes interfacing
with libraries that fetch external data more difficult.

a) I could instruct the library that fetches data from a database or
the web to return them always in the encoding I chose for
reperesenting strings in my application, irregardless of the encoding
the data was originally obtained in.
The disadvantage is that if the encoding was determined incorrectly on
input to the library the data is already garbled.

b) I could get the data from the library in the original encoding in
which it was obtained. Either because I would like to check that the
encoding is correct before converting the data or because the library
does not implement the interface for (a).
The disadvantage is that I have to traverse a potentially complex data
structure and convert all strings so that they work with the other
strings inside my application.

c) Every time I perform a string operation I should first check
(manually) that the two strings are compatible (or catch the exception
very near the opration so that I can convert the arguments and retry).
I do not think this is a reasonable option for the common case that
should be made as simple as possible: the strings can be represented
in Unicode. This may be necessary to some extent in applications
dealing with encodings that are incompatible with Unicode but it
should not be required for the common case.

The people with experience from other languages are complaining that
they have to do (b) or (c) because (a) is usually not implemented. And
ensuring either of the three does look like additional problems that
could be solved elsewhere - in the string class.

Thanks

Michal

···

On 6/20/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Hi,

In message "Re: Unicode roadmap?" > on Tue, 20 Jun 2006 23:33:43 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

>No, I meant that the strings are, of course, converted to a common
>encoding such as utf-8 before they are concatenated.
>The point is that you do not have to care in which encoding you
>obtained the pieces and convert them manually to a common encoding if
>the string class can do it automatically for you.

If you choose to convert all input text data into Unicode (and convert
them back at output), there's no need for unreliable automatic
conversion.

Topic		Replies	Views
Unicode in Ruby and a Ruby Reference ruby-talk	9	125	15 December 2004
Unicode ruby-talk	25	148	1 October 2007
Ruby and unicode ruby-talk	6	110	26 May 2006
Ruby, Unicode - ever? ruby-talk	20	126	24 June 2006
Unicode roadmap? ruby-talk	17	93	18 June 2006

Unicode roadmap?

Related topics