Unicode roadmap?

Austin_Ziegler5 · 18 June 2006 15:29

Um. Do you mean UTF-32? Because there's *no* binary representaiton of
Unicode Character Code Points that isn't an encoding of some sort. If
that's the case, that's unacceptable from a memory representation.

Yes, I do mean the String *interface* to be UTF-32, or pure code
points which is the same but less suscept to to standard changes, if
accessed at character level. If accessed at substring level, a
substring of a String is obviously a String, and you don't need a
bitwise representation at all.

Again, this is completely unacceptable from a memory usage perspective.
I certainly don't want my programs taking up 4x the additional memory
for string handling.

But "pure code points" is a red herring and a mistake in any case. Code
points aren't sufficient. You need glyphs, and some glyphs can be
produced with multiple code points (e.g., LOWERCASE A + COMBINING ACUTE
ACCENT as opposed to A ACUTE). Indeed, some glyphs can *only* be
produced with multiple code points. Dealing with this intelligently
requires a *lot* of smarts, but it's precisely what we should do.

According to my proposal, Strings do not need an encoding from the
String user's point of view when working just with Strings, and users
won't care apart from memory/performance consumption, which I believe
can be made good enough with a totally encapsulted, internal storage
format to be decided later. I will avoid a premature optimization
debate here now.

Again, you are incorrect. I *do* care about the encoding of each String
that I deal with, because only that allows me (or String) to deal with
conversions appropriately. Granted, *most* of the time, I won't care.
But I do work with legacy code page stuff from time to time, and
pronouncements that I won't care are just arrogance or ignorance.

Of course encoding matters when Strings are read or written somewhere,
or converted to bit-/bytewise representation explicitly. The Encoding
Framework, however it'll look, needs to be able to convert to and from
Unicode code points for these operations only, and not between
arbitrary encodings. (You *may* code this to recode directly from
the internal storage format for performance reasons, but that'll be
transparent to the String user.)

I prefer arbitrary encoding conversion capability.

This breaks down for characters not represented in Unicode at all, and
is a nuisance for some characters affected by the Han Unification
issue. But Unicode set out to prevent exactly this, and if we
beleieve in Unicode at all, we can only hope they'll fix this in an
upcoming revision. Meanwhile we could map any additional characters
(or sets of) we need to higher, unused Unicode plains, that'll be no
worse than having different, possibly incompatible kinds of Strings.

Those choices aren't ours to make.

We'll need an additional class for pure byte vectors, or just use
Array for this kind of work, and I think this is cleaner.

I don't. Such an additional class adds unnecessary complexity to
interfaces. This is the *main* reason that I oppose the foolish choice
to pick a fixed encoding for Ruby Strings.

Legacy data and performance.

Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String's internal storage
format, so we don't have to worry about reencoding them.

Um. This is the statement of someone who is ignoring legacy issues.
Performance *is* a big issue when you're dealing with enough legacy
data. Don't punish people because of your own arrogance about encoding
choices.

Again: Unicode Is Not Always The Right Choice. Anyone who tells you
otherwise is selling you a Unicode toolkit and only has their wallet in
mind. Unicode is *often* the right choice, but it's *not* the only
choice and there are times when having the *flexibility* to work in
other encodings without having to work through Unicode as an
intermediary is the right choice. And from an API perspective,
separating String and "ByteVector" is a mistake.

On the other hand, conversions needs to be done at other times with my
proposal than for M17N Strings, and it depends on the application if
that is more or less often. String-String operations never need to do
recoding, as opposed to M17N Strings. I/O always needs conversion, and
may need conversion with M17N too. I havea a hunch that allowing
different kinds of Strings around (as in M17N presumely) should
require recoding far more often.

Unlikely. Mixed-encoding data handling is uncommon.

-austin

···

On 6/18/06, Juergen Strobel <strobel@secure.at> wrote:

On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote:

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Juergen_Strobel · 18 June 2006 21:12

Hi,

>Language implementation, and usage of the String class should be
>easier if this set is
>
>- well defined
>- All characters are equally allowed in all Strings.

I understand these attributes might make implementation easier. But
who cares if I don't care. And I am not sure how these make usage
easier, really.

Somebody who owns gigabytes of text data in legacy encoding (e.g. me),
wants to avoid encoding conversion back and forth between Unicode and
legacy encoding everytime. Another somebody want text processing on
historical text which character set is far bigger than Unicode. The
"well-defined" simple implementation just prohibits those demands. On
the contrary, M17N approach does not bother Universal Character Set
solution. You just need to choose Unicode (UTF-8 or UTF-16) as
internal string representation, and convert encoding on I/O as you
might have done in Unicode centric languages. Nothing lost.

You may worry about implementation difficulty (and performance), but
don't. It's _my_ concern. I made a prototype, and have convinced
that I can implement it with acceptable performance.

I never worried about performance much, that's Austin.

Thanks for clarifying that. So far I could not find much info on how
exactly M17N will work, especially on the role of the encoding tag, so
I had to guess a lot.

Given your explanation, it seems our ways are quite similiar on the
interface side of things, so far as Unicode is concerned. You chose a
more powerful (and more complex) parametric class design for where I
would have left open only the possiblity of transparently useable
subclasses for performance reasons.

I am happy we've worked that out now. And you are right, I am not that
much interested in the implementation, thank you for doing it. My
concern was with the interface of the String class, but several
posters misunderstood me and tried to draw me into implementation
issues.

Jürgen

···

On Mon, Jun 19, 2006 at 01:33:54AM +0900, Yukihiro Matsumoto wrote:

In message "Re: Unicode roadmap?" > on Sun, 18 Jun 2006 23:46:40 +0900, Juergen Strobel <strobel@secure.at> writes:

>Unicode code points are pretty good in this respect, better than the
>union of all characters in all encodings of possible M17N Strings.
>And we may use private extensions to Unicode for legacy characters not
>included in Unicode already.

"private extensions". No. It just cause another nightmare.

matz.

--
The box said it requires Windows 95 or better so I installed Linux

Dmitry_Severin · 15 June 2006 05:55

IIRC, Matz has said that internally String won't change, and I suspect that
a CharString class (or smth like) won't be ever added.

Maybe just introducing String#encoding flag and addig new methods to String
with prefixes, like char_array, char_slice, char_length, char_index,
char_downcase, char_strcoll, char_strip, etc. that will internally look at
encoding flag and process respectively bytes in this particular string
without conversion (just maybe some hidden), and leaving old
byte-processing methods intact, would be the way to keep older code working
and enjoy M17N?

Though, as for me, it is still unclear, what should happen, if one tries to
perform operation on two strings with different String#encoding...

Julian_Julik_Tarkhan · 18 June 2006 15:29

But quite a few people here look like they do know. I do not know much
about regexes but I can imagine just about any other string operation.
And the current regexes already do operate on multiple encodings.

Oh, lord... Have you at least tried that to make such assumtpions? In other words, tell me, can Ruby's regexes cope with the following:

/[а-я]/
/[а-я]/i

or something like this:
http://rubyforge.org/cgi-bin/viewvc.cgi/icu4r/samples/demo_regexp.rb?revision=1.2&root=icu4r&view=markup

And how that leads to the conclusion that there should be only one encoding?

Very simply - I use many pieces of software written in many languages all the time, with non-Latin text.
I know that when they want to get "historically compatible" problems arise. And the software that settles on Unicode
internally or somehow enforces it on the programmer usually works best (all Cocoa and all C#. And to a certain extens yes, Java).

Bluntly put, I am selfish and I don't believe in the "saving grace"
of the M17N (because I just can't wrap it around my head and I sure
as hell know it's going to be VERY complex).

That's the point. If it is wrapped into the string class you do not
have to wrap it around your head.

This is rather naive.

And that is eaxctly why a fixed encoding is bad. If strings can be
encoded in any way there is no point i religious discussions which
encoding you like the most.

Yes, it just becomes hard and error prone to process them.

It is JustGoodEnouhg for most cases but not for all. It is not useless
for CJK, just suboptimal because of the Han unification. And it also
does not try to include the historic characters.

I think this thread is going to end the same as the one in 2002 did.

···

On 18-jun-2006, at 13:08, Michal Suchanek wrote:

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Tim_Bray · 18 June 2006 17:25

You need glyphs, and some glyphs can be
produced with multiple code points (e.g., LOWERCASE A + COMBINING ACUTE
ACCENT as opposed to A ACUTE).

This is another thing you need your String class to be smart about. You want an equality test between "más" and "más" to always be true even their "á" characters are encoded differently. The right way to solve this is called "Early Uniform Normalization" (see Character Model for the World Wide Web 1.0); the idea is you normalize the composed characters at the time you create the string, then the internal equality test can be done with strcmp() or equivalent.

Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String's internal storage
format, so we don't have to worry about reencoding them.

Um. This is the statement of someone who is ignoring legacy issues.
Performance *is* a big issue when you're dealing with enough legacy
data.

Note that you don't have to use a high plane. The Private Use Area in the Basic Multilingual Pane has 6,400 code points, which is quite a few. Even if you did use a high plane, it's not obvious there'd be a detectable runtime performance penalty.

Unicode is *often* the right choice, but it's *not* the only
choice and there are times when having the *flexibility* to work in
other encodings without having to work through Unicode as an
intermediary is the right choice.

That may be the case. You need to do a cost-benefit analysis; you could buy a lot of simplicity by decreeing all-Unicode-internally; would the benefits of allowing non-Unicode characters be big enough to to compensate for the loss of simplicity? I don't know the answer, but it needs thinking about.

-Tim

···

On Jun 18, 2006, at 8:29 AM, Austin Ziegler wrote:

Christian_Neukirche1 · 18 June 2006 19:17

Tim Bray <tbray@textuality.com> writes:

You need glyphs, and some glyphs can be
produced with multiple code points (e.g., LOWERCASE A + COMBINING
ACUTE
ACCENT as opposed to A ACUTE).

This is another thing you need your String class to be smart about.
You want an equality test between "más" and "más" to always be true
even their "á" characters are encoded differently. The right way to
solve this is called "Early Uniform Normalization" (see http://
Character Model for the World Wide Web 1.0); the idea
is you normalize the composed characters at the time you create the
string, then the internal equality test can be done with strcmp() or
equivalent.

Does that mean that binary.to_unicode.to_binary != binary is possible?
That could turn out pretty bad, no?

···

-Tim

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Yukihiro_Matsumoto2 · 18 June 2006 23:00

Hi,

···

In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 00:29:46 +0900, Julian 'Julik' Tarkhanov <listbox@julik.nl> writes:

In other words, tell me, can Ruby's regexes cope with the following:

/[а-я]/
/[а-я]/i

1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.

matz.

Julian_Julik_Tarkhan · 18 June 2006 19:32

And it does as long as you are not careful. One of the things I do is normalize all that come IN
into something that is suitable and predictable.

···

On 18-jun-2006, at 21:17, Christian Neukirchen wrote:

Tim Bray <tbray@textuality.com> writes:

You need glyphs, and some glyphs can be
produced with multiple code points (e.g., LOWERCASE A + COMBINING
ACUTE
ACCENT as opposed to A ACUTE).

This is another thing you need your String class to be smart about.
You want an equality test between "más" and "más" to always be true
even their "á" characters are encoded differently. The right way to
solve this is called "Early Uniform Normalization" (see http://
Character Model for the World Wide Web 1.0); the idea
is you normalize the composed characters at the time you create the
string, then the internal equality test can be done with strcmp() or
equivalent.

Does that mean that binary.to_unicode.to_binary != binary is possible?
That could turn out pretty bad, no?

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Tim_Bray · 18 June 2006 20:52

Yes, but having "más" != "más" is pretty bad too; the alternative is normalizing at comparison time, which would really hurt for example in a big sort, so you'd need to cache the normalized form, which would be a lot more code.

binary.to_unicode looks a little weird to me... can you do that without knowing what the binary is? If it's text in a known encoding, no breakage should occur. If it's unknown bit patterns, you can't really expect anything sensible to happen... or am I missing an obvious scenario? -Tim

···

On Jun 18, 2006, at 12:17 PM, Christian Neukirchen wrote:

This is another thing you need your String class to be smart about.
You want an equality test between "más" and "más" to always be true
even their "á" characters are encoded differently. The right way to
solve this is called "Early Uniform Normalization" (see http://
Character Model for the World Wide Web 1.0); the idea
is you normalize the composed characters at the time you create the
string, then the internal equality test can be done with strcmp() or
equivalent.

Does that mean that binary.to_unicode.to_binary != binary is possible?
That could turn out pretty bad, no?

Julian_Julik_Tarkhan · 18 June 2006 23:09

I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just weren't hooked in properly.

···

On 19-jun-2006, at 1:00, Yukihiro Matsumoto wrote:

Hi,

In message "Re: Unicode roadmap?" > on Mon, 19 Jun 2006 00:29:46 +0900, Julian 'Julik' Tarkhanov > <listbox@julik.nl> writes:

>In other words, tell me, can Ruby's regexes cope with the following:
>
>/[а-я]/
>/[а-я]/i

1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Yukihiro_Matsumoto2 · 18 June 2006 23:56

Hi,

···

In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 08:09:29 +0900, Julian 'Julik' Tarkhanov <listbox@julik.nl> writes:

>/[а-я]/
>/[а-я]/i

1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.

I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just
weren't hooked in properly.

If you have any problem, send us a report with what you expect and
what you get.

matz.

Christian_Neukirche1 · 19 June 2006 11:16

Tim Bray <tbray@textuality.com> writes:

···

On Jun 18, 2006, at 12:17 PM, Christian Neukirchen wrote:

This is another thing you need your String class to be smart about.
You want an equality test between "más" and "más" to always be true
even their "á" characters are encoded differently. The right way to
solve this is called "Early Uniform Normalization" (see http://
Character Model for the World Wide Web 1.0); the idea
is you normalize the composed characters at the time you create the
string, then the internal equality test can be done with strcmp() or
equivalent.

Does that mean that binary.to_unicode.to_binary != binary is
possible?
That could turn out pretty bad, no?

Yes, but having "más" != "más" is pretty bad too; the alternative is
normalizing at comparison time, which would really hurt for example
in a big sort, so you'd need to cache the normalized form, which
would be a lot more code.

binary.to_unicode looks a little weird to me... can you do that
without knowing what the binary is? If it's text in a known
encoding, no breakage should occur. If it's unknown bit patterns,
you can't really expect anything sensible to happen... or am I
missing an obvious scenario? -Tim

Those were just fictive method calls. But let's say I read from
a pipe and I know it contains UTF-16 with BOM, then .to_unicode
would make perfect sense, no?

In case of binary bit patterns, I sooner or later would expect some
kind of EncodingError, given this API. (I haven't seen yet drafts of
how the API really will be.)

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Julian_Julik_Tarkhan · 19 June 2006 01:32

Well, I tried on the CVS latest (1.9) and I get:

irb(main):011:0> "НЕБлагодарНая" =~ /[а-я]/i
=> 6 (should be zero)

That is - character classes work, casefolding doesn't.

···

On 19-jun-2006, at 1:56, Yukihiro Matsumoto wrote:

Hi,

In message "Re: Unicode roadmap?" > on Mon, 19 Jun 2006 08:09:29 +0900, Julian 'Julik' Tarkhanov > <listbox@julik.nl> writes:

>> >/[а-я]/
>> >/[а-я]/i
>>
>> 1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.
>
>I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just
>weren't hooked in properly.

If you have any problem, send us a report with what you expect and
what you get.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Tim_Bray · 19 June 2006 16:05

Yep. And yes, calling to_unicode on it might in fact change the bit patterns if you adopted Early Uniform Normalization (which would be a good thing to do). -Tim

···

On Jun 19, 2006, at 4:16 AM, Christian Neukirchen wrote:

Does that mean that binary.to_unicode.to_binary != binary is
possible?
That could turn out pretty bad, no?

Yes, but having "más" != "más" is pretty bad too; the alternative is
normalizing at comparison time, which would really hurt for example
in a big sort, so you'd need to cache the normalized form, which
would be a lot more code.

binary.to_unicode looks a little weird to me... can you do that
without knowing what the binary is? If it's text in a known
encoding, no breakage should occur. If it's unknown bit patterns,
you can't really expect anything sensible to happen... or am I
missing an obvious scenario? -Tim

Those were just fictive method calls. But let's say I read from
a pipe and I know it contains UTF-16 with BOM, then .to_unicode
would make perfect sense, no?

Yukihiro_Matsumoto2 · 19 June 2006 04:05

Hi,

···

In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 10:32:08 +0900, Julian 'Julik' Tarkhanov <listbox@julik.nl> writes:

Well, I tried on the CVS latest (1.9) and I get:

irb(main):011:0> "НЕБлагодарНая" =~ /[а-я]/i
=> 6 (should be zero)

That is - character classes work, casefolding doesn't.

I found out that Oniguruma casefolding works only for characters
within iso8869-*. Considering the size of the casefolding table it is
compromise for the time being. I will fix this in the future.

matz.

Julian_Julik_Tarkhan · 19 June 2006 05:22

Thanks for the clarification

···

On 19-jun-2006, at 6:05, Yukihiro Matsumoto wrote:

>
>That is - character classes work, casefolding doesn't.

I found out that Oniguruma casefolding works only for characters
within iso8869-*. Considering the size of the casefolding table it is
compromise for the time being. I will fix this in the future.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Dmitry_Severin · 19 June 2006 05:57

Correct me,if I'm wrong, but for Matz's plan on M17N, summary is:
1. String internally will remain the same : char *ptr, long len - in bytes
2. String instances will have encoding tag
3. All String/Regexp methods will respect that encoding tag and return
char(glyph) indexes
4. Methods like byte_size, codepoints, each_char, each_codepoint will be
introduced(?)
5. slice will always accept chars indices and return substrings

I'd say that WOULD BE GOOD, and with methods like
String#enforce_encoding!(encoding) and String#coerce_encoding!(otherstring)
it won't require developers (for C extensions also) to look at encoding tag,
just set it when needed.

But, I can see several imlementation issues and possible options, that
should be considered:
- what will happen if one tries to perfom str1.operation(str2) on two
strings with different encodings:
  a) raise exception
  b) silent coerce one or both strings to some "compatible"
charset/encoding, update encoding of result, replacing non-convertable chars
using fallback mappings? (ouch, this can be split to set of options)
  c) same as b) but raise exception if non-loss conversion is not possible?
  d) same as b) but warn if non-loss conversion is not possible?
  e) downgrade encoding tag of acceptor to "raw/bytes" and process it?

- what will happen if one changes encoding tag for String instance:
  a) check and raise exception if current bytes don't represent valid
encoding sequence?
  b) just set new tag?
  c) convert byte sequence to given encoding, using fallback mappings?

- what to do with IO:
  a) IO will return strings in "raw/bytes"?
  b) IO can be tagged and will return Strings with given econding tag?
  c) IO can be tagged and is by default tagged with global encoding tag?
  d) IO can be tagged, but is not tagged by default, although methods
returning strings (such as read, readlines) will use global encoding tag?
  e) if IO is tagged and one tries to write to it a String with different
encoding, what will happen?

- what will be default encoding tag for new Strings:
  a) "raw/bytes"
  b) derived from system properties of host platform
  c) option b) and can be overriden in application (btw, $KCODE, as present,
must definitely go away!!!)

- how to process source code files:
  a) restrict them to ASCII and require all non-ASCII strings to be
externalized?
  b) process them as "raw/bytes"?
  c) introduce some kind of commented pragma for source files allowing to
set encoding,

- at present time Ruby parser can parse only sources in ASCII compatible
encoding. Would it change?

- what encodings will have Numeric.to_s, Time.to_s etc., or String has to
have/conform for String#to_f, String#to_i?

On Unicode:
- case-independent canonical string matches/searches DO MATTER. And even for
encodings, that code variants of glyphs with different codepoints
"variant-insensitive" search, as for me, is desired. Will there be such
functionality?

- string comparison: will <=> use at least UCA rules for Unicode strings, or
only byte-order comparisons will stay?

- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when writing
a custom parser. Will those methods be provided for one-char strings?

Yes, this is short and incomplete list, but, you should get my point: it's
not that easy -- there are dozens of decisions, with their pros and cons, to
be done and implemented

Yukihiro_Matsumoto2 · 19 June 2006 07:55

Hi,

But, I can see several imlementation issues and possible options, that
should be considered:

Thank you for the ideas.

- what will happen if one tries to perfom str1.operation(str2) on two
strings with different encodings:
a) raise exception
b) silent coerce one or both strings to some "compatible"
charset/encoding, update encoding of result, replacing non-convertable chars
using fallback mappings? (ouch, this can be split to set of options)
c) same as b) but raise exception if non-loss conversion is not possible?
d) same as b) but warn if non-loss conversion is not possible?
e) downgrade encoding tag of acceptor to "raw/bytes" and process it?

a), unless either of strings is "ascii" and the other is "ascii"
compatible. This point is arguable.

- what will happen if one changes encoding tag for String instance:
a) check and raise exception if current bytes don't represent valid
encoding sequence?
b) just set new tag?
c) convert byte sequence to given encoding, using fallback mappings?

b), encoding conformance check shall done lazily. I think there's a
need for explicit encoding conformance check method.

- what to do with IO:
a) IO will return strings in "raw/bytes"?
b) IO can be tagged and will return Strings with given econding tag?
c) IO can be tagged and is by default tagged with global encoding tag?
d) IO can be tagged, but is not tagged by default, although methods
returning strings (such as read, readlines) will use global encoding tag?
e) if IO is tagged and one tries to write to it a String with different
encoding, what will happen?

c), the global default shall be set from locale setting.

- what will be default encoding tag for new Strings:
a) "raw/bytes"
b) derived from system properties of host platform
c) option b) and can be overriden in application (btw, $KCODE, as present,
must definitely go away!!!)

Encoding for literal strings are set by pragma.

- how to process source code files:
a) restrict them to ASCII and require all non-ASCII strings to be
externalized?
b) process them as "raw/bytes"?
c) introduce some kind of commented pragma for source files allowing to
set encoding,

1.9 already has encoding pragma a la Python PEP263.

- at present time Ruby parser can parse only sources in ASCII compatible
encoding. Would it change?

No. Ruby would not allow scripts in EBCDIC, nor UTF-16, although it
allows processing of those encoding.

- what encodings will have Numeric.to_s, Time.to_s etc., or String has to
have/conform for String#to_f, String#to_i?

Good point. Currently, I think they should work on ASCII.

On Unicode:
- case-independent canonical string matches/searches DO MATTER. And even for
encodings, that code variants of glyphs with different codepoints
"variant-insensitive" search, as for me, is desired. Will there be such
functionality?

Casefold search/match will be provided for Regexp. "variant
insensitive" search should be accomplished by explicit normalization
or collation.

- string comparison: will <=> use at least UCA rules for Unicode strings, or
only byte-order comparisons will stay?

Byte order comparison. UCA rules or such should be done explicitly
via normalization or collation.

- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when writing
a custom parser. Will those methods be provided for one-char strings?

Those functions will be provided via Regexp. I am not sure if we will
provide character classification methods for strings.

matz.

···

In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 14:57:22 +0900, "Dmitry Severin" <dmitry.severin@gmail.com> writes:

Michal_hramrach_Such · 19 June 2006 12:39

Hi,

>But, I can see several imlementation issues and possible options, that
>should be considered:

Thank you for the ideas.

>- what will happen if one tries to perfom str1.operation(str2) on two
>strings with different encodings:
> a) raise exception
> b) silent coerce one or both strings to some "compatible"
>charset/encoding, update encoding of result, replacing non-convertable chars
>using fallback mappings? (ouch, this can be split to set of options)
> c) same as b) but raise exception if non-loss conversion is not possible?
> d) same as b) but warn if non-loss conversion is not possible?
> e) downgrade encoding tag of acceptor to "raw/bytes" and process it?

a), unless either of strings is "ascii" and the other is "ascii"
compatible. This point is arguable.

What is "ascii"? Specifically I would like string operations to suceed
in cases when both strings are encoded as different subset of Unicode
(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
string sould result in UTF-* string, not an error.

However, this would make the errors from incompatible encodings more
surprising as they would be very infrequent.

I wonder what operations on raw strings (ones without specified
encoding) would do. Or where one of the strings is raw, and the other
is not.

>- what to do with IO:
> a) IO will return strings in "raw/bytes"?
> b) IO can be tagged and will return Strings with given econding tag?
> c) IO can be tagged and is by default tagged with global encoding tag?
> d) IO can be tagged, but is not tagged by default, although methods
>returning strings (such as read, readlines) will use global encoding tag?
> e) if IO is tagged and one tries to write to it a String with different
>encoding, what will happen?

c), the global default shall be set from locale setting.

I am not sure this is good for network IO as well. For diagnostics it
might be useful to set the default to none, and have string raise an
exception when such strings are combined with other strings.

It is only obvious for STDIN and STDOUT that they should follow the
locale setting.

hmm, but it would need to carefully consider which operations should
work on raw strings and which not. Perhaps it is not as nice as it
looks at the first glance.

Thanks

Michal

···

On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

In message "Re: Unicode roadmap?" > on Mon, 19 Jun 2006 14:57:22 +0900, "Dmitry Severin" <dmitry.severin@gmail.com> writes:

Dmitry_Severin · 26 June 2006 15:05

And what about minilanguages, incorporated in Ruby: regexp patterns,
sprintf, strftime patterns etc.?
Regexps syntax uses several metachars ( {}()+-*?.\: ) and latin letters
- lower and upper.
But there are charsets/encodings which don't have some of them, e.g.:
GB_2312-80 has none of them, JIS_X0201 doesn't have backslash, ebcdic-cp-ar1
doesn't have backslash, square and curly brackets.
So, regexp patterns can't be constructed for these charsets/encodings.

···

On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

>- at present time Ruby parser can parse only sources in ASCII compatible
>encoding. Would it change?

No. Ruby would not allow scripts in EBCDIC, nor UTF-16, although it
allows processing of those encoding.

Topic		Replies	Views
Unicode in Ruby and a Ruby Reference ruby-talk	9	125	15 December 2004
Unicode ruby-talk	25	148	1 October 2007
Ruby and unicode ruby-talk	6	110	26 May 2006
Ruby, Unicode - ever? ruby-talk	20	126	24 June 2006
Unicode roadmap? ruby-talk	17	93	18 June 2006

Unicode roadmap?

Related topics