>> What I want is all methods working seamlessly with unicode strings so
>> that I do not have to think about the encoding.
>
>That will *never* happen. Even with Unicode, you have to think about
>the encoding, because UTF-32 (the closest representation to the
>Platonic ideal "Unicode" you'll ever find) is unlikely to be supported
>in the general case. Matz's idea of m17n strings is the right one: you
>have a "byte stream" and an attribute which indicates how the byte
>stream is encoded. This will sort of be like $KCODE but on an
>individual string level so that you could meaningfully have Unicode
>(probably UTF-8) and ShiftJIS strings in the same data and still
>meaningfully call #length on them.
>
>You will *always* have to care about the encoding. As well as,
>ultimately, your locale.
No. Since I have locale stdin can be marked with the proper encoding
information so that all stings originating there have the proper
encoding information.
The string methods should not just blindly operate on bytes but use
the encoding information to operate on characters rather than bytes.
Sure something like byte_length is needed when the string is stored
somewhere outside Ruby but standard string methods should work with
character offsets and characters, not byte offsets nor bytes.
I empathically agree. I'll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:
1. Strings should deal in characters (code points in Unicode) and not
in bytes, and the public interface should reflect this.
2. Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be encapsulated
by the string class completely, except for a few related classes which
may opt to work with the gory details for performance reasons.
The internal encoding has to be decided, probably between UTF-8,
UTF-16, and UTF-32 by the String class implementor.
3. Whenever Strings are read or written to/from an external source,
their data needs to be converted. The String class encapsulates the
encoding framework, likely with additional helper Modules or Classes
per external encoding. Some methods take an optional encoding
parameter, like #char(index, encoding=:utf8), or
#to_ary(encoding=:utf8), which can be used as helper Class or Module
selector.
4. IO instances are associated with a (modifyable) encoding. For
stdin, stdout this can be derived from the locale settings. String-IO
operations work as expected.
5. Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.
6. More exotic operations can easily be provided by additional
libraries because of Ruby's open classes. Those operations may be
coded depending on on String's public interface for simplicissity, or
work with the internal representation directly for performance.
7. This approach leaves open the possibility of String subclasses
implementing different internal encodings for performance/space
tradeoff reasons which work transparently together (a bit like FixInt
and BigInt).
8. Because Strings are tightly integrated into the language with the
source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby's canonical String class. This will break some
old uses of String, but now is the right time for that.
9. The String class does not worry over character representation
on-screen, the mapping to glyphs must be done by UI frameworks or the
terminal attached to stdout.
10. Be flexible. <placeholder for future idea>
This approach has several advantages and a few disadvantages, and I'll
try to bring in some new angles to this now too:
*Advantages*
-POL, Encapsulation-
All Strings behave exactly the same everywhere, are predictable,
and do the hard work for their users.
-Cross Library Transparency-
No String user needs to worry which Strings to pass to a library, or
worry which Strings he will get from a library. With Web-facing
libraries like rails returning encoding-tagged Strings, you would be
likely to get Strings of all possible encodings otherwise, and isthe
String user prepared to deal with this properly? This is a *big* deal
IMNSHO.
-Limited Conversions-
Encoding conversions are limited to the time Strings are created or
written or explicitly transformed to an external representation.
-Correct String Operations-
Even basic String operations are very hard in the world of Unicode. If
we leave the String users to look at the encoding tags and sort it out
themselves, they are bound to make mistakes because they don't care,
don't know, or have no time. And these mistakes may be _security_
_sensitive_, since most often credentials are represented as Strings
too. There already have been exploits related to Unicode.
*Disadvantages* (with mitigating reasoning of course)
- String users need to learn that #byte_length(encoding=:utf8) >=
#size, but that's not too hard, and applies everywhere. Users do not
need to learn about an encoding tag, which is surely worse to handle
for them.
- Strings cannot be used as simple byte buffers any more. Either use
an array of bytes, or an optimized ByteBuffer class. If you need
regular expresson support, RegExp can be extended for ByteBuffers or
even more.
- Some String operations may perform worse than might be expected from
a naive user, in both the time or space domain. But we do this so the
String user doesn't need to himself, and are problably better at it
than the user too.
- For very simple uses of String, there might be unneccessary
conversions. If a String is just to be passed through somewhere,
without inspecting or modifying it at all, in- and outwards conversion
will still take place. You could and should use a ByteBuffer to avoid
this.
- This ties Ruby's String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don't commit to a
particular encoding of Unicode strongly.
- More work and time to implement. Some could call it
over-engineered. But it will save a lot of time and troubles when shit
hits the fan and users really do get unexpected foreign characters in
their Strings. I could offer help implementing it, although I have
never looked at ruby's source, C-extensions, or even done a lot of
ruby programming yet.
Close to the start of this discussion Matz asked what the problem with
current strings really was for western users. Somewhere later he
concluded case folding. I think it is more than that: we are lazy and
expect character handling to be always as easy as with 7 bit ASCII, or
as close as possible. Fixed 8-bit codepages worked quite fine most of
the time in this regard, and breakage was limited to special
characters only.
Now let's ask the question in reverse: are eastern programmers so used
to doing elaborate byte-stream to character handling by hand they
don't recognize how hard this is any more? Surely it is a target for
DRY if I ever saw one. Or are there actual problems not solveable this
way? I looked up the mentioned Han-Unification issue, and as far as I
understood this could be handled by future Unicode revisions
allocating more characters, outside of Ruby, but I don't see how it
requires our Strings to stay dumb byte buffers.
Jรผrgen
ยทยทยท
On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:
On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:
>On 6/14/06, Michal Suchanek <hramrach@centrum.cz> wrote:
Since my stdout can be also marked with correct encoding the strings
that are output there can be converted to that encoding. Even if it
originates from a source file that happens to be in a different
encoding.
Hmm, prehaps it will be necessary to mark source files with encoding
tags as well. It could be quite tedious to assingn the tag manually to
every string in a source file.
When strings are compared, concatenated, .. the encoding is known so
the methods should do the right thing.
I do not have to care about encoding. You may make a string
implemenation that forces me to care (such a the current one). But I
do not have to. I can always turn to perl if I get really desperate.
Thanks
Michal
--
The box said it requires Windows 95 or better so I installed Linux