Unicode roadmap?

> As mentioned in this topic, it's String#length, upcase, downcase,
> capitalize.
>
> BTW, does String#length works good for you?

To have the length of a Unicode string, just do str.split(//).length,
or "require 'jcode'" at the beginning of your code.
For the other functions, try looking at the unicode library
http://www.yoshidam.net/Ruby.html#unicode

I know about it. But, theoretically speaking, such a "core" methods muts be
in core. Not?

> > >also, some other classes can be affected by Unicode (possibly
> > >regexps, and pathes). Regexps seems to work fine (in my 1.9), but
pathes
> > are
> > >not: File.open with Russian letters in path don't finds the file.
> >
> > Strange. Ruby does not convert encoding, so that there should be no
> > problem opening files, if you are using strings in the encoding your
OS
> > expect. If they are differ, you have to specify (and convert) them
> > properly, no matter how Unicode support is.
>
> Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> file names; I see my filenames in Russian, but I have low knowledge of
> system internals to say, are they really Unicode?

Windows XP does support Unicode file names, but I'm not sure you can
use them with Ruby (I do not use Ruby much under Windows). Try
converting the file names to your current locale, it should work if
the file names can be converted to it. What I mean is that Russian
file names encoded in the Windows Russian encoding should work on a
Russian PC.

Yes, they works. But I can't solve the problem: need Ruby Unicode support
include filenames operations?

V.

ยทยทยท

From: Vincent Isambart [mailto:vincent.isambart@gmail.com]
Sent: Wednesday, June 14, 2006 10:14 AM

> As mentioned in this topic, it's String#length, upcase, downcase,
> capitalize.

Just to chime in, aren't upcase, downcase, and capitalize a locale/
localization issue rather than a Unicode-only issue per se? For
example, different languages will have different rules for
capitalization.

Really? I know about two cases: European capitalization and no
capitalization.

But, really, you maybe right. I suppose, Florian Gross can say something
about German-specific capitalization issues.

Granted, proper support for upcase, downcase, and capitalize is
important, but I think it's a separate issue, part of m17n as a whole
rather than support for Unicode in particular.

Maybe. Generally, sometimes I want Unicode, and sometimes (for "quick dirty"
scripts) I'll prefer capitalization and regexps "just work" with
Windows-1251 (one-byte Russian encoding).

V.

ยทยทยท

From: Michael Glaesemann [mailto:grzm@seespotcode.net]
Sent: Wednesday, June 14, 2006 10:08 AM

On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:

>> Can you show us your
>> concrete problems caused by Ruby's lack of "proper" Unicode support?
>
>As mentioned in this topic, it's String#length, upcase, downcase,
>capitalize.

OK. Case is the problem. I understand.

>BTW, does String#length works good for you?

I don't remember the last time I needed length method to count
character numbers. Actually I don't count string length at all both
in bytes and characters in my string processing. Maybe this is a
special case. I am too optimized for Ruby string operations using
Regexp.

I can confirm. But I'm afraid that some libraries I rely on use #length and
can break when #length doesn't work.

>Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
>file names; I see my filenames in Russian, but I have low knowledge of
>system internals to say, are they really Unicode?

Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we _can_ handle Russian
path names, but we need help from Russian people to improve.

In Russian encoding (Win-1251) and on Russian PC all works well. In Unicode
it doesn't, but I'm not convinced it must.

In any case, I'm ready to spend my time helping Ruby community (especially
in Russian/Ukrainian localization issues), because I really love the
language.

V.

ยทยทยท

From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 10:20 AM

In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 15:56:02 +0900, "Victor Shepelev" > <vshepelev@imho.com.ua> writes:

str.sub!('32 path encoding ', '') # :slight_smile:

I don't use Windows much, but as I understand it, Ruby interacts with
most of the Win32 API using the 'legacy code page', which is only a
subset of what the filesystem can handle. (Windows NT and its
successors use Unicode internally, and the filesystem is UTF-16
KC-normalised IIRC). Windows does provide Unicode API functions, but
to use those, a layer of translation between UTF-16 and UTF-8 would be
needed, as Ruby can't do anything useful with UTF-16 at present. I
believe that Austin Ziegler was looking into this; I don't know if
he's made any progress.

Even if a Ruby program uses UTF-8 internally, it should be possible to
access the filesystem by Iconv'ing paths to the appropriate code page
- providing that they don't contain characters not in the code page.
It's far from ideal, though: the real solution is for Ruby to use the
Unicode functions (those suffixed with W) in the API. The upside is
that UTF-8/UTF-16 conversion should be less expensive than the code
page conversion that's inside each of Win32's non-Unicode functions.

On the other hand, plenty of Windows programs don't support Unicode
properly either.

Paul.

ยทยทยท

On 14/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we _can_ handle Russian
path names, but we need help from Russian people to improve.

You can't currently use them with Ruby. The file operations in Ruby
are using the likes of CreateFileA instead of CreateFileW (it's not
that explicit; Ruby is compiled without -DUNICODE -- which is the
correct thing to do in Ruby's case -- which means that CreateFile is
CreateFileA).

All files are stored on the filesystem as UTF-16, though, even if you
are using "ANSI" access.

By the way, there are multiple Russian encodings, so ... Unicode is
better for this point. As I said in my previous message, I have
already planned to enhance the Windows filesystem support when Matz
gets the m17n strings in so that I can *always* force the file
routines on Windows to provide either UTF-8 or UTF-16 (probably the
former, since it will also make it easier to work with existing
extensions) and indicate that the strings are such.

-austin

ยทยทยท

On 6/14/06, Vincent Isambart <vincent.isambart@gmail.com> wrote:

Windows XP does support Unicode file names, but I'm not sure you can
use them with Ruby (I do not use Ruby much under Windows). Try
converting the file names to your current locale, it should work if
the file names can be converted to it. What I mean is that Russian
file names encoded in the Windows Russian encoding should work on a
Russian PC.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo โ€ข statue
               * austin@zieglers.ca

It's not that bad, Matz. I started as a Unix developer, but in the
last two years I have learned *quite* a bit about how Windows handles
this stuff and we can adapt what I did for work with no problem.

I just need M17N strings to support this. I should look at what I
can/should do to provide this as an extension, I just have no time. :frowning:

-austin

ยทยทยท

On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we _can_ handle Russian
path names, but we need help from Russian people to improve.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo โ€ข statue
               * austin@zieglers.ca

Every time these unicode discussions come up my head spins like a top. You
should see it.

We JRubyists have headaches from the unicode question too. Since JRuby is
currently 1.8-compatible, we do not have what most call *native* unicode
support. This is primarily because we do not wish to create an incompatible
version of Ruby or build in support for unicode now that would conflict with
Ruby 2.0 in the future. It is, however, embarressing to say that although we
run on top of Java, which has arguably pretty good unicode support, we don't
support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally,
converted to/from the current platform's encoding of choice by default. It
also supports converting those UTF16 strings into just about every encoding
out there, just by telling it to do so. Java supports the Unicode
specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there's always
that nagging question of what it should look like and what would mesh well
with the Ruby community at large. With the underlying platform already rich
with unicode support, it would not take much effort to modify JRuby. So then
there's a simple question:

What form would you, the Ruby users, want unicode to take? Is there a
specific library that you feel encompasses a reasonable implementation of
unicode support, e.g. icu4r? Should the support be transparent, e.g. no
longer treat or assume strings are byte vectors? JRuby, because we use
Java's String, is already using UTF16 strings exclusively...however there's
no way to get at them through core Ruby APIs. What would be the most
comfortable way to support unicode now, considering where Ruby may go in the
future?

ยทยทยท

--
Charles Oliver Nutter @ headius.blogspot.com
JRuby Developer @ jruby.sourceforge.net
Application Architect @ www.ventera.com

[...]

> The string methods should not just blindly operate on bytes but
> use the encoding information to operate on characters rather than
> bytes. Sure something like byte_length is needed when the string
> is stored somewhere outside Ruby but standard string methods
> should work with character offsets and characters, not byte
> offsets nor bytes.

I empathically agree. I'll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:

Juergen, I agree with most of what you have written. I will
add my thoughts.

1. Strings should deal in characters (code points in Unicode) and
not in bytes, and the public interface should reflect this.

2. Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be
encapsulated by the string class completely, except for a few
related classes which may opt to work with the gory details for
performance reasons. The internal encoding has to be decided,
probably between UTF-8, UTF-16, and UTF-32 by the String class
implementor.

Full ACK. Ruby programs shouldn't need to care about the
*internal* string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

3. Whenever Strings are read or written to/from an external source,
their data needs to be converted. The String class encapsulates the
encoding framework, likely with additional helper Modules or
Classes per external encoding. Some methods take an optional
encoding parameter, like #char(index, encoding=:utf8), or
#to_ary(encoding=:utf8), which can be used as helper Class or
Module selector.

I think the encoding/decoding API should be separated from the
String class. IMO, the most important change is to strictly
differentiate between arbitrary binary data and character
data. Character data is represented by an instance of the
String class.

I propose adding a new core class, maybe call it ByteString
(or ByteBuffer, or Buffer, whatever) to handle strings of
bytes.

Given a specific encoding, the encoding API converts
ByteStrings to Strings and vice versa.

This could look like:

    my_character_str = Encoding::UTF8.encode(my_byte_buffer)
    buffer = Encoding::UTF8.decode(my_character_str)

4. IO instances are associated with a (modifyable) encoding. For
stdin, stdout this can be derived from the locale settings.
String-IO operations work as expected.

I propose one of:

1) A low level IO API that reads/writes ByteBuffers. String IO
   can be implemented on top of this byte-oriented API.

   The basic binary IO methods could look like:

   binfile = BinaryIO.new("/some/file", "r")
   buffer = binfile.read_buffer(1024) # read 1K of binary data

   binfile = BinaryIO.new("/some/file", "w")
   binfile.write_buffer(buffer) # Write the byte buffer

   The standard File class (or IO module, whatever) has an
   encoding attribute. The default value is set by the
   constructor by querying OS settings (on my Linux system
   this could be $LANG):

   # read strings from /some/file, assuming it is encoded
   # in the systems default encoding.
   text_file = File.new("/some/file", "r")
   contents = text_file.read

   # alternatively one can explicitely set an encoding before
   # the first read/write:
   text_file = File.new("/some/file", "r")
   text_file.encoding = Encoding::UTF8

   The File class (or IO module) will probably use a BinaryIO
   instance internally.

2) The File class/IO module as of current Ruby just gets
   additional methods for binary IO (through ByteBuffers) and
   an encoding attribute. The methods that do binary IO don't
   need to care about the encoding attribute.

I think 1) is cleaner.

5. Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations
like case folding, sorting, comparing etc.

If the strings are represented as a sequence of Unicode
codepoints, it is possible for external libraries to implement
more advanced Unicode operations.

Since IMO a new "character" class would be overkill, I propose
that the String class provides codepoint-wise iteration (and
indexing) by representing a codepoint as a Fixnum. AFAIK a
Fixnum consists of 31 bits on a 32 bit machine, which is
enough to represent the whole range of unicode codepoints.

6. More exotic operations can easily be provided by additional
libraries because of Ruby's open classes. Those operations may be
coded depending on on String's public interface for simplicissity,
or work with the internal representation directly for performance.

7. This approach leaves open the possibility of String subclasses
implementing different internal encodings for performance/space
tradeoff reasons which work transparently together (a bit like
FixInt and BigInt).

I think providing different internal String representations
would be too much work, especially for maintenance in the long
run.

8. Because Strings are tightly integrated into the language with
the source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby's canonical String class. This will break
some old uses of String, but now is the right time for that.

9. The String class does not worry over character representation
on-screen, the mapping to glyphs must be done by UI frameworks or
the terminal attached to stdout.

10. Be flexible. <placeholder for future idea>

The advantages of this proposal over the current situation and
tagging a string with an encoding are:

* There is only one internal string (where string means a
  string of characters) representation. String operations
  don't need to be written for different encodings.

* No need for $KCODE.

* Higher abstraction.

* Separation of concerns. I always found it strange that most
  dynamic languages simply mix handling of character and
  arbitrary binary data (just think of pack/unpack).

* Reading of character data in one encoding and representing
  it in other encoding(s) would be easy.

It seems that the main argument against using Unicode strings
in Ruby is because Unicode doesn't work well for eastern
countries. Perhaps there is another character set that works
better that we could use instead of Unicode. The important
point here is that there is only *one* representation of
character data Ruby.

If Unicode is choosen as character set, there is the
question which encoding to use internally. UTF-32 would be a
good choice with regards to simplicity in implementation,
since each codepoint takes a fixed number of bytes. Consider
indexing of Strings:

        "some string"[4]

If UTF-32 is used, this operation can internally be
implemented as a simple, constant array lookup. If UTF-16 or
UTF-8 is used, this is not possible to implement as an array
lookup, since any codepoint before the fifth could occupy more
than one (8 bit or 16 bit) unit. Of course there is the
argument against UTF-32 that it takes to much memory. But I
think that most text-processing done in Ruby spends much more
memory on other data structures than in actual character data
(just consider an REXML document), but I haven't measured that
:wink:

An advantage of using UTF-8 would be that for pure ASCII files
no conversion would be necessary for IO.

Thank you for reading so far. Just in case Matz decides to
implement something similar to this proposal, I am willing to
help with Ruby development (although I don't know much about
Ruby's internals and not too much about Unicode either).

I do not have a CS degree and I'm not a Unicode expert, so
perhaps the proposal is garbage, in this case please tell me
what is wrong about it or why it is not realistic to implement
it.

ยทยทยท

On Saturday 17 June 2006 13:08, Juergen Strobel wrote:

On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:

--
Stefan

I empathically agree. I'll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:

1. Strings should deal in characters (code points in Unicode) and not
in bytes, and the public interface should reflect this.

Agree, mostly. Strings should have a way to indicate the buffer size of
the String.

2. Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be encapsulated
by the string class completely, except for a few related classes which
may opt to work with the gory details for performance reasons.
The internal encoding has to be decided, probably between UTF-8,
UTF-16, and UTF-32 by the String class implementor.

Completely disagree. Matz has the right choice on this one. You can't
think in just terms of a pure Ruby implementation -- you *must* think
in terms of the Ruby/C interface for extensions as well.

3. Whenever Strings are read or written to/from an external source,
their data needs to be converted. The String class encapsulates the
encoding framework, likely with additional helper Modules or Classes
per external encoding. Some methods take an optional encoding
parameter, like #char(index, encoding=:utf8), or
#to_ary(encoding=:utf8), which can be used as helper Class or Module
selector.

Conversion should be possible at any time. An "external source" may be
an extension that your Ruby program can't distinguish. Again, this point
fails because your #2 is unacceptable.

4. IO instances are associated with a (modifyable) encoding. For
stdin, stdout this can be derived from the locale settings. String-IO
operations work as expected.

Agree, realising that the internal implementation of String must be
completely different than you've suggested. It is also important to
retain *raw* reading; a JPEG should not be interpreted as Unicode.

5. Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.

Agreed, but this would be expected regardless of the actual encoding of
a String.

6. More exotic operations can easily be provided by additional
libraries because of Ruby's open classes. Those operations may be
coded depending on on String's public interface for simplicissity, or
work with the internal representation directly for performance.

Agreed.

7. This approach leaves open the possibility of String subclasses
implementing different internal encodings for performance/space
tradeoff reasons which work transparently together (a bit like FixInt
and BigInt).

Um. Disagree. Matz's proposed approach does this; yours does not. Yours,
in fact, makes things *much* harder.

8. Because Strings are tightly integrated into the language with the
source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby's canonical String class. This will break some
old uses of String, but now is the right time for that.

"Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

9. The String class does not worry over character representation
on-screen, the mapping to glyphs must be done by UI frameworks or the
terminal attached to stdout.

The String class doesn't worry about that now.

10. Be flexible. <placeholder for future idea>

And little is more flexible than Matz's m17n String.

This approach has several advantages and a few disadvantages, and I'll
try to bring in some new angles to this now too:

*Advantages*

-POL, Encapsulation-

All Strings behave exactly the same everywhere, are predictable,
and do the hard work for their users.

Remember: POLS is not an acceptable reason for anything. Matz's m17n
Strings would be predictable, too. a + b would be possible if and only
if a and b are the same encoding or one of them is "raw" (which would
mean that the other is treated as the defined encoding) *or* there is a
built-in conversion for them.

-Cross Library Transparency-
No String user needs to worry which Strings to pass to a library, or
worry which Strings he will get from a library. With Web-facing
libraries like rails returning encoding-tagged Strings, you would be
likely to get Strings of all possible encodings otherwise, and isthe
String user prepared to deal with this properly? This is a *big* deal
IMNSHO.

This will be true with m17n strings. However, your proposal does *not*
work for Ruby/C interfaced items. Sorry.

-Limited Conversions-

Encoding conversions are limited to the time Strings are created or
written or explicitly transformed to an external representation.

This is a mistake. I may need to know the internal representation of a
particular encoding of a String inside of a program. Trust me on this
one: I *have* done some low-level encoding work. Additionally, even
though I might have marked a network object as "UTF-8", I may not know
whether it's *actually* UTF-8 or not until I get HTTP headers -- or
worse, a <meta http-equiv> tag. Assuming UTF-8 reading in today's world
is doomed to failure.

-Correct String Operations-
Even basic String operations are very hard in the world of Unicode. If
we leave the String users to look at the encoding tags and sort it out
themselves, they are bound to make mistakes because they don't care,
don't know, or have no time. And these mistakes may be _security_
_sensitive_, since most often credentials are represented as Strings
too. There already have been exploits related to Unicode.

This is a misunderstanding on your part. Nothing about Matz's m17n
Strings suggests that String users would have to look at the encoding
tags. Merely that they *could*. I suspect that there will be pragma-
like behaviours to enforce a particular internal representation at all
times.

*Disadvantages* (with mitigating reasoning of course)
- String users need to learn that #byte_length(encoding=:utf8) >=
#size, but that's not too hard, and applies everywhere. Users do not
need to learn about an encoding tag, which is surely worse to handle
for them.

True, but the encoding tag is not worse. Anyone who assumes that
developers can ignore encoding at any time simply *doesn't* know about
the level of problems that can be encountered.

- Strings cannot be used as simple byte buffers any more. Either use
an array of bytes, or an optimized ByteBuffer class. If you need
regular expresson support, RegExp can be extended for ByteBuffers or
even more.

I see no reason for this.

- Some String operations may perform worse than might be expected from
a naive user, in both the time or space domain. But we do this so the
String user doesn't need to himself, and are problably better at it
than the user too.

This is a wash.

- For very simple uses of String, there might be unneccessary
conversions. If a String is just to be passed through somewhere,
without inspecting or modifying it at all, in- and outwards conversion
will still take place. You could and should use a ByteBuffer to avoid
this.

This is a wash.

- This ties Ruby's String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don't commit to a
particular encoding of Unicode strongly.

This is a wash. I think that it's better to leave the options open.
After all, it *is* a hope of mine to have Ruby running on iSeries
(AS/400) and *that* still uses EBCDIC.

- More work and time to implement. Some could call it over-engineered.
But it will save a lot of time and troubles when shit hits the fan and
users really do get unexpected foreign characters in their Strings. I
could offer help implementing it, although I have never looked at
ruby's source, C-extensions, or even done a lot of ruby programming
yet.

I would call it the amount of work necessary. But the work needs to be
done for a *variety* of encodings, and not just Unicode. *Especially*
because of C extensions.

Close to the start of this discussion Matz asked what the problem with
current strings really was for western users. Somewhere later he
concluded case folding. I think it is more than that: we are lazy and
expect character handling to be always as easy as with 7 bit ASCII, or
as close as possible. Fixed 8-bit codepages worked quite fine most of
the time in this regard, and breakage was limited to special
characters only.

Now let's ask the question in reverse: are eastern programmers so used
to doing elaborate byte-stream to character handling by hand they
don't recognize how hard this is any more? Surely it is a target for
DRY if I ever saw one. Or are there actual problems not solveable this
way? I looked up the mentioned Han-Unification issue, and as far as I
understood this could be handled by future Unicode revisions
allocating more characters, outside of Ruby, but I don't see how it
requires our Strings to stay dumb byte buffers.

No one has ever suggested that Ruby Strings stay byte buffers. However,
blindly choosing Unicode *adds* unnecessary complexity to the situation.

-austin

ยทยทยท

On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo โ€ข statue
               * austin@zieglers.ca

1. Strings should deal in characters (code points in Unicode) and not
in bytes, and the public interface should reflect this.

Be careful. People who care about this stuff might want to read Character Model for the World Wide Web 1.0: Fundamentals It turns out that characters do not correspond one-to-one with units of sound, or units of input, or units of display. Except for low-level stuff like regexps, it's very difficult to write any code that goes character-at-a-time that doesn't contain horrible i18n bugs. For practical purposes, a String is a more useful basic tool than a character.

5. Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.

Be careful. Case folding is a horrible can of worms, is rarely implemented correctly, and when it is (the Java library tries really hard) is insanely expensive. The reason is that case conversion is not only language-sensitive but jurisdiction sensitive (in some respects different in France & Quรฉbec). Trying to do case-folding on text that is not known to be ASCII is likely a symptom of a bug.

- This ties Ruby's String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don't commit to a
particular encoding of Unicode strongly.

For information: The XML view is that Shift-JIS, KOI8-R, EBCDIC, and many others are all encodings of Unicode and a best effort should be made to accept and emit all sane encodings on demand. Most XML software sticks to a single encoding, internally.

  -Tim

ยทยทยท

On Jun 17, 2006, at 4:08 AM, Juergen Strobel wrote:

Those libraries should probably be considered broken; they can and
should be patched to do any human-readable-string processing in an
encoding-safe manner (e.g. by using jcode's jlength and each_char
methods).

Paul.

ยทยทยท

On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:

I can confirm. But I'm afraid that some libraries I rely on use #length and
can break when #length doesn't work.

There is variety even within western European languages - Dutch, for
example, differs from English (IJsselmeer).

Paul.

ยทยทยท

On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:

> Just to chime in, aren't upcase, downcase, and capitalize a locale/
> localization issue rather than a Unicode-only issue per se? For
> example, different languages will have different rules for
> capitalization.

Really? I know about two cases: European capitalization and no
capitalization.

From: Michael Glaesemann [mailto:grzm@seespotcode.net]
Sent: Wednesday, June 14, 2006 10:08 AM
>
> > As mentioned in this topic, it's String#length, upcase, downcase,
> > capitalize.
>
> Just to chime in, aren't upcase, downcase, and capitalize a locale/
> localization issue rather than a Unicode-only issue per se? For
> example, different languages will have different rules for
> capitalization.

Really? I know about two cases: European capitalization and no

Really.

capitalization.

There is no such thing like European capitalization. There is only
<insert your language> capitalization.
The german character รŸ has no uppercase version. In most languages
using Latin script the uppercase of 'i' is 'I'. But Turkish has i and
i without dot, and the uppercase of 'i' is, of course, I with dot.

Thanks

Michal

ยทยทยท

On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:

> On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:

Every time these unicode discussions come up my head spins like a top. You
should see it.

We would love to be able to support unicode in JRuby, but there's always
that nagging question of what it should look like and what would mesh well
with the Ruby community at large. With the underlying platform already rich
with unicode support, it would not take much effort to modify JRuby. So then
there's a simple question:

Yukihiro Matsumoto wrote:

Define "proper Unicode support" first.

I'm planning enhancing Unicode support in 1.9 in a year or so
(finally). But I'm not sure that conforms your definition of "proper
Unicode support". Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

Hello everyone, and sorry for chiming so fiercely. Got into some confusion with the ML controls.

Just joined the list seeing the subject popping up once more. I am doing Unicode-aware apps in Rails and Ruby right now and it hurts. I'll try to define "proper Unicode support" as I (dream of it at night) see it.

1. All string indexing (length, index, slice, insert) works with characters instead of bytes, whatever length in bytes the characters have to be.
String methods (index or =~) should _never_ return offsets that will damage the string's characters if employed for slicing - you shouldn't have to manually translate the byte offset of 2 to character offset of 1 because the second character is multibyte.

Simple example:

     def translate_offset(str, byte_offset)
       chunk = str[0..byte_offset]
       begin
         chunk.unpack("U*").length - 1
       rescue ArgumentError # this offset is just wrong! shift upwards and retry
         chunk = str[0..(byte_offset+=1)]
         retry
       end
     end

I think it's unnecessarily painful for something as easy as string =~ /pattern/. Yes, you can get that offset you recieve from =~ and then get the slice of the string and then split it again with /./mu to get the same number etc...

2. Case-insensitive regexes actually work. Even in my Oniguruma-enabled builds of 1.8.2. it was not true (maybe changed now). At least "Unicode general" collation casefolding (such a thing exists) available built-in on every platform.
4. Locale-aware sorting, including multibyte charsets, if provided by the OS
5. Preferably separate (and strictly purposed) Bytestring that you get out of Sockets and use in Servers etc. - or the ability to "force" all strings recieved from external resources to be flagged uniformly as being of a certain encoding in _your_ program, not somewhere in someone's library. If flags have to be set by libraries, they won't be set because most developers sadly don't care:

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
http://thraxil.org/users/anders/posts/2005/11/01/unicodification/

6. Unicode-aware strip dealing with weirdo whitespaces (hair space, thin space etc.)
7. And no, as I mentioned - it doesn't handle it properly because the /i modifier is broken, and to deal without it you need to downcase BOTH the regexp and the string itself. Closed circle - you go and get the Unicode gem with tables.

All of this can be controlled either per String (then 99 out of 100 libraries I use will be getting it wrong - see above) or by a global setting such as $KCODE.

As an example of something that is ridiculously backwards to do in Ruby now is this (I spent some time refactoring this today):
http://dev.rubyonrails.org/browser/trunk/actionpack/lib/action_view/helpers/text_helper.rb#L44

Here you have a major problem because the /i flag doesn't do anything (Ruby is incapable of Unicode-aware casefolding), and using offsets means that you are always one step from damaging someone's text. It's just wrong that it has to be so painful.

Python3000, IMO, gets this right (as does Java) - byte array and a String are sompletely separate, and String operates with characters and characters only.

That's what I would expect. Hope this makes sense somewhat :slight_smile:

ยทยทยท

On 15-jun-2006, at 2:11, Charles O Nutter wrote:
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

8. Because Strings are tightly integrated into the language with the
source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby's canonical String class. This will break some
old uses of String, but now is the right time for that.

"Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

Most probably wise, but I need casefolding and character classes to work since yesteryear.
Oniguruma is there but even if you complie with it (which is not the default, still) you don't get char classes (AFAIK)
and you don't get casefolding. Case-insensitive search/replace quickly becomes bondage.

I am maintaining a gem whose test fails due to different regexps in Oniguruma, but I would be able to quickly fix it knowing that Oniguruma is in stable now.

10. Be flexible. <placeholder for future idea>

And little is more flexible than Matz's m17n String.

I couldn't find a proper description of that - as I told already, the thing I'd least prefer would be

# get a string from the database
p str + my_unicode_chars # Ok, bail out with an ugly exception because the author of the DB adaptor didn't care to send me proper Strings...

If strings in the system are allowed to have varying encodings, I don't understand how the engine is going to upgrade/downgrade strings automatically.
Especially remembering that the receiver is on the left, so I actually might get different exceptions going as I do

p my_unicode_chars + mojikyo_str # who wins?

or

p mojikyo_str + my_unicode_chars # who wins?

or (especially)

p mojikyo_str + bytestring_that_i_just_grabbed_by_http_and_i_know_it_is_mojikyo_but_its_not # who wins?

ยทยทยท

On 17-jun-2006, at 15:52, Austin Ziegler wrote:

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Full ACK. Ruby programs shouldn't need to care about the
*internal* string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

This is incorrect. *Most* Ruby programs won't need to care about the
internal string encoding. Experience suggests, however, that it is
*most*. Definitely not all.

I think the encoding/decoding API should be separated from the
String class. IMO, the most important change is to strictly
differentiate between arbitrary binary data and character
data. Character data is represented by an instance of the
String class.

I propose adding a new core class, maybe call it ByteString
(or ByteBuffer, or Buffer, whatever) to handle strings of
bytes.

Given a specific encoding, the encoding API converts
ByteStrings to Strings and vice versa.

This could look like:

    my_character_str = Encoding::UTF8.encode(my_byte_buffer)
    buffer = Encoding::UTF8.decode(my_character_str)

Unnecessarily complex and inflexible. Before you go too much further, I
*really* suggest that you look in the archives and Google to find more
about Matz's m17n String proposal. It's a really good one, as it allows
developers (both pure Ruby and extension) to choose what is appropriate
with the ability to transparently convert as well.

4. IO instances are associated with a (modifyable) encoding. For
stdin, stdout this can be derived from the locale settings.
String-IO operations work as expected.

I propose one of:

1) A low level IO API that reads/writes ByteBuffers. String IO
   can be implemented on top of this byte-oriented API.

[...]

2) The File class/IO module as of current Ruby just gets
   additional methods for binary IO (through ByteBuffers) and
   an encoding attribute. The methods that do binary IO don't
   need to care about the encoding attribute.

I think 1) is cleaner.

I think neither is necessary and both would be a mistake. It is, as I
indicated to Juergen, sometimes *impossible* to determine the encoding
to be used for an IO until you have some data from the IO already.

5. Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.

If the strings are represented as a sequence of Unicode codepoints, it
is possible for external libraries to implement more advanced Unicode
operations.

This would be true regardless of the encoding.

Since IMO a new "character" class would be overkill, I propose that
the String class provides codepoint-wise iteration (and indexing) by
representing a codepoint as a Fixnum. AFAIK a Fixnum consists of 31
bits on a 32 bit machine, which is enough to represent the whole range
of unicode codepoints.

This does not match what Matz will be doing.

  str = "Fran\303\247ais"
  str[5] # -> "\303\247"

This is better than doing a Fixnum representation. It is character
iteration, but each character is, itself, a String.

7. This approach leaves open the possibility of String subclasses
implementing different internal encodings for performance/space
tradeoff reasons which work transparently together (a bit like
FixInt and BigInt).

I think providing different internal String representations
would be too much work, especially for maintenance in the long
run.

If you're depending on classes to do that, especially given that Ruby's
String, Array, and Hash classes don't inherit well, you're right.

The advantages of this proposal over the current situation and
tagging a string with an encoding are:

The problem, of course, is that this proposal -- and your take on it --
don't account for the m17n String that Matz has planned. The current
situation is a mess. But the current situation is *not* what is planned.
I've had to do some encoding work for work in the last two years, and
while I *prefer* a UTF-8/UTF-16 internal representation, I also know
that's *impossible* in some situations and you have to be flexible. I
also know that POSIX handles this situation worse than any other
setup.

With the work that I've done on this, Matz is *right* about this, and
the people claiming that Unicode is the Only Way ... are wrong. In an
ideal world, Unicode would be the correct and only way. In the real
world, however, it's a lot messier, and Ruby has to be aware of that.

We can *still* make it as easy as possible for the common case (which
will be UTF-8 encoding data and filenames). But we shouldn't make the
mistake of assuming that the common case is all that Ruby should handle.

* There is only one internal string (where string means a
  string of characters) representation. String operations
  don't need to be written for different encodings.

This is still (mostly) correct under the m17n String proposal.

* No need for $KCODE.

This is true under the m17n String.

* Higher abstraction.

This is true under the m17n String.

* Separation of concerns. I always found it strange that most dynamic
  languages simply mix handling of character and arbitrary binary data
  (just think of pack/unpack).

The separation makes things harder most of the time.

* Reading of character data in one encoding and representing it in
  other encoding(s) would be easy.

This is true under the m17n String.

It seems that the main argument against using Unicode strings in Ruby
is because Unicode doesn't work well for eastern countries. Perhaps
there is another character set that works better that we could use
instead of Unicode. The important point here is that there is only
*one* representation of character data Ruby.

This is a mistake.

If Unicode is choosen as character set, there is the question which
encoding to use internally. UTF-32 would be a good choice with regards
to simplicity in implementation, since each codepoint takes a fixed
number of bytes. Consider indexing of Strings:

Yes, but this would be very hard on memory requirements. There are
people who are trying to get Ruby to fit into small-memory environments.
This would destroy any chance of that.

[...]

Thank you for reading so far. Just in case Matz decides to implement
something similar to this proposal, I am willing to help with Ruby
development (although I don't know much about Ruby's internals and not
too much about Unicode either).

I would suggest that you look for discussions about m17n Strings in
Ruby. Matz has this one right.

I do not have a CS degree and I'm not a Unicode expert, so perhaps the
proposal is garbage, in this case please tell me what is wrong about
it or why it is not realistic to implement it.

I don't have a CS degree either, but I have been in the business for a
*long* time and I've been immersed in Unicode and encoding issues for
the last two years. If everyone used Unicode -- and POSIX weren't stupid
-- your proposal would be much more realistic. I *agree* that Ruby
should encourage the use of Unicode as much as is practical. But it also
shouldn't tie our hands like other programming languages do.

-austin

ยทยทยท

On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo โ€ข statue
               * austin@zieglers.ca

I don't claim to be an Unicode export but shouldn't the goal be to have Ruby work with *any* text encoding on a per-string basis? Why would you want to force all strings into Unicode for example in a context where you aren't using Unicode? (The internal encoding has to be....). And of course even in the Unicode world you have several different encodings (UTF-8, UTF-16, and so on). Juergen, when you say 'internal encoding' are you talking about the text encoding of Ruby source code?

It seems to me that irrespective of any particular text encoding scheme you need clean support of a simple byte vector data structure completely unencumbered with any notion of text encoding or locale. Right now that is done by the String class, whose name I think certainly creates much confusion. If the class had been called Vector and then had methods like:

  Vector#size # size in bytes
  Vector#str_size # size in characters (encoding and locale considered)

I think this discussion would be clearer because it would be the behavior of the str* methods that would need to understand text encodings and/or locale settings while the underlying byte vector methods remained oblivious. The # method is the most confusing since sometimes you want to extract bytes and sometimes you want to extract sub-strings (i.e consider the encoding). One method, two interpretations, bad headache.

It seems that three distinct behaviors are being shoehorned (with good reason) into a single class framework (String):

  byte vector
  text encoding (encoded sequence of code points)
  locale (cultural interpretations of the encoded sequence of code points)

I'm just suggesting that these distinctions seem to be lost in much of this discussion, especially for folks (like myself) who have a practical interest in this but certainly aren't text-encoding gurus.

Gary Wright

ยทยทยท

On Jun 17, 2006, at 9:50 AM, Stefan Lang wrote:

On Saturday 17 June 2006 13:08, Juergen Strobel wrote:

2. Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be
encapsulated by the string class completely, except for a few
related classes which may opt to work with the gory details for
performance reasons. The internal encoding has to be decided,
probably between UTF-8, UTF-16, and UTF-32 by the String class
implementor.

Full ACK. Ruby programs shouldn't need to care about the
*internal* string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

Not to mention that Matz has explicitly stated in the past that he
wants Ruby to support other encodings (TRON, Mojikyo, etc.) that
aren't compatible with a Unicode internal representation.

Not tying String to Unicode is also the right thing to do: it allows
for future developments. Java's weird encoding system is entirely down
to the fact that it standardised on UCS-2; when codepoints beyond
65535 arrived, they had to be shoehorned in via an ugly hack. As far
as possible, Ruby should avoid that trap.

Paul.

ยทยทยท

On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote:

> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> really consider something else? Note that we don't commit to a
> particular encoding of Unicode strongly.

This is a wash. I think that it's better to leave the options open.
After all, it *is* a hope of mine to have Ruby running on iSeries
(AS/400) and *that* still uses EBCDIC.

>I empathically agree. I'll even repeat and propose a new Plan for
>Unicode Strings in Ruby 2.0 in 10 points:
>
>1. Strings should deal in characters (code points in Unicode) and not
>in bytes, and the public interface should reflect this.

Agree, mostly. Strings should have a way to indicate the buffer size of
the String.

>2. Strings should neither have an internal encoding tag, nor an
>external one via $KCODE. The internal encoding should be encapsulated
>by the string class completely, except for a few related classes which
>may opt to work with the gory details for performance reasons.
>The internal encoding has to be decided, probably between UTF-8,
>UTF-16, and UTF-32 by the String class implementor.

Completely disagree. Matz has the right choice on this one. You can't
think in just terms of a pure Ruby implementation -- you *must* think
in terms of the Ruby/C interface for extensions as well.

I admit I don't know about Ruby's C extensions. Are they unable to
access String's methods? That is all that is needed to work with them.

And since this String class does not have a parametric encoding
attribute, it should be easier to crunch in C even.

>3. Whenever Strings are read or written to/from an external source,
>their data needs to be converted. The String class encapsulates the
>encoding framework, likely with additional helper Modules or Classes
>per external encoding. Some methods take an optional encoding
>parameter, like #char(index, encoding=:utf8), or
>#to_ary(encoding=:utf8), which can be used as helper Class or Module
>selector.

Conversion should be possible at any time. An "external source" may be
an extension that your Ruby program can't distinguish. Again, this point
fails because your #2 is unacceptable.

Note that explict conversion to characters, arrays, etc, is possible
for any supported character set and encodig. I have even given method
examples. "External" is to be seen in the context of the String class.

>4. IO instances are associated with a (modifyable) encoding. For
>stdin, stdout this can be derived from the locale settings. String-IO
>operations work as expected.

Agree, realising that the internal implementation of String must be
completely different than you've suggested. It is also important to
retain *raw* reading; a JPEG should not be interpreted as Unicode.

>5. Since the String class is quite smart already, it can implement
>generally useful and hard (in the domain of Unicode) operations like
>case folding, sorting, comparing etc.

Agreed, but this would be expected regardless of the actual encoding of
a String.

I am unaware of Matz's exact plan. Any good english language links?

I was under the impression users of Matz' String instances need to
look at the encoding tag to implement eg. #version_sort. If that is
not the case our proposals are not that much different, only Matz' one
is even more complex to implement than mine.

>6. More exotic operations can easily be provided by additional
>libraries because of Ruby's open classes. Those operations may be
>coded depending on on String's public interface for simplicissity, or
>work with the internal representation directly for performance.

Agreed.

>7. This approach leaves open the possibility of String subclasses
>implementing different internal encodings for performance/space
>tradeoff reasons which work transparently together (a bit like FixInt
>and BigInt).

Um. Disagree. Matz's proposed approach does this; yours does not. Yours,
in fact, makes things *much* harder.

If Matz's approach requires looking at the encoding tag from the
outside, it is not as transparent as mine. If it isn't it just boils
down to a parametric class versus subclass hierarchy design decision,
and I don't see much difference and would be happy with either one.

>8. Because Strings are tightly integrated into the language with the
>source reader and are used pervasively, much of this cannot be
>provided by add-on libraries, even with open classes. Therefore the
>need to have it in Ruby's canonical String class. This will break some
>old uses of String, but now is the right time for that.

"Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

My original title, somewhere snipped out, was "A Plan for Unicode
Strings in Ruby 2.0". I don't want to rush things or break 1.8 either.

>9. The String class does not worry over character representation
>on-screen, the mapping to glyphs must be done by UI frameworks or the
>terminal attached to stdout.

The String class doesn't worry about that now.

I was just playing safe here.

>10. Be flexible. <placeholder for future idea>

And little is more flexible than Matz's m17n String.

I've had flexibility with respect to Unicode Standards in mind, to not
fall into traps similiar to Java. A simple to use String class,
powerful enough to include every character of the world was my goal,
with the ability to convert to and from other external (from the
String class'es point of view) representations.

The flexibility to have parametric String encodings inside the String
class was not what I was going for, rather I would have that
inaccessible or at least unneccessary to access for the common String
user, and I provided a somewhat weaker but maybe still sufficient
technique via subclassing.

>This approach has several advantages and a few disadvantages, and I'll
>try to bring in some new angles to this now too:
>
>*Advantages*
>
>-POL, Encapsulation-
>
>All Strings behave exactly the same everywhere, are predictable,
>and do the hard work for their users.

Remember: POLS is not an acceptable reason for anything. Matz's m17n
Strings would be predictable, too. a + b would be possible if and only
if a and b are the same encoding or one of them is "raw" (which would
mean that the other is treated as the defined encoding) *or* there is a
built-in conversion for them.

Since I probably cannot control which Strings I get from libraries,
and dont't want to worry which ones I'll have to provide to them, this
is weaker than my approach in this respect, see my next point.

>-Cross Library Transparency-
>No String user needs to worry which Strings to pass to a library, or
>worry which Strings he will get from a library. With Web-facing
>libraries like rails returning encoding-tagged Strings, you would be
>likely to get Strings of all possible encodings otherwise, and isthe
>String user prepared to deal with this properly? This is a *big* deal
>IMNSHO.

This will be true with m17n strings. However, your proposal does *not*
work for Ruby/C interfaced items. Sorry.

Please elaborate this or provide pointers. I cannot believe C cannot
crunch at my Strings, which are less parametric than Matz's ones are.

>-Limited Conversions-
>
>Encoding conversions are limited to the time Strings are created or
>written or explicitly transformed to an external representation.

This is a mistake. I may need to know the internal representation of a
particular encoding of a String inside of a program. Trust me on this
one: I *have* done some low-level encoding work. Additionally, even
though I might have marked a network object as "UTF-8", I may not know
whether it's *actually* UTF-8 or not until I get HTTP headers -- or
worse, a <meta http-equiv> tag. Assuming UTF-8 reading in today's world
is doomed to failure.

Read it as binary, and decide later. These problems should be locally
containable, and methods are still able to return Strings after
determining the encoding.

>-Correct String Operations-
>Even basic String operations are very hard in the world of Unicode. If
>we leave the String users to look at the encoding tags and sort it out
>themselves, they are bound to make mistakes because they don't care,
>don't know, or have no time. And these mistakes may be _security_
>_sensitive_, since most often credentials are represented as Strings
>too. There already have been exploits related to Unicode.

This is a misunderstanding on your part. Nothing about Matz's m17n
Strings suggests that String users would have to look at the encoding
tags. Merely that they *could*. I suspect that there will be pragma-
like behaviours to enforce a particular internal representation at all
times.

Previously you stated users need to look at the encoding to determine
if simple operations like a + b work.

Can you point to more info? I am interested how this pragma stuff
works, and if not doing it "right" can break things.

>*Disadvantages* (with mitigating reasoning of course)
>- String users need to learn that #byte_length(encoding=:utf8) >=
>#size, but that's not too hard, and applies everywhere. Users do not
>need to learn about an encoding tag, which is surely worse to handle
>for them.

True, but the encoding tag is not worse. Anyone who assumes that
developers can ignore encoding at any time simply *doesn't* know about
the level of problems that can be encountered.

For String concatenates, substring access, search, etc, I expect to be
able to ignore encoding totally. Only when interfacing with
non-String-class objects (I/O and/or explicit conversion) would I need
encoding info.

>- Strings cannot be used as simple byte buffers any more. Either use
>an array of bytes, or an optimized ByteBuffer class. If you need
>regular expresson support, RegExp can be extended for ByteBuffers or
>even more.

I see no reason for this.

In my proposal, Unicode Strings cannot represent arbitrary binary data
in their internal representation, since not everything would be valid
characters. In fact, you cannot set the internal representation
directly.

The interface could accept a code point sequence of values
(0..255), but that would be wasteful compared to an array of bytes.

>- Some String operations may perform worse than might be expected from
>a naive user, in both the time or space domain. But we do this so the
>String user doesn't need to himself, and are problably better at it
>than the user too.

This is a wash.

Only trying to refute weak arguments in advance.

>- For very simple uses of String, there might be unneccessary
>conversions. If a String is just to be passed through somewhere,
>without inspecting or modifying it at all, in- and outwards conversion
>will still take place. You could and should use a ByteBuffer to avoid
>this.

This is a wash.

Not a big problem either, but someone was bound to bring it up.

>- This ties Ruby's String to Unicode. A safe choice IMHO, or would we
>really consider something else? Note that we don't commit to a
>particular encoding of Unicode strongly.

This is a wash. I think that it's better to leave the options open.
After all, it *is* a hope of mine to have Ruby running on iSeries
(AS/400) and *that* still uses EBCDIC.

>- More work and time to implement. Some could call it over-engineered.
>But it will save a lot of time and troubles when shit hits the fan and
>users really do get unexpected foreign characters in their Strings. I
>could offer help implementing it, although I have never looked at
>ruby's source, C-extensions, or even done a lot of ruby programming
>yet.

I would call it the amount of work necessary. But the work needs to be
done for a *variety* of encodings, and not just Unicode. *Especially*
because of C extensions.

>Close to the start of this discussion Matz asked what the problem with
>current strings really was for western users. Somewhere later he
>concluded case folding. I think it is more than that: we are lazy and
>expect character handling to be always as easy as with 7 bit ASCII, or
>as close as possible. Fixed 8-bit codepages worked quite fine most of
>the time in this regard, and breakage was limited to special
>characters only.

>Now let's ask the question in reverse: are eastern programmers so used
>to doing elaborate byte-stream to character handling by hand they
>don't recognize how hard this is any more? Surely it is a target for
>DRY if I ever saw one. Or are there actual problems not solveable this
>way? I looked up the mentioned Han-Unification issue, and as far as I
>understood this could be handled by future Unicode revisions
>allocating more characters, outside of Ruby, but I don't see how it
>requires our Strings to stay dumb byte buffers.

No one has ever suggested that Ruby Strings stay byte buffers. However,
blindly choosing Unicode *adds* unnecessary complexity to the situation.

-austin
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
              * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo โ€ข statue
              * austin@zieglers.ca

The way I see it we have to choose a character set. I proposed
Unicode, because their official goal is to be the one unifying set,
and if they ain't yet, I hope they'll be sometime.

If that is not enough we will effectively create our own character
set, let's call it RubyCode, which will contain characters from the
union of Unicode and a few other sets. Each String will have a
particular encoding, which will determine which characters of RubyCode
are valid in this particular String instance. Hopefully many
characters will be valid in multiple encodings. But it doesn't sound
like a very clear design to me.

Jรผrgen

ยทยทยท

On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Ziegler wrote:

On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:

--
The box said it requires Windows 95 or better so I installed Linux

It seems that the main argument against using Unicode strings
in Ruby is because Unicode doesn't work well for eastern
countries.

Point of information: there are highly successful word-processing products selling well in countries whose writing systems include Han characters, which internally use Unicode. So while the Han-unification problems have been much discussed and are regarded as important by people who are not fools, in fact there is existence proof that Unicode does work well enough for wide deployment in commercial software.

If Unicode is choosen as character set, there is the
question which encoding to use internally. UTF-32 would be a
good choice with regards to simplicity in implementation,

UTF-32 has a practical problem in that in C code, you can't use strcmp() and friends because it's full of null bytes. Of course if you're careful to code everything using wchar_t you'll be OK, but lots of code isn't. (UTF-8 doesn't have this problem and is much more compact).

Consider
indexing of Strings:

        "some string"[4]

If UTF-32 is used, this operation can internally be
implemented as a simple, constant array lookup. If UTF-16 or
UTF-8 is used, this is not possible to implement as an array

Correct. But in practice this seems not to be too huge a problem, since in practice text is most often accessed sequentially. The times that you really need true random access to the N'th character are rare enough that for some problems, the advantages of UTF-8 are big enough to compensate for this problem. Note that in a variable-length character encoding, there's no trouble whatever with a table of pointers into text; the *only* problem is when you need to find the Nth character cheaply.

An advantage of using UTF-8 would be that for pure ASCII files
no conversion would be necessary for IO.

Be careful. There are almost no pure ASCII files left. Cafรฉ. Ordoรฑez. โ€œSmart quotesโ€

  -Tim

ยทยทยท

On Jun 17, 2006, at 6:50 AM, Stefan Lang wrote: