Andy Roonie is perhaps excessively optimistic


(Benjamin Peterson) #1

Ruby, then, has reached its Rubicon,
http://www.ibiscom.com/caesar.htm
Will Matz be its Caesar and build a new Empire? Or
will the important
issues noted by Benjamin be ignored? Full Unicode,
threading, and a fast
VM are going to be critical factors.

They will not be ignored. But ony History knows the
future.

…and only Prophecy knows the past.

Like many others, I would be happy to devote a large
amount of time to Ruby. In my particular case it
would be to i18n, since I can’t use Ruby without it.
But in practice, I have no way to find out whether
someone in Japan is already making an i18n effort, or
whether any changes I made would be accepted, or
whether matz has decided what i18n should consist of,
so it doesn’t really make sense for me to do anything
at all.

Except sit here and carp :slight_smile:

Benjamin

Postscript:

If Ruby were my own project, I might do the following:

1 – Bite the Unicode bullet and accept that despite
the legitimate concerns of many Japanese, it is the
standard and it works well enough to get things done.
I wish TRON had won. It didn’t win.
2 – Use wide characters in Ruby internally. Forget
surrogates, like every other implementation does.
UCS2 characters are fast (always the same length) and
pretty near standard. A change like this is worth
having to recompile things for.
3 – Plug in rxpp regular expressions to replace the
narrow-character gnu regex file we have.
4 – Isolate IO routines (including console IO) to
provide a layer for translating encodings. There
could be more than one layer (I would want a
windows-specific one, but to start with you could just
put in a dumb ‘squashing’ of the internal UCS2 to
ASCII). All translations would be between UCS2 and
the currently active IO encoding.

This is just what would seem most obvious to me if I
were developing Ruby for my personal use, not
necessarily a recommendation for Ruby in real life.

The last document expressing the will of matz seems to
be this one from 2000, about the time discussion of
i18n petered out:

http://www.inac.co.jp/~maki/ruby/matz-000516.html

The current validity of these thoughts is impossible
to ascertain, but (unless perhaps my Japanese is at
fault) they seem to express a somewhat MULE-ish
attitude…

x


(Yukihiro Matsumoto) #2

Hi,

Like many others, I would be happy to devote a large
amount of time to Ruby. In my particular case it
would be to i18n, since I can’t use Ruby without it.
But in practice, I have no way to find out whether
someone in Japan is already making an i18n effort, or
whether any changes I made would be accepted, or
whether matz has decided what i18n should consist of,
so it doesn’t really make sense for me to do anything
at all.

You can tell me what you like to see in the future, although I cannot
promise you anything (yet). I mean I’d like to hear about the spec,
not about the implementation. For your information, you can get and
see my experimental M17N implementation from the CVS ruby_m17n branch.

The current validity of these thoughts is impossible
to ascertain, but (unless perhaps my Japanese is at
fault) they seem to express a somewhat MULE-ish
attitude…

I’m not sure what you mean by “MULE-ish”. I think we are going to
step further.

						matz.
···

In message “Re: Andy Roonie is perhaps excessively optimistic” on 02/06/28, Benjamin Peterson bjsp123@yahoo.com writes:


(Frank Mitchell) #3

Benjamin Peterson wrote:

If Ruby were my own project, I might do the following:

1 – Bite the Unicode bullet and accept that despite
the legitimate concerns of many Japanese, it is the
standard and it works well enough to get things done.
I wish TRON had won. It didn’t win.
2 – Use wide characters in Ruby internally. Forget
surrogates, like every other implementation does.
UCS2 characters are fast (always the same length) and
pretty near standard. A change like this is worth
having to recompile things for.
3 – Plug in rxpp regular expressions to replace the
narrow-character gnu regex file we have.
4 – Isolate IO routines (including console IO) to
provide a layer for translating encodings. There
could be more than one layer (I would want a
windows-specific one, but to start with you could just
put in a dumb ‘squashing’ of the internal UCS2 to
ASCII). All translations would be between UCS2 and
the currently active IO encoding.

Java programmers will tell you that converting Unicode to a native
encoding takes up a surprisingly large amount of time. Reading a string
from a file, doing a trivial substitution, and writing it to another
file does an unnecessary amount of work. Granted, nobody expects a Ruby
script to be blindingly fast, but other threads in this newsgroup are
complaining about I/O being slow.

Maybe this has been suggested already but, since Ruby is
object-oriented, I’d vote for two (or more) virtually indistinguishable
String classes, one for Unicode strings, one for single-byte strings.
Perhaps byte strings could have an “encoding” attribute (a Symbol) to
make converting from one representation to another automatic. Maybe
you’d also need a distinction between getting the Nth byte, and getting
the Nth character (always converted to a Unicode character.)

Note that Python has two string types which are virtually
indistinguishable. Every string function “does the right thing” whether
it operates on a byte string or a wide string, as far as I know. (I
haven’t tried the regexp functions yet.)

···


Frank Mitchell (frankm@bayarea.net)

Please avoid sending me Word or PowerPoint attachments.
See http://www.fsf.org/philosophy/no-word-attachments.html


(Curt Sampson) #4

Java programmers will tell you that converting Unicode to a native
encoding takes up a surprisingly large amount of time.

Well, I’m a Java programmer, and I tend to disagree with that.
Especially if you’re using Latin-1, the conversion is very, very
cheap. (Because it does almost nothing!)

Reading a string
from a file, doing a trivial substitution, and writing it to another
file does an unnecessary amount of work.

It certainly does! But that’s because you programmed it poorly for
Java’s model. Something like this, if you want it to be efficient,
should never use a java.lang.String class.

Almost every time I’ve seen poor performance in String handling in
Java, it’s been because the programmer is using strings badly and
forcing a lot of data copies.

Maybe this has been suggested already but, since Ruby is
object-oriented, I’d vote for two (or more) virtually indistinguishable
String classes, one for Unicode strings, one for single-byte strings.

Now this, I agree with, and I sure wish Java had it. We need,
essentially, “character strings” (which are Unicode) and “byte
strings” (which are a set of arbitrary bytes).

Perhaps byte strings could have an “encoding” attribute (a Symbol) to
make converting from one representation to another automatic.

That could be handy, yes.

Maybe
you’d also need a distinction between getting the Nth byte, and getting
the Nth character (always converted to a Unicode character.)

Err…I’d say provide just “get the Nth byte,” and leave it to
character strings to get the Nth character; should the programmer
need it he can use ByteString.getCharacterString or whatever.

cjs

···

On Sat, 29 Jun 2002, Frank Mitchell wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC