4 – Isolate IO routines (including console IO) to
provide a layer for translating encodings. There
could be more than one layer (I would want a
windows-specific one, but to start with you could
put in a dumb ‘squashing’ of the internal UCS2 to
ASCII). All translations would be between UCS2 and
the currently active IO encoding.
Java programmers will tell you that converting
Unicode to a native
encoding takes up a surprisingly large amount of
time. Reading a string
from a file, doing a trivial substitution, and
writing it to another
file does an unnecessary amount of work. Granted,
nobody expects a Ruby
script to be blindingly fast, but other threads in
this newsgroup are
complaining about I/O being slow.
I totally agree, and in fact I don’t think there is
any one string strategy that is best for everyone. If
the savings for the ASCII community were sufficiently
great then UTF8 might be a better default than UCS2.
I feel the important thing is to have strings,
regexes, and IO operate on characters, and then once
this character interface is established, worry about
the efficiency of implementation.
Maybe this has been suggested already but, since Ruby
object-oriented, I’d vote for two (or more) virtually
String classes, one for Unicode strings, one for
I’d vote for any number of underlying string
implementations, provided there is one character based
interface on them that I use when I call length().
Certainly an ANSI-only implementation and a UCS2 one,
maybe a UTF8 one and a TRON one as well. But the
important part would be to create the interface, so
that people with various needs could add
implementations in their own time. It’s the making of
a character based interface to strings in ruby that is
the ‘blocking’ operation.
Note that Python has two string types which are
indistinguishable. Every string function “does the
right thing” whether
it operates on a byte string or a wide string, as far
as I know. (I
haven’t tried the regexp functions yet.)
They do the right thing (and IIRC because other string
ops in python are based on them, the rest of python
does the right thing too). It looks to me like a
’good enough’ solution, much better than byte-array