Um. Do you mean UTF-32? Because there's *no* binary representaiton of
Unicode Character Code Points that isn't an encoding of some sort. If
that's the case, that's unacceptable from a memory representation.Yes, I do mean the String *interface* to be UTF-32, or pure code
points which is the same but less suscept to to standard changes, if
accessed at character level. If accessed at substring level, a
substring of a String is obviously a String, and you don't need a
bitwise representation at all.
Again, this is completely unacceptable from a memory usage perspective.
I certainly don't want my programs taking up 4x the additional memory
for string handling.
But "pure code points" is a red herring and a mistake in any case. Code
points aren't sufficient. You need glyphs, and some glyphs can be
produced with multiple code points (e.g., LOWERCASE A + COMBINING ACUTE
ACCENT as opposed to A ACUTE). Indeed, some glyphs can *only* be
produced with multiple code points. Dealing with this intelligently
requires a *lot* of smarts, but it's precisely what we should do.
According to my proposal, Strings do not need an encoding from the
String user's point of view when working just with Strings, and users
won't care apart from memory/performance consumption, which I believe
can be made good enough with a totally encapsulted, internal storage
format to be decided later. I will avoid a premature optimization
debate here now.
Again, you are incorrect. I *do* care about the encoding of each String
that I deal with, because only that allows me (or String) to deal with
conversions appropriately. Granted, *most* of the time, I won't care.
But I do work with legacy code page stuff from time to time, and
pronouncements that I won't care are just arrogance or ignorance.
Of course encoding matters when Strings are read or written somewhere,
or converted to bit-/bytewise representation explicitly. The Encoding
Framework, however it'll look, needs to be able to convert to and from
Unicode code points for these operations only, and not between
arbitrary encodings. (You *may* code this to recode directly from
the internal storage format for performance reasons, but that'll be
transparent to the String user.)
I prefer arbitrary encoding conversion capability.
This breaks down for characters not represented in Unicode at all, and
is a nuisance for some characters affected by the Han Unification
issue. But Unicode set out to prevent exactly this, and if we
beleieve in Unicode at all, we can only hope they'll fix this in an
upcoming revision. Meanwhile we could map any additional characters
(or sets of) we need to higher, unused Unicode plains, that'll be no
worse than having different, possibly incompatible kinds of Strings.
Those choices aren't ours to make.
We'll need an additional class for pure byte vectors, or just use
Array for this kind of work, and I think this is cleaner.
I don't. Such an additional class adds unnecessary complexity to
interfaces. This is the *main* reason that I oppose the foolish choice
to pick a fixed encoding for Ruby Strings.
Legacy data and performance.
Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String's internal storage
format, so we don't have to worry about reencoding them.
Um. This is the statement of someone who is ignoring legacy issues.
Performance *is* a big issue when you're dealing with enough legacy
data. Don't punish people because of your own arrogance about encoding
choices.
Again: Unicode Is Not Always The Right Choice. Anyone who tells you
otherwise is selling you a Unicode toolkit and only has their wallet in
mind. Unicode is *often* the right choice, but it's *not* the only
choice and there are times when having the *flexibility* to work in
other encodings without having to work through Unicode as an
intermediary is the right choice. And from an API perspective,
separating String and "ByteVector" is a mistake.
On the other hand, conversions needs to be done at other times with my
proposal than for M17N Strings, and it depends on the application if
that is more or less often. String-String operations never need to do
recoding, as opposed to M17N Strings. I/O always needs conversion, and
may need conversion with M17N too. I havea a hunch that allowing
different kinds of Strings around (as in M17N presumely) should
require recoding far more often.
Unlikely. Mixed-encoding data handling is uncommon.
-austin
ยทยทยท
On 6/18/06, Juergen Strobel <strobel@secure.at> wrote:
On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote:
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo โข statue
* austin@zieglers.ca