Unicode in Ruby now?

Alexander_Bokovoy · 1 August 2002 13:40

Unicode 3.1 is 32-bit wide.

I have just looked at my 3.0 standard and the 3.1 and 3.2 updates on the
web site, and I do not see any evidence of this. Did I miss something?
See the message I just posted for the details as I know them.
UAX #19: UTF-32 :

–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–

3 Relation to ISO/IEC 10646 and UCS-4

ISO/IEC 10646 defines a 4-byte encoding form called UCS-4. Since UTF-32 is
simply a subset of UCS-4 characters, it is conformant to ISO/IEC 10646 as
well as to the Unicode Standard.

As of the recent publication of the second edition of ISO/IEC 10646-1,
UCS-4 still assigns private use codepoints (E0000016…FFFFFF16 and
6000000016…7FFFFFFF16) that are not in the range of valid Unicode
codepoints. To promote interoperability among the Unicode encoding forms
JTC1/SC2/WG2 has approved a motion removing those private use assignments:

Resolution M38.6 (Restriction of encoding space) [adopted unanimously]

“WG2 accepts the proposal in document N2175 towards removing the provision
for Private Use Groups and Planes beyond Plane 16 in ISO/IEC 10646, to
ensure internal consistency in the standard between UCS-4, UTF-8 and
UTF-16 encoding formats, and instructs its project editor [to] prepare
suitable text for processing as a future Technical Corrigendum or an
Amendment to 10646-1:2000.”

While this resolution must still be approved as an Amendment to
10646-1:2000, the Unicode Technical Committee has every expectation that
once the text for that Amendment completes its formal balloting it will
proceed smoothly to publication as part of that standard.

Until the formal balloting is concluded, the term UTF-32 can be used to
refer to the subset of UCS-4 characters that are in the range of valid
Unicode code points. After it passes, UTF-32 will then simply be an alias
for UCS-4 (with the extra requirement that Unicode semantics are observed)

–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–8<–

I do not see reason to exist projects like
Mojikyo when it is perfectly can be done in 32-bit Unicode.

Mojikyo is doing things like setting code points for characters that
will never exist in Unicode, because those characters are combined due
to the character combining rules. Mojikyo has a different purpose from
Unicode: Unicode wants to make doing standard, day-to-day work easy;
Mojikyo wants to give maximum flexability in the display of Chinese
characters. Given the number and complexity of kanji, these two aims are
basically incompatable.
I still don’t see why both goals should be incompatible a priori. But this
is possible an offtopic here.

···

On Thu, Aug 01, 2002 at 09:55:48PM +0900, Curt Sampson wrote:

On Thu, 1 Aug 2002, Alexander Bokovoy wrote:

–
/ Alexander Bokovoy

I went to a Grateful Dead Concert and they played for SEVEN hours. Great song.
– Fred Reuss

Dan_Sugalski · 2 August 2002 04:03

This is just a comment from an interested but mildly uninvolved
bystander (though I’m dealing with similar issues with Parrot) but…
Given that the people who’ve made these decisions have made them
about their native language (a language that is neither your nor my
native language) perhaps it’s a bit presumptuous to decide that what
they’ve done is wrong and some other way is better. It’d be about the
same as someone else deciding that there’s no need for a character
set to deal with upper and lowercase roman letters since, after all,
they represent the same thing. Or that you’re only supporting
whatever Esperanto needs since that should be good enough for anyone.

This is someone’s language you’re dealing with. It existed long
before computers did, it’s deeply rooted in culture, and is by far
more important than any computer issue. Language is important–it
conveys meaning and culture, and is the data. The computer is a tool.
If the tool can’t deal with the language, it means the tool is
broken, not the language.

···

At 1:50 PM +0900 8/1/02, Curt Sampson wrote:

On Thu, 1 Aug 2002, Hal E. Fulton wrote:

Seriously, since you have some expertise, I’m sure your knowledge will
be valuable in improving Ruby… talk to vruz also.

I doubt it. My opinion of the matter is that the correct way to
do things is to go with Unicode internally. (This does not rule
out processing non-Unicode things, but you process them as binary
byte-strings, not as character strings.) You lose a little bit of
functionality this way, but overall it’s easy, fast, and gives you
everything you really need.

Unfortunately, a lot of Japanese programmers disagree with this. They
feel the need, for example, to have separate code points for a single
character, simply because one stroke is slightly different between the
way Japanese and Chinese people write it. (The meaning is exactly the
same.)

–
Dan

--------------------------------------“it’s like this”-------------------
Dan Sugalski even samurai
dan@sidhe.org have teddy bears and even
teddy bears get drunk

MikkelFJ1 · 2 August 2002 22:15

“Curt Sampson” cjs@cynic.net wrote in message
news:Pine.NEB.4.44.0208021752530.442-100000@angelic.cynic.net…

But isn’t this what matz suggest?
Each stream is tagged, that is the same as having different types. It’s
basically just a different way to store the type while having a lot of
common string operations.

No, because then you have to deal with conversions. Most popular
character sets are convertable to Unicode and back without loss. That is
not true of any arbitrary pair of character sets, though, even if you go
through Unicode.

Not really, you just produce a runtime error that the data cannot be - say -
concatenated. You just use the same vehicle to carry different cargo.
You can have special functions for to-Unicode and From-unicode. And similar
for popular Asian scripts.

But there’s an even better reason than this for converting to
Unicode on input, rather than doing internal tagging. If you don’t
have conversion tables for a particular character encoding, it’s
much better to find out at the time you try to get the information
in to the system than at some arbitrary later point when you try
to do a conversion. That way you know where the problem information
is coming from.

That is essentially static versus dynamic typing. I’d say in most situations
the application would be pretty well aware about what they are doing. The
tagging allows a generic rutine to handle multiple formats if it so chooses,
or could error if it got anything but Unicode (or whatever).

2. Add a UString or similar for dealing with UTF-16 data. There's

Obviously you know more about Unicode than most. What is the practical
difference between UCS-4, UCS-2 and UTF-16. Is it that “extended
characerts” - or surrogates - will take on more space than UCS-4 but
typically take up the same space as UCS-2?

and options for future extensions.

Not that I know of. Can you explain what these are?

I guess you know more about this than I. I can’t give you details I only
have it from memory. It is possible that it is covered by reserved ranges of
code pairs.
In that case UCS-4 should be sufficient.

Hence UCS-4 is a strategy with limited timespan.

Not at all, unless they decide to change Unicode to the point where it
no longer uses 16-bit code values, or add escape codes, or something
like that. That would be backward-incompatable, severely complicate
processing, and generally cause all hell to break lose. So I’d rate
this as “Not Likely.”

It wouldn’t be the first time hell breaks loose in this area though.

2. There are many situations where, even if surrogate pairs
are present, you don't know or care, and need do nothing to
correctly deal with them.

Does this means that UCS-2 is the best format?

So I propose just what the Unicode standard itself proposes in
section 5.4: UString (or whatever we call it) should have the
Surrogate Support Level “none”; i.e., it completely ignores the
existence of surrogate pairs. Things that use UString that have
the potential to encounter surrogate pair problems or wish to
interpret them can add simple or complex code, as they need, to
deal with the problem at hand. (Many users of UString will need to
do nothing.)

Is that UCS-2 or UTF-16 then?

Note that there’s a big difference between this and your UTF-8
proposal: ignoring multibyte stuff in UTF-8 is going to cause much,
much more lossage because there’s a much, much bigger chance of
breaking things when using Asian languages. With UTF-16, you probably
won’t even encounter surrogates, whereas with Japanese in UTF-8,
pretty much every character is multibyte.

I did not mean so that you should ignore the content. But you can process it
as if it were ASCII because in many languages everthing that is not text is
found in the ASCII range. Due to the way UTF-8 is encoded you never risc
getting a spurious ASCII character following this path. For example, you can
find delimited text simply scanning from one double quote to the next.
Everything in between is a sound text possibly in UTF-8 - you do not need to
care about this. Subsequently you may wish to convert the delimited string
into UCS-2 or whatever.

This approach avoids complex character type lookups when parsing text. It
will for example work for XML and Ruby.

In order to add international identifier support to a UTF-8 stream processed
as ASCII you can use the following pattern (I believe Ruby already does
something similar).

identifier = /[A-Za-z_\x80-\0xfd][A-Za-z_\x0080-\0xfd]*/

Mikkel

Yukihiro_Matsumoto2 · 1 August 2002 09:02

Reasonable. How do you want to specify the reading/writing charset?

						matz.

···

In message “Re: Unicode in Ruby now?” on 02/08/01, Curt Sampson cjs@cynic.net writes:

Me? Not so much, actually. I need to be able to read in ISO-8859-1,
ISO-2022-JP, EUC-JP, Shift_JIS and UTF-8, convert them to a common
internal format so I don’t have to worry about where the data I am
manipulating came from, and do the various conversions again for output.

Curt_Sampson · 1 August 2002 14:23

Unicode 3.1 is 32-bit wide.

I have just looked at my 3.0 standard and the 3.1 and 3.2 updates on the
web site, and I do not see any evidence of this. Did I miss something?
See the message I just posted for the details as I know them.

UAX #19: UTF-32 :
[section 3, Relation to ISO/IEC 10646 and UCS-4]

Actually, I was looking for someone to attack my argument, not
support it.

What this says is that they are removing some private code areas
in the ISO 10646 UCS-4 encoding so that it will become smaller and
compatable with UTF-32. And, as it says at the beginning of that
document:

UTF-32 is restricted in values to the range 0..10FFFF16, which
precisely matches the range of characters defined in the Unicode
Standard (and other standards such as XML), and those representable
by UTF-8 and UTF-16.

So Unicode is not 32-bit in any sense of the word. A character in
the UCS-32 encoding of Unicode takes up 32-bits, but many of those
bits are unused.

Mojikyo wants to give maximum flexability in the display of Chinese
characters. Given the number and complexity of kanji, these two aims are
basically incompatable.

I still don’t see why both goals should be incompatible a priori. But this
is possible an offtopic here.

Partly efficiency concerns. As the speed of CPUs increases relative
to memory, the relative cost of string handling (which is pretty
memory intensive) gets higher and higher. And also things like ease
of use; avoiding duplications makes things like pattern matching
and use of dictionaries much easier. (Imagine, for example, that
ASCII had two 'e’s in it, and people used one or the other randomly,
as they liked. Now instead of writing s/feet/fleet/, you have to
write at least s/f[ee][ee]t/fleet/, or in certain fussy cases even
s/f([ee][ee])t/fl\1t/. Ouch.

cjs

···

On Thu, 1 Aug 2002, Alexander Bokovoy wrote:

On Thu, Aug 01, 2002 at 09:55:48PM +0900, Curt Sampson wrote:

On Thu, 1 Aug 2002, Alexander Bokovoy wrote:
–
Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Curt_Sampson · 2 August 2002 08:14

Given that the people who’ve made these decisions have made them
about their native language (a language that is neither your nor my
native language) perhaps it’s a bit presumptuous to decide that what
they’ve done is wrong and some other way is better.

I don’t think it’s so presumptious.

First, I do know some Japanese and Chinese, including kanji, so this
stuff isn’t a complete mystery to me. Second, through experience
building I18N web sites and suchlike, I’d say I have a better
understanding of I18N issues than many Japanese programmers do.
Certainly Japanese systems builders more often than not do not generally
take into account I18N issues. (And for many of them, why should they?
They’re not interested in anything outside of Japan, so it’s not worth
spending time, effort and money on it.)

Also, note that the Unicode-haters in Japan, while noisy amongst
programmers, are far from representative of the users. Most Japanese
could care less if you even have 薔薇 (bara–rose) available in
kanji, much less anything in the Unicode surrogate area.

For those that really do need support for all the kanji, rather than all
the generally generally used in modern life, there are solutions that
are much better than Unicode will ever be, and they should use those.
Those solutions are also much higher in overhead (for both programming
and machine resources), though, and that burden shouldn’t be put on all
software.

An analogy might be text files versus DTP. ASCII doesn’t have things
like font sizes, kerning information, and so on, so it alone isn’t
useful for DTP. For that you use another, customized software system
that adds the capabilities you need. But this is a good thing, it means
that all those systems out there that don’t care about font, size,
kerning, etc. (such as your local database server) don’t deal with the
overhead of it.

If the tool can’t deal with the language, it means the tool is
broken, not the language.

No tool can deal with everything in the language. ASCII, or even
ISO-8859-1, certainly doesn’t deal with a huge number of issues in
English. Yet ASCII does a good job for a lot of everyday needs,
and doesn’t cost too much, so it serves us well. (Certainly seems
to be working ok in this e-mail message, anyway!)

cjs

···

On Fri, 2 Aug 2002, Dan Sugalski wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Curt_Sampson · 4 August 2002 08:58

Obviously you know more about Unicode than most. What is the practical
difference between UCS-4, UCS-2 and UTF-16.

I don’t have my spec. handy, so I’m going from memory here; someone with
the spec in front of him should correct me if I’m wrong.

UCS-4 is a 4-byte encoding, and UCS-2 is a two-byte encoding for ISO-10646.
UCS-2 is similar to UTF-16, which is a Unicode encoding.

Is it that “extended
characerts” - or surrogates - will take on more space than UCS-4 but
typically take up the same space as UCS-2?

All characters take up 4 bytes in UCS-4. Each code value takes up
two bytes in UCS-2 and UTF-16; some characters need two code values.

Not at all, unless they decide to change Unicode to the point where it
no longer uses 16-bit code values, or add escape codes, or something
like that. That would be backward-incompatable, severely complicate
processing, and generally cause all hell to break lose. So I’d rate
this as “Not Likely.”

It wouldn’t be the first time hell breaks loose in this area though.

It would for Unicode. I don’t think they’re likely to completely
break backwards compatability.

2. There are many situations where, even if surrogate pairs
are present, you don't know or care, and need do nothing to
correctly deal with them.

Does this means that UCS-2 is the best format?

In my opinion, yes.

I did not mean so that you should ignore the content. But you can process it
as if it were ASCII because in many languages everthing that is not text is
found in the ASCII range. Due to the way UTF-8 is encoded you never risc
getting a spurious ASCII character following this path. For example, you can
find delimited text simply scanning from one double quote to the next.
Everything in between is a sound text possibly in UTF-8 - you do not need to
care about this.

Right. The same is true of UTF-16. However, UTF-16 has the advantage
that it’s more compact when representing Japanese or other Asian
languages, and it’s easier to manipulate individual characters.

cjs

···

On Sat, 3 Aug 2002, MikkelFJ wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Curt_Sampson · 1 August 2002 10:44

Well, what Java does works fine for me. Basically, java has “streams”
which do binary I/O, and “readers/writers” which do character I/O. So to
do character I/O, you hand your InputStream or OutputStream to a class
that implements the InputReader or OutputWriter interface, and does the
charecter encoding conversion. Typically methods that open files or
whatever and return a reader or writer will have two signatures: one
which uses the “system default” character encoding (set by the locale
when you start the JVM), and the other where you can explicitly specify
the character encoding as a parameter.

This makes it easy to write “auto-sensing” InputReaders, too; so
long as they can read enough in advance of the first read from the
consumer, they can buffer it and look to see if it’s, for example,
Shift_JIS versus EUC-JP.

But I’d be open to other ways of doing this, too.

BTW, I’d prefer to use the term “character encoding” rather than
“character set,” as technically, the character set is just the
characters themselves, and not their assigment to binary numbers
or number sequences. Also, it would probably be best to use the
IANA standards for character encoding (though they call them
“character sets”) names, available at

http://www.iana.org/assignments/character-sets

cjs

···

On Thu, 1 Aug 2002, Yukihiro Matsumoto wrote:

Reasonable. How do you want to specify the reading/writing charset?

–
Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Tobias1 · 1 August 2002 13:01

How about

class IO
USER_DEFAULT_ENCODING = nl_langinfo(CODESET) # libc function, probably
# not available on windows
attr_accessor :encoding
end

have read/write look at @encoding. Should be added it to the IO
constructors in some way, but made optional.

Tobias

···

On Thu, 1 Aug 2002, Yukihiro Matsumoto wrote:

Reasonable. How do you want to specify the reading/writing charset?

Alexander_Bokovoy · 1 August 2002 16:36

I have just looked at my 3.0 standard and the 3.1 and 3.2 updates on the
web site, and I do not see any evidence of this. Did I miss something?
See the message I just posted for the details as I know them.

UAX #19: UTF-32 :
[section 3, Relation to ISO/IEC 10646 and UCS-4]

Actually, I was looking for someone to attack my argument, not
support it.
The reason I’ve pointed to that section, is that there will be no
difference in ISO/IEC 10646 and Unicode very soon in sense of covered code
space. This means as soon as it will be acomplished, uniform expansion to
unused bits in 32-bit space will start. If CJK community will have
interest in it, of course. As you may remember, there were some complaints
in past about ‘small’ code space for covering CJK in Unicode. It wouldn’t
be so relatively soon.

Mojikyo wants to give maximum flexability in the display of Chinese
characters. Given the number and complexity of kanji, these two aims are
basically incompatable.

I still don’t see why both goals should be incompatible a priori. But this
is possible an offtopic here.

Partly efficiency concerns. As the speed of CPUs increases relative
to memory, the relative cost of string handling (which is pretty
memory intensive) gets higher and higher. And also things like ease
of use; avoiding duplications makes things like pattern matching
and use of dictionaries much easier. (Imagine, for example, that
ASCII had two 'e’s in it, and people used one or the other randomly,
as they liked. Now instead of writing s/feet/fleet/, you have to
write at least s/f[ee][ee]t/fleet/, or in certain fussy cases even
s/f([ee][ee])t/fl\1t/. Ouch.
Well, it raises a completely different problem set. It does attack a
foundation upon which current meaning of character encoding is built.
Remember that ‘character encoding’ usually understood as a way to address
and differentiate ‘characters’ in a ‘string’ using one property –
position in some abstract ‘alphabet’ which has little to do with real
life language properties. For example, CP1251 which is used in Belarus and
other slavic countries has two ‘i’ – one from ASCII and another (with
exactly same glyph in fonts) for Belarussian and Ukrainian languages.
There is no information in the CP1251 encoding to differentiate those two
except attaching external property table (which is done in IANA proposal
by mapping all positions of encoding to some Unicode code points, which,
in turn, have all needed properties assigned).

What you are showing above, is a need to perform operations on these ‘external’
properties, like it is done in ICU, for example. Actually, it would be much
more productive to implement kind of Mojikyo inside ICU.

···

On Thu, Aug 01, 2002 at 11:23:38PM +0900, Curt Sampson wrote:

–
/ Alexander Bokovoy

Ever notice that even the busiest people are never too busy to tell you
just how busy they are?

Yukihiro_Matsumoto2 · 1 August 2002 16:19

Hi,

Well, what Java does works fine for me. Basically, java has “streams”
which do binary I/O, and “readers/writers” which do character I/O. So to
do character I/O, you hand your InputStream or OutputStream to a class
that implements the InputReader or OutputWriter interface, and does the
charecter encoding conversion. Typically methods that open files or
whatever and return a reader or writer will have two signatures: one
which uses the “system default” character encoding (set by the locale
when you start the JVM), and the other where you can explicitly specify
the character encoding as a parameter.

This makes it easy to write “auto-sensing” InputReaders, too; so
long as they can read enough in advance of the first read from the
consumer, they can buffer it and look to see if it’s, for example,
Shift_JIS versus EUC-JP.

Thank you for information.

BTW, I’d prefer to use the term “character encoding” rather than
“character set,” as technically, the character set is just the
characters themselves, and not their assigment to binary numbers
or number sequences. Also, it would probably be best to use the
IANA standards for character encoding (though they call them
“character sets”) names, available at

Character Sets

I use the following definitions:

character set:
the set of characters with number assigned to each character.

code point:
the assigned number for each character in particular
character set

character encoding scheme:
the way to represent sequence of code points.

Mojikyo is a character set. Unicode is a character set. UTF-8 is a
character encoding scheme for Unicode. Shift_JIS is a character
encoding scheme. ISO 10646 defines both character set and encoding
scheme.

						matz.

···

In message “Re: Unicode in Ruby now?” on 02/08/01, Curt Sampson cjs@cynic.net writes:

Curt_Sampson · 2 August 2002 08:53

The reason I’ve pointed to that section, is that there will be no
difference in ISO/IEC 10646 and Unicode very soon in sense of covered code
space.

Right. They’re reducing the ISO/IEC code space to match Unicode.

This means as soon as it will be acomplished, uniform expansion to
unused bits in 32-bit space will start.

I very, very much doubt that. Remember, Unicode uses 16-bit code values,
and all high and low surrogate characters are immediately identifiable.
Breaking this would result in much, much pain.

If CJK community will have
interest in it, of course. As you may remember, there were some complaints
in past about ‘small’ code space for covering CJK in Unicode. It wouldn’t
be so relatively soon.

Yeah, but for day to day use, nobody even uses the surrogate pairs. This
is part of the whole point of Unicode; you can safely ignore them or do
only very minimal processing to deal with them, and all but specialized
applications will still work.

cjs

···

On Fri, 2 Aug 2002, Alexander Bokovoy wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Masayoshi_TAKAHASHI · 1 August 2002 23:36

Hi,

matz@ruby-lang.org (Yukihiro Matsumoto) wrote:

Mojikyo is a character set. Unicode is a character set.

IMHO, Mojikyo is a character-glyph set. It defines detail
of glyph design(HANE, TOME, HARAI) which unified in other
character set (JIS and UCS).
It’s the reason why Mojikyo is not (should not) unified in
Unicode, I think.

Regards,

TAKAHASHI ‘Maki’ Masayoshi E-mail: maki@rubycolor.org

Clifford_Heath1 · 4 August 2002 23:38

Curt Sampson wrote:

Remember, Unicode uses 16-bit code values,

No. Unicode uses UCS-4 characters, 32 bits. It also provides UCS-2,
which has surrogates, which don’t allow easy extension to encoding
all UCS-4 characters. However that’s not a good argument why programs
should deal with characters as anything less than 32-bit. UCS-2 has
always been a broken encoding and should be avoided, but UTF-8
resolves the issue (up to 31 bits anyway).

···

–
Clifford Heath

Chen_Levkovich · 23 December 2004 13:52

Take a look at http://www.muftah-alhuruf.com you can find there an
Arabic virtual keyboard.

Clifford_Heath1 · 2 August 2002 04:53

TAKAHASHI Masayoshi wrote:

It’s the reason why Mojikyo is not (should not) unified in
Unicode, I think.

That sounds fair, but computers still need to process such symbols.
IMO the ISO 10646 folk should be approached to allocate a 24-bit
block inside the UCS-4 encoding, but outside the Unicode space.
That way the UCS-4 Mojikyo characters can be encoded using either
the 4/5/6 byte extension of UTF-8, or using the UTF-8 style of
encoding with 1/2/3/4 bytes (i.e. with an assumed UCS-4 top byte
that isn’t zero, as with Unicode).

I agree with Dan’s comments, and think this would be the best way
to resolve the issue.

···

–
Clifford Heath

Curt_Sampson · 5 August 2002 01:41

These statements are both very wrong. Please consult the Unicode
specification or read previous messages here under this subject
line.

cjs

···

On Mon, 5 Aug 2002, Clifford Heath wrote:

Curt Sampson wrote:

Remember, Unicode uses 16-bit code values,

No. Unicode uses UCS-4 characters, 32 bits. It also provides UCS-2,
which has surrogates, which don’t allow easy extension to encoding
all UCS-4 characters.

–
Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Tobias1 · 2 August 2002 07:53

IMO the ISO 10646 folk should be approached to allocate a 24-bit
[…]
I agree with Dan’s comments, and think this would be the best way
to resolve the issue.

Then we would delay i18n string processing until the iso 10646 people have
made such a decision? This will probably never happen! And even if it
does, then not in the next few years. We need unicode in ruby now. It
seems we can get by with a choice of two possible canonical encodings to
be used for the result of concatenating strings with different encodings:
utf-8 and some mojikyu encoding, based on the encodings of the original
strings. Let’s implement it.

Tobias

···

On Fri, 2 Aug 2002, Clifford Heath wrote:

Clifford_Heath1 · 5 August 2002 03:58

Curt Sampson wrote:

These statements are both very wrong.

I was deliberately being “reinterpretive”, but what I said is the effective
truth. If you want to do Unicode-3 correctly and simply, then a 32 bit
internal representation is the right one - surrogates simply recreate
exactly the same problems of variable-length encoding that plagued
earlier efforts at providing a simple way (for the programmer) to code
correctly. Externally, a more compact encoding is needed (utf-8 or utf-16
are valid choices), but internally, UTF-16 is bogus in the extreme.
Even internally, if appropriately hidden behind an API that only exposes
whole characters, a more compact encoding (such as I’ve recently
described) can be worthwhile.

You seem to be so wedded to the Java/Unicode model that you can’t see
out of the hole into which you’ve dug yourself.

···

–
Clifford Heath

Yukihiro_Matsumoto2 · 3 August 2002 09:51

Hi,

···

In message “Re: Unicode in Ruby now?” on 02/08/02, Tobias Peters tpeters@invalid.uni-oldenburg.de writes:

Then we would delay i18n string processing until the iso 10646 people have
made such a decision? This will probably never happen! And even if it
does, then not in the next few years. We need unicode in ruby now. It
seems we can get by with a choice of two possible canonical encodings to
be used for the result of concatenating strings with different encodings:
utf-8 and some mojikyu encoding, based on the encodings of the original
strings. Let’s implement it.

I’m not against Unicode or any other charset. I just want that the
applications written in Ruby can choose their cannonical encodings.
Many of them choose Unicode in the future. But I don’t want to force
Unicode in any way, when EUC-JP is good enough. And I’m implementing
it now.

						matz.

Topic		Replies	Views
Unicode in Ruby and a Ruby Reference ruby-talk	9	125	15 December 2004
A plan for another unicode string hack ruby-talk	27	153	15 June 2006
Ruby unicode./encoding support ruby-talk	9	71	4 June 2003
Ruby 1.9 - US-ASCII vs UTF-8 ruby-talk	2	149	19 December 2009
How to send utf8 data to remote computer in ruby 1.9.2 ruby-talk	2	140	17 August 2011

Unicode in Ruby now?

– / Alexander Bokovoy

On Fri, 2 Aug 2002, Dan Sugalski wrote:

On Sat, 3 Aug 2002, MikkelFJ wrote:

– / Alexander Bokovoy

On Fri, 2 Aug 2002, Alexander Bokovoy wrote:

Related topics

–
/ Alexander Bokovoy

–
/ Alexander Bokovoy