Unicode in Ruby now?

Tobias1 · 31 July 2002 12:38

I’ve read the thread “Unicode in Ruby’s Future?” [ruby-talk: 40016]. It
remains a bit vague.

Of course you can already translate strings between multiple encodings
with one of the existing chartacter encoding libs right now. A problem
that I am currently facing:

When I export a string to an utf-8 encoded stream, how can I possibly know
its current encoding. Strings do not have an “encoding” tag. Will they
have in future?

Wouldn’t it be a better solution to store strings in memory in a canonical
format (be it utf-8 for space savings, ucs-4 for O(1) indexing operations,
whatever) and let string sources and sinks have an “encoding” property,
and do the transformation on the fly?

We would need to identify possible sinks and sources of character strings
and how to determine their encoding anyway. Anyone interested? Perhaps
I’ll create a Wiki Page at rubygarden for that.

Examples: stdin and stderr would be influenced by the user’s locale.
Literal strings in ruby source code are a string source. There should be a
mechanism to state the character encoding for ruby source files, with a
reasonable default (which?). Filesystem names returned by Dir objects have
a charset encoding. How to determine that?

You get the picture. If you do Data Serialization to formats that restrict
the character encoding (be it xml or yaml), you have to know the encoding
of strings in memory. It would be helpful if ruby determined the character
encoding right when a string was created. Later on, there is no chance to
do that (except for error-prone heuristics).

Tobias

Yukihiro_Matsumoto2 · 31 July 2002 14:24

Hi,

When I export a string to an utf-8 encoded stream, how can I possibly know
its current encoding. Strings do not have an “encoding” tag. Will they
have in future?

Yes.

Wouldn’t it be a better solution to store strings in memory in a canonical
format (be it utf-8 for space savings, ucs-4 for O(1) indexing operations,
whatever) and let string sources and sinks have an “encoding” property,
and do the transformation on the fly?

No. Considering the existence of “big character set” like Mojikyo
(charset developed in Japan, which is bigger than Unicode), there
cannot be any ideal canonical format. In addition, from my
estimation, the cost penalty from code conversion to/from the
canonical character set is intolerable if one processes mainly on
non-ASCII, non-Unicode text data, like we do in Japan.

						matz.

···

In message “Unicode in Ruby now?” on 02/07/31, Tobias Peters tpeters@invalid.uni-oldenburg.de writes:

Clifford_Heath1 · 1 August 2002 00:20

Matz,

Is Mojikyo a superset of Unicode? If not, how hard is the translation to
UCS-4?

I designed the UCS-4 string class we use here in C++, with a UTF-8 storage
format (up to 31-bit with a six-byte UTF-8 sequence). The string class
remembers which character you last accessed and at what byte offset it
started, so that when you ask for another character, it can decide whether
heuristically to search forward from the start, forward or backward from
the remembered point (most common), or if it has ever counted the
characters, backward from the end. This minimises the search cost since
most string processing is largely sequential.

With the “remembered point” feature, I think UTF-8 has been a good tradeoff,
so much so that although I implemented the class using a pure interface and
a factory to allow alternate formats, we haven’t needed to do it.

BTW, re “style”, I like the definition I heard from a fashion figure:
“quirkiness with confidence”. I guess the definition doesn’t hold so well
for software though :-).

···

–
Clifford Heath

Tobias1 · 1 August 2002 10:01

When I export a string to an utf-8 encoded stream, how can I possibly know
its current encoding. Strings do not have an “encoding” tag. Will they
have in future?

Yes.

Nice. I still think sources and sinks of characters also need an
“encoding” property. Strings originating from some character source will
then have the source’s encoding. Strings exported to character sinks will
have to be converted on the fly in case of a different encodings. We could
make behaviour in case of unconvertible characters a property of character
sinks.

We also need rules how to combine strings with different encoding then.
concatenating two strings encoded in koi8-r and iso-8859-1, respectively,
may only be possible when the result is encoded in some unicode
representation.

Are there any other character sets of relevance that are not part of
unicode yet? Otherwise, we could probably live with just two possible
canonical character encodings. With canonical, here I mean the encoding of
a string that is the result of a combination of other strings with
different encodings.

No. Considering the existence of “big character set” like Mojikyo
(charset developed in Japan, which is bigger than Unicode), there
cannot be any ideal canonical format.

I understand that Mojikyo will not be folded into unicode due to
political reasons. Combining ruby strings in some unicode encoding with
ruby strings encoded in some Mojiko encoding might result in a runtime
error then.

In addition, from my
estimation, the cost penalty from code conversion to/from the
canonical character set is intolerable if one processes mainly on
non-ASCII, non-Unicode text data, like we do in Japan.

I understand that. It would affect all countries that use non-ascii
encodings.

Due to ruby’s dynamic nature we could probably implement most of what is
required for international string with the current ruby version. The
biggest problems that I see are:

Determining the encoding of string literals in source code. This should
be specified in the source file itself. Perhaps it’s possible to
implement by overriding “require” and “load”, read the whole file in
memory, and convert it to a user/system-specific default character set
before calling eval on it.
Determining the encoding of strings that describe File system Paths. I
have no idea if operating systems provide this information to user space
applications.

Anyone interested in working on it?

Tobias

···

On Wed, 31 Jul 2002, Yukihiro Matsumoto wrote:

In message “Unicode in Ruby now?” > on 02/07/31, Tobias Peters tpeters@invalid.uni-oldenburg.de writes:

Marcin_Qrczak_Kowalc · 6 August 2002 20:45

Thu, 1 Aug 2002 19:01:39 +0900, Tobias Peters tpeters@invalid.uni-oldenburg.de pisze:

Determining the encoding of strings that describe File system Paths. I
have no idea if operating systems provide this information to user space
applications.

On Unix it’s generally the encoding determined by the locale.
At least it should be in theory.

···

–
__("< Marcin Kowalczyk
__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Curt_Sampson · 1 August 2002 03:30

Yes.

Basically, Ruby being a Japanese product, it is unlikely ever to be able
to standardise on Unicode. The Japanese (well, enough of them to matter)
hate Unicode. (They seem to hate I18N in general, for that matter; they
certainly seem to pay less attention to it than, say, Americans.) Maybe
they just like doing their own thing rather than interoperating with the
rest of the world.

(Sorry about the minor flame. I spent two years in the U.S. and another
year here in Japan doing I18N work, and it was pretty darn frustrating.
And it was my great disappointment with ruby to find out that the I18N
support is so poor.)

cjs

···

On Thu, 1 Aug 2002, Clifford Heath wrote:

Is Mojikyo a superset of Unicode?

–
Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Yukihiro_Matsumoto2 · 1 August 2002 05:07

Hi,

Is Mojikyo a superset of Unicode? If not, how hard is the translation to
UCS-4?

Mojikyo character set contains all the CJK characters in ISO 10646.
But codepoint number for each character is different, so that we can
only do Unicode → Mojikyo conversion using table lookup. The
translation to UCS-4 from Mojikyo is nearly impossible. (How can one
assign non-Unicode codepoint number in UCS-4?)

http://www.mojikyo.org/

I designed the UCS-4 string class we use here in C++, with a UTF-8 storage
format (up to 31-bit with a six-byte UTF-8 sequence). The string class
remembers which character you last accessed and at what byte offset it
started, so that when you ask for another character, it can decide whether
heuristically to search forward from the start, forward or backward from
the remembered point (most common), or if it has ever counted the
characters, backward from the end. This minimises the search cost since
most string processing is largely sequential.

Remembered point technique is very interesting. When we meet
performance problem with Ruby I18N, I will use it.

						matz.

···

In message “Re: Unicode in Ruby now?” on 02/08/01, Clifford Heath cjh_nospam@managesoft.com writes:

Curt_Sampson · 1 August 2002 10:53

…
We also need rules how to combine strings with different encoding then.
concatenating two strings encoded in koi8-r and iso-8859-1, respectively,
may only be possible when the result is encoded in some unicode
representation.

Yeah. This is getting into complex nightmare city. That’s why I’d prefer
to have the basic system just work completely in Unicode. One could have
a separate character system (character and string classes, byte-stream
to char converters, etc.) to work with this tagged format if one wished.

Are there any other character sets of relevance that are not part of
unicode yet?

Yeah. There are tons of obscure Japanese characters that are not in and
will never be in Unicode, some of which exist in various other character
sets. In particular there’s Mojikyo (http://www.mojikyo.org/) which is
at 80,000 characters and growing.

I understand that Mojikyo will not be folded into unicode due to
political reasons.

Not just political reasons, but practical reasons. Unicode is designed
to work if you restrict yourself to using only 16-bit chars, and I
expect most programs are going to limit themselves to that. So even if
it were folded in to the extension space, most people wouldn’t use it.

cjs

···

On Thu, 1 Aug 2002, Tobias Peters wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

HAL_9000 · 1 August 2002 03:51

Well, here is your chance to contribute to
the I18N support…

Seriously, since you have some expertise, I’m
sure your knowledge will be valuable in
improving Ruby… talk to vruz also.

Hal

···

----- Original Message -----
From: “Curt Sampson” cjs@cynic.net
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, July 31, 2002 10:30 PM
Subject: Re: Unicode in Ruby now?

(Sorry about the minor flame. I spent two years in the U.S. and another
year here in Japan doing I18N work, and it was pretty darn frustrating.
And it was my great disappointment with ruby to find out that the I18N
support is so poor.)

Yukihiro_Matsumoto2 · 1 August 2002 04:58

Hi,

Basically, Ruby being a Japanese product, it is unlikely ever to be able
to standardise on Unicode. The Japanese (well, enough of them to matter)
hate Unicode. (They seem to hate I18N in general, for that matter; they
certainly seem to pay less attention to it than, say, Americans.) Maybe
they just like doing their own thing rather than interoperating with the
rest of the world.

I don’t think we hate I18N in general, but I admit many Japanese hate
Unicode-centric I18N.

(Sorry about the minor flame. I spent two years in the U.S. and another
year here in Japan doing I18N work, and it was pretty darn frustrating.
And it was my great disappointment with ruby to find out that the I18N
support is so poor.)

It’s because no one but me is working on it. We need power.

						matz.

···

In message “Re: Unicode in Ruby now?” on 02/08/01, Curt Sampson cjs@cynic.net writes:

Tobias1 · 1 August 2002 12:01

Are there any other character sets of relevance that are not part of
unicode yet?

Yeah. [… Mojikyo …]

Sorry, with “other” I meant besides Mojikyo. ?

Not just political reasons, but practical reasons. Unicode is designed
to work if you restrict yourself to using only 16-bit chars, and I
expect most programs are going to limit themselves to that.

I thought this was a design error, and Unicode has now roughly 32 bits for
code points, and I read somewhere it is expected that 21 bits will suffice
to encode all possible glyphs.

Tobias

···

On Thu, 1 Aug 2002, Curt Sampson wrote:

On Thu, 1 Aug 2002, Tobias Peters wrote:

Philipp_Meier · 1 August 2002 12:14

Can’t we follow an approach like ruby or java handles numbers? It’s
naturally possible to add a floating point number and an integer. The
result is a floating point. I will sketch some ideas:

class String
alias :+ :__+

def + (other)
if other.is_a? self.class
self.__+(other)
else
self += convert(other)
end
end
end

class ISO88591 < String
def convert_from(other)
if other.is_a? ASCII
return other
elsif other.is_a? UTF8
return from_utf8
else
convert_from(UTF8.convert_from(other))
end
end

def from_utf8()
…
end

def to_utf8()
…
end

end

class UTF8 < String
def convert_from(other)
other.to_utf8
end
end

The convert method can of course try to delegate everything to a UTF8
class, so that ascii → iso-8859-1 will result in ascii → utf8 →
iso8859-1. That means when introducing a new encoding one must only
provide a method to convert from / to utf8.

-billy.

···

On Thu, Aug 01, 2002 at 07:53:07PM +0900, Curt Sampson wrote:

On Thu, 1 Aug 2002, Tobias Peters wrote:

…
We also need rules how to combine strings with different encoding then.
concatenating two strings encoded in koi8-r and iso-8859-1, respectively,
may only be possible when the result is encoded in some unicode
representation.

Yeah. This is getting into complex nightmare city. That’s why I’d prefer
to have the basic system just work completely in Unicode. One could have
a separate character system (character and string classes, byte-stream
to char converters, etc.) to work with this tagged format if one wished.

–
Meisterbohne Söflinger Straße 100 Tel: +49-731-399 499-0
eLösungen 89077 Ulm Fax: +49-731-399 499-9

Alexander_Bokovoy · 1 August 2002 12:30

Unicode 3.1 is 32-bit wide. I do not see reason to exist projects like
Mojikyo when it is perfectly can be done in 32-bit Unicode. Also, Mojikyo
institute restrictions on Mojikyo fonts are somewhat useless if they
want to make their encoding system more widespread and usefull.

···

On Thu, Aug 01, 2002 at 07:53:07PM +0900, Curt Sampson wrote:

I understand that Mojikyo will not be folded into unicode due to
political reasons.

Not just political reasons, but practical reasons. Unicode is designed
to work if you restrict yourself to using only 16-bit chars, and I
expect most programs are going to limit themselves to that. So even if
it were folded in to the extension space, most people wouldn’t use it.
–
/ Alexander Bokovoy

Maintence window broken

MikkelFJ1 · 1 August 2002 15:12

“Curt Sampson” cjs@cynic.net wrote in message
news:Pine.NEB.4.44.0208011945110.539-100000@angelic.cynic.net…

…
We also need rules how to combine strings with different encoding then.
concatenating two strings encoded in koi8-r and iso-8859-1,
respectively,
may only be possible when the result is encoded in some unicode
representation.

Yeah. This is getting into complex nightmare city. That’s why I’d prefer
to have the basic system just work completely in Unicode. One could have
a separate character system (character and string classes, byte-stream
to char converters, etc.) to work with this tagged format if one wished.

But isn’t this what matz suggest?
Each stream is tagged, that is the same as having different types. It’s
basically just a different way to store the type while having a lot of
common string operations.
The only major issues I see are: fixed size characters versus variable
length characters - for example UTF-8 versus UCS-4.
I think a string class for each type might make sense.

I think UTF-8 is universal enough to warrant its own string class, but for
fixed size formats, why should it matter what encoding is used? The
important thing is that the encoding is stored, possibly along with the byte
width of the character.

BTW: Unicode is not a fixed with format. It is an almost fixed with format -
but there are escape codes and options for future extensions. Hence UCS-4 is
a strategy with limited timespan. That’s why I prefer UTF-8 - it recognizes
the variable length issue up front, yet usually takes up less space than
UCS-2 or UCS-4 except if you are from Asia in which case you probably want a
different encoding.

One other detail about UTF-8 - its really nice to work with as raw 8 bit
character. When writing a lexer for UTF-8 you can define anything above 127
to be a letter and A-Za-z to also be a letter. Then you can forget
everything about UTF-8 and still parse everything correctly including
indentifiers. UCS-2, UCS-4 lexers get bloated or impossible due to the huge
lookup.

For practical string processing, it is actually very seldom that you
actually need to index a character sequence. Normally you simply break the
string into pieces instead.
You could have a iterator object for the string class which does not care
about variable length issues. It cannot index, but it can remember a
position even after a string is changed.

Mikkel

···

On Thu, 1 Aug 2002, Tobias Peters wrote:

Curt_Sampson · 1 August 2002 04:50

I doubt it. My opinion of the matter is that the correct way to
do things is to go with Unicode internally. (This does not rule
out processing non-Unicode things, but you process them as binary
byte-strings, not as character strings.) You lose a little bit of
functionality this way, but overall it’s easy, fast, and gives you
everything you really need.

Unfortunately, a lot of Japanese programmers disagree with this. They
feel the need, for example, to have separate code points for a single
character, simply because one stroke is slightly different between the
way Japanese and Chinese people write it. (The meaning is exactly the
same.)

They sometimes even feel the need to have the source language encoded
within strings, rather than having only applications that need this
information deal with it in their data formats. (It’s not that there
aren’t uses for these sorts of features, but they are not useful enough
to put the burden and overhead of them on every single program that
wants to deal with a bit of text.)

Basically, if I18N is not going to be completely impossible, you’re
going to have to live with a bit of lossage when it comes to putting
data into a computer, especially kanji data. But everybody suffers this
loss: even in English we lived through all the days of ASCII without the
ability to spell co-operate properly (with a diaeresis over the second
‘o’, instead of the hyphen). Or naive (diaeresis over the ‘i’), for that
matter. We lived.

Anyway, I’ve had it with that battle. Ruby gets what it gets, and maybe
one day I’ll be able easily to use it for I18N work, maybe not. In the
mean time there’s perl and Java.

cjs

···

On Thu, 1 Aug 2002, Hal E. Fulton wrote:

Seriously, since you have some expertise, I’m sure your knowledge will
be valuable in improving Ruby… talk to vruz also.

–
Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Yukihiro_Matsumoto2 · 1 August 2002 05:32

Hi,

···

In message “Re: Unicode in Ruby now?” on 02/08/01, Yukihiro Matsumoto matz@ruby-lang.org writes:

And it was my great disappointment with ruby to find out that the I18N
support is so poor.)

It’s because no one but me is working on it. We need power.

By the way, what are the required I18N features to satisfy you?
Enlighten me.

						matz.

Curt_Sampson · 1 August 2002 12:50

No, and no. As specified in section 2.2 of the standard (I quote
from 3.0, but 3.1 and 3.2 do not appear to have updated these
sections):

Plain Unicode text consists of sequences of 16-bit Unicode character
codes.... From the full range of 65,536 code values, 63,486 are
available to represent characters with single 16-bit code values,
and 2,048 code values are available to represent an additional
1,048,544 chaaracters through paired 16-bit code values.

So note that the character codes are still 16-bit. though sometimes
it takes two character codes to represent a character.

Section 2.9, “Conforming to the Unicode Standard” states as its
first precept that, “An implementation that conforms to the Unicode
Standard… treats characters as 16-bit units.” It goes on to say
that the standard, “does not require that an application be capable
of interpreting and rendering all Unicode characters so as to be
conformant.”

Section 5.4 really gets into the details of handling surrogate pairs.
There are essentially three levels at which you can do this: none, where
you treat each 16-bit code as an individual, unknown character and do
not guarantee integrity of the pairs (i.e., you don’t guarantee that
you won’t split them); weak where you interpret some pairs, but treat
the others as “none” treats them; and “strong” where you interpet some
pairs, and guarantee the integrity of them.

It continues with:

As with text-element boundaries, the lowest-level string-handling
routines (such as wcschr) do not necessarily need to be modified
to prevent surrogates from being damaged. In practice, it is
sufficient that only certain higher-level processes...be aware of
surrogate pairs; the lowest-level routines can continue to function
on sequences of 16-bit Unicode code values without having to treat
surrogates specially.

So from all of this, it’s pretty obvious that the expectation is that
only those systems that really need to work with surrogates are going to
be doing more than the bare minimum to support them, and those that do
support it will be doing it at something above the basic language level.

I should also point out that Java has no special support for surrogates
in the String class; String.length() returns the number of code values
in the string, not the number of characters. It’s not a problem in practice.

cjs

···

On Thu, 1 Aug 2002, Tobias Peters wrote:

Not just political reasons, but practical reasons. Unicode is designed
to work if you restrict yourself to using only 16-bit chars, and I
expect most programs are going to limit themselves to that.

I thought this was a design error, and Unicode has now roughly 32 bits for
code points…

–
Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Curt_Sampson · 1 August 2002 12:55

Unicode 3.1 is 32-bit wide.

I have just looked at my 3.0 standard and the 3.1 and 3.2 updates on the
web site, and I do not see any evidence of this. Did I miss something?
See the message I just posted for the details as I know them.

I do not see reason to exist projects like
Mojikyo when it is perfectly can be done in 32-bit Unicode.

Mojikyo is doing things like setting code points for characters that
will never exist in Unicode, because those characters are combined due
to the character combining rules. Mojikyo has a different purpose from
Unicode: Unicode wants to make doing standard, day-to-day work easy;
Mojikyo wants to give maximum flexability in the display of Chinese
characters. Given the number and complexity of kanji, these two aims are
basically incompatable.

cjs

···

On Thu, 1 Aug 2002, Alexander Bokovoy wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Curt_Sampson · 2 August 2002 09:24

Yeah. This is getting into complex nightmare city. That’s why I’d prefer
to have the basic system just work completely in Unicode. One could have
a separate character system (character and string classes, byte-stream
to char converters, etc.) to work with this tagged format if one wished.

But isn’t this what matz suggest?
Each stream is tagged, that is the same as having different types. It’s
basically just a different way to store the type while having a lot of
common string operations.

No, because then you have to deal with conversions. Most popular
character sets are convertable to Unicode and back without loss. That is
not true of any arbitrary pair of character sets, though, even if you go
through Unicode.

The reason for this is as follows. Say character set Foo has split
a unified hanji, “a”, and also has “A”. When converting to Unicode,
that “A” will be preserved because it’s assigned a code point in
a compatability area, and when you convert back from Unicode, that
“A” will be translated to “A” in Foo. However, if character set Bar
does not have “A”, just “a”, the “A” will be converted to “a”. When you
go from Bar back to Unicode, you end up with “a” again because there’s
no way to tell that it was originally “A” when you converted out.

But there’s an even better reason than this for converting to
Unicode on input, rather than doing internal tagging. If you don’t
have conversion tables for a particular character encoding, it’s
much better to find out at the time you try to get the information
in to the system than at some arbitrary later point when you try
to do a conversion. That way you know where the problem information
is coming from.

In terms of interface, I would say:

1. Continue to use String as it is for "binary" data. This is
efficient, if you don't need to do much processing.

2. Add a UString or similar for dealing with UTF-16 data. There's
no need for surrogate support in this, for reasons I will get into
below, so this is straight fixed width. Reasonably efficient (almost
maximally efficient for those of us using Asian languages :-)) and
very easy to use.

3. Add other, specialized classes when you need to do special
purpose things. No need for this in the standard distribution.

BTW: Unicode is not a fixed with format.

In terms of code values, it is fixed width. However, some characters are
represented by pairs of code values.

…but there are escape codes…

No, there are no escape codes. The high and low code values for
surrogate characters have their own special areas, and so are easily
identifiable.

and options for future extensions.

Not that I know of. Can you explain what these are?

Hence UCS-4 is a strategy with limited timespan.

Not at all, unless they decide to change Unicode to the point where it
no longer uses 16-bit code values, or add escape codes, or something
like that. That would be backward-incompatable, severely complicate
processing, and generally cause all hell to break lose. So I’d rate
this as “Not Likely.”

Here are a few points to keep in mind about Unicode processing:

1. The surrogate pairs are almost never used. Two years ago
there weren't even any characters assigned to those code points.

2. There are many situations where, even if surrogate pairs
are present, you don't know or care, and need do nothing to
correctly deal with them.

3. Broken surrogate pairs are not a problem; the standard says you
must be able to ignore broken pairs, if you interpret surrogate
pairs at all.

3. The surrogate pairs are extremely easy to distinguish, even
if you don't interpret them.

4. The code for dealing with surrogate pairs well (basically,
not breaking them) is very simple.

The implication of point 1 is that one should not spend a lot of effort
dealing with surrogate pairs, as very few users will ever use them. Very
few Asian users will ever use them in their lifetimes, in fact.

The implication of points 2 and 3 are that not everything that deals
with Unicode has to deal with, or even know about, surrogate pairs. If
you are writing a web application, for example, your typical fields you
just take as a whole from the web browser or database, and give as a whole
to the web browser or database. Thus only the web browser really has any
need at all to deal with surrogate pairs.

If you take a substring of a string and in the process end up with
a surrogate pair half on either end, that’s no problem. It just
gets ignored by whatever facilities deal with surrogate pairs, or
treated as an unknown character by those that don’t (rather than
two unknown characters for an unsplit surrogate pair).

The only time you really run into a problem is if you insert
something into a string; there’s a chance you might split the
surrogate pair, and lose the character. This is pretty uncommon
except in interactive input situations, though, where you know how
to handle surrogate pairs and can avoid doing this, or where you
don’t know and the user can’t see the characters anyway.

Well, another area you can run into problems with is line wrapping, but
there’s no single algorithm for that anyway, and plenty of algorithms
break on languages for which they were not designed. So there you should
add some very simple code that avoids splitting surrogate pairs. (This
code is much simpler than the line wrapping code anyway, so it’s hardly
a burden.) That shows the advantages of points 3 and 4 (essentially the
same point).

So I propose just what the Unicode standard itself proposes in
section 5.4: UString (or whatever we call it) should have the
Surrogate Support Level “none”; i.e., it completely ignores the
existence of surrogate pairs. Things that use UString that have
the potential to encounter surrogate pair problems or wish to
interpret them can add simple or complex code, as they need, to
deal with the problem at hand. (Many users of UString will need to
do nothing.)

Note that there’s a big difference between this and your UTF-8
proposal: ignoring multibyte stuff in UTF-8 is going to cause much,
much more lossage because there’s a much, much bigger chance of
breaking things when using Asian languages. With UTF-16, you probably
won’t even encounter surrogates, whereas with Japanese in UTF-8,
pretty much every character is multibyte.

cjs

···

On Fri, 2 Aug 2002, MikkelFJ wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Curt_Sampson · 1 August 2002 08:33

Me? Not so much, actually. I need to be able to read in ISO-8859-1,
ISO-2022-JP, EUC-JP, Shift_JIS and UTF-8, convert them to a common
internal format so I don’t have to worry about where the data I am
manipulating came from, and do the various conversions again for output.
And use your basic regexps and such as well, on the internal format.
The internal format should treat a character as a single character,
regardless of whether it’s represented by one, two or more bytes in any
particular encoding.

I’ll probably need to convert to/from various Chinese and/or Korean
character sets at some point in the future, but I don’t need that right
now.

cjs

···

On Thu, 1 Aug 2002, Yukihiro Matsumoto wrote:

By the way, what are the required I18N features to satisfy you?
Enlighten me.

–
Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Topic		Replies	Views
Unicode in Ruby and a Ruby Reference ruby-talk	9	125	15 December 2004
A plan for another unicode string hack ruby-talk	27	153	15 June 2006
Ruby unicode./encoding support ruby-talk	9	71	4 June 2003
Ruby 1.9 - US-ASCII vs UTF-8 ruby-talk	2	149	19 December 2009
How to send utf8 data to remote computer in ruby 1.9.2 ruby-talk	2	140	17 August 2011

Unicode in Ruby now?

On Thu, 1 Aug 2002, Tobias Peters wrote:

On Thu, 1 Aug 2002, Alexander Bokovoy wrote:

On Fri, 2 Aug 2002, MikkelFJ wrote:

Related topics