Strange behaviour of Strings in Range

Yukihiro_Matsumoto2 · 4 May 2004 13:18

Hi,

As far as I can see, currently 20 bits are sufficient
Unicode 15.1 Character Code Charts

And anything after “Special” looks really quite special to me. At least
western languages as well as Kanji, Hiragana and Katakana are supported.
IMHO pragmatically 16 bits are good enough.

I assume you’re saying that there’s no more than 65536 characters on
earth in daily use, even including Asian ideograms (Kanjis).

You are right, if we can live in the idealistic world.

The problems are:

Japan, China, Korea and Taiwan have characters from same origin,
but with different glyph (appearance). Due to Han unification,
Unicode assigns same character code number to those characters.
We used to use encodings to switch country information (script) in
internationalized applications. Unicode does not allow this
approach. We need to implement another layer to switch script.
Due to historical reason and unification, some characters do not
round trip through conversion from/to Unicode. Sometimes we loose
information by implicit Unicode conversion.
Asian people have used multibyte encoding (EUC-JP for example) for
long time. We have gigabytes of legacy encoding files. The cost
of code conversion is not negligible. We also have to care about
the round trip problem.
There are some huge set of characters little known to western
world. For example, the TRON code contains 170,000 characters.
They are important to researchers, novelists, and people who care
characters.

In conclusion, Unicode is a good thing, but not perfect. Unicode will
be handled as much as other encodings in future version of Ruby. It
will not be (and cannot be) a pivot of character encodings, like in
Java way.

						matz.

···

In message “Re: Strange behaviour of Strings in Range” on 04/05/04, “Robert Klemme” bob.news@gmx.net writes:

Robert · 4 May 2004 11:33

“Yukihiro Matsumoto” matz@ruby-lang.org schrieb im Newsbeitrag
news:1083604752.355707.7861.nullmailer@picachu.netlab.jp…

Hi,

Your note “The definition of “character” should belong to the
application
domain” sounded to me like you didn’t consider enhancing unicode
treatment
in Ruby. I’m sorry if I misread you.

Then what’s the approach planned at the moment?

Basic idea is your “alternative” in [ruby-talk:99089].
We prove it’s not insane though making prototype.

Ah, ok. Is that present in the ruby_m17n branch?

Could you search the ruby-talk archive with keyword I18N for more
detail? Or you can check ruby_m17n branch in the CVS.

I include the references I found to spare others the effort:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/46186

http://blade.nagaokaut.ac.jp/cgi-bin/vframe.rb/ruby/ruby-talk/45937?45702-47288
http://blade.nagaokaut.ac.jp/cgi-bin/vframe.rb/ruby/ruby-talk/7436?7335-8382

Kind regards

robert

···

In message “Re: Strange behaviour of Strings in Range” > on 04/05/04, “Robert Klemme” bob.news@gmx.net writes:

Robert · 4 May 2004 13:44

“Yukihiro Matsumoto” matz@ruby-lang.org schrieb im Newsbeitrag
news:1083676682.291770.9438.nullmailer@picachu.netlab.jp…

Hi,

As far as I can see, currently 20 bits are sufficient
Unicode 15.1 Character Code Charts

And anything after “Special” looks really quite special to me. At
least
western languages as well as Kanji, Hiragana and Katakana are
supported.
IMHO pragmatically 16 bits are good enough.

I assume you’re saying that there’s no more than 65536 characters on
earth in daily use, even including Asian ideograms (Kanjis).

Yes, that’s what I thought - until now. :-}

You are right, if we can live in the idealistic world.

The problems are:

Japan, China, Korea and Taiwan have characters from same origin,
but with different glyph (appearance). Due to Han unification,
Unicode assigns same character code number to those characters.
We used to use encodings to switch country information (script) in
internationalized applications. Unicode does not allow this
approach. We need to implement another layer to switch script.

Due to historical reason and unification, some characters do not
round trip through conversion from/to Unicode. Sometimes we loose
information by implicit Unicode conversion.

Asian people have used multibyte encoding (EUC-JP for example) for
long time. We have gigabytes of legacy encoding files. The cost
of code conversion is not negligible. We also have to care about
the round trip problem.

There are some huge set of characters little known to western
world. For example, the TRON code contains 170,000 characters.
They are important to researchers, novelists, and people who care
characters.

I see. Pardon my ignorage, I did not know of any of these problems. I
assumed naively that Unicode had taken care of all issues, but apparently
it didn’t (and probably couldn’t).

In conclusion, Unicode is a good thing, but not perfect. Unicode will
be handled as much as other encodings in future version of Ruby. It
will not be (and cannot be) a pivot of character encodings, like in
Java way.

Well, then it’s even better than the Java approach, isn’t it? That’s
great news!

Matz, thank you for taking the time to explain this!

Kind regards

robert

···

In message “Re: Strange behaviour of Strings in Range” > on 04/05/04, “Robert Klemme” bob.news@gmx.net writes:

Yukihiro_Matsumoto2 · 4 May 2004 13:20

Hi,

···

In message “Re: Strange behaviour of Strings in Range” on 04/05/04, “Robert Klemme” bob.news@gmx.net writes:

Basic idea is your “alternative” in [ruby-talk:99089].
We prove it’s not insane though making prototype.

Ah, ok. Is that present in the ruby_m17n branch?

Yes, even though it’s merely a prototype.

						matz.

Topic		Replies	Views
Bizarre Range behavior ruby-talk	41	181	20 August 2009
[BUG] string range membership ruby-talk	2	68	23 November 2005
[BUG] string range membership ruby-talk	2	71	1 December 2005
String + Range = Strange ruby-talk	6	76	2 December 2005
Is this a bug of String#count? ruby-talk	3	116	19 August 2010

Strange behaviour of Strings in Range

Related topics