Deprecation and Unicode

Hello all,

There’s a nice article at
http://www.onjava.com/pub/a/onjava/2002/07/31/java3.html
that discusses improvements for Java. Here’s a two things that also apply
to Ruby:

10 and 9: are methods and such going to be deprecated in Ruby? I think
there are some good candidates, like every method that ends with a “2” or
"3"…

7: According to this point, Unicode has -much- more characters than 65536,
so it needs more than two bytes. I haven’t seen that mentioned here
before.

Danny

Actually, we’ve pretty much beat this to death in the past few days.

My understanding from what I’ve heard (and I don’t know much about
Unicode, so please correct me if I’m wrong), in layman’s terms (i.e.
not standard-compliant language):

Unicode is defined to be representable using 2-byte numbers. There
are provisions for using sequences of two (or more now?) of these
numbers to represent characters outside the basic 64K or so. There
are few applications that deal fully with these extended characters
(“surrogate pairs”?); Java (for instance) doesn’t.

In a program, representing Unicode by 2-byte numbers is probably ideal
unless you’re very space-constrained (in which case UTF-8 is best) or
have to do a lot of character indexing and are using these
extension characters (in which case you might want to use a 4-byte
representation).

···

On Saturday 03 August 2002 09:21 am, Danny van Bruggen wrote:

7: According to this point, Unicode has -much- more characters than
65536, so it needs more than two bytes. I haven’t seen that
mentioned here before.


Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

There’s a nice article at
Radar – O’Reilly
that discusses improvements for Java.

I’d be careful with this article; it’s got errors in it. I’m not going
to bother to do a full analysis, but here’s an example of a blatent one:

Encodings should be identified with IANA-registered names like
ISO-8859-1 and UTF-16 instead of Java class names like 8859_1
and UTF16.

Well, this is already the case and has been for a long time; you can see
this in the 1.3 JDK documenatation at

http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html

7: According to this point, Unicode has -much- more characters than 65536,
so it needs more than two bytes. I haven’t seen that mentioned here
before.

Look back through the archives for my posts over the past few days;
it’s been dicussed quite heavily.

cjs

···

On Sun, 4 Aug 2002, Danny van Bruggen wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Hi,

···

At Sun, 4 Aug 2002 01:21:02 +0900, Danny van Bruggen wrote:

10 and 9: are methods and such going to be deprecated in Ruby? I think
there are some good candidates, like every method that ends with a “2” or
“3”…

What methods do you mean? open3?


Nobu Nakada

I think you have it. Unicode is formally a 31 bit character set but the
first 16 bits are called the “basic multilingual plane” (BMP) and were
supposed to be adequate for representing most natural languages. Some of the
characters in the BMP are combining or ‘dead key’ characters.

What I have learned here is that either some or all of the Japanese think
that it doesn’t work for their language. It’s not clear whether they think
that more bytes are required or just that the existing standard doesn’t
assign them correctly.

I’m ignoring the distinction between ‘character’ and ‘glyph’ because the
discussion here has always made the distinction the same way as the Unicode
folks do.

···

On 8/3/02 10:25 AM, “Ned Konz” ned@bike-nomad.com wrote:

On Saturday 03 August 2002 09:21 am, Danny van Bruggen wrote:

7: According to this point, Unicode has -much- more characters than
65536, so it needs more than two bytes. I haven’t seen that
mentioned here before.

Actually, we’ve pretty much beat this to death in the past few days.

My understanding from what I’ve heard (and I don’t know much about
Unicode, so please correct me if I’m wrong), in layman’s terms (i.e.
not standard-compliant language):

Unicode is defined to be representable using 2-byte numbers. There
are provisions for using sequences of two (or more now?) of these
numbers to represent characters outside the basic 64K or so. There
are few applications that deal fully with these extended characters
(“surrogate pairs”?); Java (for instance) doesn’t.

In a program, representing Unicode by 2-byte numbers is probably ideal
unless you’re very space-constrained (in which case UTF-8 is best) or
have to do a lot of character indexing and are using these
extension characters (in which case you might want to use a 4-byte
representation).


Many individuals have, like uncut diamonds, shining qualities beneath a
rough exterior. - Juvenal, poet (c. 60-140)

Hi,

···

At Sun, 4 Aug 2002 02:25:35 +0900, Ned Konz wrote:

Unicode is defined to be representable using 2-byte numbers. There
are provisions for using sequences of two (or more now?) of these
numbers to represent characters outside the basic 64K or so. There
are few applications that deal fully with these extended characters
(“surrogate pairs”?); Java (for instance) doesn’t.

I guess you mean UTF-16, one of variable length encodings, it
would be supported by M17N ruby.


Nobu Nakada

Unicode is defined to be representable using 2-byte numbers. There
are provisions for using sequences of two (or more now?) of these
numbers to represent characters outside the basic 64K or so.

Right.

There
are few applications that deal fully with these extended characters
(“surrogate pairs”?); Java (for instance) doesn’t.

“Java” deals with it just fine; if you do what the Unicode spec.
tells you to do. You decide what level of support you need and
write it in. Many common activities need no extra support for
dealing with surrogate pairs.

In a program, representing Unicode by 2-byte numbers is probably ideal
unless you’re very space-constrained (in which case UTF-8 is best)

UTF-8 is best if you’re using ISO-8859-1, yes. However, in Japanese
for example, UTF-8 will take up more space than UTF-16.

cjs

···

On Sun, 4 Aug 2002, Ned Konz wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Just want to chime in and let you know that I agree with Danny. Now,
there doesn’t seem to be too many which end in digits, until you require
‘date’…

DateTime#exist1?
DateTime#new2
DateTime#new0
DateTime#new1
DateTime#exist3?
DateTime#exist2?
DateTime#new3
Date#exist1?
Date#new2
Date#new0
Date#new1
Date#exist3?
Date#exist2?
Date#new3

Cor blimey, its worse than I remember from last time I used it :smiley: Some
verbose, descriptive aliases wouldn’t go amiss, since this is part of
the standard library.

(Not intentionally slagging the date module, its the only one I used
where the numbers started confusing me.)

···

nobu.nokada@softhome.net wrote:

At Sun, 4 Aug 2002 01:21:02 +0900, > Danny van Bruggen wrote:

10 and 9: are methods and such going to be deprecated in Ruby? I think
there are some good candidates, like every method that ends with a “2” or
“3”…

What methods do you mean? open3?


([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student
/(( _d L b_/ NTNU - graduate engineering - 4. year )
( __õ|õ// ) )Industrial economics and technological management(
_
/ö____/ (_engineering.discipline=Computer::Technology)

Yes, names like that. But I didn’t mean to open a flamewar on naming
standards, I wanted to see if people wanted an official way to mark their
things “deprecated”. I know I would like to…

Danny

···

On Sun, 4 Aug 2002 nobu.nokada@softhome.net wrote:

At Sun, 4 Aug 2002 01:21:02 +0900, > Danny van Bruggen wrote:

10 and 9: are methods and such going to be deprecated in Ruby? I think
there are some good candidates, like every method that ends with a “2” or
“3”…

What methods do you mean? open3?

I think you have it. Unicode is formally a 31 bit character set…

Been here before, just the other day. Unicode is not a 31-bit
character set. The entire range can be represented in 21 bits.

…first 16 bits are called the “basic multilingual plane” (BMP) and were
supposed to be adequate for representing most natural languages.

In fact it does appear to be adequate. At least, I can testify that
it’s fine for English, Japanese and Chinese.

cjs

···

On Sun, 4 Aug 2002, Chris Gehlker wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC

Hi,

I think you have it. Unicode is formally a 31 bit character set but the
first 16 bits are called the “basic multilingual plane” (BMP) and were
supposed to be adequate for representing most natural languages. Some of the
characters in the BMP are combining or ‘dead key’ characters.

I think “Unicode” here should be replaced by “ISO 10646”.

What I have learned here is that either some or all of the Japanese think
that it doesn’t work for their language. It’s not clear whether they think
that more bytes are required or just that the existing standard doesn’t
assign them correctly.

Historically, Japanese people sustained original ISO 10646 32 bit, but
Unicode people brought Han Unification to limit in the 16 bit space,
and ISO 10646 adopted it. That made Japanese (and probably other
Asian people) disappointed. It requires table lookup for every code
conversion.

But practically there’s no problem to handle Japanese with Unicode.
It covers almost all Japanese characters used daily. It already used
as canonical encoding in major applications (for example, Microsoft
Word) without any problems.

The feeling behind “noisy” Japanese people around Curt is probably
caused by the “defeat” at the ISO10646 disussion.

I personally don’t have any negative feeling toward Unicode, but I
don’t trust it wholeheartedly. I don’t use Unicode in my daily life.
Most of my Japanese text are in EUC-JP or ISO-2022-JP. I don’t want
to be forced to convert them back and force from Unicode, if I don’t
care about any non Japanese text. When I want to treat Japanese,
Korean, and Chinese at the same time, I’d happily use Unicode. When I
want to handle characters not in Unicode, I’d use more bigger charset
like Mojikyo. I want to provide the freedom of choice. Unicode
centric I18N is simpler, but choices are ASCII or Unicode. Ruby I18N
will provide the way to handle user-definable encodings. You will be
able to choose ASCII, EUC-JP, Shift_JIS, EUC-KR, ISO-8859-1, or
Unicode. It will be more complex, but complex part will be hide
behind the framework.

						matz.
···

In message “Re: Deprecation and Unicode” on 02/08/04, Chris Gehlker gehlker@fastq.com writes:

Yes, names like that. But I didn't mean to open a flamewar on naming
standards, I wanted to see if people wanted an official way to mark their
things "deprecated". I know I would like to...

Well, it depend what you call "official" but actually you have

pigeon% /usr/bin/ruby -ve '.indexes'
ruby 1.6.7 (2002-03-01) [i686-linux]
pigeon%

pigeon% ./ruby -ve '.indexes'
ruby 1.7.2 (2002-07-13) [i686-linux]
-e:1: warning: Array#indexes is deprecated; use Array#select
pigeon%

Guy Decoux

Hi,

Just want to chime in and let you know that I agree with Danny. Now,
there doesn’t seem to be too many which end in digits, until you require
‘date’…

I forgot it, thank you.

Cor blimey, its worse than I remember from last time I used it :smiley: Some
verbose, descriptive aliases wouldn’t go amiss, since this is part of
the standard library.

In 1.7, now the methods have new descriptive names and the old
names are aliases.

···

At Sun, 4 Aug 2002 18:37:12 +0900, Kent Dahl wrote:


Nobu Nakada