Andy Roonie is perhaps excessively optimistic


(Jan Witt) #1

To begin with, I do not know who Andy Roonie is,
either.
I think it is worth while to point out some
of the serious problems relating to natural
languages , bytes, fonts , chars , glyphs and all
that:
As I see it, the Unicode effort has been deeply
misguided right from the beginning and the Java
brotherhood has misunderstood the thing as well:
(1) in many languages there are more glyphs than
letters in the alphabet, e.g. because of
ligatures,
i.e. letters that get intertwined with their
neighbors.( take Hindi or Arabic as examples)
Unicode does not cater for this.
(2) Diacritics are not everywhere as simple as
accents in French, umlauts in German , which
luckily could be fit into Latin-1.
(3) Some languages are written from left to right,
some from the top down and texts may be mixed.
(4) Some historic languages are even written in
bustrophic style or even have symbols that face
left or right depending on context like
hieroglyphics.
Please consider that a multilingual text editor
must know about the [possibly varying] glyph bindings
of all of its
languages.
(5) Japanese, as you probably know, has the rich
choice
of kanji characters and the two kana alphabets,
but no ligatures.
(6) Collating sequences are a nontrivial issue.
In classical Spanish, e.g. LL and CH are
considered
separate characters.
(7) Real and virtual keyboards are a major issue:
There are “latinized” keyboards that allow non-native
speakers of Greek, Russian, Arabic, Hebrew etc. to
find the equivalents of, say, English letters in
the same places. Diacritics like accents also
impact on the keyboard.
(8) UTF-8 and open-ended variable-length encoding
is obviously the way to go, but I wish I knew who
is looking after these things in Ruby and how far
they’ve got. (as far as I know, as of now, you do not
get cut and dried solutions in Java yet)
(9) How about Cgi scripts and Tk Guis in Urdu?

Jan

···

=====


Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup


(Curt Sampson) #2

As I see it, the Unicode effort has been deeply
misguided right from the beginning…

I think that you’re deeply misguided right from the beginning about
what Unicode is supposed to do. :slight_smile:

(1) in many languages there are more glyphs than
letters in the alphabet, e.g. because of ligatures,
i.e. letters that get intertwined with their
neighbors.( take Hindi or Arabic as examples)
Unicode does not cater for this.

Nor is it supposed to. These are typesetting issues, not data
issues. The word “fish” contains the same letters whether or not
you use a ligature for the “fi”.

(2) Diacritics are not everywhere as simple as
accents in French, umlauts in German , which
luckily could be fit into Latin-1.

So? Unicode deals with a lot of diacritical marks. (Take a look
at the Vietnamese support, for example.) Where exactly does Unicode
fall down in supporting diacriticals?

(3) Some languages are written from left to right,
some from the top down and texts may be mixed.

Unicode supports mixed-direction writing.

Please consider that a multilingual text editor
must know about the [possibly varying] glyph bindings
of all of its
languages.

Not really. I get by just fine with an editor that cannot generate
"proper" (in print terms) glyphs for “fi”, “fl”, “ffl”, and so on.
I suspect most others do, too.

(5) Japanese, as you probably know, has the rich
choice of kanji characters and the two kana alphabets,
but no ligatures.

Since I know a little bit of Japanese, I’d be particularly interested
in what you think the Unicode problems are in relation to Japanese.

(6) Collating sequences are a nontrivial issue.
In classical Spanish, e.g. LL and CH are
considered
separate characters.

Unicode does not specify any collating sequences.

cjs

···

On Wed, 26 Jun 2002, Jan Witt wrote:

Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC


(Evan Martin) #3

This thread has gone horribly OT, and I suggest the first poster quoted
here consult a Unicode FAQ, as many of his concerns have already been
addressed elsewhere.

As for Japanese-specific problems, that is the one place the consortium
may have messed up in unifying CJK kanji; see
http://www.debian.or.jp/~kubota/ if you want the gory details.

···

On Thu, Jun 27, 2002 at 08:46:28PM +0900, Curt Sampson wrote:

(5) Japanese, as you probably know, has the rich
choice of kanji characters and the two kana alphabets,
but no ligatures.

Since I know a little bit of Japanese, I’d be particularly interested
in what you think the Unicode problems are in relation to Japanese.


Evan Martin
martine@cs.washington.edu
http://neugierig.org