Unicode in Ruby?

Hi,
i do not want to bore you, but,
I need help on abstract concepts and terminology:
Let me give you a very small example that is at least
30 years old:
When ECMA (European Computer Manufacturers
Association) TC 1 was dealing
with Character codes and glyphs , they had long
discussions
about an unbiased name for “currency symbol” as well
as a glyph for it.
They wound up with something they called “solidus” and
the glyph
looked like a sun symbol: a circle with a dot in the
middle.
Very late it dawned on them , that it was not such a
good idea
to send 27.5 solidi from the UK to the US , when
solidus would
print as a British pound sign on one end and as a US
dollar on the other.
(Today, this device might be useful for doing some
“creative bookkeeping”)
All the new wonderful standards talk about
“characters” and their encodings.
This is only one side of the medal!
It makes a difference, where one hooks up which
attributes:
It impacts on multilingual OCR, spell checkers,
virtual keyboards, collating sequences
and much more:
Let us assume that dictionary entries [ of a variant
of a language]
are bound to a canonic font (glyph-char pair set)
and that the complete dictionary is organized along a
canonic collating sequence.
Assume further that each glyph can be theoretically
OCRed unambiguously,
but that one and the same glyph may be bound to many
chars across many
languages and their variants.
Assert that whatever can be printed can be OCRed back,
if language and its
variant are known.
Assume that there are human beings who can read and
understand display text and hardcopy text, if they
understand
its language.
Assume that such understanders react to the text on
two levels:

  • tolerate certain systematic but cognitively neutral
    glyph deformations

  • react to the conscious and subliminal cues aimed
    at directing their attention
    (highlights, the “small print” etc.)
    If all the above applies, we are led to the following
    insights:

  • if we store text, we want to be able to restore it
    without losses.

  • we also want to be able to take the text apart ,
    separating out
    linguistic facets for text analysis or mechanical
    translation

  • we want to be able to do all this for all living
    and dead languages
    on this planet (as far as having a written
    representation for its texts) .

    For a given piece of [glyphic] text, all we should
    know is its language and its variant.
    (If either or both are not known, we might need
    additional devices
    to identify them, this is not part of the current
    argument)

  • there are languages that have a few problems not
    present in Japanese or
    English. These problems are sufficiently well
    known and have already been solved in local
    contexts.
    (Let us leave out the truly archaic stuff, like
    boustrophism.)
    Arabic glyph texts are mostly characterized by 3
    facts

    • consonants take begin, middle, end and isolated
      form glyphs
    • vowels exist in their short form as diacritics
    • they run from left to right

    Arabic characters with variations exist in Urdu
    and other indic languages,
    Farsi (Iran), and a few other languages.
    Hebrew also runs from right to left, but does not
    show
    the positional dependencies of Arabic.

    Devanagari and other indic languages are written
    from left to right.
    Their main complication are ligatures:
    neighbouring characters intertwining.

Now, for me, the world would be optimally organized,
if we had 2 tables:

glyphset: set of number (globally unique code point) -
glyph ( ttr or postscript description)
lang_glyphs: language, variant, subset from glyphset

now, using brute force, why can we not make scale,
boldface, italic,all that jazz
attributes of glyph instances?
(Maybe even, I dare not say it, katakana or hiragana
as an attribute)
Font is of course a more subtle problem:
( German children today are unable to read “gothic”
text)
Unfortunately, we must also come to some decision on
real “quotes” :
The chief would answer this with “Urrrgh”
He drew the following shape in the sand: …"

Now, having all this, we would [finally] have to come
down to real encodings.
today, it seems pretty ridiculous to discuss memory
space and/or cpu performance.
In the worst case we are facing expansion/bandwidth
requirement increase factors
between 2 and 3 on the average.
For an end user, the critical question is, whether
he/she can do something at all
in a reliable fashion, not if it takes 3 times as
long.
If you tell the end user that his/her data are encoded
via code X,
they will not be interested. They will get worried, if
their application cannot find the right characters.

So, ultimately,
as computers and networks are increasing performance
daily,
why can we not concentrate on functionality?
Once we get language across, we will be able to get
meaning across.

Think it over,
guys and dolls!

Jan

···

Do You Yahoo!?
Yahoo! Health - Feel better, live better

The question is really about where you put the functionality. Should
all this be at the lowest-level of string handling (your basic string
class), or should some of it be up at higher levels?

If you look at some common applications, such as dealing with small
strings of data that move between a user’s web browser and a service
provider’s database (login names, passwords, addresses, etc.),
there’s very little processing to be done. Often you don’t even
care what language or anything else the data are in; you just need
to know the character set encoding (so you don’t mix encodings on
output). The web browser takes care of most or all of your input
and display issues, including even wrapping text. (And that’s a big,
big issue right there; Japanese and English use quite different text
wrapping algorithms, for example.)

Unicode is designed to let you leave the lower levels quite simple,
and use upper-level code to deal with more complexity only if you
really need to. Moving this kind of stuff into lower level code
leads to several problems:

1. It's not efficient to have lower level code do things that
you don't need it to do. This is not a killer problem, I agree,
but is still a concern.

2. Having the lower-level code doing this stuff can actually
force unnecessary complexity on to the upper-level stuff using
that code.

3. If the lower-level stuff is broken in some way, or doesn't
do all that you need, you still end up having to have further
processing in the upper layers anyway.

To my mind, reason number three is the real killer: text processing
varies in complexity and can get very, very, very complex. And for
those applications, there’s a good chance that the lower levels
are actually going to do things wrong in some cases, in which case it
makes life far more complex for the programmer than if it had just stayed
out of the way in the first place.

Take even the simple idea of iterating over the characters in a
string, for example. As soon as you bring in left-to-right languages,
such as Hebrew, how does this work? Do you iterate in the order
the text is read? What happens when the direction of reading changes
in the middle? (Such as when you mix Hebrew and arabic numerals?)
What if it’s a multi-line string? Investigate the issues and work
out how you might deal with this, and you’ll quickly realize that
you really want all this complexity in a class separate from your
basic string.

cjs

···

On Mon, 5 Aug 2002, Jan Witt wrote:

So, ultimately, as computers and networks are increasing performance
daily, why can we not concentrate on functionality? Once we get
language across, we will be able to get meaning across.


Curt Sampson cjs@cynic.net +81 90 7737 2974 http://www.netbsd.org
Don’t you know, in this new Dark Age, we’re all light. --XTC