Hi,
i do not want to bore you, but,
I need help on abstract concepts and terminology:
Let me give you a very small example that is at least
30 years old:
When ECMA (European Computer Manufacturers
Association) TC 1 was dealing
with Character codes and glyphs , they had long
discussions
about an unbiased name for “currency symbol” as well
as a glyph for it.
They wound up with something they called “solidus” and
the glyph
looked like a sun symbol: a circle with a dot in the
middle.
Very late it dawned on them , that it was not such a
good idea
to send 27.5 solidi from the UK to the US , when
solidus would
print as a British pound sign on one end and as a US
dollar on the other.
(Today, this device might be useful for doing some
“creative bookkeeping”)
All the new wonderful standards talk about
“characters” and their encodings.
This is only one side of the medal!
It makes a difference, where one hooks up which
attributes:
It impacts on multilingual OCR, spell checkers,
virtual keyboards, collating sequences
and much more:
Let us assume that dictionary entries [ of a variant
of a language]
are bound to a canonic font (glyph-char pair set)
and that the complete dictionary is organized along a
canonic collating sequence.
Assume further that each glyph can be theoretically
OCRed unambiguously,
but that one and the same glyph may be bound to many
chars across many
languages and their variants.
Assert that whatever can be printed can be OCRed back,
if language and its
variant are known.
Assume that there are human beings who can read and
understand display text and hardcopy text, if they
understand
its language.
Assume that such understanders react to the text on
two levels:
-
tolerate certain systematic but cognitively neutral
glyph deformations -
react to the conscious and subliminal cues aimed
at directing their attention
(highlights, the “small print” etc.)
If all the above applies, we are led to the following
insights: -
if we store text, we want to be able to restore it
without losses. -
we also want to be able to take the text apart ,
separating out
linguistic facets for text analysis or mechanical
translation -
we want to be able to do all this for all living
and dead languages
on this planet (as far as having a written
representation for its texts) .For a given piece of [glyphic] text, all we should
know is its language and its variant.
(If either or both are not known, we might need
additional devices
to identify them, this is not part of the current
argument) -
there are languages that have a few problems not
present in Japanese or
English. These problems are sufficiently well
known and have already been solved in local
contexts.
(Let us leave out the truly archaic stuff, like
boustrophism.)
Arabic glyph texts are mostly characterized by 3
facts- consonants take begin, middle, end and isolated
form glyphs - vowels exist in their short form as diacritics
- they run from left to right
Arabic characters with variations exist in Urdu
and other indic languages,
Farsi (Iran), and a few other languages.
Hebrew also runs from right to left, but does not
show
the positional dependencies of Arabic.Devanagari and other indic languages are written
from left to right.
Their main complication are ligatures:
neighbouring characters intertwining. - consonants take begin, middle, end and isolated
Now, for me, the world would be optimally organized,
if we had 2 tables:
glyphset: set of number (globally unique code point) -
glyph ( ttr or postscript description)
lang_glyphs: language, variant, subset from glyphset
now, using brute force, why can we not make scale,
boldface, italic,all that jazz
attributes of glyph instances?
(Maybe even, I dare not say it, katakana or hiragana
as an attribute)
Font is of course a more subtle problem:
( German children today are unable to read “gothic”
text)
Unfortunately, we must also come to some decision on
real “quotes” :
The chief would answer this with “Urrrgh”
He drew the following shape in the sand: …"
Now, having all this, we would [finally] have to come
down to real encodings.
today, it seems pretty ridiculous to discuss memory
space and/or cpu performance.
In the worst case we are facing expansion/bandwidth
requirement increase factors
between 2 and 3 on the average.
For an end user, the critical question is, whether
he/she can do something at all
in a reliable fashion, not if it takes 3 times as
long.
If you tell the end user that his/her data are encoded
via code X,
they will not be interested. They will get worried, if
their application cannot find the right characters.
So, ultimately,
as computers and networks are increasing performance
daily,
why can we not concentrate on functionality?
Once we get language across, we will be able to get
meaning across.
Think it over,
guys and dolls!
Jan
···
Do You Yahoo!?
Yahoo! Health - Feel better, live better