Unicode in Ruby now?

As for someone who just wants to use a set of
different character sets, the current discussions
about Unicode, I18N and all the different standards
is not very helpful for me.
Some of the discussions also display some
unpleasant parochialism, maybe traceable
to the use of some conceptual inconsistencies.
I suspect that “character” is an ambiguous concept
as used in the standards. To clarify matters, we must
distinguish between
meaning, representation and encodings.

Let us try and agree on a few basic issues:

What you see on the screen, analyze via OCR, write on
paper or see in a book,
are elementary shapes or “glyphs” (from ancient greek
glyphe = the carving).
These glyphs may be scaled or otherwise deformed,
emboldened etc.,
without introducing interpretation problems on that
level.
( Size and Emphasis play important roles on the level
of words and
sentences, but not typically on the encoding level of
single characters)

As on the level of textual terms there are
overlappings of
terms and meanings, namely synonyms and homonyms,
so on the glyph level, there are synmorphs and
homomorphs,
when we produce mappings of glyphs and [language]
characters.

A trivial case of homomorphism is the relationship
between
the glyphs representing an uppercase latin H and the
Greek letter Eta.

The classic case of synmorphism is the use of a
variety of fonts
(sets of glyph-character pairings) within a single
language.

A not so simple case is synmorphism from context
dependency like in
all languages written with Arabic characters, where
the choice of glyph for a given
character depends on the position of the character in
the word (beginning, middle, end).

So far, I have assumed that we are talking about
characters/glyphs that follow
linear textual orderings like :

Most modern European languages: glyphs: left to
right, lines top down
Japanese (if not following ")

Japanese: glyphs: top down,
lines right to left

Arabic, Hebrew (printed forms): glyphs: left to
right, lines top down

There are 2 exceptions to this:
diacritics (small marks above, below, besides or
within a glyph )
ligatures (intertwining of 2 or more
characters, e.g. in Indian languages)

Since the advent of Adobe Postscript and MS Truetype,
glyphs have become manageable
as geometric objects, which means, that they can be
encoded and decoded as such.

Also, for all languages, people use dictionaries.
Doing this, they know,
what language they are dealing with. Also, they know
about the standard
collating sequence of that language. ( In classical
Spanish do not search for ‘chorro’ under ‘c’)

To produce an online or hard copy dictionary, you
must choose a “canonical
representative” amongst the available fonts and
select a collating sequence.
(There may be several to choose from)

As a multilingual reader, I want my browser to find
out what language
a piece of text is written in and to display the
appropriate glyphs for
the different HTML elements as defined by the text
author.

As a multilingual writer, or OCR reader I will tell
my word processor what language I
want to use currently and it will give me a virtual
keyboard, provide me with a
dictionary, a spelling checker etc…

Assuming that all glyphs are uniquely numbered,
and knowing the language,
it should always be possible, to retrieve text from
the glyph representation
and to deal with it on any linguistic level desired.

The end user has to be given a list of standardized
names of languages, nothing more.
He/she should not need to know any glyph or
“character” encodings explicitly.

For someone who wants to write a multilingual
information system in Ruby, the question is how the
ideal situation just described can best be
approximated.

Jan

  • Más sabe el diablo por viejo que por diablo. ( The
    devil knows more from being old than from being the
    devil.)
···

Do You Yahoo!?
Yahoo! Health - Feel better, live better