“pj” peterjohannsen@hotmail.com wrote in message
news:127ce4a9.0303081532.2011b08f@posting.google.com…
There is a Ruby FAQ which I read that said that Ruby only supports
ASCII. That is, AFAIK, a 7 bit character encoding scheme for a very
old character set (one with less than 128 characters).
…
I’m frankly hoping that UTF-8 is available, because I would happily
stick to UTF-8 – well, assuming there is something like iconv
…
I was designing a domain specific scripting language and asked matz how he
was dealing with non-ASCII identifiers without getting into complex
what-is-a-letter issues. As I recall matz reply was something like dealing
with ascii the normal way and anything above would simply be considered
letters for the purpose of identifiers. Up front this was quite useful to
me, but it does not appear Ruby is currently using it - perhaps in Kanji
mode?
Internationalization is important to us because we deal with questionnaires
worldwide.
Here is what we did: First, the input can be in Unicode
This is not entirely exact according to Unicode specs, but it is simply and
works quite well. (The actual identifiers are slightly different as we allow
@ and a few other symbols in ASCII as well, but that is beside the point).
identifier = /[A-Za-z\x80-\xff][A-Za-z0-9-\x0080-\xffff]*/
Highly interestly an 8-bit characterstream in UTF-8 or Latin-1 can detect
identfiers with the following expression, oblivious to the fact that the
stream:
utf8ident = /[A-Za-z\x80-\xff][A-Za-z0-9-\x80-\xff]*/
In Latin-1 it works as expected.
In UTF-8 the magic works because any non-ASCII character is made up of a
sequence of one or more characters in the \x80-\xff range and never in the
ASCII range. It is very important that ASCII characters never appear inside
a non-ASCII UTF-8 sequence. This makes it extremely easy to support UTF-8 in
a parser normally dealing with ASCII symbols for operators etc. (+'*\ etc.)
The expression may think that one character is multiple characters, but that
is completely irrelevant - it works.
Therefore, I would guess that Ruby could instantly support UTF-8 and Latin-1
identifiers by a slight change to the lexer.
UTF-8 has an optional leading 3-character sequence called BOM (Byte Order
Marker) that should be stripped.
available to project down to CP-1252 for example when going to US
English MS-Windows (or down to UCS-2 when going to MS-Windows NT,
because, I don’t think I’d risk believing scattered claims that (some
of) NT handles UTF-16).
UTF-16 text has a big endian or little endian BOM 3-character sequence.
Ruby could easily support these by detecting the BOM and converting the
script to UTF-8 before parsing. The conversion is completely mechanical
using a few shift and logical operations per character.
By converting to UTF-8, the Lexer need not be able to handle 16 bit
characters.
In the script I’m working with, it currently only reads ASCII, Latin-1 and
UTF-8 directly from disk, but does a UCS-2 conversion when receiving script
in-memory from other components.
It may be the process is not 100% bullet proof, but it seems to work well
enough.
Mikkel