What character sets are available in Ruby?

There is a Ruby FAQ which I read that said that Ruby only supports
ASCII. That is, AFAIK, a 7 bit character encoding scheme for a very
old character set (one with less than 128 characters).

I am hoping that that is out of date ?

I found one long thread, but it appeared to be speculative, about what
to do about Unicode, and I think it may have mixed up character sets
and character encoding schemes. (I often see Microsoft documentation
and Microsoft-trained or influenced programmers expounding on how
Unicode is 16 bits; this is the type of misunderstanding of which I’m
quite tired now).

What character set(s) are supported under Ruby, using (of course) what
character encoding schemes ?

I’m frankly hoping that UTF-8 is available, because I would happily
stick to UTF-8 – well, assuming there is something like iconv
available to project down to CP-1252 for example when going to US
English MS-Windows (or down to UCS-2 when going to MS-Windows NT,
because, I don’t think I’d risk believing scattered claims that (some
of) NT handles UTF-16).

“pj” peterjohannsen@hotmail.com wrote in message
news:127ce4a9.0303081532.2011b08f@posting.google.com

There is a Ruby FAQ which I read that said that Ruby only supports
ASCII. That is, AFAIK, a 7 bit character encoding scheme for a very
old character set (one with less than 128 characters).

I’m frankly hoping that UTF-8 is available, because I would happily
stick to UTF-8 – well, assuming there is something like iconv

I was designing a domain specific scripting language and asked matz how he
was dealing with non-ASCII identifiers without getting into complex
what-is-a-letter issues. As I recall matz reply was something like dealing
with ascii the normal way and anything above would simply be considered
letters for the purpose of identifiers. Up front this was quite useful to
me, but it does not appear Ruby is currently using it - perhaps in Kanji
mode?

Internationalization is important to us because we deal with questionnaires
worldwide.

Here is what we did: First, the input can be in Unicode
This is not entirely exact according to Unicode specs, but it is simply and
works quite well. (The actual identifiers are slightly different as we allow
@ and a few other symbols in ASCII as well, but that is beside the point).

identifier = /[A-Za-z\x80-\xff][A-Za-z0-9-\x0080-\xffff]*/

Highly interestly an 8-bit characterstream in UTF-8 or Latin-1 can detect
identfiers with the following expression, oblivious to the fact that the
stream:

utf8ident = /[A-Za-z\x80-\xff][A-Za-z0-9-\x80-\xff]*/

In Latin-1 it works as expected.
In UTF-8 the magic works because any non-ASCII character is made up of a
sequence of one or more characters in the \x80-\xff range and never in the
ASCII range. It is very important that ASCII characters never appear inside
a non-ASCII UTF-8 sequence. This makes it extremely easy to support UTF-8 in
a parser normally dealing with ASCII symbols for operators etc. (+'*\ etc.)

The expression may think that one character is multiple characters, but that
is completely irrelevant - it works.

Therefore, I would guess that Ruby could instantly support UTF-8 and Latin-1
identifiers by a slight change to the lexer.

UTF-8 has an optional leading 3-character sequence called BOM (Byte Order
Marker) that should be stripped.

available to project down to CP-1252 for example when going to US
English MS-Windows (or down to UCS-2 when going to MS-Windows NT,
because, I don’t think I’d risk believing scattered claims that (some
of) NT handles UTF-16).

UTF-16 text has a big endian or little endian BOM 3-character sequence.
Ruby could easily support these by detecting the BOM and converting the
script to UTF-8 before parsing. The conversion is completely mechanical
using a few shift and logical operations per character.

By converting to UTF-8, the Lexer need not be able to handle 16 bit
characters.

In the script I’m working with, it currently only reads ASCII, Latin-1 and
UTF-8 directly from disk, but does a UCS-2 conversion when receiving script
in-memory from other components.

It may be the process is not 100% bullet proof, but it seems to work well
enough.

Mikkel

Hi,

···

In message “What character sets are available in Ruby ?” on 03/03/09, pj peterjohannsen@hotmail.com writes:

There is a Ruby FAQ which I read that said that Ruby only supports
ASCII. That is, AFAIK, a 7 bit character encoding scheme for a very
old character set (one with less than 128 characters).

Ruby does handle ASCII, UTF-8, EUC-JP, SJIS, and perhaps ISO-8859-x.

						matz.

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message
news:3e6b1d39$0$137$edfadb0f@dtext01.news.tele.dk…

identfiers with the following expression, oblivious to the fact that the
stream:
utf8ident = /[A-Za-z\x80-\xff][A-Za-z0-9-\x80-\xff]*/

should read … oblivious to the fact that the stream is UTF-8 or Latin-1.

Mikkel

“Yukihiro Matsumoto” matz@ruby-lang.org wrote in message
news:1047215982.535206.10664.nullmailer@picachu.netlab.jp…

Hi,

There is a Ruby FAQ which I read that said that Ruby only supports
ASCII. That is, AFAIK, a 7 bit character encoding scheme for a very
old character set (one with less than 128 characters).

Ruby does handle ASCII, UTF-8, EUC-JP, SJIS, and perhaps ISO-8859-x.

But not in identifiers?
The following does not work out of the box (the danish name followed by the
english name of Copenhagen):

#translations:
København = “Copenhagen”

Mikkel

···

In message “What character sets are available in Ruby ?” > on 03/03/09, pj peterjohannsen@hotmail.com writes:

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message news:3e6b1d39$0$137$edfadb0f@dtext01.news.tele.dk

“pj” peterjohannsen@hotmail.com wrote in message
news:127ce4a9.0303081532.2011b08f@posting.google.com

There is a Ruby FAQ which I read that said that Ruby only supports
ASCII. That is, AFAIK, a 7 bit character encoding scheme for a very
old character set (one with less than 128 characters).

I’m frankly hoping that UTF-8 is available, because I would happily
stick to UTF-8 – well, assuming there is something like iconv

Here is what we did: First, the input can be in Unicode

(You obviously mean UTF-8 encoded Unicode, as you make clear below.)

This is not entirely exact according to Unicode specs, but it is simply and
works quite well. (The actual identifiers are slightly different as we allow
@ and a few other symbols in ASCII as well, but that is beside the point).

identifier = /[A-Za-z\x80-\xff][A-Za-z0-9-\x0080-\xffff]*/

Highly interestly an 8-bit characterstream in UTF-8 or Latin-1 can detect
identfiers with the following expression, oblivious to the fact that the
stream:

utf8ident = /[A-Za-z\x80-\xff][A-Za-z0-9-\x80-\xff]*/

In Latin-1 it works as expected.
In UTF-8 the magic works because any non-ASCII character is made up of a
sequence of one or more characters in the \x80-\xff range and never in the
ASCII range. It is very important that ASCII characters never appear inside
a non-ASCII UTF-8 sequence. This makes it extremely easy to support UTF-8 in
a parser normally dealing with ASCII symbols for operators etc. (+'*\ etc.)

The expression may think that one character is multiple characters, but that
is completely irrelevant - it works.

Therefore, I would guess that Ruby could instantly support UTF-8 and Latin-1
identifiers by a slight change to the lexer.

But, isn’t there a string#length – does it give you character length
or byte length ?

I am pretty ignorant of Ruby – does it provide uppercase, lowercase,
titlecase functions ? If so, I would guess that they only support
UTF-8 if they were designed for UTF-8 ?

UTF-8 has an optional leading 3-character sequence called BOM (Byte Order
Marker) that should be stripped.

available to project down to CP-1252 for example when going to US
English MS-Windows (or down to UCS-2 when going to MS-Windows NT,
because, I don’t think I’d risk believing scattered claims that (some
of) NT handles UTF-16).

UTF-16 text has a big endian or little endian BOM 3-character sequence.
Ruby could easily support these by detecting the BOM and converting the
script to UTF-8 before parsing. The conversion is completely mechanical
using a few shift and logical operations per character.

By converting to UTF-8, the Lexer need not be able to handle 16 bit
characters.

Ok, you’re talking about how Ruby could be altered to do things, right
?

I’m wondering what Ruby out of the box does, altho, it is pretty
interesting to hear what Ruby could do too.

In the script I’m working with, it currently only reads ASCII, Latin-1 and
UTF-8 directly from disk, but does a UCS-2 conversion when receiving script
in-memory from other components.

It may be the process is not 100% bullet proof, but it seems to work well
enough.

Mikkel

I’ve not been following, but I read at least a rumor that China put
out a new standard (GB18030 ? I probably have the numbers wrong) that
cannot be represented in UCS-2, but can be represented in Unicode 3.1
(ie, UTF-8, UTF-16, etc). I dunno if that matters to anyone outside of
China though.

Ruby does handle ASCII, UTF-8, EUC-JP, SJIS, and perhaps ISO-8859-x.

But not in identifiers?
The following does not work out of the box (the danish name followed by
the
english name of Copenhagen):

#translations:
København = “Copenhagen”

I don’t think this is intended to work. I know
it does in Java, but that’s a different beast.

Hal

···

----- Original Message -----
From: “MikkelFJ” mikkelfj-anti-spam@bigfoot.com
Newsgroups: comp.lang.ruby
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Sunday, March 09, 2003 11:21 AM
Subject: Re: What character sets are available in Ruby ?

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com

But not in identifiers?
The following does not work out of the box (the danish name
followed by the english name of Copenhagen):

#translations:
København = “Copenhagen”

I must have not understood the concept behind i18n.
Never had a need to write such programs.

So please allow me a question out of ignorance.

I thought that it had to do with ** data ** that the program
manipulates and ** not ** with the program itself.
In other words, as long as a language supports:

 Copenhagen = "København"

we can say that it has i18n support.

Whether it supports variables, statements etc. like MikkelFJ suggested:

 København = "Copenhagen"

does not really matter.

Wrong, eh ?

– shanko

Hi,

···

In message “Re: What character sets are available in Ruby ?” on 03/03/10, “MikkelFJ” mikkelfj-anti-spam@bigfoot.com writes:

But not in identifiers?

Even in identifiers (for UTF-8, EUC-JP, SJIS), if you set -K option
properly. But I strongly discourage it.

						matz.

“pj” peterjohannsen@hotmail.com wrote in message
news:127ce4a9.0303091534.7c5e7d8d@posting.google.com

Here is what we did: First, the input can be in Unicode

(You obviously mean UTF-8 encoded Unicode, as you make clear below.)

No, I meant Unicode as in UCS-2 - that is why the expression was
\x0080-\xffff and not \x80-\xff (although I forget it in the lead
character):

identifier = /[A-Za-z\x80-\xff][A-Za-z0-9-\x0080-\xffff]*/

I then went on to argue that it also works in UTF-8 representation.

But, isn’t there a string#length – does it give you character length
or byte length ?

As I also answered in another posting, it is a separate issue. I’m talking
about the lexer process and symbol table lookup. String classes can easily
have 16 or 32 bit representation internally. The main point is that it is
easy to use an 8bit lexer.

I am pretty ignorant of Ruby – does it provide uppercase, lowercase,
titlecase functions ? If so, I would guess that they only support
UTF-8 if they were designed for UTF-8 ?

I’m not sure I understand your question. Ruby is case sensitive.
Generally case sensitive languages are great because you completely avoid
the problem of what is an upper case letter (a moving target in Unicode i.e.
large table that must be updated frequently).
However, Ruby uses uppercase to identify constants. This means that Ruby
does not avoid the upper case problem after all.
This problem is not properly addressed by my expression, except if you only
allow A-Z upper case constant. In the end you have use a large table to do
it properly.

By converting to UTF-8, the Lexer need not be able to handle 16 bit
characters.

Ok, you’re talking about how Ruby could be altered to do things, right
?

Well yes - I’m partly talking about how I thought Ruby did (and apparently
does it using the proper -K switch), partly I’m talking about my own
experiences with how UTF-8 support can easily be retrofitted a parser. I
have already used it several times so I exploit every opportunity to spread
the merry message :wink: Actually I picked the expression almost directly from
a parser I’m hacking in Ruby.

I’ve not been following, but I read at least a rumor that China put
out a new standard (GB18030 ? I probably have the numbers wrong) that
cannot be represented in UCS-2, but can be represented in Unicode 3.1
(ie, UTF-8, UTF-16, etc). I dunno if that matters to anyone outside of
China though.

Is this the Big-5?
Anyway - I think this is one reason that matz refuses to settle for Unicode
only. Initially I though this was wrong, but now I realize how important it
is to not settle for any single format. Indeed I sent an email to Paul
Graham that he considered this point in his new Arc language.

Mikkel

“Shashank Date” sdate@everestkc.net wrote in message news:1d858c178229b890eb27952ca739f334@news.teranews.com

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com

But not in identifiers?
The following does not work out of the box (the danish name
followed by the english name of Copenhagen):

#translations:
København = “Copenhagen”

I must have not understood the concept behind i18n.
Never had a need to write such programs.

So please allow me a question out of ignorance.

I thought that it had to do with ** data ** that the program
manipulates and ** not ** with the program itself.
In other words, as long as a language supports:

 Copenhagen = "København"

we can say that it has i18n support.

Whether it supports variables, statements etc. like MikkelFJ suggested:

 København = "Copenhagen"

does not really matter.

Wrong, eh ?

– shanko

But, wouldn’t you also want deal with string length method, and any
character methods (I dunno Ruby – does it have uppercase and stuff
like that, or iswhitespace) ?

“Yukihiro Matsumoto” matz@ruby-lang.org wrote in message
news:1047250120.292535.21994.nullmailer@picachu.netlab.jp…

Hi,

But not in identifiers?

Even in identifiers (for UTF-8, EUC-JP, SJIS), if you set -K option
properly. But I strongly discourage it.

I agree that it is not good praxis to use identifiers in local charactersets
for portability reasons.

Mikkel

···

In message “Re: What character sets are available in Ruby ?” > on 03/03/10, “MikkelFJ” mikkelfj-anti-spam@bigfoot.com writes:

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message news:3e6bde55$0$155$edfadb0f@dtext01.news.tele.dk

“pj” peterjohannsen@hotmail.com wrote in message
news:127ce4a9.0303091534.7c5e7d8d@posting.google.com

Here is what we did: First, the input can be in Unicode

(You obviously mean UTF-8 encoded Unicode, as you make clear below.)

No, I meant Unicode as in UCS-2 - that is why the expression was
\x0080-\xffff and not \x80-\xff (although I forget it in the lead
character):

Oh yes, I was mislead by not reading very carefully.

identifier = /[A-Za-z\x80-\xff][A-Za-z0-9-\x0080-\xffff]*/

I then went on to argue that it also works in UTF-8 representation.

Right, I see.

But, isn’t there a string#length – does it give you character length
or byte length ?

As I also answered in another posting, it is a separate issue. I’m talking
about the lexer process and symbol table lookup. String classes can easily
have 16 or 32 bit representation internally. The main point is that it is
easy to use an 8bit lexer.

I think you’re addressing identifier parsing – am I getting that right ?

I am pretty ignorant of Ruby – does it provide uppercase, lowercase,
titlecase functions ? If so, I would guess that they only support
UTF-8 if they were designed for UTF-8 ?

I’m not sure I understand your question. Ruby is case sensitive.
Generally case sensitive languages are great because you completely avoid
the problem of what is an upper case letter (a moving target in Unicode i.e.
large table that must be updated frequently).
However, Ruby uses uppercase to identify constants. This means that Ruby
does not avoid the upper case problem after all.
This problem is not properly addressed by my expression, except if you only
allow A-Z upper case constant. In the end you have use a large table to do
it properly.

Yes, I suppose the “constants begin with upper case letters” is a
good rule for an ASCII language, but not so good for a modern one – or at
least, not so easy :slight_smile:

By converting to UTF-8, the Lexer need not be able to handle 16 bit
characters.

Ok, you’re talking about how Ruby could be altered to do things, right
?

Well yes - I’m partly talking about how I thought Ruby did (and apparently
does it using the proper -K switch), partly I’m talking about my own
experiences with how UTF-8 support can easily be retrofitted a parser. I
have already used it several times so I exploit every opportunity to spread
the merry message :wink: Actually I picked the expression almost directly from
a parser I’m hacking in Ruby.

I’ve not been following, but I read at least a rumor that China put
out a new standard (GB18030 ? I probably have the numbers wrong) that
cannot be represented in UCS-2, but can be represented in Unicode 3.1
(ie, UTF-8, UTF-16, etc). I dunno if that matters to anyone outside of
China though.

Is this the Big-5?

Here is a url that talks about it:
http://www.anycities.com/gb18030/introduce.htm

My thought is that China has a very difficult fight.

I think UCS-2 was an unfortunate, short-sighted mistake that would
probably be dead and gone, except that Microsoft invested heavily
in it, so heavily that I cannot imagine Windows having a 32bit wchar_t.
Because MS-Windows is such a force, I think therefore the UCS-2 mistake
will be strong for a long-time. AFAIK, this is fine for almost
everyone outside of China, so they are the sole big loser, and isolated,
and have a rather difficult road to try to fight it.

Anyway - I think this is one reason that matz refuses to settle for Unicode
only. Initially I though this was wrong, but now I realize how important it
is to not settle for any single format. Indeed I sent an email to Paul
Graham that he considered this point in his new Arc language.

Mikkel

What I wish for, is some language where characters are a primitive,
and I would not have to muck about with them myself. But I fear that
maybe most languages use UCS-2 at best, so that eventually they
will (may ?) all require the programmers to muck about with surrogate
pairs and whatnot. I’m lazy, and I want a language that does my work
for me (altho of course that is rather utopian dream) :slight_smile:

But I’ll grant you that if you use 16 bit characters, your life is
much easier with UCS-2 than UTF-16; UTF-16 will only bring you
heartache (what do you do with string slices that cut through
surrogate characters – ugh).

“pj” peterjohannsen@hotmail.com wrote in message
news:127ce4a9.0303091535.2b52a780@posting.google.com

“Shashank Date” sdate@everestkc.net wrote in message
news:1d858c178229b890eb27952ca739f334@news.teranews.com

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com

snip UTF-8 in identfiers

But, wouldn’t you also want deal with string length method, and any
character methods (I dunno Ruby – does it have uppercase and stuff
like that, or iswhitespace) ?

Yes, but it is a separate issue. A simple solution is to parse strings as
UTF-8 and convert them to 16 or 32 bit internally. I think matz is working
on a multi-representation style of string class because Unicode isn’t the
solution for everything.

Mikkel

“pj” peterjohannsen@hotmail.com wrote in message
news:127ce4a9.0303100359.3291cffb@posting.google.com

I think you’re addressing identifier parsing – am I getting that right ?

Yes I was mostly discussing identifier parsing, sorry if I wasn’t clear on
that. However, the technique works generally - you can easily parse a string
in quotes the same way. This works because quotes are never part of other
characters than quotes themselves (because ASCII char always mean ASCII char
in UTF-8).

I think UCS-2 was an unfortunate, short-sighted mistake that would
probably be dead and gone, except that Microsoft invested heavily

There was a very long discussion in comp.lang.ruby a while back.

I think UCS-2 works fine in praxis because some it mostly works, and when it
doesn’t, a character just takes up more space with the locations easily
identified if you are concerned with non-fixed length issues. However, given
that it isn’t truly fixed length I find UTF-8 the best external
representation whereas UCS-2 or UCS-4 are sometimes easier to process
internally. UTF-8 is also a great external format because it is so easy to
parse - as I just explained.

Mikkel

“pj” peterjohannsen@hotmail.com wrote in message
news:127ce4a9.0303100359.3291cffb@posting.google.com

Here is a url that talks about it:
http://www.anycities.com/gb18030/introduce.htm

In this format ASCII characters do not necessarily mean ASCII characters:

Single-byte: 0x00-0x7f
Two-byte: 0x81-0xfe + 0x40-0x7e, 0x80-0xfe
Four-byte: 0x81-0xfe + 0x30-0x39 + 0x81-0xfe + 0x30-0x39

Very unfortunate, since it make it difficult to parse as opposed to UTF-8.

Mikkel

But, wouldn’t you also want deal with string length method, and any
character methods (I dunno Ruby – does it have uppercase and stuff
like that, or iswhitespace) ?

Ruby only handles byte strings, so #length, #downcase, etc, are not
unicode-aware. You can find some of the functionality in the library
‘unicode’, which works with UTF-8 strings.

http://raa.ruby-lang.org/list.rhtml?name=unicode