Ruby unicode./encoding support

Hello,

I searched internet and the ml and realised this question was asked many 

times, but i never found a satisfying answer for me.

I have an application that will read data (web pages) in different encodings 

(at least latin1, latin2, possibly cp1250, potentially i would like to
support all possible encodings letting the user give me a string… maybe i’m
dreaming).
For now I only support latin2. but when i’ll add utf-8 and other encodings,
i’ll then mix all those encodings in a file…
i would like to convert these strings to unicode and from then on only to
deal with unicode.

From what I understand:

a/ the character set conversion is not supported from ruby proper. there are
libraries (where???). i found uconv, but it seems more utf8-japanese
encodings than latin1/2. i didn’t find anything else in
http://raa.ruby-lang.org/cat.rhtml?category_major=Library;category_minor=I18N
and
http://raa.ruby-lang.org/cat.rhtml?category_major=Library;category_minor=Text

maybe http://raa.ruby-lang.org/list.rhtml?name=codeconv but i see only binary,
euc (what’s that?), sjis, utf8 support.
i don’t think EUC is latin1/2, but I may be wrong.

for this conversion I could use “recode” in command-line although it’s not
very elegant :O(

b/ anyway it’s a moot point, because ruby doesn’t handle unicode strings… at
least it’s what I understood. am I wrong? Well, i read it should work, but
then from what I understood capitalize, size etc don’t work. once more, if I
understood correctly, there are libaries for this (one is 0.1, one is 0.2
"may work").

I’m also worried after seeing on a xterm -u8:
[emmanuel@emmanuels output]$ irb
irb(main):001:0> “a”.size
=> 1
irb(main):002:0> “č”.size
=> 2
irb(main):003:0>

maybe you don’t see the second character, it’s latin2, and it’s just one
character…

so, is it so bad as i seems, doctor?

emmanuel

PS: thank you for reading this long mail, and as I said, I’m aware this was
asked many times, only I didn’t find anything clear yet…

···


“If there is any kind of God, it’s not in you or in me,
but in the space between us”
– Celine, “Before Sunrise”
(It’s not about what you do, it’s about what you give.)

Emmanuel Touzery wrote:

b/ anyway it’s a moot point, because ruby doesn’t handle unicode strings… at
least it’s what I understood. am I wrong? Well, i read it should work, but
then from what I understood capitalize, size etc don’t work. once more, if I
understood correctly, there are libaries for this (one is 0.1, one is 0.2
“may work”).

Use the ‘jcode’ library, and set the encoding to unicode with the -K
option (or by setting the KCODE variable). jcode overrides most of the
methods in class String so that they support the encoding you set with
the -K option. The exception is the length and size methods: it leaves
these untouched and instead gives you two new methods, jlength and jsize:

$KCODE = “u”
require ‘jcode’
a = “∂x/∂y”
puts a
puts a.length
puts a.jlength
puts a.chop!
puts a.chop!
puts a.length
puts a.jlength

generates

∂x/∂y
9
5
∂x/∂
∂x/
5
3

(I don’t know if this e-mail will encode the above properly. If not, the
original string was “\xe2\x88\x82x/\xe2\x88\x82y”, which is
x/y.

Cheers

Dave

Hi,

···

At Wed, 28 May 2003 23:06:37 +0900, Emmanuel Touzery wrote:

a/ the character set conversion is not supported from ruby proper. there are
libraries (where???). i found uconv, but it seems more utf8-japanese
encodings than latin1/2. i didn’t find anything else in
http://raa.ruby-lang.org/cat.rhtml?category_major=Library;category_minor=I18N
and
http://raa.ruby-lang.org/cat.rhtml?category_major=Library;category_minor=Text

iconv has been contained in 1.8 standard and shim for 1.6, so I
had removed it from RAA.


Nobu Nakada

Use the ‘jcode’ library, and set the encoding to unicode with the -K
option (or by setting the KCODE variable). jcode overrides most of the
methods in class String so that they support the encoding you set with
the -K option. The exception is the length and size methods: it leaves
these untouched and instead gives you two new methods, jlength and jsize:
[…]

And add ‘unicode’ and you can also compare, capitalize, etc…

irb(main):001:0> $KCODE=‘u’
=> “u”
irb(main):002:0> require ‘jcode’
=> true
irb(main):003:0> a=“č”
=> “č”
irb(main):004:0> a.upcase
=> “č”
irb(main):005:0> require ‘unicode’
=> true
irb(main):006:0> Unicode.upcase a
=> “Č”
irb(main):007:0>

Quoteing nobu.nokada@softhome.net, on Thu, May 29, 2003 at 07:45:09AM +0900:

iconv has been contained in 1.8 standard and shim for 1.6, so I
had removed it from RAA.

Maybe on the shim-ruby page you can mention iconv and utf8 and
multilingual so that it shows up when searching the archive?

Right now, if you were searching for utf8 support, you wouldn’t be able
to know its in the shim, and its really, really useful. If I hadn’t
found your iconv module, I would have been porting some C code I have
to ruby to do it myself…

Cheers,
Sam

Hi,

···

At Thu, 29 May 2003 10:02:52 +0900, Sam Roberts wrote:

iconv has been contained in 1.8 standard and shim for 1.6, so I
had removed it from RAA.

Maybe on the shim-ruby page you can mention iconv and utf8 and
multilingual so that it shows up when searching the archive?

The description certainly lacks a few things. But I’m not the
maintainer of it, and cannot update that page.


Nobu Nakada

firstable, i’m happy to read i was wrong and that ruby does have unicode
support in standard :O)

but all of this means that i can’t really make using unicode optionnal in my
application… that would require a series of wrapper functions… so either i
go unicode 100%, either i don’t go at all :O(
i’ll see what i’ll do. i don’t need it so much after all…

i hope future versions of ruby will get more transparent wrt all of this (from
what i read it seems it will).

thanks for the quick and clear answers!

emmanuel

···

On Wednesday 28 of May 2003 18:18, Carlos wrote:

Use the ‘jcode’ library, and set the encoding to unicode with the -K
option (or by setting the KCODE variable). jcode overrides most of the
methods in class String so that they support the encoding you set with
the -K option. The exception is the length and size methods: it leaves
these untouched and instead gives you two new methods, jlength and jsize:

[…]

And add ‘unicode’ and you can also compare, capitalize, etc…

irb(main):001:0> $KCODE=‘u’
=> “u”
irb(main):002:0> require ‘jcode’
=> true
irb(main):006:0> Unicode.upcase a
=> “Č”
irb(main):007:0>


“Droit devant soi, on ne peut pas aller bien loin”
- Le petit prince, Antoine de Saint Exupéry

“Carlos” angus@quovadis.com.ar schrieb im Newsbeitrag
news:20030528161716.GA4410@quovadis.com.ar…

Use the ‘jcode’ library, and set the encoding to unicode with the -K
option (or by setting the KCODE variable). jcode overrides most of the
methods in class String so that they support the encoding you set with
the -K option. The exception is the length and size methods: it leaves
these untouched and instead gives you two new methods, jlength and
jsize:

It seems that one has has to set the encoding on a per process basis. Is
this so? I find it a bit strange. IMHO the downside is, that you either
can’t read from different sources with different encodings in one
application or you very carefully have to ensure that no other thread is
reading / writing while the encoding is temporarily changed.

Personally I think Java has one of the best approaches taken here: all
strings consist of unicode characters and streams have an encoding
attached that is applied during reading and writing. Maybe this is not
the ideal solution for ruby. Maybe one should add a class UnicodeString
that supports encodings and conversions. What do others think?

Cheers

robert

“Robert Klemme” bob.news@gmx.net wrote in message news:bbf1bd$8nl22$1@ID-52924.news.dfncis.de

Personally I think Java has one of the best approaches taken here: all
strings consist of unicode characters and streams have an encoding
attached that is applied during reading and writing. Maybe this is not
the ideal solution for ruby. Maybe one should add a class UnicodeString
that supports encodings and conversions. What do others think?

I think you’re right. There was a time about 2 years ago when this
problem was quite widely discussed in the ruby community, but I think
everyone’s resigned themselves to the status quo now :slight_smile:

A unicodestring class would be nice (actually, I could have sworn
there was one already) but would there be any way to make it
seamlessly replace the existing string, so that string literals,
regexes, and File.readline etc would all do the right thing? There’s
been a (proper, not \uXXXX) unicode regex module available for ruby
for a long time, I think.

“Benjamin Peterson” bjsp123@imap.cc schrieb im Newsbeitrag
news:986d2608.0306030832.7e9dab83@posting.google.com

“Robert Klemme” bob.news@gmx.net wrote in message
news:bbf1bd$8nl22$1@ID-52924.news.dfncis.de

Personally I think Java has one of the best approaches taken here: all
strings consist of unicode characters and streams have an encoding
attached that is applied during reading and writing. Maybe this is
not
the ideal solution for ruby. Maybe one should add a class
UnicodeString
that supports encodings and conversions. What do others think?

I think you’re right. There was a time about 2 years ago when this
problem was quite widely discussed in the ruby community, but I think
everyone’s resigned themselves to the status quo now :slight_smile:

sigh Thanks for replying anyway.

A unicodestring class would be nice (actually, I could have sworn
there was one already) but would there be any way to make it
seamlessly replace the existing string, so that string literals,
regexes, and File.readline etc would all do the right thing?

Hm, of course there would be some work in the interpreter, in IO and
subclasses, in Regexp and of course in String. But using a default
encoding it could be possible to leave most existing applications
untouched, I guess. But, it is a major effort.

There’s
been a (proper, not \uXXXX) unicode regex module available for ruby
for a long time, I think.

Thanks!

Regards

robert