UTF-8 question

Nikolai_lone-star_We · 12 August 2003 14:31

I’ve finally switched to UTF-8. It’s awesome. Now, if I can only find
a utility that draws ANSI-art with UNICODE’s line/block-drawing
characters ;-). (I guess it’d have to be UNICODE-art then My
problem, however, is: Ruby doesn’t seem to support UTF-8, which is bad
for me. I need it to parse incoming UTF-8 strings to convert them to
valid ISO9660 filenames. Anyone have any suggestions as to how to solve
this? Are there any good libraries?
nikolai

···

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)[“have”]+“fun”-97);}

Brian_Candler · 12 August 2003 14:55

Try ‘iconv’, supplied with 1.8.0 or available from RAA as in the “shim”
module.

Note: this didn’t build on my FreeBSD system, even though I have libiconv
and iconv.h:

$ less ext/iconv/mkmf.log
have_header: checking for iconv.h… --------------------
gcc -E -I/home/brian/rubykit18/work/ruby-1.8.0 -I/home/brian/rubykit18/work/ruby-1.8.0 -g -O2 -o conftest.i conftest.c
conftest.c:1: iconv.h: No such file or directory
checked program was:
/* begin /
#include <iconv.h>
/ end */

It guess it doesn’t look in /usr/local/include unless told explicitly to do
so.

Regards,

Brian.

···

On Tue, Aug 12, 2003 at 11:31:53PM +0900, Nikolai Weibull wrote:

I’ve finally switched to UTF-8. It’s awesome. Now, if I can only find
a utility that draws ANSI-art with UNICODE’s line/block-drawing
characters ;-). (I guess it’d have to be UNICODE-art then My
problem, however, is: Ruby doesn’t seem to support UTF-8, which is bad
for me. I need it to parse incoming UTF-8 strings to convert them to
valid ISO9660 filenames. Anyone have any suggestions as to how to solve
this? Are there any good libraries?

Yukihiro_Matsumoto2 · 12 August 2003 15:51

Hi,

···

In message “UTF-8 question” on 03/08/12, Nikolai Weibull lone-star@home.se writes:

My problem, however, is: Ruby doesn’t seem to support UTF-8, which is bad
for me. I need it to parse incoming UTF-8 strings to convert them to
valid ISO9660 filenames. Anyone have any suggestions as to how to solve
this? Are there any good libraries?

As usual, my first answer is:

define “UTF-8 support” first.

Ruby does support UTF-8 mostly using its UTF-8 aware regex engine.
Besides, for conversion between encodings, you have iconv module.

						matz.

Brian_Candler · 12 August 2003 15:09

It guess it doesn’t look in /usr/local/include unless told explicitly to do
so.

FYI, I just rebuilt using

CPPFLAGS=“-I/usr/local/include” LDFLAGS=“-L/usr/local/lib” ./configure && make && make test && sudo make install

and now I have iconv.

Cheers,

Brian.

Nikolai_lone-star_We · 12 August 2003 16:34

Yukihiro Matsumoto matz@ruby-lang.org [Aug, 12 2003 18:10]:

Hi,

My problem, however, is: Ruby doesn’t seem to support UTF-8, which is bad
for me. I need it to parse incoming UTF-8 strings to convert them to
valid ISO9660 filenames. Anyone have any suggestions as to how to solve
this? Are there any good libraries?

As usual, my first answer is:

define “UTF-8 support” first.

hm…I messed up. I was trying to do “hispañola”.gsub(/\xf1/, ‘n’), but
should have been doing “hispañola”.gsub(/ñ/, ‘n’)

Ruby does support UTF-8 mostly using its UTF-8 aware regex engine.
Besides, for conversion between encodings, you have iconv module.

Yeah, I found it. Too bad I couldn’t find the documentation for it (but
I guessed the method names and it works fine (principle of least
surprise at work i guess
thanks,
nikolai

···

In message “UTF-8 question” > on 03/08/12, Nikolai Weibull lone-star@home.se writes:

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Ollivier_Robert5 · 15 August 2003 21:53

[courtesy cc of this posting sent to cited author via email]

In article 20030812155530.A58473@linnet.org,

Note: this didn’t build on my FreeBSD system, even though I have libiconv
and iconv.h:

On FreeBSD, kny-san has divided Ruby into several ports, including
ruby-iconv, ruby-gdbm and ruby-mode.el. You have to install all of them to
get the same as the Ruby distribution.

It guess it doesn’t look in /usr/local/include unless told explicitly to do
so.

That is correct, the builtin gcc doesn’t automatically look into
/usr/local.

···

Brian Candler B.Candler@pobox.com wrote:

Ollivier ROBERT -=- Eurocontrol EEC/ITM -=- roberto@eurocontrol.fr
Usenet Canal Historique FreeBSD: The Power to Serve!

Nobuyoshi_Nakada · 12 August 2003 23:51

Hi,

···

At Wed, 13 Aug 2003 00:09:39 +0900, Brian Candler wrote:

FYI, I just rebuilt using

CPPFLAGS=“-I/usr/local/include” LDFLAGS=“-L/usr/local/lib” ./configure && make && make test && sudo make install

Alternatively,

./configure --with-iconv-dir=/usr/local && make

or

./configure && CONFIGURE_ARGS=–with-iconv-dir=/usr/local make

–
Nobu Nakada

Nobuyoshi_Nakada · 12 August 2003 23:56

Hi,

···

At Wed, 13 Aug 2003 01:34:41 +0900, Nikolai Weibull wrote:

hm…I messed up. I was trying to do “hispañola”.gsub(/\xf1/, ‘n’), but
should have been doing “hispañola”.gsub(/ñ/, ‘n’)

Although “\xf1” seems ISO-8859-1 instead of UTF-8, you have to
use -Ku option in command line or shebang to write literals in
UTF-8.

–
Nobu Nakada

Nikolai_lone-star_We · 13 August 2003 00:15

nobu.nokada@softhome.net nobu.nokada@softhome.net [Aug, 13 2003 02:00]:

Although “\xf1” seems ISO-8859-1 instead of UTF-8, you have to
use -Ku option in command line or shebang to write literals in
UTF-8.

ruby -Ku -e ‘puts “\xf1”’
doesn’t seem to work right (i get a box, not a squiggly n)

···

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Mark_J_Reed · 13 August 2003 01:05

Right. In a UTF-8 file, the string “hispañola” doesn’t
contain the byte \xf1. It contains the UTF-8 encoding of the character
U+00F1, which is \xc3 \xb1. Ruby can examine the language settings of its
runtime environment, but has no way of doing so for the environment
in which the script was written. So it has no way of knowing what
character encoding a program file itself uses. The -Ku option tells Ruby
that the program file is written in UTF-8.

I believe it assumes ISO-8859-1 otherwise. I think it would also honor
a Unicode Byte Order Mark at the top of the file, but a BOM gets in the
way of the #!.

-Mark

···

On Wed, Aug 13, 2003 at 08:56:34AM +0900, nobu.nokada@softhome.net wrote:

Hi,

At Wed, 13 Aug 2003 01:34:41 +0900, > Nikolai Weibull wrote:

hm…I messed up. I was trying to do “hispañola”.gsub(/\xf1/, ‘n’), but
should have been doing “hispañola”.gsub(/ñ/, ‘n’)
Although “\xf1” seems ISO-8859-1 instead of UTF-8, you have to
use -Ku option in command line or shebang to write literals in
UTF-8.

Nobuyoshi_Nakada · 13 August 2003 00:38

Hi,

···

At Wed, 13 Aug 2003 09:15:13 +0900, Nikolai Weibull wrote:

Although “\xf1” seems ISO-8859-1 instead of UTF-8, you have to
use -Ku option in command line or shebang to write literals in
UTF-8.

ruby -Ku -e ‘puts “\xf1”’
doesn’t seem to work right (i get a box, not a squiggly n)

I meant that single “\xf1” is not valid in UTF-8, do you want
to use ISO-8859 instead?

–
Nobu Nakada

Mark_J_Reed · 13 August 2003 01:05

That’s going the other way. -Ku tells Ruby to treat the source code
as UTF-8; it doesn’t tell it to generate UTF-8 from methods like
puts(). I think you have to do that explicitly with iconv (although
if there’s an easier way I’d love to learn about it):

ruby -riconv -e ‘puts Iconv.iconv(“UTF-8”, “ISO-8859-1”, “\xf1”)’

···

On Wed, Aug 13, 2003 at 09:15:13AM +0900, Nikolai Weibull wrote:

ruby -Ku -e ‘puts “\xf1”’
doesn’t seem to work right (i get a box, not a squiggly n)

Nobuyoshi_Nakada · 13 August 2003 01:22

Hi,

···

At Wed, 13 Aug 2003 10:05:14 +0900, Mark J. Reed wrote:

I believe it assumes ISO-8859-1 otherwise. I think it would also honor
a Unicode Byte Order Mark at the top of the file, but a BOM gets in the
way of the #!.

BOM(byte order mark) sounds nonsense in UTF-8, one of multibyte
encoding.

–
Nobu Nakada

Nikolai_lone-star_We · 13 August 2003 01:25

Mark J. Reed markjreed@mail.com [Aug, 13 2003 03:10]:

Right. In a UTF-8 file, the string “hispañola” doesn’t
contain the byte \xf1. It contains the UTF-8 encoding of the character
U+00F1, which is \xc3 \xb1. Ruby can examine the language settings of its
runtime environment, but has no way of doing so for the environment
in which the script was written. So it has no way of knowing what
character encoding a program file itself uses. The -Ku option tells Ruby
that the program file is written in UTF-8.

OK. Thanks for the explanation. Now to the real issue:
What I wan’t to be able to do is
input.tr!(“\xf1”, “n”)
where input contains unicode characters. This doesn’t seem to be
possible. String is only for ISO-8859-1 strings? I feel I’m being
stupid here, but sadly I don’t know enough about UNICODE to be smart
about it (I hope that makes sense ;-). I guess I need to be tr’ing for
\xc3\xb1 but that won’t be possible.
nikolai

···

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Nikolai_lone-star_We · 13 August 2003 00:58

nobu.nokada@softhome.net nobu.nokada@softhome.net [Aug, 13 2003 02:40]:

ruby -Ku -e ‘puts “\xf1”’
doesn’t seem to work right (i get a box, not a squiggly n)

I meant that single “\xf1” is not valid in UTF-8, do you want
to use ISO-8859 instead?

ah, sorry, I misunderstood. Hm, yeah, you’re right, it’s not valid.
OK, so how would one make it valid? I get the number from VIM’s digraph
table (running under UTF-8).
nikolai

···

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Nobuyoshi_Nakada · 13 August 2003 01:42

Hi,

···

At Wed, 13 Aug 2003 10:25:04 +0900, Nikolai Weibull wrote:

What I wan’t to be able to do is
input.tr!(“\xf1”, “n”)
where input contains unicode characters. This doesn’t seem to be
possible. String is only for ISO-8859-1 strings? I feel I’m being
stupid here, but sadly I don’t know enough about UNICODE to be smart
about it (I hope that makes sense ;-). I guess I need to be tr’ing for
\xc3\xb1 but that won’t be possible.

If it is really in UTF-8,

input.gsub!(/\xc3\xb1/u, “n”)

or

input.gsub!(/ñ/, “n”) # with -Ku option.

–
Nobu Nakada

Neil_Hodgson · 13 August 2003 13:06

Nobu Nakada:

BOM(byte order mark) sounds nonsense in UTF-8, one of multibyte
encoding.

While it does sound like a nonsense, the sequence of bytes [0xEF, 0xBB,
0xBF] which is the UTF-8 rendering of the BOM character U+FEFF is often used
to identify files as UTF-8.

Neil

Nikolai_lone-star_We · 13 August 2003 01:59

nobu.nokada@softhome.net nobu.nokada@softhome.net [Aug, 13 2003 03:50]:

If it is really in UTF-8,

input.gsub!(/\xc3\xb1/u, “n”)

or

input.gsub!(/ñ/, “n”) # with -Ku option.

yeah, this works. The problem is - I would prefer tr(), as I could then
list other characters as well. It’s a bitch having to give a long list
of gsub()'s.
nikolai

···

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Mark_J_Reed · 13 August 2003 13:47

yeah, this works. The problem is - I would prefer tr(), as I could then
list other characters as well. It’s a bitch having to give a long list
of gsub()'s.

The problem is that Ruby Strings are still logically treated as
collections of bytes rather than collections of characters. So
String#tr only works for characters whose representation takes up a
single byte - which, in the case of UTF-8, means only the 7-bit
ASCII characters.

To avoid the unightly gsub chaining, you can write your own tr-like
method that takes an array of regexes and a corresponding array
of substitutions:

class String
    def trsub!(res, subs)
	res.each_with_index do
	    >re, i|
	     self.gsub!(re, subs[i] || '')
	end
    end
end
	
s.trsub!(%w<ñ á é í ó ú ü ¡ ¿>,  %w<n a e i o u u>)

···

On Wed, Aug 13, 2003 at 10:59:18AM +0900, Nikolai Weibull wrote:

Nobuyoshi_Nakada · 15 August 2003 04:51

Hi,

If it is really in UTF-8,

input.gsub!(/\xc3\xb1/u, “n”)

or

input.gsub!(/ñ/, “n”) # with -Ku option.

yeah, this works. The problem is - I would prefer tr(), as I could then
list other characters as well. It’s a bitch having to give a long list
of gsub()'s.

You can use jcode.rb.

$ cat n.rb
input = “hispañola”
input.tr!(“ñ”, “n”)
p input
$ iconv -f iso-8859-1 -t utf-8 n.rb | ruby -Ku
“hispannola”
$ iconv -f iso-8859-1 -t utf-8 n.rb | ruby -Ku -rjcode
“hispanola”

···

At Wed, 13 Aug 2003 10:59:18 +0900, Nikolai Weibull wrote:

–
Nobu Nakada

Topic		Replies	Views
UTF-8 -> iso8859-15 ruby-talk	7	90	31 July 2003
Unicode in Ruby and a Ruby Reference ruby-talk	9	125	15 December 2004
UTF-8 strings? ruby-talk	1	93	25 October 2004
Wanted: Script to convert to/from UTF-8/UTF-16/UTF-32 ruby-talk	2	188	31 August 2008
[ENCODING] UTF8 hell ruby-talk	14	705	24 February 2010

UTF-8 question

Brian Candler B.Candler@pobox.com wrote:

Related topics