UTF-8 question

I’ve finally switched to UTF-8. It’s awesome. Now, if I can only find
a utility that draws ANSI-art with UNICODE’s line/block-drawing
characters ;-). (I guess it’d have to be UNICODE-art then :wink: My
problem, however, is: Ruby doesn’t seem to support UTF-8, which is bad
for me. I need it to parse incoming UTF-8 strings to convert them to
valid ISO9660 filenames. Anyone have any suggestions as to how to solve
this? Are there any good libraries?
nikolai

···


::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)[“have”]+“fun”-97);}

Try ‘iconv’, supplied with 1.8.0 or available from RAA as in the “shim”
module.

Note: this didn’t build on my FreeBSD system, even though I have libiconv
and iconv.h:

$ less ext/iconv/mkmf.log
have_header: checking for iconv.h… --------------------
gcc -E -I/home/brian/rubykit18/work/ruby-1.8.0 -I/home/brian/rubykit18/work/ruby-1.8.0 -g -O2 -o conftest.i conftest.c
conftest.c:1: iconv.h: No such file or directory
checked program was:
/* begin /
#include <iconv.h>
/
end */

It guess it doesn’t look in /usr/local/include unless told explicitly to do
so.

Regards,

Brian.

···

On Tue, Aug 12, 2003 at 11:31:53PM +0900, Nikolai Weibull wrote:

I’ve finally switched to UTF-8. It’s awesome. Now, if I can only find
a utility that draws ANSI-art with UNICODE’s line/block-drawing
characters ;-). (I guess it’d have to be UNICODE-art then :wink: My
problem, however, is: Ruby doesn’t seem to support UTF-8, which is bad
for me. I need it to parse incoming UTF-8 strings to convert them to
valid ISO9660 filenames. Anyone have any suggestions as to how to solve
this? Are there any good libraries?

Hi,

···

In message “UTF-8 question” on 03/08/12, Nikolai Weibull lone-star@home.se writes:

My problem, however, is: Ruby doesn’t seem to support UTF-8, which is bad
for me. I need it to parse incoming UTF-8 strings to convert them to
valid ISO9660 filenames. Anyone have any suggestions as to how to solve
this? Are there any good libraries?

As usual, my first answer is:

define “UTF-8 support” first.

Ruby does support UTF-8 mostly using its UTF-8 aware regex engine.
Besides, for conversion between encodings, you have iconv module.

						matz.

It guess it doesn’t look in /usr/local/include unless told explicitly to do
so.

FYI, I just rebuilt using

CPPFLAGS=“-I/usr/local/include” LDFLAGS=“-L/usr/local/lib” ./configure && make && make test && sudo make install

and now I have iconv.

Cheers,

Brian.

Hi,

My problem, however, is: Ruby doesn’t seem to support UTF-8, which is bad
for me. I need it to parse incoming UTF-8 strings to convert them to
valid ISO9660 filenames. Anyone have any suggestions as to how to solve
this? Are there any good libraries?

As usual, my first answer is:

define “UTF-8 support” first.

hm…I messed up. I was trying to do “hispañola”.gsub(/\xf1/, ‘n’), but
should have been doing “hispañola”.gsub(/ñ/, ‘n’)

Ruby does support UTF-8 mostly using its UTF-8 aware regex engine.
Besides, for conversion between encodings, you have iconv module.

Yeah, I found it. Too bad I couldn’t find the documentation for it (but
I guessed the method names and it works fine :wink: (principle of least
surprise at work i guess :wink:
thanks,
nikolai

···

In message “UTF-8 question” > on 03/08/12, Nikolai Weibull lone-star@home.se writes:


::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

[courtesy cc of this posting sent to cited author via email]

In article 20030812155530.A58473@linnet.org,

Note: this didn’t build on my FreeBSD system, even though I have libiconv
and iconv.h:

On FreeBSD, kny-san has divided Ruby into several ports, including
ruby-iconv, ruby-gdbm and ruby-mode.el. You have to install all of them to
get the same as the Ruby distribution.

It guess it doesn’t look in /usr/local/include unless told explicitly to do
so.

That is correct, the builtin gcc doesn’t automatically look into
/usr/local.

···

Brian Candler B.Candler@pobox.com wrote:

Ollivier ROBERT -=- Eurocontrol EEC/ITM -=- roberto@eurocontrol.fr
Usenet Canal Historique FreeBSD: The Power to Serve!

Hi,

···

At Wed, 13 Aug 2003 00:09:39 +0900, Brian Candler wrote:

FYI, I just rebuilt using

CPPFLAGS=“-I/usr/local/include” LDFLAGS=“-L/usr/local/lib” ./configure && make && make test && sudo make install

Alternatively,

./configure --with-iconv-dir=/usr/local && make

or

./configure && CONFIGURE_ARGS=–with-iconv-dir=/usr/local make


Nobu Nakada

Hi,

···

At Wed, 13 Aug 2003 01:34:41 +0900, Nikolai Weibull wrote:

hm…I messed up. I was trying to do “hispañola”.gsub(/\xf1/, ‘n’), but
should have been doing “hispañola”.gsub(/ñ/, ‘n’)

Although “\xf1” seems ISO-8859-1 instead of UTF-8, you have to
use -Ku option in command line or shebang to write literals in
UTF-8.


Nobu Nakada

Although “\xf1” seems ISO-8859-1 instead of UTF-8, you have to
use -Ku option in command line or shebang to write literals in
UTF-8.

ruby -Ku -e ‘puts “\xf1”’
doesn’t seem to work right (i get a box, not a squiggly n)

···


::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Right. In a UTF-8 file, the string “hispañola” doesn’t
contain the byte \xf1. It contains the UTF-8 encoding of the character
U+00F1, which is \xc3 \xb1. Ruby can examine the language settings of its
runtime environment, but has no way of doing so for the environment
in which the script was written. So it has no way of knowing what
character encoding a program file itself uses. The -Ku option tells Ruby
that the program file is written in UTF-8.

I believe it assumes ISO-8859-1 otherwise. I think it would also honor
a Unicode Byte Order Mark at the top of the file, but a BOM gets in the
way of the #!.

-Mark

···

On Wed, Aug 13, 2003 at 08:56:34AM +0900, nobu.nokada@softhome.net wrote:

Hi,

At Wed, 13 Aug 2003 01:34:41 +0900, > Nikolai Weibull wrote:

hm…I messed up. I was trying to do “hispañola”.gsub(/\xf1/, ‘n’), but
should have been doing “hispañola”.gsub(/ñ/, ‘n’)
Although “\xf1” seems ISO-8859-1 instead of UTF-8, you have to
use -Ku option in command line or shebang to write literals in
UTF-8.

Hi,

···

At Wed, 13 Aug 2003 09:15:13 +0900, Nikolai Weibull wrote:

Although “\xf1” seems ISO-8859-1 instead of UTF-8, you have to
use -Ku option in command line or shebang to write literals in
UTF-8.

ruby -Ku -e ‘puts “\xf1”’
doesn’t seem to work right (i get a box, not a squiggly n)

I meant that single “\xf1” is not valid in UTF-8, do you want
to use ISO-8859 instead?


Nobu Nakada

That’s going the other way. -Ku tells Ruby to treat the source code
as UTF-8; it doesn’t tell it to generate UTF-8 from methods like
puts(). I think you have to do that explicitly with iconv (although
if there’s an easier way I’d love to learn about it):

ruby -riconv -e ‘puts Iconv.iconv(“UTF-8”, “ISO-8859-1”, “\xf1”)’

···

On Wed, Aug 13, 2003 at 09:15:13AM +0900, Nikolai Weibull wrote:

ruby -Ku -e ‘puts “\xf1”’
doesn’t seem to work right (i get a box, not a squiggly n)

Hi,

···

At Wed, 13 Aug 2003 10:05:14 +0900, Mark J. Reed wrote:

I believe it assumes ISO-8859-1 otherwise. I think it would also honor
a Unicode Byte Order Mark at the top of the file, but a BOM gets in the
way of the #!.

BOM(byte order mark) sounds nonsense in UTF-8, one of multibyte
encoding. :slight_smile:


Nobu Nakada

Right. In a UTF-8 file, the string “hispañola” doesn’t
contain the byte \xf1. It contains the UTF-8 encoding of the character
U+00F1, which is \xc3 \xb1. Ruby can examine the language settings of its
runtime environment, but has no way of doing so for the environment
in which the script was written. So it has no way of knowing what
character encoding a program file itself uses. The -Ku option tells Ruby
that the program file is written in UTF-8.

OK. Thanks for the explanation. Now to the real issue:
What I wan’t to be able to do is
input.tr!(“\xf1”, “n”)
where input contains unicode characters. This doesn’t seem to be
possible. String is only for ISO-8859-1 strings? I feel I’m being
stupid here, but sadly I don’t know enough about UNICODE to be smart
about it (I hope that makes sense ;-). I guess I need to be tr’ing for
\xc3\xb1 but that won’t be possible.
nikolai

···


::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

ruby -Ku -e ‘puts “\xf1”’
doesn’t seem to work right (i get a box, not a squiggly n)

I meant that single “\xf1” is not valid in UTF-8, do you want
to use ISO-8859 instead?

ah, sorry, I misunderstood. Hm, yeah, you’re right, it’s not valid.
OK, so how would one make it valid? I get the number from VIM’s digraph
table (running under UTF-8).
nikolai

···


::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Hi,

···

At Wed, 13 Aug 2003 10:25:04 +0900, Nikolai Weibull wrote:

What I wan’t to be able to do is
input.tr!(“\xf1”, “n”)
where input contains unicode characters. This doesn’t seem to be
possible. String is only for ISO-8859-1 strings? I feel I’m being
stupid here, but sadly I don’t know enough about UNICODE to be smart
about it (I hope that makes sense ;-). I guess I need to be tr’ing for
\xc3\xb1 but that won’t be possible.

If it is really in UTF-8,

input.gsub!(/\xc3\xb1/u, “n”)

or

input.gsub!(/ñ/, “n”) # with -Ku option.


Nobu Nakada

Nobu Nakada:

BOM(byte order mark) sounds nonsense in UTF-8, one of multibyte
encoding. :slight_smile:

While it does sound like a nonsense, the sequence of bytes [0xEF, 0xBB,
0xBF] which is the UTF-8 rendering of the BOM character U+FEFF is often used
to identify files as UTF-8.

Neil

If it is really in UTF-8,

input.gsub!(/\xc3\xb1/u, “n”)

or

input.gsub!(/ñ/, “n”) # with -Ku option.

yeah, this works. The problem is - I would prefer tr(), as I could then
list other characters as well. It’s a bitch having to give a long list
of gsub()'s.
nikolai

···


::: name: Nikolai Weibull :: aliases: pcp / lone-star :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,php,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

yeah, this works. The problem is - I would prefer tr(), as I could then
list other characters as well. It’s a bitch having to give a long list
of gsub()'s.

The problem is that Ruby Strings are still logically treated as
collections of bytes rather than collections of characters. So
String#tr only works for characters whose representation takes up a
single byte - which, in the case of UTF-8, means only the 7-bit
ASCII characters.

To avoid the unightly gsub chaining, you can write your own tr-like
method that takes an array of regexes and a corresponding array
of substitutions:

class String
    def trsub!(res, subs)
	res.each_with_index do
	    >re, i|
	     self.gsub!(re, subs[i] || '')
	end
    end
end
	
s.trsub!(%w<ñ á é í ó ú ü ¡ ¿>,  %w<n a e i o u u>)
···

On Wed, Aug 13, 2003 at 10:59:18AM +0900, Nikolai Weibull wrote:

Hi,

If it is really in UTF-8,

input.gsub!(/\xc3\xb1/u, “n”)

or

input.gsub!(/ñ/, “n”) # with -Ku option.

yeah, this works. The problem is - I would prefer tr(), as I could then
list other characters as well. It’s a bitch having to give a long list
of gsub()'s.

You can use jcode.rb.

$ cat n.rb
input = “hispañola”
input.tr!(“ñ”, “n”)
p input
$ iconv -f iso-8859-1 -t utf-8 n.rb | ruby -Ku
“hispannola”
$ iconv -f iso-8859-1 -t utf-8 n.rb | ruby -Ku -rjcode
“hispanola”

···

At Wed, 13 Aug 2003 10:59:18 +0900, Nikolai Weibull wrote:


Nobu Nakada