Inconsistent results using Iconv

Howdy. I'm working with Iconv and discovered that this code

  require "iconv"
  ["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
    puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
  end

generates different results depending on how the code is loaded.
"ægis" and "straße" convert fine no matter how they're called but
"kierkegård" and "joão" only convert correctly when loaded or typed or
pasted within an open instance of irb. Sadly, poor "bjørk" never
manages to get correctly converted. At the bottom of the email, I have
pasted the results of the various methods I've tried to run this code.

Note: The last attempt I show below apparently _does_ work for a
friend of mine using Fedora 7. Just more weirdness? I dunno. My mind's
reeling. Time for a break. :wink:

Thanks in advance,

RSL

Here's the various code runs...

rsl@sneaky ~ > irb
irb(main):001:0> load "omg.rb"
ægis => aegis
straße => strasse
kierkegård => kierkegard
joão => joao
bjørk => bj?rk

rsl@sneaky ~ > ./omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

rsl@sneaky ~ > ruby omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

rsl@sneaky ~ > irb omg.rb
omg.rb(main):001:0> #!/usr/bin/env ruby
omg.rb(main):002:0* require "iconv"
=> true
omg.rb(main):003:0> ["ægis", "straße", "kierkegård", "joão",
"bjørk"].each do |word|
omg.rb(main):004:1* puts "#{word} =>
#{Iconv.iconv('ascii//translit', 'utf-8', word)}"
omg.rb(main):005:1> end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
=> ["\303\246gis", "stra\303\237e", "kierkeg\303\245rd",
"jo\303\243o", "bj\303\270rk"]
omg.rb(main):006:0> exit

rsl@sneaky ~ > cat omg.rb | irb
#!/usr/bin/env ruby
require "iconv"
true
["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
  puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
["\303\246gis", "stra\303\237e", "kierkeg\303\245rd", "jo\303\243o",
"bj\303\270rk"]
exit

I just realized that I didn't really ask a question there, did I?
Woops. I'd like to know what I'm doing wrong or perhaps just how to do
it right so that I get the same results. Do I need to manually include
another library that somehow isn't getting included in the other ways
I've tried? I'd really like to be able to count on my Iconv.iconv code
doing what I need all the time but it seems I can't at the moment. :frowning:

Here's hoping someone can help me solve this puzzle.

RSL

···

On 9/2/07, Russell Norris <rsl@swimcommunity.org> wrote:

Howdy. I'm working with Iconv and discovered that this code

  require "iconv"
  ["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
    puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
  end

generates different results depending on how the code is loaded.
"ægis" and "straße" convert fine no matter how they're called but
"kierkegård" and "joão" only convert correctly when loaded or typed or
pasted within an open instance of irb. Sadly, poor "bjørk" never
manages to get correctly converted. At the bottom of the email, I have
pasted the results of the various methods I've tried to run this code.

Note: The last attempt I show below apparently _does_ work for a
friend of mine using Fedora 7. Just more weirdness? I dunno. My mind's
reeling. Time for a break. :wink:

Thanks in advance,

RSL

Here's the various code runs...

rsl@sneaky ~ > irb
irb(main):001:0> load "omg.rb"
ægis => aegis
straße => strasse
kierkegård => kierkegard
joão => joao
bjørk => bj?rk

rsl@sneaky ~ > ./omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

rsl@sneaky ~ > ruby omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

rsl@sneaky ~ > irb omg.rb
omg.rb(main):001:0> #!/usr/bin/env ruby
omg.rb(main):002:0* require "iconv"
=> true
omg.rb(main):003:0> ["ægis", "straße", "kierkegård", "joão",
"bjørk"].each do |word|
omg.rb(main):004:1* puts "#{word} =>
#{Iconv.iconv('ascii//translit', 'utf-8', word)}"
omg.rb(main):005:1> end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
=> ["\303\246gis", "stra\303\237e", "kierkeg\303\245rd",
"jo\303\243o", "bj\303\270rk"]
omg.rb(main):006:0> exit

rsl@sneaky ~ > cat omg.rb | irb
#!/usr/bin/env ruby
require "iconv"
true
["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
  puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
["\303\246gis", "stra\303\237e", "kierkeg\303\245rd", "jo\303\243o",
"bj\303\270rk"]
exit

[Russell Norris <rsl@swimcommunity.org>, 2007-09-03 00.40 CEST]
[...]

Howdy. I'm working with Iconv and discovered that this code

  require "iconv"
  ["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
    puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
  end

[...]

rsl@sneaky ~ > ./omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

Hi, Russell. Apparently the ASCII transliteration rules are defined in
locale data files, and not all locales define all of them (and some that
define it, do it differently). The resolution of this bug report explains
the situation a little more:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=376272

Good luck.

···

--

RSL ___ wrote:
I just realized that I didn't really ask a question there, did I?
Woops. I'd like to know what I'm doing wrong or perhaps just how to do
it right so that I get the same results. Do I need to manually include
another library that somehow isn't getting included in the other ways
I've tried? I'd really like to be able to count on my Iconv.iconv code
doing what I need all the time but it seems I can't at the moment. :frowning:

Here's hoping someone can help me solve this puzzle.

RSL

Converting from utf-8 to iso-8859-1 seems to work though (source code
file is encoded in UTF-8 and character set encoding of Terminal.app is
set to ISO Latin 1, i.e. iso-8859-1).

...

p "#{word} => #{Iconv.iconv('iso-8859-1//translit', 'utf-8', word)}"

...

=>
irb(main):020:0> load "omg.rb"
"\303\246gis => \346gis"
"stra\303\237e => stra\337e"
"kierkeg\303\245rd => kierkeg\345rd"
"jo\303\243o => jo\343o"
"bj\303\270rk => bj\370rk"
=> true

Cheers

j.k.

···

--
Posted via http://www.ruby-forum.com/\.

Carlos wrote:

[Russell Norris <rsl@swimcommunity.org>, 2007-09-03 00.40 CEST]
[...]

Howdy. I'm working with Iconv and discovered that this code

  require "iconv"
  ["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
    puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
  end

[...]

rsl@sneaky ~ > ./omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

Hi, Russell. Apparently the ASCII transliteration rules are defined in
locale data files, and not all locales define all of them (and some that
define it, do it differently). The resolution of this bug report explains
the situation a little more:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=376272

Also I get different results depending on platform:

Ubuntu/iconv (GNU libc) 2.3.6:
   ruby -riconv -e 'puts Iconv.iconv("US-ASCII//TRANSLIT", "UTF-8", "caff\303\250")'
   => caff?

FreeBSD/iconv (GNU libiconv 1.9)
   ruby -riconv -e 'puts Iconv.iconv("US-ASCII//TRANSLIT", "UTF-8", "caff\303\250")'
   => caff`e

This doesn't explain why you have different results in irb and ruby, but it does show how unreliable Iconv translit can be. IMHO you'd be better off using the Unicode gem if you want to decompose UTF8 strings.

Daniel