Iconv transfer code

Pen_Ttt · 17 April 2010 05:22

in my computer(ubuntu9.1+ruby1.9):
pt@pt-laptop:~$ irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> str = Iconv.iconv('GBK', 'UTF-8', '我说').to_s
=> "[\"��˵\"]"

in my friend's(ubuntu9.1+ruby1.9):
$ irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> str = Iconv.iconv('GBK', 'UTF-8', '我说').to_s
=> "\316\322\313\265"
irb(main):003:0> puts Iconv.iconv('UTF-8', 'GBK', str).to_s
我说
=> nil

what's wrong in my system?

···

--
Posted via http://www.ruby-forum.com/.

Brian_Candler · 18 April 2010 09:22

Pen Ttt wrote:

in my computer(ubuntu9.1+ruby1.9):
pt@pt-laptop:~$ irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> str = Iconv.iconv('GBK', 'UTF-8', '我说').to_s
=> "[\"��˵\"]"

in my friend's(ubuntu9.1+ruby1.9):
$ irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> str = Iconv.iconv('GBK', 'UTF-8', '我说').to_s
=> "\316\322\313\265"
irb(main):003:0> puts Iconv.iconv('UTF-8', 'GBK', str).to_s
我说
=> nil

what's wrong in my system?

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That's even if the two
machines have identical versions of ruby and OS *and* you are feeding in
the same input data.

My advice is to stick with ruby 1.8.x, where the behaviour is both sane
and predictable. However there are other people who will vociferously
tell you that I am doing the entire ruby community a disservice by
recommending this to you. It's up to you whose advice to follow.

If you want to persevere with ruby 1.9, I suggest the following:

* Check you have exactly identical versions of 1.9 (check the
RUBY_DESCRIPTION constant) on both machines. The behaviour is subtle,
and a lot of it has changed.

* Look at str.bytes.to_a to see if the byte sequence is correct or not.
That is, the fact that irb displays the string wrongly or rightly
doesn't mean anything; don't trust what you see.

* Instead of using irb, write a .rb script, and run it from the command
line directly.

* Check the environments are the same on both. You could try
experimenting with setting LANG and/or LC_ALL environment variables
before starting ruby.

* I tried to understand how this all works, and I documented what I
found at string19/string19.rb at master · candlerb/string19 · GitHub

There are about 200 cases of encoding behaviour described there.

Also, it's possible to do what you're trying to do in ruby 1.9 without
using Iconv, but instead tagging str with its correct encoding, and then
using encode! to convert it to another. Whether it appears correctly on
the terminal or not, especially within irb, is still not something to
trust. Again, use str.bytes.to_a to see if it is the expected sequence
of bytes in the new encoding.

Good luck,

Brian.

···

--
Posted via http://www.ruby-forum.com/\.

Benoit_Daloze · 18 April 2010 14:18

Hi,

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That's even if the two
machines have identical versions of ruby and OS *and* you are feeding in
the same input data.

Please don't be so pessimist without real reason
(that said, show some code that has different result in the conditions you
said).

Maybe what you're describing is caused by different revisions, but that
happened also in 1.8, no?

* Look at str.bytes.to_a to see if the byte sequence is correct or not.

That is, the fact that irb displays the string wrongly or rightly
doesn't mean anything; don't trust what you see.

Yes, that's true, encoding in irb is still ,often, having a bad result.

B.D.

···

On 18 April 2010 11:22, Brian Candler <b.candler@pobox.com> wrote:

JEG2 · 18 April 2010 14:31

I'm pretty sure that's true with Ruby 1.8 as well. For example, don't the encodings available to iconv vary depending on the platform?

James Edward Gray II

···

On Apr 18, 2010, at 4:22 AM, Brian Candler wrote:

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That's even if the two
machines have identical versions of ruby and OS *and* you are feeding in
the same input data.

Brian_Candler · 18 April 2010 17:06

Benoit Daloze wrote:

Please don't be so pessimist without real reason
(that said, show some code that has different result in the conditions
you
said).

Sure. Here's a simple one:

File.open("myfile.txt") do |f|
line = f.gets
line =~ /./
end

You can run this script on two machines, with the same version of OS and
ruby and the same myfile.txt but with different environment variable
settings, and get it to crash on one but not the other. (One way: if the
default external encoding on one machine is US-ASCII and myfile.txt
contains any byte with the top bit set)

Maybe what you're describing is caused by different revisions, but that
happened also in 1.8, no?

This is intentional behaviour in ruby 1.9.

···

--
Posted via http://www.ruby-forum.com/\.

Brian_Candler · 18 April 2010 17:14

James Edward Gray II wrote:

···

On Apr 18, 2010, at 4:22 AM, Brian Candler wrote:

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That's even if the two
machines have identical versions of ruby and OS *and* you are feeding in
the same input data.

I'm pretty sure that's true with Ruby 1.8 as well. For example, don't
the encodings available to iconv vary depending on the platform?

Perhaps, but I was talking about an identical platform, O/S, and
installation of ruby - but different configured locale (such as LANG,
LC_CTYPE or LC_ALL environment variables)

Unless you write your ruby script defensively, it will behave
differently dependent on those environment settings when everything else
is identical.
--
Posted via http://www.ruby-forum.com/\.

JEG2 · 18 April 2010 20:17

So your main complaint is that Ruby honors the settings of your environment?

James Edward Gray II

···

On Apr 18, 2010, at 12:14 PM, Brian Candler wrote:

James Edward Gray II wrote:

On Apr 18, 2010, at 4:22 AM, Brian Candler wrote:

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That's even if the two
machines have identical versions of ruby and OS *and* you are feeding in
the same input data.

I'm pretty sure that's true with Ruby 1.8 as well. For example, don't
the encodings available to iconv vary depending on the platform?

Perhaps, but I was talking about an identical platform, O/S, and
installation of ruby - but different configured locale (such as LANG,
LC_CTYPE or LC_ALL environment variables)

Benoit_Daloze · 18 April 2010 21:02

answer)

Yeah, I think it's normal it saves in the encoding depending on the
environment.
And if you want something that doesn't depend on the environment there is
many possibilities.

The easiest with File: File.open("myfile.ext", "w:UTF-8")

···

On 18 April 2010 22:17, James Edward Gray II <james@graysoftinc.com> wrote:

So your main complaint is that Ruby honors the settings of your
environment?

James Edward Gray II

Beautiful that one (couldn't get a cool answer so I waited somebody else

Brian_Candler · 19 April 2010 08:03

Perhaps, but I was talking about an identical platform, O/S, and
installation of ruby - but different configured locale (such as LANG,
LC_CTYPE or LC_ALL environment variables)

So your main complaint is that Ruby honors the settings of your
environment?

My complaints are listed at
string19/soapbox.rb at master · candlerb/string19 · GitHub - but I guess
the main one is what the OP saw. Same program, same data, same ruby,
different behaviour.

Normally when analysing a program you only need to look at the program
and its input, but ruby 1.9 has extra "hidden" input data in the form of
environment variables which can alter your program's behaviour, or not,
depending on the content of the input data as well.

I wonder how many Ruby users are fully aware of which environment
variables influence POSIX locales, and which ones take precendence over
the others?

I also note that there is an effort underway to standardise the Ruby
language definition, and this has chosen 1.8.7 as its baseline.

···

--
Posted via http://www.ruby-forum.com/\.

Brian_Candler · 19 April 2010 07:42

Benoit Daloze wrote:

The easiest with File: File.open("myfile.ext", "w:UTF-8")

This is a poor example of the point in question, although a good example
of how hard ruby 1.9 is to understand.

In fact: the default external encoding is nil for files opened for
write, and does not depend on the environment at all. That is,

File.open("myfile.ext","w") { |f| f.puts str }

just outputs whatever bytes are in str, without meddling with them.
Whereas

File.open("myfile.ext","w:UTF-8") { |f| f.puts str}

will attempt to re-encode str from its current encoding to UTF-8, and
may raise an exception if it cannot do so.

So if you want to write programs which don't crash, the first is
arguably better.

The rules for *reading* from files are completely different, and indeed
"r:UTF-8" is the right thing to do if you are reading from a file which
contains UTF-8 text and you don't want this to be affected by
environment variable magic.

···

--
Posted via http://www.ruby-forum.com/\.

botp1 · 19 April 2010 08:01

... File.open("myfile.ext","w:UTF-8") { |f| f.puts str}

will attempt to re-encode str from its current encoding to UTF-8, and
may raise an exception if it cannot do so.

good

So if you want to write programs which don't crash, the first is
arguably better.

we disagree there but what do you mean by "crash"?

best regards -botp

···

On Mon, Apr 19, 2010 at 3:42 PM, Brian Candler <b.candler@pobox.com> wrote:

Brian_Candler · 19 April 2010 08:31

botp wrote:

So if you want to write programs which don't crash, the first is
arguably better.

we disagree there but what do you mean by "crash"?

I mean "raise an exception". The first example I wrote will never raise
an exception. The second can.

Code to demonstrate:

  str = "\xff"
  File.open("out1","w") { |f| f.puts str }
  File.open("out2","w:UTF-8") { |f| f.puts str }

Line 2 will never raise an exception, regardless of the content or the
encoding of str, and regardless of environment variable settings. It
just writes the string to the file.

Line 3 may raise an exception. It does in this particular program
because str has data tagged as ASCII-8BIT which cannot be transcoded to
UTF-8.

···

--
Posted via http://www.ruby-forum.com/\.

JEG2 · 19 April 2010 15:46

That's grossly inaccurate. You may not have write permission to the file, the volume you are trying to place the file on may be out of space, etc.

These are more examples of how you could move the same code to a new machine and have it fail. Ignoring the environment code runs in will not make it go away.

James Edward Gray II

···

On Apr 19, 2010, at 3:31 AM, Brian Candler wrote:

Code to demonstrate:

str = "\xff"
File.open("out1","w") { |f| f.puts str }
File.open("out2","w:UTF-8") { |f| f.puts str }

Line 2 will never raise an exception, regardless of the content or the
encoding of str, and regardless of environment variable settings. It
just writes the string to the file.

Brian_Candler · 19 April 2010 20:28

James Edward Gray II wrote:

Code to demonstrate:

str = "\xff"
File.open("out1","w") { |f| f.puts str }
File.open("out2","w:UTF-8") { |f| f.puts str }

Line 2 will never raise an exception, regardless of the content or the
encoding of str, and regardless of environment variable settings. It
just writes the string to the file.

That's grossly inaccurate. You may not have write permission to the
file, the volume you are trying to place the file on may be out of
space, etc.

Of course syscalls can fail due to insufficient resources and other
system-level problems. I'm talking about the normal flow of execution.

The point remains: Benoit said that one way to make your program immune
to influence from environment variables was to use
File.open("myfile.ext","w:UTF-8"). I was trying to highlight that advice
is incorrect, because the regular File.open("myfile.ext","w") is immune
to environment variables already. Furthermore, "w:UTF-8" can crash in
the normal flow under more circumstances than "w" - and those
circumstances depend on string contents and encodings, which _can_ be
affected by environment variables.

···

On Apr 19, 2010, at 3:31 AM, Brian Candler wrote:

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Difference in iconv between ruby and irb ruby-talk	2	117	30 June 2008
How can I add the iconv extension to Ruby? ruby-talk	2	132	12 May 2004
Possible bug, or just confusing documentation of class Icon in the Ruby standard library ruby-talk	0	96	4 August 2005
Iconv weirdness on Windows XP ruby-talk	13	115	14 December 2005
Iconv "\n" (Iconv::InvalidCharacter) ruby-talk	0	84	8 September 2009

Iconv transfer code

Related topics