Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files

Atoli_Atoli · 18 November 2010 03:31

Hello

My ruby 1.9.2 does some strange things when manipulating encodings
especially when reading and writing text file.
I have a file (attached) with broken UTF-8 characters: E5 AD 97 E6 99 2E
The offending sequence is E6 99

And here's the example I'm using to try to figure out how the heck does
this Encoding stuff work:

···

#--------------------------------------
data = File.open("broken.txt", "r:UTF-8") { |f| f.read }
puts data.valid_encoding?

utf_a = data.encode("UTF-8", invalid: :replace, undef: :replace,
replace: "_")
puts utf_a.valid_encoding?

utf_b = data.encode("UTF-8", "UTF-8", invalid: :replace, undef:
:replace, replace: "_")
puts utf_b.valid_encoding?
puts (utf_a == utf_b) && (data == utf_b)

File.open("valid.txt", "w:UTF-8") { |f| f.write(utf_a) }
#--------------------------------------

The output is:
false
false
true
true

Basically I'm trying to replace the broken sequences with "_", but the
encode method doesn't seem to do any replacements, maybe because the
forced encoding is already set to UTF-8?

I've read James Edward post concerning strings encoding in ruby 1.9 and
also candlerb's doc, but didn't find anything.

This can't be so complicated, right? Sure I'm missing something.

Thank you.

Atoli.

Attachments:
http://www.ruby-forum.com/attachment/5415/broken.txt

--
Posted via http://www.ruby-forum.com/.

Brabuhr · 18 November 2010 05:04

For the case of fixing broken files, I would probably use iconv from the shell:

$ cat broken.txt
字?.
$ iconv -f UTF8 -t UTF8 --byte-subst=_ broken.txt
字__.

(I don't know if Ruby's Iconv module supports the subst options.)

For short strings, this seems to work:

irb(main):001:0> s = "\xE5\xAD\x97\xE6\x99\x2E"
=> "字\xE6\x99."
irb(main):002:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):003:0> s.valid_encoding?
=> false
irb(main):004:0> t = s.chars.map{|c| c.valid_encoding? ? c : '_'}.join
=> "字__."
irb(main):005:0> t.valid_encoding?
=> true
irb(main):006:0> t.encoding
=> #<Encoding:UTF-8>

···

On Wed, Nov 17, 2010 at 10:31 PM, Atoli Atoli <grayphoque@gmail.com> wrote:

Basically I'm trying to replace the broken sequences with "_", but the
encode method doesn't seem to do any replacements, maybe because the
forced encoding is already set to UTF-8?

Atoli_Atoli · 18 November 2010 13:54

Thanks for the tip.

For now, making ruby "think" the encoding is valid seems to work (it
doesn't break regular expressions at least).
So I just encode("UTF-8", "UTF-8", invalid: :replace, undef:
:replace, replace: "_") each time I read my files.

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
[ruby 1.9] reading an UTF-8 encoded file ruby-talk	12	197	11 March 2010
Broken UTF-8 string ruby-talk	1	96	27 July 2010
Ruby 1.9.1: Encoding trouble: broken US-ASCII String ruby-talk	21	199	16 December 2008
[ENCODING] UTF8 hell ruby-talk	14	705	24 February 2010
Ruby 1.9 - US-ASCII vs UTF-8 ruby-talk	2	151	19 December 2009

Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files

Related topics