Inconsistent IO character reading when converting encoding

7stud2 · 10 June 2013 17:31

In Ruby 1.9.3-429, I am trying to parse plain text files with various
encodings that will ultimately be converted to UTF-8 strings. Non-ascii
characters work fine with a file encoded as UTF-8, but problems come up
with non-UTF-8 files.

Simplified example:

File.open(file) do |io|
io.set_encoding("#{charset.upcase}:#{Encoding::UTF_8}")
line, char = "", nil

  until io.eof? || char == ?\n || char == ?\r
    char = io.readchar
    puts "Character #{char} has #{char.each_codepoint.count} codepoints"
    puts "SLICE FAIL" unless char == char.slice(0,1)

    line << char
  end
  line
end

Both files are just a single string áÁð encoded appropriately. I have
checked that the files have been encoded correctly via "$ file -i
<file_name>"

With a UTF-8 file, I get back:
Character á has 1 codepoints
Character Á has 1 codepoints
Character ð has 1 codepoints

With an ISO-8859-1 file:
Character á has 2 codepoints
SLICE FAIL
Character Á has 2 codepoints
SLICE FAIL
Character ð has 2 codepoints
SLICE FAIL

The way I am interpreting this is readchar is returning an incorrectly
converted encoding which is causing slice to return incorrectly.

Is this behavior correct? Or am I specifying the file external encoding
incorrectly? I would rather not rewrite this process so I am hoping I am
making a mistake somewhere. There are reasons why I am parsing files
this way, but I don't think those are relevant to my question.
Specifying the internal and external encoding as an option in File.open
yielded the same results.

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
[ruby 1.9] reading an UTF-8 encoded file ruby-talk	12	172	11 March 2010
Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files ruby-talk	2	112	18 November 2010
Testing stdin for bad encoding, ruby 1.9 ruby-talk	0	102	11 May 2008
Ruby 1.9.1: Encoding trouble: broken US-ASCII String ruby-talk	21	167	16 December 2008
Reading Files: how to I specify the encoding? ruby-talk	2	102	14 May 2007

Inconsistent IO character reading when converting encoding

Related Topics