Invalid byte sequence in UTF-8 (ArgumentError) - Ruby - how to hande invaid bytes on runtime

Basically I am having data in a file as below :

    Cote 0.56 0.6 0.71 0.93 0.08 0.21 0.98 0.96 CÙte d'Ivoire
CÙte d'Ivoire 20.15 0.002 0.002 0.003 0

Problem created for `Ù`.

I was getting an exception every time :

Just I came up with a code below :

    File.open('/home/kirti/workspace/Ruby/project_free/data/2013_diamonds.txt')
do |file|
      file.readlines.map do |i|
        begin
        i = i.gsub(/[\u0022]/,'')
        rescue
          p $!
          p i
          p i.encode('UTF-16', :invalid => :replace, :replace =>
'').encode('utf-8')
        end
      end
    end

    # >> #<ArgumentError: invalid byte sequence in UTF-8>
    # >> "Cote\t0.56\t0.6\t0.71\t0.93\t0.08\t0.21\t0.98\t0.96\tC\xD9te
d'Ivoire\t\tC\xD9te d'Ivoire\t20.15\t0.002\t0.002\t0.003\t0\r\n"
    # >> "Cote\t0.56\t0.6\t0.71\t0.93\t0.08\t0.21\t0.98\t0.96\tCte
d'Ivoire\t\tCte d'Ivoire\t20.15\t0.002\t0.002\t0.003\t0\r\n"

Now problem is the line `i.encode('UTF-16', :invalid => :replace,
:replace => '').encode('utf-8')` handling it properly, but for invalid
byte it is replacing it with `""`. As you can see, I got `Cte
d'Ivoire\t\tCte..` where the character `Ù` is missing. But can this be
placed, with some logic in the line `...:replace => ''`. I am looking
for instead of `""`, the dynamic charater for which error happned, with
some processing replacement character should be that `Ù` or any ...

···

--
Posted via http://www.ruby-forum.com/.

Dear Arup Rakshit,

I have a (humble) guess about your problem...
I think you are trying to open an ISO_8859_1 encoded file as if it was an UTF_8.
If so, you don't need to recreate the logic of the character/encoding
translation the way you are trying to do. Ruby will kindly do that for
you.
You just have to open the file telling Ruby that it is an ISO_8859_1 file.

Try...
    File.open('/home/kirti/workspace/Ruby/project_free/data/2013_diamonds.txt',
external_encoding: Encoding::ISO_8859_1) do |file|

I think no error will be raised.

This worked for me, at least with the two lines that you gave as example.

Just tell me if it worked with the whole file.

You can read more at the Encoding, IO and File ruby documentation (see
internal and external encoding).

Best regards,
Abinoam Jr.

···

On Fri, Jan 31, 2014 at 5:14 PM, Arup Rakshit <lists@ruby-forum.com> wrote:

Basically I am having data in a file as below :

    Cote 0.56 0.6 0.71 0.93 0.08 0.21 0.98 0.96 CÙte d'Ivoire
CÙte d'Ivoire 20.15 0.002 0.002 0.003 0

Problem created for `Ù`.

I was getting an exception every time :

Just I came up with a code below :

    File.open('/home/kirti/workspace/Ruby/project_free/data/2013_diamonds.txt')
do |file|
      file.readlines.map do |i|
        begin
        i = i.gsub(/[\u0022]/,'')
        rescue
          p $!
          p i
          p i.encode('UTF-16', :invalid => :replace, :replace =>
'').encode('utf-8')
        end
      end
    end

    # >> #<ArgumentError: invalid byte sequence in UTF-8>
    # >> "Cote\t0.56\t0.6\t0.71\t0.93\t0.08\t0.21\t0.98\t0.96\tC\xD9te
d'Ivoire\t\tC\xD9te d'Ivoire\t20.15\t0.002\t0.002\t0.003\t0\r\n"
    # >> "Cote\t0.56\t0.6\t0.71\t0.93\t0.08\t0.21\t0.98\t0.96\tCte
d'Ivoire\t\tCte d'Ivoire\t20.15\t0.002\t0.002\t0.003\t0\r\n"

Now problem is the line `i.encode('UTF-16', :invalid => :replace,
:replace => '').encode('utf-8')` handling it properly, but for invalid
byte it is replacing it with `""`. As you can see, I got `Cte
d'Ivoire\t\tCte..` where the character `Ù` is missing. But can this be
placed, with some logic in the line `...:replace => ''`. I am looking
for instead of `""`, the dynamic charater for which error happned, with
some processing replacement character should be that `Ù` or any ...

--
Posted via http://www.ruby-forum.com/\.

Abinoam Jr. wrote in post #1135206:

Dear Arup Rakshit,

I have a (humble) guess about your problem...

Try...
    File.open('/home/kirti/workspace/Ruby/project_free/data/2013_diamonds.txt',
external_encoding: Encoding::ISO_8859_1) do |file|

Very good suggestion indeed.

Just one topic in Ruby, always troubled me to understand the rationality
about this encoding. When to think of `internal_encoding` and
`external_encoding`. Why not only `encoding`? Sometimes in this
situation I also used `force_encoding`.. This all I just used an an
trial and error. No I didn't really aware of what I was doing. My goal
was to fix the error. First try `encoding`, then try `force_encoding`...

Can you give me some lights on this topic ?

···

--
Posted via http://www.ruby-forum.com/\.

Dear Arup,

For you to try to understand the rationality of it just relax and
think about why _you_ (not me :wink: ) were trying to "encoding" or
"force_enconding" a string coming from a file that has a different
encoding than the internal one in your program.

Perhaps you will notice that you are receiving data (external data) in
an encoding different than that used internally.

Ruby does exactly what _you_ were trying to accomplish, but in a more
elegant/fashioned way ;-).

When you set the external encoding of a file, Ruby tries to translate
all data coming from the file from the external encoding to the
internal one.
And when you try to write to the file it does the reverse.
So that you can preserve the original encoding of the file and don't
have to worry about encoding compatibility inside your program.

Go for at

Feel free to ask if it's not clear yet.
The encoding problem is complex (but the solution we have in Ruby is
simple IMHO).
I'm a native portuguese speaker and have to rely on good encoding support.
As Ruby has its roots on japanese programmers, I think they're really
concerned on good encoding support with a rich set of features to deal
with it.

Kind regards,
Abinoam Jr.

···

On Sat, Feb 1, 2014 at 3:16 AM, Arup Rakshit <lists@ruby-forum.com> wrote:

Abinoam Jr. wrote in post #1135206:

Dear Arup Rakshit,

I have a (humble) guess about your problem...

Try...
    File.open('/home/kirti/workspace/Ruby/project_free/data/2013_diamonds.txt',
external_encoding: Encoding::ISO_8859_1) do |file|

Very good suggestion indeed.

Just one topic in Ruby, always troubled me to understand the rationality
about this encoding. When to think of `internal_encoding` and
`external_encoding`. Why not only `encoding`? Sometimes in this
situation I also used `force_encoding`.. This all I just used an an
trial and error. No I didn't really aware of what I was doing. My goal
was to fix the error. First try `encoding`, then try `force_encoding`...

Can you give me some lights on this topic ?

--
Posted via http://www.ruby-forum.com/\.

Abinoam Jr. wrote in post #1135236:

Feel free to ask if it's not clear yet.
The encoding problem is complex (but the solution we have in Ruby is
simple IMHO).
I'm a native portuguese speaker and have to rely on good encoding
support.

One line from the file gave me so much trouble, once. I fixed it as
below

data_line.chomp.force_encoding('windows-1252').encode('utf-8')

But before doing this - I tried first

(a) data_line.chomp.encoding('utf-8')
(b) data_line.chomp.force_encoding('utf-8')

Then finally

data_line.chomp.force_encoding('windows-1252').encode('utf-8') worked.

Why a and b attempt didn't work ? As I told you earlier, I always fixed
it using "trial and error" method.

Can you explain this ? May be with your help, I can make my base much
strong, in such encoding related issue

···

--
Posted via http://www.ruby-forum.com/\.

Can you give me the line?
I can try to help you.
Most of the problem come from the following: One open a file telling Ruby
(implicitly or explicitly) the encoding, for example utf-8. But, the byte
representation inside it is another encoding.

···

Em 01/02/2014 11:16, "Arup Rakshit" <lists@ruby-forum.com> escreveu:

Abinoam Jr. wrote in post #1135236:

> Feel free to ask if it's not clear yet.
> The encoding problem is complex (but the solution we have in Ruby is
> simple IMHO).
> I'm a native portuguese speaker and have to rely on good encoding
> support.

One line from the file gave me so much trouble, once. I fixed it as
below

data_line.chomp.force_encoding('windows-1252').encode('utf-8')

But before doing this - I tried first

(a) data_line.chomp.encoding('utf-8')
(b) data_line.chomp.force_encoding('utf-8')

Then finally

data_line.chomp.force_encoding('windows-1252').encode('utf-8') worked.

Why a and b attempt didn't work ? As I told you earlier, I always fixed
it using "trial and error" method.

Can you explain this ? May be with your help, I can make my base much
strong, in such encoding related issue

--
Posted via http://www.ruby-forum.com/\.