ArgumentError - invalid byte sequence in UTF-8

Hi,

Every now and then I get errors relating to UTF8 encodings, and each time I fail to (guess) find the right combination of words to get Ruby 1.92 to play nice with some string it doesn't like.

Right now I want to open a log file and read it, but some script kiddie has decided to connect using some crazy non ASCII characters, and this line in my script

    File.readlines(logfile, :encoding => "UTF-8" )

Now spits out the error:

  ArgumentError - invalid byte sequence in UTF-8

when encountering lines like this:

83.44.178.124 - - [19/Jul/2011:19:15:00 +0100] ?.???S\x08\x02?N~],>~Q?~@6\x15`ҷ?~Vg?'dR\x1C??\x08?F\x06w?~H?~F?\x08P~V?\x0Bf\x22?\x17~M^??{??j\x1E??p?~AU~\\
"400" 166 "-" "-" "-"

I'd really like to know how to fix this without dropping 1.9. Does anyone know the magic words that will get this logfile read? These are my best efforts

    File.readlines(logfile, :encoding => "UTF-8" ).map{|e| e.force_encoding('UTF-8')}

    File.readlines(logfile, :encoding => "UTF-8" ).map{|e| e.encode('UTF-8', undef: :replace, replace: "??")}

    File.readlines(logfile, :encoding => "UTF-8" ).map{|e| e.encode('iso-8859-1', undef: :replace, replace: "??")}

They fail :frowning: They do read a logfile with valid utf8 in there. Any help is much appreciated.

Regards,
Iain

1) No matter what you do, there can always be an invalid byte sequence.
2) You have to know the encoding of a file to read it.

#encoding: utf-8

puts RUBY_VERSION

str = "m€, ¥ou"

File.open('text.txt', 'w') do |f|
  f.puts str
end

IO.foreach('text.txt', 'r') do |line|
  p line.encoding.name
  p line
end

--output:--
1.9.2
"UTF-8"
"m€, ¥ou\n"

···

--
Posted via http://www.ruby-forum.com/.

Iain Barnett wrote in post #1012004:

    File.readlines(logfile, :encoding => "UTF-8" )

Now spits out the error:

  ArgumentError - invalid byte sequence in UTF-8

Are you sure it's that particular line which splits out the error?

There are no hard-and-fast rules, because of the whole incoherent design
of ruby 1.9, but in many cases you can *read* a string which has invalid
encodings, but you get an error later on when you try to do things like
regexp matches on it.

irb(main):002:0> File.open("zzz1","wb") { |f| f.write("\xdd\xdd") }
=> 2
irb(main):003:0> File.readlines("zzz1")
=> ["\xDD\xDD"]
irb(main):004:0> File.readlines("zzz1", :encoding=>"UTF-8")
=> ["\xDD\xDD"]
irb(main):005:0> File.readlines("zzz1", :encoding=>"UTF-8")[0] =~ /./
ArgumentError: invalid byte sequence in UTF-8
  from (irb):5
  from /usr/local/bin/irb192:12:in `<main>'
irb(main):006:0>

You can of course set :encoding=>"BINARY" (or "ASCII-8BIT") when you
read the file. Or you could open the file in binary mode ("rb"), which I
don't think File.readlines supports directly, but File.open does. The
two are not exactly the same; binary mode also prevents CR/CRLF
translations on non-Unix platforms.

I'd suggest that BINARY mode is the way to go for you. If your objective
is to read in some log lines, chomp them, and write them out again,
whilst allowing arbitrary byte sequences, this will Just Work [TM], just
like it would in ruby 1.8.

However, regexp matches will be against individual bytes of the string,
rather than entire UTF-8 characters.

It's strange how in ruby 1.9, str works just fine with invalid
encodings, but str=~/./ does not. But that's only one of many strange
things about ruby 1.9.

···

--
Posted via http://www.ruby-forum.com/\.

Thanks. I am running some regex on the lines later, so that is where the script is actually choking. I'll just have to put up with this I suppose.

Again, many thanks.

Regards,
Iain

···

On 21 Jul 2011, at 13:26, Brian Candler wrote:

Iain Barnett wrote in post #1012004:

   File.readlines(logfile, :encoding => "UTF-8" )

Now spits out the error:

ArgumentError - invalid byte sequence in UTF-8

Are you sure it's that particular line which splits out the error?

There are no hard-and-fast rules, because of the whole incoherent design
of ruby 1.9, but in many cases you can *read* a string which has invalid
encodings, but you get an error later on when you try to do things like
regexp matches on it.

irb(main):002:0> File.open("zzz1","wb") { |f| f.write("\xdd\xdd") }
=> 2
irb(main):003:0> File.readlines("zzz1")
=> ["\xDD\xDD"]
irb(main):004:0> File.readlines("zzz1", :encoding=>"UTF-8")
=> ["\xDD\xDD"]
irb(main):005:0> File.readlines("zzz1", :encoding=>"UTF-8")[0] =~ /./
ArgumentError: invalid byte sequence in UTF-8
from (irb):5
from /usr/local/bin/irb192:12:in `<main>'
irb(main):006:0>

You can of course set :encoding=>"BINARY" (or "ASCII-8BIT") when you
read the file. Or you could open the file in binary mode ("rb"), which I
don't think File.readlines supports directly, but File.open does. The
two are not exactly the same; binary mode also prevents CR/CRLF
translations on non-Unix platforms.

I'd suggest that BINARY mode is the way to go for you. If your objective
is to read in some log lines, chomp them, and write them out again,
whilst allowing arbitrary byte sequences, this will Just Work [TM], just
like it would in ruby 1.8.

However, regexp matches will be against individual bytes of the string,
rather than entire UTF-8 characters.

It's strange how in ruby 1.9, str works just fine with invalid
encodings, but str=~/./ does not. But that's only one of many strange
things about ruby 1.9.