ArgumentError - invalid byte sequence in UTF-8

Iain_Barnett · 21 July 2011 03:46

Hi,

Every now and then I get errors relating to UTF8 encodings, and each time I fail to (guess) find the right combination of words to get Ruby 1.92 to play nice with some string it doesn't like.

Right now I want to open a log file and read it, but some script kiddie has decided to connect using some crazy non ASCII characters, and this line in my script

File.readlines(logfile, :encoding => "UTF-8" )

Now spits out the error:

ArgumentError - invalid byte sequence in UTF-8

when encountering lines like this:

83.44.178.124 - - [19/Jul/2011:19:15:00 +0100] ?.???S\x08\x02?N~],>~Q?~@6\x15`ҷ?~Vg?'dR\x1C??\x08?F\x06w?~H?~F?\x08P~V?\x0Bf\x22?\x17~M^??{??j\x1E??p?~AU~\\
"400" 166 "-" "-" "-"

I'd really like to know how to fix this without dropping 1.9. Does anyone know the magic words that will get this logfile read? These are my best efforts

File.readlines(logfile, :encoding => "UTF-8" ).map{|e| e.force_encoding('UTF-8')}

File.readlines(logfile, :encoding => "UTF-8" ).map{|e| e.encode('UTF-8', undef: :replace, replace: "??")}

File.readlines(logfile, :encoding => "UTF-8" ).map{|e| e.encode('iso-8859-1', undef: :replace, replace: "??")}

They fail They do read a logfile with valid utf8 in there. Any help is much appreciated.

Regards,
Iain

7stud · 21 July 2011 04:57

1) No matter what you do, there can always be an invalid byte sequence.
2) You have to know the encoding of a file to read it.

#encoding: utf-8

puts RUBY_VERSION

str = "m€, ¥ou"

File.open('text.txt', 'w') do |f|
f.puts str
end

IO.foreach('text.txt', 'r') do |line|
p line.encoding.name
p line
end

--output:--
1.9.2
"UTF-8"
"m€, ¥ou\n"

···

--
Posted via http://www.ruby-forum.com/.

Brian_Candler · 21 July 2011 12:26

Iain Barnett wrote in post #1012004:

File.readlines(logfile, :encoding => "UTF-8" )

Now spits out the error:

ArgumentError - invalid byte sequence in UTF-8

Are you sure it's that particular line which splits out the error?

There are no hard-and-fast rules, because of the whole incoherent design
of ruby 1.9, but in many cases you can *read* a string which has invalid
encodings, but you get an error later on when you try to do things like
regexp matches on it.

irb(main):002:0> File.open("zzz1","wb") { |f| f.write("\xdd\xdd") }
=> 2
irb(main):003:0> File.readlines("zzz1")
=> ["\xDD\xDD"]
irb(main):004:0> File.readlines("zzz1", :encoding=>"UTF-8")
=> ["\xDD\xDD"]
irb(main):005:0> File.readlines("zzz1", :encoding=>"UTF-8")[0] =~ /./
ArgumentError: invalid byte sequence in UTF-8
from (irb):5
from /usr/local/bin/irb192:12:in `<main>'
irb(main):006:0>

You can of course set :encoding=>"BINARY" (or "ASCII-8BIT") when you
read the file. Or you could open the file in binary mode ("rb"), which I
don't think File.readlines supports directly, but File.open does. The
two are not exactly the same; binary mode also prevents CR/CRLF
translations on non-Unix platforms.

I'd suggest that BINARY mode is the way to go for you. If your objective
is to read in some log lines, chomp them, and write them out again,
whilst allowing arbitrary byte sequences, this will Just Work [TM], just
like it would in ruby 1.8.

However, regexp matches will be against individual bytes of the string,
rather than entire UTF-8 characters.

It's strange how in ruby 1.9, str works just fine with invalid
encodings, but str=~/./ does not. But that's only one of many strange
things about ruby 1.9.

···

--
Posted via http://www.ruby-forum.com/\.

Iain_Barnett · 24 July 2011 10:36

Thanks. I am running some regex on the lines later, so that is where the script is actually choking. I'll just have to put up with this I suppose.

Again, many thanks.

Regards,
Iain

···

On 21 Jul 2011, at 13:26, Brian Candler wrote:

Iain Barnett wrote in post #1012004:

File.readlines(logfile, :encoding => "UTF-8" )

Now spits out the error:

ArgumentError - invalid byte sequence in UTF-8

Are you sure it's that particular line which splits out the error?

There are no hard-and-fast rules, because of the whole incoherent design
of ruby 1.9, but in many cases you can *read* a string which has invalid
encodings, but you get an error later on when you try to do things like
regexp matches on it.

irb(main):002:0> File.open("zzz1","wb") { |f| f.write("\xdd\xdd") }
=> 2
irb(main):003:0> File.readlines("zzz1")
=> ["\xDD\xDD"]
irb(main):004:0> File.readlines("zzz1", :encoding=>"UTF-8")
=> ["\xDD\xDD"]
irb(main):005:0> File.readlines("zzz1", :encoding=>"UTF-8")[0] =~ /./
ArgumentError: invalid byte sequence in UTF-8
from (irb):5
from /usr/local/bin/irb192:12:in `<main>'
irb(main):006:0>

You can of course set :encoding=>"BINARY" (or "ASCII-8BIT") when you
read the file. Or you could open the file in binary mode ("rb"), which I
don't think File.readlines supports directly, but File.open does. The
two are not exactly the same; binary mode also prevents CR/CRLF
translations on non-Unix platforms.

I'd suggest that BINARY mode is the way to go for you. If your objective
is to read in some log lines, chomp them, and write them out again,
whilst allowing arbitrary byte sequences, this will Just Work [TM], just
like it would in ruby 1.8.

However, regexp matches will be against individual bytes of the string,
rather than entire UTF-8 characters.

It's strange how in ruby 1.9, str works just fine with invalid
encodings, but str=~/./ does not. But that's only one of many strange
things about ruby 1.9.

Topic		Replies	Views
Invalid byte sequence in UTF-8 (ArgumentError) - Ruby - how to hande invaid bytes on runtime ruby-talk	5	162	1 February 2014
How to fix - "ArgumentError: invalid byte sequence in UTF-8" ruby-talk	9	190	2 November 2014
Slice! invalid byte sequence in UTF-8 ruby-talk	9	149	4 March 2011
Ruby unicode/string explosion (0xFF in utf-8) ruby-talk	2	421	12 December 2010
Ruby 1.9.2: How to sanitize text with invalid characters? ruby-talk	6	221	12 October 2010

ArgumentError - invalid byte sequence in UTF-8

Related topics