Ruby unicode/string explosion (0xFF in utf-8)

Sylvester_T_Cat · 11 December 2010 04:40

Hi, I'm using ruby 1.9.2

I'm reading a CSV file that has some non US-ASCII characters. I want
to parse each value in each row and strip out any leading/lagging
potential whitespace.
However, when I come across some unusual characters, I get invalid
byte sequence in UTF-8

here is an example:

irb(main):041:0* a = "\xFF"
=> "\xFF"
irb(main):042:0> a.encoding
=> #<Encoding:UTF-8>
irb(main):043:0> a.strip
ArgumentError: invalid byte sequence in UTF-8
        from (irb):43:in `strip'
        from (irb):43
        from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/
rails/commands/console.rb:44:in `start'
        from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/
rails/commands/console.rb:8:in `start'
        from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/
rails/commands.rb:23:in `<top (required)>'
        from script/rails:6:in `require'
        from script/rails:6:in `<main>'

# so now I'm going to try and change encoding, but this doesn't work
either

irb(main):044:0> a.encode!("ASCII-8BIT", undef: :replace)
Encoding::InvalidByteSequenceError: "\xFF" on UTF-8
        from (irb):44:in `encode!'
        from (irb):44
        from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/
rails/commands/console.rb:44:in `start'
        from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/
rails/commands/console.rb:8:in `start'
        from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/
rails/commands.rb:23:in `<top (required)>'
        from script/rails:6:in `require'
        from script/rails:6:in `<main>'

Is there any way to strip out these characters while staying with
utf-8 encoding?

botp1 · 11 December 2010 08:52

see Gray Soft / Not Found

eg,

require 'iconv'
#=> true
a="abcde \xFF ghi"
#=> "abcde \xFF ghi"
a.encoding
#=> #<Encoding:UTF-8>
ic = Iconv.new 'UTF-8//IGNORE', 'UTF-8'
#=> #<Iconv:0x944f468>
ic.iconv a
#=> "abcde ghi"

kind regards -botp

···

On Sat, Dec 11, 2010 at 12:40 PM, Sylvester T Cat <sylvestertcat1@gmail.com> wrote:

Is there any way to strip out these characters while staying with
utf-8 encoding?

Brian_Candler · 12 December 2010 21:57

Sylvester T Cat wrote in post #967790:

I'm reading a CSV file that has some non US-ASCII characters. I want
to parse each value in each row and strip out any leading/lagging
potential whitespace.
However, when I come across some unusual characters, I get invalid
byte sequence in UTF-8

I guess it's not genuinely UTF-8.

If you think it *is* sort of broken UTF-8 which includes FF characters
for some reason, then you could force encoding to binary, remove the FF
characters, then force back to UTF-8.

More likely I'd have thought it was a single-byte encoding (like
ISO-8859-1 perhaps). But in any case, if you're just doing CSV parsing,
you can quite legitimately treat UTF-8 as binary - since all you need to
do is recognise commas and double quotes, and the rest just gets passed
through.

More info at

github.com

candlerb/string19/blob/master/string19.rb

#!/usr/bin/env ruby19
# encoding: UTF-8
# This document is Copyright (C) Brian Candler 2009 and released under a
# Creative Commons Attribution-NonCommercial 3.0 Unported License.

############# CONTENTS ###################

# -1. PREAMBLE
#  0. INTRODUCTION
#  1. ENCODINGS
#  2. PROPERTIES OF ENCODINGS
#  3. STRING, FILE AND REGEXP ENCODINGS
#  4. VALID AND FIXED ENCODINGS
#  5. COMPATIBLE OBJECTS
#  6. STRING CONCATENATION
#  7. THE BINARY / ASCII-8BIT ENCODING
#  8. SINGLE CHARACTERS
#  9. EQUALITY AND COLLATION
# 10. HASH AND EQL?
# 11. UPPER AND LOWER CASE

This file has been truncated. show original

Or just use ruby 1.8.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
ArgumentError - invalid byte sequence in UTF-8 ruby-talk	3	439	24 July 2011
Ruby 1.9.2: How to sanitize text with invalid characters? ruby-talk	6	222	12 October 2010
How to fix - "ArgumentError: invalid byte sequence in UTF-8" ruby-talk	9	193	2 November 2014
Slice! invalid byte sequence in UTF-8 ruby-talk	9	149	4 March 2011
UTF-8 "bug": not in accordance with the unicode-3 specs ruby-talk	4	156	2 December 2002

Ruby unicode/string explosion (0xFF in utf-8)

Related topics