Ruby 1.9.2: How to sanitize text with invalid characters?

Andreas_S4 · 11 October 2010 21:08

I process a lot of text files of which I know the encoding, but that
might contain a few bytes that are invalid (i.e., make gsub fail with
"ArgumentError: invalid byte sequence in US-ASCII/UTF8"). What's the
best way to handle this situation gracefully, by ignoring or removing
the invalid characters?

···

--
Posted via http://www.ruby-forum.com/.

Scott_Gonyea · 11 October 2010 22:45

Code / Data samples?

r_string = 'blah'.encode('UTF-8')
r_regex = /#{r_string}/
text = "wahlahblahblahwahbalablah".encode("UTF-8")
text.gsub!(r_regex, '')

That's a horrible example. Still, if you have ASCII in one place, and
UTF-8 in another, it's conceivable that the matcher may just throw up
its hands. Force the encoding and try again. If it doesn't work,
please post more information (preferably with a Gist / pastie). If
that helps, please mention it so that Google can direct other poor
souls to this post.

Scott

···

On Mon, Oct 11, 2010 at 2:08 PM, Andreas S. <x-ruby-lang@andreas-s.net> wrote:

I process a lot of text files of which I know the encoding, but that
might contain a few bytes that are invalid (i.e., make gsub fail with
"ArgumentError: invalid byte sequence in US-ASCII/UTF8"). What's the
best way to handle this situation gracefully, by ignoring or removing
the invalid characters?
--
Posted via http://www.ruby-forum.com/\.

Andreas_S4 · 11 October 2010 23:03

Scott Gonyea wrote in post #949026:

Code / Data samples?

Trivial example:
"#{0xFF.chr} abcde".force_encoding("utf-8").gsub(/a/,'')
ArgumentError: invalid byte sequence in UTF-8

···

--
Posted via http://www.ruby-forum.com/\.

Scott_Gonyea · 11 October 2010 23:12

Will this work?

blah1 = "#{0xFF.chr} abcde"
blah2 = blah.split(/[^[:print:]]/).join

···

On Mon, Oct 11, 2010 at 4:03 PM, Andreas S. <x-ruby-lang@andreas-s.net> wrote:

Scott Gonyea wrote in post #949026:

Code / Data samples?

Trivial example:
"#{0xFF.chr} abcde".force_encoding("utf-8").gsub(/a/,'')
ArgumentError: invalid byte sequence in UTF-8
--
Posted via http://www.ruby-forum.com/\.

Andreas_S4 · 11 October 2010 23:16

Using iconv to clean the string works:
Iconv.conv('utf-8//IGNORE','utf-8',"#{0xFF.chr} abcde")
=> " abcde"

However, it would be nicer if there was a way to do this with the
built-in encoding functions of Ruby 1.9.

···

--
Posted via http://www.ruby-forum.com/.

Andreas_S4 · 11 October 2010 23:35

Scott Gonyea wrote in post #949256:

Will this work?

blah1 = "#{0xFF.chr} abcde"
blah2 = blah.split(/[^[:print:]]/).join

Only if the desired encoding is ASCII.

···

--
Posted via http://www.ruby-forum.com/\.

Marvin_GA_lker · 12 October 2010 07:29

String#encode can do this much nicer:

···

Am 12.10.2010 01:16, schrieb Andreas S.:

Using iconv to clean the string works:
Iconv.conv('utf-8//IGNORE','utf-8',"#{0xFF.chr} abcde")
=> " abcde"

However, it would be nicer if there was a way to do this with the
built-in encoding functions of Ruby 1.9.

============================================
$ irb
irb(main):001:0> RUBY_DESCRIPTION
=> "ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]"
irb(main):002:0> str = "#{0xFF.chr}"
=> "\xFF"
irb(main):003:0> str.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> str.encode("UTF-8")
Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8
  from (irb):4:in `encode'
  from (irb):4
  from /opt/rubies/ruby-1.9.2-p0/bin/irb:12:in `<main>'
irb(main):005:0> str.encode("UTF-8", :invalid => :replace, :undef =>
:replace, :replace => "?")
=> "?"
irb(main):006:0>

In order to remove invalid chars completely, use an empty string instead
of "?".

Vale,
Marvin

Topic		Replies	Views
Invalid byte sequence in US-ASCII (ArgumentError) ruby-talk	19	322	10 November 2010
Ruby unicode/string explosion (0xFF in utf-8) ruby-talk	2	421	12 December 2010
Encoding problems .. ruby 1.9.2 ruby-talk	6	144	28 September 2010
Dealing with invalid encoding... ruby-talk	4	618	5 May 2018
ArgumentError - invalid byte sequence in UTF-8 ruby-talk	3	439	24 July 2011

Ruby 1.9.2: How to sanitize text with invalid characters?

Related topics