Forcing a string to valid UTF-8

Gavin_Kistner3 · 26 April 2010 22:45

I have some legacy text data that's gone through several databases and
web services in its life, playing promiscuously with dirty web
servers, browsers, and encodings.

It's coming out of the source database as ASCII-8bit. I'm trying to
bring it all into UTF-8. I've found ways to coerce many of the bad
entries into compliance, but now I've hit one that is simply bad. I
want to just delete the minimum necessary to make it valid UTF-8. What
I'm trying isn't working. Here's my code:

  if new_value.is_a? String
    begin
      utf8 = new_value.force_encoding('UTF-8')
      if utf8.valid_encoding?
        new_value = utf8
      else
        new_value.encode!( 'UTF-8', 'Windows-1252' )
      end
    rescue EncodingError => e
      puts "Bad encoding: #{old_table}.#{pk}:#{old_row[pk]} -
#{new_value.inspect}"
      new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace,
replace: '' )
      p new_value.encoding unless new_value.valid_encoding?
    end
  end

When I fall into the rescue clause, I'm getting out:
Bad encoding: bugs.id:2469 - "Indexing C:\\\\πé│πâö\xE3\x81E \x81E
\x81EZCa_zu5.264"
#<Encoding:UTF-8>
The conversion resulted in an invalid UTF-8 string (that happens to be
the same as the original, as far as I can tell.) I'm surprised,
because I thought the purpose of invalid/undef replace was to clean
things up.

How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?

Brian_Candler · 27 April 2010 10:19

Gavin Kistner wrote:

How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?

AFAICS, the trouble with your rescue clause is that the string failed to
be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
and so an attempt to "re-encode" as UTF-8 is silently ignored because
it's already UTF-8, even though it contains invalid characters.

For example, this doesn't do anything:

a = "abc\xffdef".force_encoding("UTF-8")

=> "abc\xFFdef"

b = a.encode("UTF-8", :invalid=>:replace, :replace=>"?")

=> "abc\xFFdef"

but this does:

b = a.encode("UTF-16BE", :invalid=>:replace, :replace=>"?").encode("UTF-8")

=> "abc?def"

Proviso: ruby 1.9 string handling is undocumented and subject to
continuous change. I tested the above with

RUBY_DESCRIPTION

=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

so it may or may not work with your version, or with future versions of
Ruby.

···

--
Posted via http://www.ruby-forum.com/\.

Gavin_Kistner3 · 27 April 2010 14:45

Gavin Kistner wrote:
> How do I force it into a valid UTF-8 encoding, losing as little data
> as possible but happily throwing out the senseless bits?

AFAICS, the trouble with your rescue clause is that the string failed to
be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
and so an attempt to "re-encode" as UTF-8 is silently ignored because
it's already UTF-8, even though it contains invalid characters.

Excellent point. Fixing that led me to a similar error earlier: I had
assumed that
s2 = s1.force_encoding(...)
left s1 intact. In fact, it modifies and returns s1. Thank you very
much, Brian.

For those that care or stumble upon this via Google, here's a modified
version that works:

  # Converting ASCII-8BIT to UTF-8 based domain-specific guesses
  if new_value.is_a? String
    begin
      # Try it as UTF-8 directly
      cleaned = new_value.dup.force_encoding('UTF-8')
      unless cleaned.valid_encoding?
        # Some of it might be old Windows code page
        cleaned = new_value.encode( 'UTF-8', 'Windows-1252' )
      end
      new_value = cleaned
    rescue EncodingError
      # Force it to UTF-8, throwing out invalid bits
      new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace )
    end
  end

Proviso: ruby 1.9 string handling is undocumented and subject to
continuous change. I tested the above with

FWIW my new code works on ruby 1.9.1p243 (2009-07-16 revision 24175)
[i386-mingw32]

Thanks again!

···

On Apr 27, 4:19 am, Brian Candler <b.cand...@pobox.com> wrote:

Topic		Replies	Views
Str.encode! sets valid_encoding even if resulting string is invalid (1.9.3) ruby-talk	1	159	10 February 2012
Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files ruby-talk	2	141	18 November 2010
[ENCODING] UTF8 hell ruby-talk	14	702	24 February 2010
UTF-8 in Ruby ruby-talk	3	105	1 May 2008
Ruby 1.9.2: How to sanitize text with invalid characters? ruby-talk	6	222	12 October 2010

Forcing a string to valid UTF-8

Related topics