I get some untrusted input from some of our partners that should be in
utf-8 (or generally plain 7-bit ascii), but isn't always (and in fact in
some cases appears to be a multiple incompatible string encodings
concatenated together, truncated strangely then joined, or perhaps just
noise). I'd like to convert the string into something that's valid
utf-8 so I can work with it, ideally keeping as much of the valid
encoding parts of the string as possible. I tried encode! but ran into
weirdness where it would return a string that claims to valid but isn't
(which seems like a bug).
# test strings
1.9.3p0> str1 = "ceramic
rollers1\x82ры/Рейд-боссы/50—59F\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls&tempFileName=1310611982277\xC1\xA6110ȸ
\xC7հ\xDD\xC0ڸ\xED\xB4\xDC(\xC1\xF6\xBF\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls"
1.9.3p0> str2 = "hydroxide+caustic 田由\xE7\xBE"
# encode!
1.9.3p0> a = str1.dup
1.9.3p0> a.valid_encoding?
=> false
1.9.3p0> a.encode!(Encoding::UTF_8, Encoding::UTF_8, :invalid=>:replace,
:undef=>:replace, :replace=>'')
=> "ceramic
rollers1\x82ры/Рейд-боссы/50—59F\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls&tempFileName=1310611982277\xC1\xA6110ȸ
\xC7հ\xDD\xC0ڸ\xED\xB4\xDC(\xC1\xF6\xBF\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls"
1.9.3p0> a.valid_encoding?
=> true
# so far so good
1.9.3p0> a.squeeze(' ')
ArgumentError: invalid byte sequence in UTF-8
from (irb):10:in `squeeze'
from (irb):10
from /home/tgarnett/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'
# !!! ruby just claimed the encoding was valid! BUG??
# a.dup.squeeze(' '), "#{a} ".squeeze(' ') both fail as well
Also tried iconv with //IGNORE but it returns
invalid strings on some inputs, and also crashes on some others. I've
had better luck with unpack/pack, but I was wondering if anyone new a
better way to do this.
# iconv
1.9.3p0> require 'iconv'
1.9.3p0> a = str1.dup
1.9.3p0> a = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv(a)
=> "ceramic
rollers1ры/Рейд-боссы/50—59F)-ټ.xls&tempFileName=1310611982277110ȸ
հڸ(\xF6\xBF\xAA\xB3)-ټ.xls"
1.9.3p0> a.valid_encoding?
=> false
# no luck here either...
1.9.3p0> b = str2.dup
1.9.3p0> b = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv(b)
Iconv::InvalidCharacter: "\xE7\xBE"
from (irb):22:in `iconv'
from (irb):22
from /home/tgarnett/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'
# ok, can crash too...
# unpack, pack
1.9.3p0> a = str2.dup
1.9.3p0> a = a.unpack('C*').pack('U*')
=> "hydroxide+caustic ç\u0094°ç\u0094±ç¾"
1.9.3p0> a.valid_encoding?
=> true
1.9.3p0> a.squeeze(' ')
=> "hydroxide+caustic ç\u0094°ç\u0094±ç¾"
# some success, also works for str1
···
--
Posted via http://www.ruby-forum.com/.