[snipped]
I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.
Seconded, Thirded, and Quadrupled.
Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.
We're doing something comparable in Raggle by trapping the exception and
stripping out the invalid character. Obviously this doesn't work
properly for multibyte characters, and you won't be able to use a lookup
table for arbitrary source encodings, but it's a start.
begin
# convert element_text to native charset (note: in this case we're
# converting from utf-8 to the native charset, but the only thing
# about the code that's utf-8 specific is the assumption about
# character width and the unicode lookup table below)
ret = $iconv.iconv(element_text) << $iconv.iconv(nil)
rescue Iconv::IllegalSequence => e
# save the portion of the string that was successful, the
# invalid character, and the remaining (pending) string
success_str = e.success
ch, pending_str = e.failed.split(//, 2)
ch_int = ch.to_i
# see if we have a map for that characters
if String::UNICODE_LUT.has_key?(ch_int)
# we have a mapping for this character, so convert it and
# re-process the string
# log status
err_str = _('converting unicode')
$log.warn(meth) { "#{err_str} ##{ch_int}" }
# create new string, with the bad character mapped
element_text = success_str + UNICODE_LUT[ch_int] + pending_str
else
if $config['iconv_munge_illegal']
# munge the illegal character with a safe string
# log status
err_str = _('munging unicode')
$log.warn(meth) { "#{err_str} ##{ch_int}" }
# create new string, with the bad character munged
munge_str = $config['unicode_munge_str']
element_text = success_str + munge_str + pending_str
else
# just drop the character altogether
# log status
err_str = _('dropping unicode')
$log.warn(meth) { "#{err_str} ##{ch_int}" }
# create new string, sans the bad character
element_text = success_str + pending_str
end
end
retry
end
Not a perfect solution, but it helps a bit.
···
* Andreas S. (f@andreas-s.net) wrote:
--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562