How are people making use of Iconv?

Wilson_Bilkovich · 21 December 2005 05:16

Since Iconv jumped out of the pond and chewed on my leg the other
week, I've been toying with the idea of a character-set conversion
library implemented totally in Ruby, with identical behavior on every
platform.
However, I'm only using Iconv for simple things, like converting my
music tags from Shift-JIS to UTF-8.

What 'serious' things are people using this for? Are there any unit
tests? Any gems on RubyForge I can download containing projects that
make use of Iconv? What do you hate about Iconv?

Thanks,
--Wilson.

Andreas_S1 · 21 December 2005 10:56

Wilson Bilkovich wrote:

Since Iconv jumped out of the pond and chewed on my leg the other
week, I've been toying with the idea of a character-set conversion
library implemented totally in Ruby, with identical behavior on every
platform.
However, I'm only using Iconv for simple things, like converting my
music tags from Shift-JIS to UTF-8.

Well, that's all that Iconv is supposed to be used for.

What 'serious' things are people using this for? Are there any unit
tests? Any gems on RubyForge I can download containing projects that
make use of Iconv?

Rails uses Iconv, at least in ActionMailer.

What do you hate about Iconv?

I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.

···

--
Posted via http://www.ruby-forum.com/\.

Paul_Duncan · 21 December 2005 14:54

[snipped]

I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.

We're doing something comparable in Raggle by trapping the exception and
stripping out the invalid character. Obviously this doesn't work
properly for multibyte characters, and you won't be able to use a lookup
table for arbitrary source encodings, but it's a start.

    begin
      # convert element_text to native charset (note: in this case we're
      # converting from utf-8 to the native charset, but the only thing
      # about the code that's utf-8 specific is the assumption about
      # character width and the unicode lookup table below)
      ret = $iconv.iconv(element_text) << $iconv.iconv(nil)
    rescue Iconv::IllegalSequence => e
      # save the portion of the string that was successful, the
      # invalid character, and the remaining (pending) string
      success_str = e.success
      ch, pending_str = e.failed.split(//, 2)
      ch_int = ch.to_i

      # see if we have a map for that characters
      if String::UNICODE_LUT.has_key?(ch_int)
        # we have a mapping for this character, so convert it and
        # re-process the string

        # log status
        err_str = _('converting unicode')
        $log.warn(meth) { "#{err_str} ##{ch_int}" }

        # create new string, with the bad character mapped
        element_text = success_str + UNICODE_LUT[ch_int] + pending_str
      else
        if $config['iconv_munge_illegal']
          # munge the illegal character with a safe string

          # log status
          err_str = _('munging unicode')
          $log.warn(meth) { "#{err_str} ##{ch_int}" }

          # create new string, with the bad character munged
          munge_str = $config['unicode_munge_str']
          element_text = success_str + munge_str + pending_str
        else
          # just drop the character altogether

          # log status
          err_str = _('dropping unicode')
          $log.warn(meth) { "#{err_str} ##{ch_int}" }

          # create new string, sans the bad character
          element_text = success_str + pending_str
        end
      end
      retry
    end

Not a perfect solution, but it helps a bit.

···

* Andreas S. (f@andreas-s.net) wrote:

--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

Wilson_Bilkovich · 21 December 2005 15:55

What if String just had a couple of new methods on it:
String#transcode(from_encoding, to_encoding)
..and
String#transcode!(from_encoding, to_encoding)
..and the "modifies receiver" version returned true or false,
depending on whether it managed to convert every character?
Then you could do:
unless some_string.transcode!('Shift-JIS', 'UTF-8')
puts "Some characters got mangle-fied!"
end

Is that a mess? I kinda like it, at first glance.

···

On 12/21/05, Paul Duncan <pabs@pablotron.org> wrote:

* Andreas S. (f@andreas-s.net) wrote:
[snipped]
> I dislike that Iconv raises an exception when it finds characters it can
> not convert. I would prefer if it could be made to ignore invalid
> characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.

We're doing something comparable in Raggle by trapping the exception and
stripping out the invalid character. Obviously this doesn't work
properly for multibyte characters, and you won't be able to use a lookup
table for arbitrary source encodings, but it's a start.

Christian_Neukirche1 · 21 December 2005 19:24

Paul Duncan <pabs@pablotron.org> writes:

···

* Andreas S. (f@andreas-s.net) wrote:
[snipped]

I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.

Can't you just use //IGNORE?

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Paul_Duncan · 21 December 2005 17:33

I know a future version of Ruby (2.0?) will make a distinction between
strings as arrays of bytes and strings as sets of characters with an
encoding (with the former being an obvious superset of the latter), so
I'm not sure how well that method would work with the new way of
handling strings.

That said, I like the idea, although I'd like an optional block to
handle unknown characters. I'd also add an hash as an optional third
argument which allows you to toggle transliteration, munging, and
exception behavior.

···

* Wilson Bilkovich (wilsonb@gmail.com) wrote:

On 12/21/05, Paul Duncan <pabs@pablotron.org> wrote:
> * Andreas S. (f@andreas-s.net) wrote:
> [snipped]
> > I dislike that Iconv raises an exception when it finds characters it can
> > not convert. I would prefer if it could be made to ignore invalid
> > characters and just try to make the best of the text.
>
> Seconded, Thirded, and Quadrupled.
>
> Iconv needs a "as close as I could get with transliteration and ignoring
> invalid characters" mode.
>
> We're doing something comparable in Raggle by trapping the exception and
> stripping out the invalid character. Obviously this doesn't work
> properly for multibyte characters, and you won't be able to use a lookup
> table for arbitrary source encodings, but it's a start.
>
<snip interesting code>

What if String just had a couple of new methods on it:
String#transcode(from_encoding, to_encoding)
..and
String#transcode!(from_encoding, to_encoding)
..and the "modifies receiver" version returned true or false,
depending on whether it managed to convert every character?
Then you could do:
unless some_string.transcode!('Shift-JIS', 'UTF-8')
puts "Some characters got mangle-fied!"
end

Is that a mess? I kinda like it, at first glance.

--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

Paul_Duncan · 21 December 2005 19:30

I wasn't aware of "//IGNORE". I'll check it out. Thanks!

···

* Christian Neukirchen (chneukirchen@gmail.com) wrote:

Paul Duncan <pabs@pablotron.org> writes:

> * Andreas S. (f@andreas-s.net) wrote:
> [snipped]
>> I dislike that Iconv raises an exception when it finds characters it can
>> not convert. I would prefer if it could be made to ignore invalid
>> characters and just try to make the best of the text.
>
> Seconded, Thirded, and Quadrupled.
>
> Iconv needs a "as close as I could get with transliteration and ignoring
> invalid characters" mode.

Can't you just use //IGNORE?

--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

Paul_Duncan · 21 December 2005 19:35

You sir, are a genius. That works great here.

···

* Christian Neukirchen (chneukirchen@gmail.com) wrote:

Paul Duncan <pabs@pablotron.org> writes:

> * Andreas S. (f@andreas-s.net) wrote:
> [snipped]
>> I dislike that Iconv raises an exception when it finds characters it can
>> not convert. I would prefer if it could be made to ignore invalid
>> characters and just try to make the best of the text.
>
> Seconded, Thirded, and Quadrupled.
>
> Iconv needs a "as close as I could get with transliteration and ignoring
> invalid characters" mode.

Can't you just use //IGNORE?

--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

Topic		Replies	Views
Iconv and incompatible encodings ruby-talk	17	128	6 June 2009
Iconv.iconv and Windows XP ruby-talk	4	127	4 October 2005
Reliable character encodings conversion ruby-talk	4	86	30 September 2008
How to do charset conversion in ruby? ruby-talk	3	120	21 July 2005
Unicode illegal characters problem ruby-talk	15	103	5 November 2007

How are people making use of Iconv?

Related topics