Reliable character encodings conversion

Hubert_Lepicki · 30 September 2008 12:30

Hi,

I am looking for reliable and error-resistant way to convert character
encodings to UTF8. Input encodings vary, and I have quite good input
encodings detection in place.

I am using Iconv library wrapper to convert texts to UTF8, but it's
throwing "Iconv::IllegalSequence" exception. The problem is that input
texts are user-generated and have sometimes mixed characters
encodings.

Does anyone have any experience with these kind of situations, or can
suggest alternative libraries?

Thanks,
Hubert

···

--
Pozdrawiam,
Hubert Łępicki
-----------------------------------------------
[ http://hubertlepicki.com ]

James_Edward_Gray_II · 30 September 2008 13:04

You can add a //TRANSLIT to the end of the "to" encoding to have Iconv attempt to convert characters to reasonable equivalents in that encoding. This is usually more helpful when your input is all one encoding and just has some characters that won't translate well (like a UTF-8 … going to ISO-8859-1).

Your case of mixed encodings is probably best handled with //IGNORE instead, which asks Iconv to skip over any characters that cannot be converted. You will loose some data with this, but it will convert what it can.

You can also use //TRANSLIT//IGNORE to convert what can be converted and skip the rest.

Hope that helps.

James Edward Gray II

···

On Sep 30, 2008, at 7:30 AM, Hubert Łępicki wrote:

I am using Iconv library wrapper to convert texts to UTF8, but it's
throwing "Iconv::IllegalSequence" exception.

Hubert_Lepicki · 30 September 2008 13:20

I am using Iconv library wrapper to convert texts to UTF8, but it's
throwing "Iconv::IllegalSequence" exception.

You can add a //TRANSLIT to the end of the "to" encoding to have Iconv
attempt to convert characters to reasonable equivalents in that encoding.
This is usually more helpful when your input is all one encoding and just
has some characters that won't translate well (like a UTF-8 … going to
ISO-8859-1).

Your case of mixed encodings is probably best handled with //IGNORE instead,
which asks Iconv to skip over any characters that cannot be converted. You
will loose some data with this, but it will convert what it can.

You can also use //TRANSLIT//IGNORE to convert what can be converted and
skip the rest.

Thanks, //IGNORE//TRANSLIT seems to help a bit - but it's not perfect.
I am loosing characters like British pound that were placed in
us-ascii encoding for example. Is there some smart library out there
that can help with common problems like this one?

I have noticed that there is ICU (http://www.icu-project.org/\) library
for C++ that I could use if it's any smarter - anyone had any
experience with it?

Best,
H.

···

2008/9/30 James Gray <james@grayproductions.net>:

On Sep 30, 2008, at 7:30 AM, Hubert Łępicki wrote:

Hope that helps.

James Edward Gray II

--
Pozdrawiam,
Hubert Łępicki
-----------------------------------------------
[ http://hubertlepicki.com ]

James_Edward_Gray_II · 30 September 2008 13:58

You listed those backwards. Is that really what you tried? Does reversing them make any difference?

James Edward Gray II

···

On Sep 30, 2008, at 8:20 AM, Hubert Łępicki wrote:

2008/9/30 James Gray <james@grayproductions.net>:

On Sep 30, 2008, at 7:30 AM, Hubert Łępicki wrote:

I am using Iconv library wrapper to convert texts to UTF8, but it's
throwing "Iconv::IllegalSequence" exception.

You can add a //TRANSLIT to the end of the "to" encoding to have Iconv
attempt to convert characters to reasonable equivalents in that encoding.
This is usually more helpful when your input is all one encoding and just
has some characters that won't translate well (like a UTF-8 … going to
ISO-8859-1).

Your case of mixed encodings is probably best handled with //IGNORE instead,
which asks Iconv to skip over any characters that cannot be converted. You
will loose some data with this, but it will convert what it can.

You can also use //TRANSLIT//IGNORE to convert what can be converted and
skip the rest.

Thanks, //IGNORE//TRANSLIT seems to help a bit - but it's not perfect.

Marcin_Raczkowski · 30 September 2008 14:34

you can use RChardet library,

her'es what i use:

require 'rchardet'

class String
   def encoding
     @encoding ||= guess_encoding
   end

   def encoding=(new)
     @encoding = new
   end

   def convert_to(new)
     self.replace(Iconv.iconv(new, encoding, self)[0])
     @encoding = new
   end

   def guess_encoding
     @encoding = CharDet.guess(self)["encoding"]
   end

   # this enables "foo".convert :us-ascii => :utf8
   def convert(hash)
     from = hash.keys[0]
     to = hash[from]
     self.replace(Iconv.iconv(to, from, self)[0])
   end
end

it handles translating preatty well

Topic		Replies	Views
Problems with Iconv ruby-talk	2	81	22 October 2005
Iconv and incompatible encodings ruby-talk	17	128	6 June 2009
Converting file from utf-16 to utf-8 ruby-talk	3	142	24 March 2010
Testing stdin for bad encoding, ruby 1.9 ruby-talk	0	113	11 May 2008
How to handle Iconv errors (newbie question) ruby-talk	2	109	5 April 2005

Reliable character encodings conversion

Related topics