Unicode Question

Hello all,

I download a file from a website and one of the lines look like this:

row = "\0001\0002\0003\000\t\0002\0004\0008\0004\0005\000\t\000C\000o
\000m\000p\000U\000S\000A\000\t\000W
\0006\0003\0001\0000\0003\0000\0006\0000\0001\0000\0001\000\t
\0002\0000\0000\0009\000-\0000\0003\000-\0002\0007\000\t
\0001\0005\000:\0005\0001\000:\0000\0000\000\t\000U\000L\000T\000-\000D
\0002\000P\000K\000\t\0000\000.\0005\000\t\0001\000\t
\0000\000.\0000\0001\000\t\0002\0000\0000\0009\000-\0000\0003\000-
\0002\0008\000\t\0001\0003\000:\0003\0006\000:\0002\0004\000\r\000\n"

On my Mac, if I do:

Iconv.iconv("UTF8", "UCS-2", row)

I get:

["123\t24845\tCompUSA\tW63103060101\t2009-03-27\t15:51:00\tULT-D2PK
\t0.5\t1\t0.01\t2009-03-28\t13:36:24\r\n"]

Which is exactly right. On the production Linux box (Ubuntu 8.04),
doing the same thing yields:

["㄀㈀㌀ऀ㈀㐀㠀㐀㔀ऀ䌀漀洀瀀唀匀䄀ऀ圀㘀㌀㄀ ㌀ 㘀 ㄀ ㄀ऀ㈀ 㤀ⴀ ㌀ⴀ㈀㜀ऀ㄀㔀㨀㔀㄀㨀 ऀ唀䰀吀ⴀ䐀㈀倀䬀ऀ ⸀㔀ऀ㄀ऀ
⸀ ㄀ऀ㈀ 㤀ⴀ ㌀ⴀ㈀㠀ऀ㄀㌀㨀㌀㘀㨀㈀㐀ഀ਀"]

I figured it out and fixed it by doing:

Iconv.iconv("UTF8", "UCS-2BE", row)

Which works on both environments. I fixed it by reading about encoding
and trial and error, so I'm left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

Thanks,
Ahmed

Hi,

At Thu, 23 Apr 2009 10:10:45 +0900,
Daly wrote in [ruby-talk:334732]:

I figured it out and fixed it by doing:

Iconv.iconv("UTF8", "UCS-2BE", row)

Which works on both environments. I fixed it by reading about encoding
and trial and error, so I'm left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

It seems a bug of Ubuntu iconv. According to Unicode
Consortium, UCS-2 is defaulted to big endian if no BOM exists.

Explicit endianness anyway. Ruby 1.9 does not provide UCS
names without endians.

···

--
Nobu Nakada

I've been trying to covering character encodings with a heavy Ruby slant on my blog for just this reason:

My coverage is about 98% complete now, in case you want to browse a bit.

Of course, in this case it just seems that you hit a bug as Nobu said.

James Edward Gray II

···

On Apr 22, 2009, at 8:10 PM, Daly wrote:

I fixed it by reading about encoding
and trial and error, so I'm left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

Hi James,

75% of my research was made on your blog :slight_smile:

Thanks for your answers.

···

On Apr 22, 10:37 pm, James Gray <ja...@grayproductions.net> wrote:

On Apr 22, 2009, at 8:10 PM, Daly wrote:

> I fixed it by reading about encoding
> and trial and error, so I'm left with a working solution, but not
> knowing why it works on the Mac but not in Linux. Could someone please
> explain?

I've been trying to covering character encodings with a heavy Ruby
slant on my blog for just this reason:

Gray Soft / Not Found

My coverage is about 98% complete now, in case you want to browse a bit.

Of course, in this case it just seems that you hit a bug as Nobu said.

James Edward Gray II

Hi James,

I was wondering if you wanted to add a note on how to deal with
potentially unknown character encodings. This was one of the more
annoying problems that I hit with trying to use Iconv directly for ruby
1.8. In the past, I'd used the Mozilla character detection library (in
Java) when doing processing of XML in order to ensure that the file
encoding matched the declaration in the <?xml > header.

Fortunately, I recently found the rchardet gem which is the port of this
library to Ruby, and it's helped me deal with giving more appropriate
encoding information to Iconv.

Usage goes something like this:

91 cd = CharDet.detect(text)
92 encoding = cd['encoding']
93 puts "Reading detected encoding '#{encoding}' text with confidence: %. 2f%%" % [cd['confidence'] * 100]
94 iconv = Iconv.new("UTF-8", encoding)
95 puts "Conversion to UTF-8 successful."

This time, I needed this sort of thing when trying to ensure I could
load arbitrary text files from unknown sources into GTK+ widgets.

I've actually rarely been in the case where I knew the encoding of the
input I was trying to deal with if it wasn't the same as the system
default, but maybe that's just me... :wink:

The referenced blog post looks really good. Thanks for your efforts.

Cheers,

ast

···

On Thu, 2009-04-23 at 11:37 +0900, James Gray wrote:

On Apr 22, 2009, at 8:10 PM, Daly wrote:

> I fixed it by reading about encoding
> and trial and error, so I'm left with a working solution, but not
> knowing why it works on the Mac but not in Linux. Could someone please
> explain?

I've been trying to covering character encodings with a heavy Ruby
slant on my blog for just this reason:

Gray Soft / Not Found

My coverage is about 98% complete now, in case you want to browse a bit.

--
Andrew S. Townley <ast@atownley.org>
http://atownley.org

Thanks for the information. I just added a comment with a link to your email here:

James Edward Gray II

···

On Apr 24, 2009, at 6:05 AM, Andrew S. Townley wrote:

I was wondering if you wanted to add a note on how to deal with
potentially unknown character encodings.