Iconv problem - not handling \r correctly

Louise_Rains · 26 October 2008 21:48

I have an XML file that I need to process. I'm working in the Windows
environment. Here is the head of the file:

0000000000 ■ < \0 A \0 u \0 t \0 o \0 S \0 t
\0
0000000020 a \0 t \0 \0 J \0 a \0 v \0 a \0 C
\0
0000000040 l \0 a \0 s \0 s \0 = \0 " \0 c \0 o
\0
0000000060 m \0 . \0 a \0 u \0 t \0 o \0 s \0 i
\0
0000000100 m \0 . \0 a \0 s \0 t \0 . \0 a \0 u
\0
0000000120 t \0 o \0 m \0 o \0 d \0 . \0 A \0 M
\0
0000000140 H \0 e \0 a \0 d \0 e \0 r \0 " \0 >
\0
0000000160 \r \0 \n \0 \t \0 < \0 B \0 a \0 s \0 e
\0
0000000200 A \0 u \0 t \0 o \0 S \0 t \0 a \0 t
\0
0000000220 \0 J \0 a \0 v \0 a \0 C \0 l \0 a
\0

Notice the character sequence \r \0 \n \0.

I need to edit some of the text elements in this file. I have used both
REXML and Hpricot to edit the file successfully, after converting to
UTF-8. Here is the head of the UTF-8 file:

0000000000 < A u t o S t a t J a v a C
l
0000000020 a s s = ' c o m . a u t o s i
m
0000000040 . a s t . a u t o m o d . A M
H
0000000060 e a d e r ' > \r \n \t < B a s e
A
0000000100 u t o S t a t J a v a C l a
s
0000000120 s = ' c o m . a u t o s i m .
a
0000000140 s t . a u t o m o d . A M H e
a
0000000160 d e r ' S a v e F i l e V e
r
0000000200 s i o n = ' 1 . 3 ' > \r \n \t \t
<
0000000220 P r o p e r t i e s J a v a
C

Notice that \r \n shows up in the next to last line.

Now in order for the edited XML file to work with my original
application, I need to convert back to UTF-16. Here is the code that I
use:

file = File.read("sta_utf8.xml")
conv = Iconv.new("UTF-16LE", "UTF-8")
result = conv.iconv(file);
result= 0xFF.chr << 0xFE.chr << result
file = File.new("sta_utf16.xml", "w")
file.write(result)
file.close

The resulting file (sta_utf16.xml) looks like:

0000000000 ■ < \0 A \0 u \0 t \0 o \0 S \0 t
\0
0000000020 a \0 t \0 \0 J \0 a \0 v \0 a \0 C
\0
0000000040 l \0 a \0 s \0 s \0 = \0 ' \0 c \0 o
\0
0000000060 m \0 . \0 a \0 u \0 t \0 o \0 s \0 i
\0
0000000100 m \0 . \0 a \0 s \0 t \0 . \0 a \0 u
\0
0000000120 t \0 o \0 m \0 o \0 d \0 . \0 A \0 M
\0
0000000140 H \0 e \0 a \0 d \0 e \0 r \0 ' \0 >
\0
0000000160 \r \n \0 \t \0 < \0 B \0 a \0 s \0 e \0
A
0000000200 \0 u \0 t \0 o \0 S \0 t \0 a \0 t \0
0000000220 \0 J \0 a \0 v \0 a \0 C \0 l \0 a \0
s

Notice that the \r does not have a \0 following it. This means that
every other line in my sta_utf16.xml file is in the wrong byte order and
I get garbled results:

<AutoStat
JavaClass='com.autosim.ast.automod.AMHeader'>਍ऀ㰀䈀愀猀攀䄀甀琀漀匀琀愀琀䨀愀瘀愀䌀氀愀猀猀㴀✀挀漀洀⸀愀甀琀漀猀椀洀⸀愀猀琀⸀愀甀琀漀洀漀搀⸀䄀䴀䠀攀愀搀攀爀✀ 匀愀瘀攀䘀椀氀攀嘀攀爀猀椀漀渀㴀✀㄀⸀㌀✀㸀ഀ

Is this a defect in Iconv?

Thanks,
LG

···

--
Posted via http://www.ruby-forum.com/.

Heesob_Park · 27 October 2008 04:55

No, it's a defect not in Iconv but in Windows.

Use binary flag for file handling like this:

file = File.open("sta_utf8.xml","rb").read
conv = Iconv.new("UTF-16LE", "UTF-8")
result = conv.iconv(file);
result= 0xFF.chr << 0xFE.chr << result
file = File.new("sta_utf16.xml", "wb")
file.write(result)
file.close

Regards,

Park Heesob

···

2008/10/27 Louise Rains <rainyglade@comcast.net>:

I have an XML file that I need to process. I'm working in the Windows
environment. Here is the head of the file:

0000000000 ■ < \0 A \0 u \0 t \0 o \0 S \0 t
\0
0000000020 a \0 t \0 \0 J \0 a \0 v \0 a \0 C
\0
0000000040 l \0 a \0 s \0 s \0 = \0 " \0 c \0 o
\0
0000000060 m \0 . \0 a \0 u \0 t \0 o \0 s \0 i
\0
0000000100 m \0 . \0 a \0 s \0 t \0 . \0 a \0 u
\0
0000000120 t \0 o \0 m \0 o \0 d \0 . \0 A \0 M
\0
0000000140 H \0 e \0 a \0 d \0 e \0 r \0 " \0 >
\0
0000000160 \r \0 \n \0 \t \0 < \0 B \0 a \0 s \0 e
\0
0000000200 A \0 u \0 t \0 o \0 S \0 t \0 a \0 t
\0
0000000220 \0 J \0 a \0 v \0 a \0 C \0 l \0 a
\0

Notice the character sequence \r \0 \n \0.

I need to edit some of the text elements in this file. I have used both
REXML and Hpricot to edit the file successfully, after converting to
UTF-8. Here is the head of the UTF-8 file:

0000000000 < A u t o S t a t J a v a C
l
0000000020 a s s = ' c o m . a u t o s i
m
0000000040 . a s t . a u t o m o d . A M
H
0000000060 e a d e r ' > \r \n \t < B a s e
A
0000000100 u t o S t a t J a v a C l a
s
0000000120 s = ' c o m . a u t o s i m .
a
0000000140 s t . a u t o m o d . A M H e
a
0000000160 d e r ' S a v e F i l e V e
r
0000000200 s i o n = ' 1 . 3 ' > \r \n \t \t
<
0000000220 P r o p e r t i e s J a v a
C

Notice that \r \n shows up in the next to last line.

Now in order for the edited XML file to work with my original
application, I need to convert back to UTF-16. Here is the code that I
use:

file = File.read("sta_utf8.xml")
conv = Iconv.new("UTF-16LE", "UTF-8")
result = conv.iconv(file);
result= 0xFF.chr << 0xFE.chr << result
file = File.new("sta_utf16.xml", "w")
file.write(result)
file.close

The resulting file (sta_utf16.xml) looks like:

0000000000 ■ < \0 A \0 u \0 t \0 o \0 S \0 t
\0
0000000020 a \0 t \0 \0 J \0 a \0 v \0 a \0 C
\0
0000000040 l \0 a \0 s \0 s \0 = \0 ' \0 c \0 o
\0
0000000060 m \0 . \0 a \0 u \0 t \0 o \0 s \0 i
\0
0000000100 m \0 . \0 a \0 s \0 t \0 . \0 a \0 u
\0
0000000120 t \0 o \0 m \0 o \0 d \0 . \0 A \0 M
\0
0000000140 H \0 e \0 a \0 d \0 e \0 r \0 ' \0 >
\0
0000000160 \r \n \0 \t \0 < \0 B \0 a \0 s \0 e \0
A
0000000200 \0 u \0 t \0 o \0 S \0 t \0 a \0 t \0
0000000220 \0 J \0 a \0 v \0 a \0 C \0 l \0 a \0
s

Notice that the \r does not have a \0 following it. This means that
every other line in my sta_utf16.xml file is in the wrong byte order and
I get garbled results:

<AutoStat
JavaClass='com.autosim.ast.automod.AMHeader'>਍ऀ㰀䈀愀猀攀䄀甀琀漀匀琀愀琀䨀愀瘀愀䌀氀愀猀猀㴀✀挀漀洀⸀愀甀琀漀猀椀洀⸀愀猀琀⸀愀甀琀漀洀漀搀⸀䄀䴀䠀攀愀搀攀爀✀ 匀愀瘀攀䘀椀氀攀嘀攀爀猀椀漀渀㴀✀㄀⸀㌀✀㸀ഀ

Is this a defect in Iconv?

Louise_Rains · 27 October 2008 11:12

It looks like IO.binmode does the same thing as well:

file = File.new("sta_utf16.xml", "wb")
file.binmode
file.write(result)
file.close

Thanks!

Heesob Park wrote:

···

No, it's a defect not in Iconv but in Windows.

Use binary flag for file handling like this:

file = File.open("sta_utf8.xml","rb").read
conv = Iconv.new("UTF-16LE", "UTF-8")
result = conv.iconv(file);
result= 0xFF.chr << 0xFE.chr << result
file = File.new("sta_utf16.xml", "wb")
file.write(result)
file.close

Regards,

Park Heesob

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Unicode illegal characters problem ruby-talk	15	115	5 November 2007
Replace delimiter in unicode encdoded file ruby-talk	19	166	11 December 2006
How to clean an xml files from non-utf-8 chars? ruby-talk	18	189	19 September 2008
Unicode string conversion ruby-talk	3	78	8 May 2007
The return of the son of Umlaute ruby-talk	6	97	4 November 2007

Iconv problem - not handling \r correctly

Related topics