I have an XML file that I need to process. I'm working in the Windows
environment. Here is the head of the file:
0000000000 ■ < \0 A \0 u \0 t \0 o \0 S \0 t
\0
0000000020 a \0 t \0 \0 J \0 a \0 v \0 a \0 C
\0
0000000040 l \0 a \0 s \0 s \0 = \0 " \0 c \0 o
\0
0000000060 m \0 . \0 a \0 u \0 t \0 o \0 s \0 i
\0
0000000100 m \0 . \0 a \0 s \0 t \0 . \0 a \0 u
\0
0000000120 t \0 o \0 m \0 o \0 d \0 . \0 A \0 M
\0
0000000140 H \0 e \0 a \0 d \0 e \0 r \0 " \0 >
\0
0000000160 \r \0 \n \0 \t \0 < \0 B \0 a \0 s \0 e
\0
0000000200 A \0 u \0 t \0 o \0 S \0 t \0 a \0 t
\0
0000000220 \0 J \0 a \0 v \0 a \0 C \0 l \0 a
\0
Notice the character sequence \r \0 \n \0.
I need to edit some of the text elements in this file. I have used both
REXML and Hpricot to edit the file successfully, after converting to
UTF-8. Here is the head of the UTF-8 file:
0000000000 < A u t o S t a t J a v a C
l
0000000020 a s s = ' c o m . a u t o s i
m
0000000040 . a s t . a u t o m o d . A M
H
0000000060 e a d e r ' > \r \n \t < B a s e
A
0000000100 u t o S t a t J a v a C l a
s
0000000120 s = ' c o m . a u t o s i m .
a
0000000140 s t . a u t o m o d . A M H e
a
0000000160 d e r ' S a v e F i l e V e
r
0000000200 s i o n = ' 1 . 3 ' > \r \n \t \t
<
0000000220 P r o p e r t i e s J a v a
C
Notice that \r \n shows up in the next to last line.
Now in order for the edited XML file to work with my original
application, I need to convert back to UTF-16. Here is the code that I
use:
file = File.read("sta_utf8.xml")
conv = Iconv.new("UTF-16LE", "UTF-8")
result = conv.iconv(file);
result= 0xFF.chr << 0xFE.chr << result
file = File.new("sta_utf16.xml", "w")
file.write(result)
file.close
The resulting file (sta_utf16.xml) looks like:
0000000000 ■ < \0 A \0 u \0 t \0 o \0 S \0 t
\0
0000000020 a \0 t \0 \0 J \0 a \0 v \0 a \0 C
\0
0000000040 l \0 a \0 s \0 s \0 = \0 ' \0 c \0 o
\0
0000000060 m \0 . \0 a \0 u \0 t \0 o \0 s \0 i
\0
0000000100 m \0 . \0 a \0 s \0 t \0 . \0 a \0 u
\0
0000000120 t \0 o \0 m \0 o \0 d \0 . \0 A \0 M
\0
0000000140 H \0 e \0 a \0 d \0 e \0 r \0 ' \0 >
\0
0000000160 \r \n \0 \t \0 < \0 B \0 a \0 s \0 e \0
A
0000000200 \0 u \0 t \0 o \0 S \0 t \0 a \0 t \0
0000000220 \0 J \0 a \0 v \0 a \0 C \0 l \0 a \0
s
Notice that the \r does not have a \0 following it. This means that
every other line in my sta_utf16.xml file is in the wrong byte order and
I get garbled results:
<AutoStat
JavaClass='com.autosim.ast.automod.AMHeader'>ऀ㰀䈀愀猀攀䄀甀琀漀匀琀愀琀 䨀愀瘀愀䌀氀愀猀猀㴀✀挀漀洀⸀愀甀琀漀猀椀洀⸀愀猀琀⸀愀甀琀漀洀漀搀⸀䄀䴀䠀攀愀搀攀爀✀ 匀愀瘀攀䘀椀氀攀嘀攀爀猀椀漀渀㴀✀⸀㌀✀㸀ഀ
Is this a defect in Iconv?
Thanks,
LG
···
--
Posted via http://www.ruby-forum.com/.