REXML Input File Question

Mike_Pe · 19 July 2010 19:13

Hi,

So I am having issues parsing in a document using the Ruby XML parser
REXML. The issue seems to be with the first line of my file that
identifies the XML file.

Here are two xml files, the first is not parsed with REXML and the
second is parsed properly:

error = <<EOF
     <?xml version="1.0" encoding="UTF-16"?>
     <document test="yes">
     </document>
EOF

noerror = <<EOF
<document test="yes">
</document>
EOF

When I try to parse in the information from "error", REXML does not read
any of the attributes or elements.

doc = Document.new error
puts doc.root.attributes["test"] --> nil

doc = Document.new noerror
puts doc.root.attributes["test"] --> yes

I considered the fact that REXML only takes in UTF-8 unicoded files, but
when I convert these files from UTF-16 to UTF-8, it still does not parse
properly.

Does anyone know what I am doing wrong? Thank you very Much.

Mike

Attachments:
http://www.ruby-forum.com/attachment/4868/error.xml

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 19 July 2010 19:52

So I am having issues parsing in a document using the Ruby XML parser
REXML. The issue seems to be with the first line of my file that
identifies the XML file.

Here are two xml files, the first is not parsed with REXML and the
second is parsed properly:

error =<<EOF
      <?xml version="1.0" encoding="UTF-16"?>
      <document test="yes">
      </document>
EOF

The string is most likely not UTF-16 encoded so REXML cannot parse it properly. Which Ruby version? If it is 1.9ish you'll find information about i18n here:

noerror =<<EOF
<document test="yes">
</document>
EOF

When I try to parse in the information from "error", REXML does not read
any of the attributes or elements.

doc = Document.new error
puts doc.root.attributes["test"] --> nil

doc = Document.new noerror
puts doc.root.attributes["test"] --> yes

I considered the fact that REXML only takes in UTF-8 unicoded files, but
when I convert these files from UTF-16 to UTF-8, it still does not parse
properly.

Does anyone know what I am doing wrong? Thank you very Much.

Can you show what exactly you did?

Kind regards

robert

···

On 19.07.2010 21:13, Mike Pe wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Mike_Pe · 23 July 2010 22:12

Robert Klemme wrote:

</document>
EOF

The string is most likely not UTF-16 encoded so REXML cannot parse it
properly. Which Ruby version? If it is 1.9ish you'll find information
about i18n here:
Gray Soft / Not Found

puts doc.root.attributes["test"] --> nil

doc = Document.new noerror
puts doc.root.attributes["test"] --> yes

I considered the fact that REXML only takes in UTF-8 unicoded files, but
when I convert these files from UTF-16 to UTF-8, it still does not parse
properly.

Does anyone know what I am doing wrong? Thank you very Much.

Can you show what exactly you did?

Kind regards

robert

Hi Robert,

The issue is that the first line of my input file:

<?xml version="1.0" encoding="UTF-16"?>

Causes the file to be read as an "xml application". Basically, I just
want to be able to use REXML to parse out this xml file, but it does not
parse properly with this line in the beginning of my input file.
(otherwise it works fine).

I tried converting the files using iconv commands from your link, but it
UTF-16 and UTF-8, the same error occurs, without regard for format.

Why is this line interfering with the parser and how would I fix it?
Thank you for your help.

Best,
Mike

···

On 19.07.2010 21:13, Mike Pe wrote:

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 27 July 2010 09:01

Robert Klemme wrote:
  &lt;/document&gt;
EOF
The string is most likely not UTF-16 encoded so REXML cannot parse it
properly. Which Ruby version? If it is 1.9ish you'll find information
about i18n here:
Gray Soft / Not Found

puts doc.root.attributes["test"] --> nil

doc = Document.new noerror
puts doc.root.attributes["test"] --> yes

I considered the fact that REXML only takes in UTF-8 unicoded files, but
when I convert these files from UTF-16 to UTF-8, it still does not parse
properly.

Does anyone know what I am doing wrong? Thank you very Much.

Can you show what exactly you did?

The issue is that the first line of my input file:

<?xml version="1.0" encoding="UTF-16"?>

Causes the file to be read as an "xml application". Basically, I just
want to be able to use REXML to parse out this xml file, but it does not
parse properly with this line in the beginning of my input file.
(otherwise it works fine).

Please provide the code you are using so others can try this out
themselves. I asked for this already (see above).

I tried converting the files using iconv commands from your link, but it
UTF-16 and UTF-8, the same error occurs, without regard for format.

Why is this line interfering with the parser and how would I fix it?
Thank you for your help.

It seems there is no UTF-16 support:

irb(main):009:0> f=File.open "x", "r:UTF-16"
(irb):9: warning: Unsupported encoding UTF-16 ignored
=> #<File:x>

So there is no point in trying to import a UTF-16 encoded file in Ruby.

Kind regards

robert

···

2010/7/24 Mike Pe <mikep123@gmail.com>:

On 19.07.2010 21:13, Mike Pe wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Mike_Pe · 27 July 2010 16:47

Robert Klemme wrote:

puts doc.root.attributes["test"] --> �nil

Can you show what exactly you did?

The issue is that the first line of my input file:

<?xml version="1.0" encoding="UTF-16"?>

Causes the file to be read as an "xml application". Basically, I just
want to be able to use REXML to parse out this xml file, but it does not
parse properly with this line in the beginning of my input file.
(otherwise it works fine).

Please provide the code you are using so others can try this out
themselves. I asked for this already (see above).

I tried converting the files using iconv commands from your link, but it
UTF-16 and UTF-8, the same error occurs, without regard for format.

Why is this line interfering with the parser and how would I fix it?
Thank you for your help.

It seems there is no UTF-16 support:

irb(main):009:0> f=File.open "x", "r:UTF-16"
(irb):9: warning: Unsupported encoding UTF-16 ignored
=> #<File:x>

So there is no point in trying to import a UTF-16 encoded file in Ruby.

Kind regards

robert

Hi Robert,

As for the code that I am using, I simplified the code in my original
post. The first line:

doc = REXML::Document.new error

Should parse in the XML document and recognize all of the roots,
elements, attributes, etc. from the input document.

i.e.:
puts doc.root.attributes["test"]

Should return "yes" because the attribute in the error xml file (see
above) is "yes. With the extra line, it puts "nil". (because the parser
did not do its job).

I tried converting all of the files to UTF-8 and they still did not
work. (If you remove the extra line, it does work) I do not think the
problem with is in the unicode.

Thanks,

Mike

···

2010/7/24 Mike Pe <mikep123@gmail.com>:

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 27 July 2010 21:20

Robert Klemme wrote:

puts doc.root.attributes["test"] --> �nil

Can you show what exactly you did?

The issue is that the first line of my input file:

<?xml version="1.0" encoding="UTF-16"?>

Causes the file to be read as an "xml application". Basically, I just
want to be able to use REXML to parse out this xml file, but it does not
parse properly with this line in the beginning of my input file.
(otherwise it works fine).

Please provide the code you are using so others can try this out
themselves. I asked for this already (see above).

I tried converting the files using iconv commands from your link, but it
UTF-16 and UTF-8, the same error occurs, without regard for format.

Why is this line interfering with the parser and how would I fix it?
Thank you for your help.

It seems there is no UTF-16 support:

irb(main):009:0> f=File.open "x", "r:UTF-16"
(irb):9: warning: Unsupported encoding UTF-16 ignored
=> #<File:x>

So there is no point in trying to import a UTF-16 encoded file in Ruby.

As for the code that I am using, I simplified the code in my original
post. The first line:

doc = REXML::Document.new error

What is "error"? How do you obtain it?

Should parse in the XML document and recognize all of the roots,
elements, attributes, etc. from the input document.

i.e.:
puts doc.root.attributes["test"]

Should return "yes" because the attribute in the error xml file (see
above) is "yes. With the extra line, it puts "nil". (because the parser
did not do its job).

I tried converting all of the files to UTF-8 and they still did not
work. (If you remove the extra line, it does work) I do not think the
problem with is in the unicode.

Hmm...

robert

···

On 27.07.2010 18:47, Mike Pe wrote:

2010/7/24 Mike Pe<mikep123@gmail.com>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Mike_Pe · 28 July 2010 06:23

Robert Klemme wrote:

parse properly with this line in the beginning of my input file.

It seems there is no UTF-16 support:

irb(main):009:0> f=File.open "x", "r:UTF-16"
(irb):9: warning: Unsupported encoding UTF-16 ignored
=> #<File:x>

So there is no point in trying to import a UTF-16 encoded file in Ruby.

As for the code that I am using, I simplified the code in my original
post. The first line:

doc = REXML::Document.new error

What is "error"? How do you obtain it?

By "error", I meant my file called error from my first post:

error = <<EOF
     <?xml version="1.0" encoding="UTF-16"?>
     <document test="yes">
     </document>
EOF

···

On 27.07.2010 18:47, Mike Pe wrote:

I tried converting all of the files to UTF-8 and they still did not
work. (If you remove the extra line, it does work) I do not think the
problem with is in the unicode.

Hmm...

robert

--
Posted via http://www.ruby-forum.com/\.

Brabuhr · 28 July 2010 13:49

Can you show what exactly you did?

Please provide the code you are using so others can try this out
themselves. I asked for this already (see above).

Could you provide a link to a zip file that contains an original input
that fails, a re-encoded input file that fails, and an input file that
does not fail and a script that loads them?

Or, provide a more detailed step-by-step of what you did, e.g.:

# poke at the original file to see what it looks like
ls -l orig-utf16.xml
file orig-utf16.xml
wc -c orig-utf16.xml
enca orig-utf16.xml
head orig-utf16.xml

# convert the file
iconv -t UTF8 -f UTF16 < orig-utf16.xml > new-utf8.xml

# poke at the new file to see what it looks like
ls -l new-utf8.xml
file new-utf8.xml
wc -c new-utf8.xml
enca new-utf8.xml
head new-utf8.xml

# load the files in the script
cat rexmltest.rb
ruby rexmltest.rb old-utf16.xml
ruby rexmltest.rb new-utf8.xml

Thanks.

Topic		Replies	Views
REXML::Document could not parse UTF-8 "<name>\302</name>" ruby-talk	4	150	6 January 2008
REXML::Document parsing ruby-talk	2	77	11 November 2007
Help needed with rexml ruby-talk	14	66	31 August 2005
Problem with REXML ruby-talk	3	76	24 May 2007
REXML: parsing a string with unescaped ampersand entities ruby-talk	7	133	25 August 2009

REXML Input File Question

Related topics