REXML::Document could not parse UTF-8 "<name>\302</name>"

Hi all,

Im working with some UTF-8 data and basically if I run this:

require 'rexml/document'
data = "<name>\302</name>"
doc = REXML::Document.new(data)

I get an error that says I did not close the <name> tag:
REXML::ParseException: #<REXML::ParseException: No close tag for
["name"]>
/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:26:in `parse'
/usr/lib/ruby/1.8/rexml/document.rb:190:in `build'
/usr/lib/ruby/1.8/rexml/document.rb:45:in `initialize'
(irb):48:in `new'
(irb):48:in `irb_binding'
/usr/lib/ruby/1.8/irb/workspace.rb:52:in `irb_binding'
/usr/lib/ruby/1.8/irb/workspace.rb:52
...
No close tag for ["name"]
Line:
Position:
Last 80 unconsumed characters:

        from /usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:89:in
`parse'
        from /usr/lib/ruby/1.8/rexml/document.rb:190:in `build'
        from /usr/lib/ruby/1.8/rexml/document.rb:45:in `initialize'
        from (irb):48:in `new'
        from (irb):48

The code only works if I use single quotes instead,
i.e.
doc = REXML::Document.new('<name>\302</name>')

But since data is a variable, I cant simply declare it with single
quotes.

Any ideas why REXML::Document doesnt parse properly? Or perhaps is
there a way around this? Maybe I can convert to some other character
encoding to avoid the problem...

Best regards,

Jesse

Hi,

···

In message "Re: REXML::Document could not parse UTF-8 "<name>\302</name>"" on Sat, 5 Jan 2008 02:40:00 +0900, "Jesse P." <j.prabawa@gmail.com> writes:

Im working with some UTF-8 data and basically if I run this:

require 'rexml/document'
data = "<name>\302</name>"
doc = REXML::Document.new(data)

"<name>\302</name>" is not a valid UTF-8 byte sequence. The rest is
up to you, after recognizing working on non UTF-8 data.

              matz.

Hi Matz,

Thanks for your help. So I guess my problem is this:
1. I get an XML that is declared to be valid UTF-8, but
2. when I process some of the values, as you pointed out, some is not
valid UTF-8, and
3. causes a lot of problems when parsed by REXML.

For a string of characters (e.g. some xml file), is there anyway I can
detect just the non UTF-8 characters and convert them to UTF-8?

This way I can make sure what is processed by REXML is valid UTF-8
without unnecessarily processing characters in the string that are
already valid UTF-8.

Best regards,

Jesse

···

On Jan 5, 10:41 pm, Yukihiro Matsumoto <m...@ruby-lang.org> wrote:

Hi,

In message "Re: REXML::Document could not parse UTF-8 "<name>\302</name>"" > on Sat, 5 Jan 2008 02:40:00 +0900, "Jesse P." <j.prab...@gmail.com> writes:

>Im working with some UTF-8 data and basically if I run this:
>
>require 'rexml/document'
>data = "<name>\302</name>"
>doc = REXML::Document.new(data)

"<name>\302</name>" is not a valid UTF-8 byte sequence. The rest is
up to you, after recognizing working on non UTF-8 data.

                                                        matz.

Hi,

Thanks for your help. So I guess my problem is this:
1. I get an XML that is declared to be valid UTF-8, but
2. when I process some of the values, as you pointed out, some is not
valid UTF-8, and
3. causes a lot of problems when parsed by REXML.

For a string of characters (e.g. some xml file), is there anyway I can
detect just the non UTF-8 characters and convert them to UTF-8?

I guess you have to define what you want to do with this broken UTF-8
data first. As long as you treat the data as UTF-8, it is impossible
to treat it correctly. You can either

  * fix the data before reading it via REXML
  * parse data as Latin-1 or some other single byte encoding
  * replace the broken data with some valid UTF-8 sequence

But YMMV.

              matz.

···

In message "Re: REXML::Document could not parse UTF-8 "<name>\302</name>"" on Sun, 6 Jan 2008 03:00:04 +0900, "Jesse P." <j.prabawa@gmail.com> writes:

Thanks Matz :slight_smile:

···

On Jan 6, 3:01 am, Yukihiro Matsumoto <m...@ruby-lang.org> wrote:

Hi,

In message "Re: REXML::Document could not parse UTF-8 "<name>\302</name>"" > on Sun, 6 Jan 2008 03:00:04 +0900, "Jesse P." <j.prab...@gmail.com> writes:

>Thanks for your help. So I guess my problem is this:
>1. I get an XML that is declared to be valid UTF-8, but
>2. when I process some of the values, as you pointed out, some is not
>valid UTF-8, and
>3. causes a lot of problems when parsed by REXML.
>
>For a string of characters (e.g. some xml file), is there anyway I can
>detect just the non UTF-8 characters and convert them to UTF-8?

I guess you have to define what you want to do with this broken UTF-8
data first. As long as you treat the data as UTF-8, it is impossible
to treat it correctly. You can either

  * fix the data before reading it via REXML
  * parse data as Latin-1 or some other single byte encoding
  * replace the broken data with some valid UTF-8 sequence

But YMMV.

                                                        matz.