Encoding hell

Damphyr · 5 September 2005 13:34

OK, I am officially frustrated/lost/bewildered (take your pick) with all this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers. I'm using a simple GET HTTP request and on a query the service returns the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod Johnson, with Juergen Hoeller.</FULLTITLE>

<SHORTTITLE>Expert one-on-one J2EE development without EJB /</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>

</RECORD>
</ISBNORG>
</string>

which I can't parse with REXML :(. If all the < > where < and > then no prob, everything checks out fine. Same code with the above snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in their database

···

____________________________________________________________________
http://www.freemail.gr - Ã¤Ã¹Ã±Ã¥ÃÃ ÃµÃ°Ã§Ã±Ã¥Ã³ÃÃ¡ Ã§Ã«Ã¥ÃªÃ´Ã±Ã¯ÃÃ©ÃªÃ¯Ã½ Ã´Ã¡Ã·ÃµÃ¤Ã±Ã¯Ã¬Ã¥ÃÃ¯Ãµ.
http://www.freemail.gr - free email service for the Greek-speaking.

James_Edward_Gray_II · 5 September 2005 15:20

This is definitely a hack, but it's working for the data you showed:

require "rexml/document"

def unescape( xml )
     xml.gsub!("<", "<")
     xml.gsub!(">", ">")
     xml.gsub!("&", "&")
     xml
end

doc = REXML::Document.new(unescape(DATA.read))
doc.each_element("string/ISBNORG/RECORD/*") { |e| p e }

__END__
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod Johnson, with Juergen Hoeller.</FULLTITLE>

<SHORTTITLE>Expert one-on-one J2EE development without EJB /</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>

</RECORD>
</ISBNORG>
</string>

Hope that helps.

James Edward Gray II

···

On Sep 5, 2005, at 8:34 AM, Damphyr wrote:

OK, I am officially frustrated/lost/bewildered (take your pick) with all this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers. I'm using a simple GET HTTP request and on a query the service returns the following:

Zach_Dennis1 · 5 September 2005 15:25

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string )
REXML::Document.new( str )

HTH,

Zach

···

On Mon, 2005-09-05 at 22:34 +0900, Damphyr wrote:

OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.</FULLTITLE>

<SHORTTITLE>Expert one-on-one J2EE development without EJB
/</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>

</RECORD>
</ISBNORG>
</string>

which I can't parse with REXML :(. If all the < > where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database

Joshua_Haberman1 · 5 September 2005 15:35

Another way of looking at this: you're getting one XML document embedded in another:

enclosing_doc = REXML::Document.new(str)
real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)

Josh

···

On Sep 5, 2005, at 8:25 AM, Zach Dennis wrote:

On Mon, 2005-09-05 at 22:34 +0900, Damphyr wrote:

OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.</FULLTITLE>

<SHORTTITLE>Expert one-on-one J2EE development without EJB
/</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>

</RECORD>
</ISBNORG>
</string>

which I can't parse with REXML :(. If all the < > where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string )
REXML::Document.new( str )

Damphyr · 5 September 2005 15:37

Zach Dennis wrote:

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string ) REXML::Document.new( str )

Aaaaargh, I knew it. That's whre I saw the 'double escaping' reference: reading about the changes in CGI between 1.6 and 1.8.
Thanks, that's what I was looking for (sorry James, the whole purpose for the mail was to avoid the hack you so kindly provided ).
Cheers,
V.-

···

____________________________________________________________________
http://www.freemail.gr - Ã¤Ã¹Ã±Ã¥ÃÃ ÃµÃ°Ã§Ã±Ã¥Ã³ÃÃ¡ Ã§Ã«Ã¥ÃªÃ´Ã±Ã¯ÃÃ©ÃªÃ¯Ã½ Ã´Ã¡Ã·ÃµÃ¤Ã±Ã¯Ã¬Ã¥ÃÃ¯Ãµ.
http://www.freemail.gr - free email service for the Greek-speaking.

Christian_Neukirche1 · 5 September 2005 15:58

Joshua Haberman <joshua@reverberate.org> writes:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.</FULLTITLE>

<SHORTTITLE>Expert one-on-one J2EE development without EJB
/</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>

</RECORD>
</ISBNORG>
</string>

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML::Document.new(str)
real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)

This is the right way to tackle the problem. Don't unescape on your own,
let the XML parser do it.

Whoever created that webservice should be shot, by the way. Namespaces
don't exist without a reason.

···

On Sep 5, 2005, at 8:25 AM, Zach Dennis wrote:

On Mon, 2005-09-05 at 22:34 +0900, Damphyr wrote:

Josh

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Zach_Dennis1 · 5 September 2005 16:26

Good call. then unescape real_doc...

Zach

···

On Tue, 2005-09-06 at 00:58 +0900, Christian Neukirchen wrote:

>
> Another way of looking at this: you're getting one XML document
> embedded in another:
>
> enclosing_doc = REXML::Document.new(str)
> real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)
>

Joshua_Haberman1 · 5 September 2005 16:34

No, that's the whole point. All escapes were interpreted when you parsed enclosing_doc, and replaced by their corresponding characters.

The text of the <string> element is itself a valid XML document, and incidentally, the XML document you really care about.

Try "puts enclosing_doc.elements["/string"].text", and it should all make more sense.

Josh

···

On Sep 5, 2005, at 9:26 AM, Zach Dennis wrote:

On Tue, 2005-09-06 at 00:58 +0900, Christian Neukirchen wrote:

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML::Document.new(str)
real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)

Good call. then unescape real_doc...

Zach_Dennis1 · 5 September 2005 17:41

Ah, yep, you're right. Doing that makes more sense. =) I did't know
REXML would auto-unescape for you. Pretty cool. Thanks Josh!

Zach

···

On Tue, 2005-09-06 at 01:34 +0900, Joshua Haberman wrote:

Try "puts enclosing_doc.elements["/string"].text", and it should all
make more sense.

Christian_Neukirche1 · 5 September 2005 18:30

Zach Dennis <zdennis@mktec.com> writes:

Try "puts enclosing_doc.elements["/string"].text", and it should all
make more sense.

Ah, yep, you're right. Doing that makes more sense. =) I did't know
REXML would auto-unescape for you. Pretty cool. Thanks Josh!

We all know REXML does a lot of stuff in interesting ways... but of
course it resolves the core entities correctly!

···

On Tue, 2005-09-06 at 01:34 +0900, Joshua Haberman wrote:

Zach

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Topic		Replies	Views
Parse / write XML without changing character encoding ruby-talk	0	115	19 October 2007
XML parsing, ISO8859-1 & UTF-8 ruby-talk	1	155	19 September 2013
REXML: parsing a string with unescaped ampersand entities ruby-talk	7	133	25 August 2009
Problem with REXML ruby-talk	3	76	24 May 2007
REXML & Extended characters - newbie question ruby-talk	2	112	13 January 2004

Encoding hell

Related topics