Encoding hell

OK, I am officially frustrated/lost/bewildered (take your pick) with all this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers. I'm using a simple GET HTTP request and on a query the service returns the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
&lt;ISBNORG&gt;
&lt;RECORD&gt;
&lt;ISBN&gt;0764558315&lt;/ISBN&gt;
&lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
&lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

&lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB /&lt;/SHORTTITLE&gt;
&lt;EDITION&gt;&lt;/EDITION&gt;
&lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
&lt;DATE&gt;c2004.&lt;/DATE&gt;
&lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

&lt;/RECORD&gt;
&lt;/ISBNORG&gt;
</string>

which I can't parse with REXML :(. If all the &lt; &gt; where < and > then no prob, everything checks out fine. Same code with the above snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in their database :slight_smile:

···

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.

This is definitely a hack, but it's working for the data you showed:

require "rexml/document"

def unescape( xml )
     xml.gsub!("&lt;", "<")
     xml.gsub!("&gt;", ">")
     xml.gsub!("&amp;", "&")
     xml
end

doc = REXML::Document.new(unescape(DATA.read))
doc.each_element("string/ISBNORG/RECORD/*") { |e| p e }

__END__
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
&lt;ISBNORG&gt;
&lt;RECORD&gt;
&lt;ISBN&gt;0764558315&lt;/ISBN&gt;
&lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
&lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

&lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB /&lt;/SHORTTITLE&gt;
&lt;EDITION&gt;&lt;/EDITION&gt;
&lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
&lt;DATE&gt;c2004.&lt;/DATE&gt;
&lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

&lt;/RECORD&gt;
&lt;/ISBNORG&gt;
</string>

Hope that helps.

James Edward Gray II

···

On Sep 5, 2005, at 8:34 AM, Damphyr wrote:

OK, I am officially frustrated/lost/bewildered (take your pick) with all this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers. I'm using a simple GET HTTP request and on a query the service returns the following:

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string )
REXML::Document.new( str )

HTH,

Zach

···

On Mon, 2005-09-05 at 22:34 +0900, Damphyr wrote:

OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
&lt;ISBNORG&gt;
&lt;RECORD&gt;
&lt;ISBN&gt;0764558315&lt;/ISBN&gt;
&lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
&lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

&lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB
/&lt;/SHORTTITLE&gt;
&lt;EDITION&gt;&lt;/EDITION&gt;
&lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
&lt;DATE&gt;c2004.&lt;/DATE&gt;
&lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

&lt;/RECORD&gt;
&lt;/ISBNORG&gt;
</string>

which I can't parse with REXML :(. If all the &lt; &gt; where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database :slight_smile:

Another way of looking at this: you're getting one XML document embedded in another:

enclosing_doc = REXML::Document.new(str)
real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)

Josh

···

On Sep 5, 2005, at 8:25 AM, Zach Dennis wrote:

On Mon, 2005-09-05 at 22:34 +0900, Damphyr wrote:

OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
&lt;ISBNORG&gt;
&lt;RECORD&gt;
&lt;ISBN&gt;0764558315&lt;/ISBN&gt;
&lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
&lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

&lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB
/&lt;/SHORTTITLE&gt;
&lt;EDITION&gt;&lt;/EDITION&gt;
&lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
&lt;DATE&gt;c2004.&lt;/DATE&gt;
&lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

&lt;/RECORD&gt;
&lt;/ISBNORG&gt;
</string>

which I can't parse with REXML :(. If all the &lt; &gt; where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database :slight_smile:

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string )
REXML::Document.new( str )

Zach Dennis wrote:

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string ) REXML::Document.new( str )

Aaaaargh, I knew it. That's whre I saw the 'double escaping' reference: reading about the changes in CGI between 1.6 and 1.8.
Thanks, that's what I was looking for (sorry James, the whole purpose for the mail was to avoid the hack you so kindly provided :slight_smile: ).
Cheers,
V.-

···

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.

Joshua Haberman <joshua@reverberate.org> writes:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
&lt;ISBNORG&gt;
&lt;RECORD&gt;
&lt;ISBN&gt;0764558315&lt;/ISBN&gt;
&lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
&lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

&lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB
/&lt;/SHORTTITLE&gt;
&lt;EDITION&gt;&lt;/EDITION&gt;
&lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
&lt;DATE&gt;c2004.&lt;/DATE&gt;
&lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

&lt;/RECORD&gt;
&lt;/ISBNORG&gt;
</string>

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML::Document.new(str)
real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)

This is the right way to tackle the problem. Don't unescape on your own,
let the XML parser do it.

Whoever created that webservice should be shot, by the way. Namespaces
don't exist without a reason.

···

On Sep 5, 2005, at 8:25 AM, Zach Dennis wrote:

On Mon, 2005-09-05 at 22:34 +0900, Damphyr wrote:

Josh

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Good call. then unescape real_doc...

Zach

···

On Tue, 2005-09-06 at 00:58 +0900, Christian Neukirchen wrote:

>
> Another way of looking at this: you're getting one XML document
> embedded in another:
>
> enclosing_doc = REXML::Document.new(str)
> real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)
>

No, that's the whole point. All escapes were interpreted when you parsed enclosing_doc, and replaced by their corresponding characters.

The text of the <string> element is itself a valid XML document, and incidentally, the XML document you really care about.

Try "puts enclosing_doc.elements["/string"].text", and it should all make more sense.

Josh

···

On Sep 5, 2005, at 9:26 AM, Zach Dennis wrote:

On Tue, 2005-09-06 at 00:58 +0900, Christian Neukirchen wrote:

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML::Document.new(str)
real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)

Good call. then unescape real_doc...

Ah, yep, you're right. Doing that makes more sense. =) I did't know
REXML would auto-unescape for you. Pretty cool. Thanks Josh!

Zach

···

On Tue, 2005-09-06 at 01:34 +0900, Joshua Haberman wrote:

Try "puts enclosing_doc.elements["/string"].text", and it should all
make more sense.

Zach Dennis <zdennis@mktec.com> writes:

Try "puts enclosing_doc.elements["/string"].text", and it should all
make more sense.

Ah, yep, you're right. Doing that makes more sense. =) I did't know
REXML would auto-unescape for you. Pretty cool. Thanks Josh!

We all know REXML does a lot of stuff in interesting ways... but of
course it resolves the core entities correctly!

···

On Tue, 2005-09-06 at 01:34 +0900, Joshua Haberman wrote:

Zach

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org