OK, I am officially frustrated/lost/bewildered (take your pick) with all this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers. I'm using a simple GET HTTP request and on a query the service returns the following:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod Johnson, with Juergen Hoeller.</FULLTITLE>
<SHORTTITLE>Expert one-on-one J2EE development without EJB /</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>
</RECORD>
</ISBNORG>
</string>
which I can't parse with REXML :(. If all the < > where < and > then no prob, everything checks out fine. Same code with the above snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in their database
This is definitely a hack, but it's working for the data you showed:
require "rexml/document"
def unescape( xml )
xml.gsub!("<", "<")
xml.gsub!(">", ">")
xml.gsub!("&", "&")
xml
end
doc = REXML::Document.new(unescape(DATA.read))
doc.each_element("string/ISBNORG/RECORD/*") { |e| p e }
__END__
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod Johnson, with Juergen Hoeller.</FULLTITLE>
<SHORTTITLE>Expert one-on-one J2EE development without EJB /</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>
</RECORD>
</ISBNORG>
</string>
Hope that helps.
James Edward Gray II
···
On Sep 5, 2005, at 8:34 AM, Damphyr wrote:
OK, I am officially frustrated/lost/bewildered (take your pick) with all this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers. I'm using a simple GET HTTP request and on a query the service returns the following:
OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.</FULLTITLE>
<SHORTTITLE>Expert one-on-one J2EE development without EJB
/</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>
</RECORD>
</ISBNORG>
</string>
which I can't parse with REXML :(. If all the < > where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database
OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.</FULLTITLE>
<SHORTTITLE>Expert one-on-one J2EE development without EJB
/</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>
</RECORD>
</ISBNORG>
</string>
which I can't parse with REXML :(. If all the < > where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database
You are seeing already escaped characters. You need to unescape them.
Aaaaargh, I knew it. That's whre I saw the 'double escaping' reference: reading about the changes in CGI between 1.6 and 1.8.
Thanks, that's what I was looking for (sorry James, the whole purpose for the mail was to avoid the hack you so kindly provided ).
Cheers,
V.-
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.</FULLTITLE>
<SHORTTITLE>Expert one-on-one J2EE development without EJB
/</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>
</RECORD>
</ISBNORG>
</string>
Another way of looking at this: you're getting one XML document
embedded in another:
On Tue, 2005-09-06 at 00:58 +0900, Christian Neukirchen wrote:
>
> Another way of looking at this: you're getting one XML document
> embedded in another:
>
> enclosing_doc = REXML::Document.new(str)
> real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)
>