Odd puts behaviour with REXML

David_Sainte-claire · 30 November 2009 18:19

Hello,

This is really weird, and I'm not sure if maybe there's a formatting
option that I can set on the puts method that will solve it. Hopefully
someone has done something similar??

I have been using REXML to parse through an XML document that I get back
from calling a ReST-ish web service, and up until now it has parsed
every single element as I'd expect. But I run into this small patch of
data were it acts weird. The XML looks like this:

<profile>
<support>
<title>eSupport</title>
<prefix>Find related product support and help in</prefix>
<link>
http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
</link>
</support>
</profile>

And I'm parsing it like this:
data.elements.each("product/profile/support"){|element|
  element.elements.each() do |child|
    puts "Did XPATH match any profile elements?"
    puts child.text
  end
}

I know it's pretty rudimentary, but I'm just confused as to why this
same pattern has been able to get text out of every other element I have
searched for, but here I get something like this when I loop through the
subelemenents:
Did XPATH match any profile elements?
???Product_Support_Link_Title???
Did XPATH match any profile elements?
???Product_Support_Link_Prefix???
Did XPATH match any profile elements?
???Product_Support_Link???&mdl=someModel

It looks like XPATH doesn't like the format of the text somehow, but I
have no idea what it is about it that it doesn't like since it appears
to be just normal text to me.

Anyone have any ideas? It looks like something that REXML is doing
since it outputs the path to the element in the document Product -->
Support but I don't have a clue where the _Link is coming from?

Thanks in advance for all your help!
David

···

--
Posted via http://www.ruby-forum.com/.

John_W_Higgins1 · 30 November 2009 18:33

Morning David,

···

On Mon, Nov 30, 2009 at 10:19 AM, David Sainte-claire < dsainteclaire@gmail.com> wrote:

Hello,

This is really weird, and I'm not sure if maybe there's a formatting
option that I can set on the puts method that will solve it. Hopefully
someone has done something similar??

I have been using REXML to parse through an XML document that I get back
from calling a ReST-ish web service, and up until now it has parsed
every single element as I'd expect. But I run into this small patch of
data were it acts weird. The XML looks like this:

<profile>
<support>
<title>eSupport</title>
<prefix>Find related product support and help in</prefix>
<link>

http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
</link>
</support>
</profile>

And I'm parsing it like this:
data.elements.each("product/profile/support"){|element|
element.elements.each() do |child|
puts "Did XPATH match any profile elements?"
puts child.text
end
}

I know it's pretty rudimentary, but I'm just confused as to why this
same pattern has been able to get text out of every other element I have
searched for, but here I get something like this when I loop through the
subelemenents:
Did XPATH match any profile elements?
???Product_Support_Link_Title???
Did XPATH match any profile elements?
???Product_Support_Link_Prefix???
Did XPATH match any profile elements?
???Product_Support_Link???&mdl=someModel

It looks like XPATH doesn't like the format of the text somehow, but I
have no idea what it is about it that it doesn't like since it appears
to be just normal text to me.

Anyone have any ideas? It looks like something that REXML is doing
since it outputs the path to the element in the document Product -->
Support but I don't have a clue where the _Link is coming from?

Rule one of anything like this is to remove each element from the XML file
one by one and see which one is breaking you. In this case you will find
that it's the link element because you are using ampersands in the URL. See
here

http://www.w3.org/TR/xhtml1/guidelines.html#C_12

John

David_Sainte-claire · 30 November 2009 18:43

John W Higgins wrote:

Morning David,

searched for, but here I get something like this when I loop through the
to be just normal text to me.

Anyone have any ideas? It looks like something that REXML is doing
since it outputs the path to the element in the document Product -->
Support but I don't have a clue where the _Link is coming from?

Rule one of anything like this is to remove each element from the XML
file
one by one and see which one is breaking you. In this case you will find
that it's the link element because you are using ampersands in the URL.
See
here

XHTML 1.0 - HTML Compatibility Guidelines

John

Thanks for the help, in terms of how to narrow it down, but if I change
my XPATH statement to look more like this:

    data.elements.each("product/profile/support/title"){|element|
      puts element.text
    }

Where product is the root note, and I'm only pulling out the title
element, I still get output that looks like this:

???Product_Support_Link_Title???

Can the link element be messing up my XPATH query even though I'm not
looking at that element?

···

On Mon, Nov 30, 2009 at 10:19 AM, David Sainte-claire < > dsainteclaire@gmail.com> wrote:

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 30 November 2009 18:45

One more hint: when posting things like this it is best to provide a _complete_ example. From the XPath given it is clear that something must be missing (there are no "product" tags). It's pretty easy with REXML:

robert@fussel ~
$ cat x.rb

require 'rexml/document'

data = REXML::Document.new <<'DOC'
<product>
<profile>
<support>
<title>eSupport</title>
<prefix>Find related product support and help in</prefix>
<link>
http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
</link>
</support>
</profile>
</product>
DOC

data.elements.each("product/profile/support"){|element|
   element.elements.each() do |child|
     puts "Did XPATH match any profile elements?"
     puts child.text
   end
}

Which on my box produces:

robert@fussel ~
$ ruby19 x.rb
/usr/local/lib/ruby19/1.9.1/rexml/parsers/treeparser.rb:95:in `rescue in parse': #<RuntimeError: Illegal character '&' in raw string " (REXML::ParseException)
http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
">
/usr/local/lib/ruby19/1.9.1/rexml/text.rb:155:in `block in check'
/usr/local/lib/ruby19/1.9.1/rexml/text.rb:153:in `scan'
/usr/local/lib/ruby19/1.9.1/rexml/text.rb:153:in `check'
/usr/local/lib/ruby19/1.9.1/rexml/text.rb:125:in `parent='
/usr/local/lib/ruby19/1.9.1/rexml/parent.rb:19:in `add'
/usr/local/lib/ruby19/1.9.1/rexml/parsers/treeparser.rb:45:in `parse'
/usr/local/lib/ruby19/1.9.1/rexml/document.rb:228:in `build'
/usr/local/lib/ruby19/1.9.1/rexml/document.rb:43:in `initialize'
x.rb:4:in `new'
x.rb:4:in `<main>'
...
Illegal character '&' in raw string "
http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
"
Line: 8
Position: 221
Last 80 unconsumed characters:
</link>
         from /usr/local/lib/ruby19/1.9.1/rexml/parsers/treeparser.rb:20:in `parse'
         from /usr/local/lib/ruby19/1.9.1/rexml/document.rb:228:in `build'
         from /usr/local/lib/ruby19/1.9.1/rexml/document.rb:43:in `initialize'
         from x.rb:4:in `new'
         from x.rb:4:in `<main>'

robert@fussel ~
$

How did you manage to get REXML parse this?

Kind regards

robert

···

On 30.11.2009 19:33, John W Higgins wrote:

[Note: parts of this message were removed to make it a legal post.]

Morning David,

On Mon, Nov 30, 2009 at 10:19 AM, David Sainte-claire < > dsainteclaire@gmail.com> wrote:

Hello,

This is really weird, and I'm not sure if maybe there's a formatting
option that I can set on the puts method that will solve it. Hopefully
someone has done something similar??

I have been using REXML to parse through an XML document that I get back
from calling a ReST-ish web service, and up until now it has parsed
every single element as I'd expect. But I run into this small patch of
data were it acts weird. The XML looks like this:

<profile>
<support>
<title>eSupport</title>
<prefix>Find related product support and help in</prefix>
<link>

http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
</link>
</support>
</profile>

And I'm parsing it like this:
data.elements.each("product/profile/support"){|element|
element.elements.each() do |child|
puts "Did XPATH match any profile elements?"
puts child.text
end
}

I know it's pretty rudimentary, but I'm just confused as to why this
same pattern has been able to get text out of every other element I have
searched for, but here I get something like this when I loop through the
subelemenents:
Did XPATH match any profile elements?
???Product_Support_Link_Title???
Did XPATH match any profile elements?
???Product_Support_Link_Prefix???
Did XPATH match any profile elements?
???Product_Support_Link???&mdl=someModel

It looks like XPATH doesn't like the format of the text somehow, but I
have no idea what it is about it that it doesn't like since it appears
to be just normal text to me.

Anyone have any ideas? It looks like something that REXML is doing
since it outputs the path to the element in the document Product -->
Support but I don't have a clue where the _Link is coming from?

Rule one of anything like this is to remove each element from the XML file
one by one and see which one is breaking you. In this case you will find
that it's the link element because you are using ampersands in the URL. See
here

XHTML 1.0 - HTML Compatibility Guidelines

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

John_W_Higgins1 · 30 November 2009 18:51

Morning Again David,

···

On Mon, Nov 30, 2009 at 10:43 AM, David Sainte-claire < dsainteclaire@gmail.com> wrote:

if I change
my XPATH statement to look more like this:

   data.elements.each("product/profile/support/title"){|element|
     puts element.text
   }

Where product is the root note, and I'm only pulling out the title
element, I still get output that looks like this:

???Product_Support_Link_Title???

Can the link element be messing up my XPATH query even though I'm not
looking at that element?

The ampersand messes things up because if you confuse the parser then how is
it supposed to figure out where the end of the support element is? XML is
sort of a "it works or it doesn't" concept most of the time. Not much wiggle
room for the most part.

John

David_Sainte-claire · 30 November 2009 18:58

Robert Klemme wrote:

option that I can set on the puts method that will solve it. Hopefully
<prefix>Find related product support and help in</prefix>
   puts "Did XPATH match any profile elements?"
Did XPATH match any profile elements?
Support but I don't have a clue where the _Link is coming from?

Rule one of anything like this is to remove each element from the XML file
one by one and see which one is breaking you. In this case you will find
that it's the link element because you are using ampersands in the URL. See
here

http://www.w3.org/TR/xhtml1/guidelines.html#C_12

One more hint: when posting things like this it is best to provide a
_complete_ example. From the XPath given it is clear that something
must be missing (there are no "product" tags). It's pretty easy with
REXML:

robert@fussel ~
$ cat x.rb

require 'rexml/document'

data = REXML::Document.new <<'DOC'
<product>
<profile>
<support>
<title>eSupport</title>
<prefix>Find related product support and help in</prefix>
<link>
http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
</link>
</support>
</profile>
</product>
DOC

data.elements.each("product/profile/support"){|element|
   element.elements.each() do |child|
     puts "Did XPATH match any profile elements?"
     puts child.text
   end
}

Which on my box produces:

robert@fussel ~
$ ruby19 x.rb
/usr/local/lib/ruby19/1.9.1/rexml/parsers/treeparser.rb:95:in `rescue in
parse': #<RuntimeError: Illegal character '&' in raw string "
(REXML::ParseException)
http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
">
/usr/local/lib/ruby19/1.9.1/rexml/text.rb:155:in `block in check'
/usr/local/lib/ruby19/1.9.1/rexml/text.rb:153:in `scan'
/usr/local/lib/ruby19/1.9.1/rexml/text.rb:153:in `check'
/usr/local/lib/ruby19/1.9.1/rexml/text.rb:125:in `parent='
/usr/local/lib/ruby19/1.9.1/rexml/parent.rb:19:in `add'
/usr/local/lib/ruby19/1.9.1/rexml/parsers/treeparser.rb:45:in `parse'
/usr/local/lib/ruby19/1.9.1/rexml/document.rb:228:in `build'
/usr/local/lib/ruby19/1.9.1/rexml/document.rb:43:in `initialize'
x.rb:4:in `new'
x.rb:4:in `<main>'
...
Illegal character '&' in raw string "
http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
"
Line: 8
Position: 221
Last 80 unconsumed characters:
</link>
         from
/usr/local/lib/ruby19/1.9.1/rexml/parsers/treeparser.rb:20:in `parse'
         from /usr/local/lib/ruby19/1.9.1/rexml/document.rb:228:in
`build'
         from /usr/local/lib/ruby19/1.9.1/rexml/document.rb:43:in
`initialize'
         from x.rb:4:in `new'
         from x.rb:4:in `<main>'

robert@fussel ~
$

How did you manage to get REXML parse this?

Kind regards

  robert

That's really interesting. This works, but just gives that odd output
like this: ???Product_Support_Link_Title???

Here is exactly what I'm doing:

I have XML that looks like this:
<product>
<profile>
<support>
<title>eSupport</title>
<prefix>Find related product support and help in</prefix>
<link>
http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel
</link>
</support>
</profile>
</product>

And I'm walking it using REXML like this:
data.elements.each("product/profile/support"){|element|
  element.elements.each() do |child|
    puts "Did XPATH match any profile elements?"
    puts child.text
  end
}

Where data is the body of an HTTP GET request to a ReST-ish web service
(I can't post the endpoint of the web service since it's an internal
server)

I'm invoking the web service from a file called product.rb using the
Rails command script/runner app/models/product.rb

Maybe if I had done it all from IRB I would have gotten the illegal
character exception.

Is there any way that I can ignore the formatting rules for illegal
character and just take the whole text as a string? The article that
the first person posted said that the webservice should return links in
this format:
http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel

using & instead of just & between arguments in the URL, which might
fly with REXML, but I'm not sure how to strip all that back off to make
it into a valid link since my browser doesn't know what to do with the
URL that has & between all the query parameters

Again, apologies if these are totally naive questions. I'm pretty new
at this...

Thanks,
David

···

On 30.11.2009 19:33, John W Higgins wrote:

--
Posted via http://www.ruby-forum.com/\.

John_W_Higgins1 · 30 November 2009 19:14

David,

Where data is the body of an HTTP GET request to a ReST-ish web service
(I can't post the endpoint of the web service since it's an internal
server)

First, you need to get your internal service fixed so it returns properly
XML encoded URLs. You are not currently getting XML back from the service so
your stuck with garbage at the moment. If you get really stuck you're going
to need to find and replace those ampersands.

Is there any way that I can ignore the formatting rules for illegal
character and just take the whole text as a string? The article that
the first person posted said that the webservice should return links in
this format:

http://www.somecompany.com/US/perl/model-home.pl?XID=M:products:crmportal&LOC=3&mdl=someModel

using & instead of just & between arguments in the URL, which might
fly with REXML, but I'm not sure how to strip all that back off to make
it into a valid link since my browser doesn't know what to do with the
URL that has & between all the query parameters

You don't have to - any XML parser will make the change for you because it
understands the & encodings and will return you the proper string with
nice pretty ampersands Please try using the proper XML string prior to
complaining that it's not what you want.

John

···

On Mon, Nov 30, 2009 at 10:58 AM, David Sainte-claire < dsainteclaire@gmail.com> wrote:

David_Sainte-claire · 30 November 2009 19:21

Where data is the body of an HTTP GET request to a ReST-ish web service
(I can't post the endpoint of the web service since it's an internal
server)

First, you need to get your internal service fixed so it returns
properly
XML encoded URLs. You are not currently getting XML back from the
service so
your stuck with garbage at the moment. If you get really stuck you're
going
to need to find and replace those ampersands.

URL that has & between all the query parameters

You don't have to - any XML parser will make the change for you because
it
understands the & encodings and will return you the proper string
with
nice pretty ampersands Please try using the proper XML string prior
to
complaining that it's not what you want.

John

Hey John! Thanks! You saved me so much time! I wasn't exactly
complaining. I just didn't fully understand the implications of having
ampersands in XML document. Now that I know, I can tell the guys who
wrote the service that they need to clean it up! I'll try to keep my
questions to a minimum, or at least the really obviously NEWB ones...

Thanks again

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
REXML XPath bug? ruby-talk	1	71	17 May 2008
Parsing through XML with REXML/XPath ruby-talk	2	101	25 September 2007
REXML and XPath Confusion ruby-talk	6	98	10 January 2003
Bug In REXML ruby-talk	0	82	27 September 2005
REXML::Document parsing ruby-talk	2	77	11 November 2007

Odd puts behaviour with REXML

Related topics