[libxml]: Can't find nodes using XPath, namespaces mess

Hi,

I am having problems accessing elements in the XML documents using
XPath. My xml document looks like that:

<?xml version="1.0" encoding="UTF-8"?>
<configuration-data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">
<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>

My XPath only works when I remove all the namespaces from the root node
but I do need to access it without modifying the xml.

I am using:
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-mswin32]
libxml-ruby (1.1.3)

···

--
Posted via http://www.ruby-forum.com/.

Have your run your XML thru a validator? That semicolon looks invalid to
me. m.

···

Stanislaw Wozniak <stan@wozniak.com> wrote:

Hi,

I am having problems accessing elements in the XML documents using
XPath. My xml document looks like that:

<?xml version="1.0" encoding="UTF-8"?>
<configuration-data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;;
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">
<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>

My XPath only works when I remove all the namespaces from the root node
but I do need to access it without modifying the xml.

As Matt said, the document is not well-formed XML. Try adding the
RECOVER option to the parser, which tells libxml to ignore syntax
errors like that.

Hi,

I am having problems accessing elements in the XML documents using
XPath. My xml document looks like that:

<?xml version="1.0" encoding="UTF-8"?>
<configuration-data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;;
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">

  ^^^^^ That says that all nodes inside this document (if not explicitly
  namespaced) belong to an implicit namespace

<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>

My XPath only works when I remove all the namespaces from the root node
but I do need to access it without modifying the xml.

You need to register that namespace with the libxml xpath engine. I'm
not sure how you register namespaces with libxml-ruby, but with
nokogiri, I would do this:

  doc = Nokogiri::XML(xml)
  doc.xpath('//ns:attribute', 'ns' => 'urn:company:platform:foundation:configuration:defn:v1')

Nokogiri will automatically register root level namespaces, so you could
also do this:

  doc = Nokogiri::XML(xml)
  doc.xpath('//xmlns:attribute')

I know there is a way to do this with libxml-ruby, I just don't know the
syntax off the top of my head. Look through the libxml-ruby
documentation for "find", and I'm sure you'll find how to register
namespaces.

···

On Sat, Aug 01, 2009 at 05:33:31AM +0900, Stanislaw Wozniak wrote:

--
Aaron Patterson
http://tenderlovemaking.com/

Hi, this was a typo, no semicolon in there:

<?xml version="1.0" encoding="UTF-8"?>
<configuration-data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">
<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>

···

--
Posted via http://www.ruby-forum.com/.

Then what's the problem? XPath works:

s = <<END
<?xml version="1.0" encoding="UTF-8"?>
<configuration-data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v
1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">
<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>
END
require 'rexml/document'
include REXML
doc = Document.new(s)
p XPath.match(doc, "//treenode['Root']/treenode")
#=> [<treenode name='1'/>]

Oh, wait, you said you were using libxml:

s = <<END
<?xml version="1.0" encoding="UTF-8"?>
<configuration-data>
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v
1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">
<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>
END
require 'rubygems'
require 'xml'
doc = XML::Document.string(s)
doc.find("//treenode['Root']/treenode").each do |el|
  p el #=> <treenode name="1"/>
end

Sorry, I'm failing to guess what problem you're having. Perhaps if you
showed your actual code? m.

···

Stanislaw Wozniak <stan@wozniak.com> wrote:

Hi, this was a typo, no semicolon in there:

<?xml version="1.0" encoding="UTF-8"?>
<configuration-data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">
<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>

> Hi, this was a typo, no semicolon in there:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration-data
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
> xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v1"
> xmlns="urn:company:platform:foundation:configuration:defn:v1">
> <attributeList>
> <attribute name="siteid" validationRuleName="String" description="Site
> id">
> <tree name="siteid_hierarchy">
> <treenode name="Root">
> <treenode name="1" />
> </treenode>
> </tree>
> </attribute>
> </attributeList>
> </configuration-data>

Then what's the problem? XPath works:

s = <<END
<?xml version="1.0" encoding="UTF-8"?>
<configuration-data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v
1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">
<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>
END
require 'rexml/document'
include REXML
doc = Document.new(s)
p XPath.match(doc, "//treenode['Root']/treenode")
#=> [<treenode name='1'/>]

Wow. These results are just wrong. This is a bug in REXML. In XPath,
when you do not specify a namespace for your node, that means that you
want a node *with no namespace*.

For example:

  require 'rexml/document'
  
  include REXML
  
  s = <<END
  <?xml version="1.0" encoding="UTF-8"?>
  <shop>
    <!-- car inventory -->
    <inventory xmlns="http://gm.com/&quot;&gt;
      <tire name="all season" />
    </inventory>
  
    <!-- bike inventory -->
    <inventory xmlns="Schwinn Bicycles | Schwinn Fitness;
      <tire name="street" />
    </inventory>
  
    <!-- no namespace inventory -->
    <inventory>
      <tire name="wtf" />
    </inventory>
  </shop>
  END
  
  doc = Document.new(s)
  
  p XPath.match(doc, "//tire")

REXML matches *all three* tires. Surely a car tire is not the same as a bike
tire? Using XPath, how would I query for a tire that has *no namespace*
(the third one) without matching the two that *do* belong in a
namespace (it's possible to do this with REXML, just strange)? The XPath used
above *should* only match the third entry.

This is a broken implementation of XPath.

Oh, wait, you said you were using libxml:

You have an error in your XML below

s = <<END
<?xml version="1.0" encoding="UTF-8"?>
<configuration-data>

                     ^ That ">" should not be there.
libxml-ruby has corrections turned on by default, so you've effectively
removed all namespaces from this document.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v
1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">
<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>
END
require 'rubygems'
require 'xml'
doc = XML::Document.string(s)
doc.find("//treenode['Root']/treenode").each do |el|
  p el #=> <treenode name="1"/>
end

Sorry, I'm failing to guess what problem you're having. Perhaps if you
showed your actual code? m.

Since the namespaces were removed, this example succeeds.

···

On Sun, Aug 02, 2009 at 12:50:05AM +0900, Matt Neuburg wrote:

Stanislaw Wozniak <stan@wozniak.com> wrote:

--
Aaron Patterson
http://tenderlovemaking.com/

Thanks for spotting that. I must have removed the namespace and then put
it back, to see if I could duplicate the OP's problems, and I must have
put it back wrong. I wish libxml had just complained that my XML was
bad...

You're right; fixing the error, I can now duplicate the OP's problem in
libxml (but not in REXML, as you also observed). And then I can solve
it:

s = <<END
<?xml version="1.0" encoding="UTF-8"?>
<configuration-data
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v
1"
xmlns="urn:company:platform:foundation:configuration:defn:v1">
<attributeList>
<attribute name="siteid" validationRuleName="String" description="Site
id">
<tree name="siteid_hierarchy">
<treenode name="Root">
<treenode name="1" />
</treenode>
</tree>
</attribute>
</attributeList>
</configuration-data>
END
require 'rubygems'
require 'xml'
doc = XML::Document.string(s)
ns = {"xsi" => "urn:company:platform:foundation:configuration:defn:v1"}
doc.find("//xsi:treenode['Root']/xsi:treenode", ns).each do |el|
  p el #=> <treenode name="1"/>
end

That is the desired sort of result, I take it. Notice that we register
the namespace with the XPath engine and that we actually use the
namescape in our XPath expression. m.

···

Aaron Patterson <aaron@tenderlovemaking.com> wrote:

You have an error in your XML below