Libxml: is it possible not to use doctype declaration?

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

ruud grosmann wrote:

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

If the doctype is an HTML, open the document like this:

     xp = XML::HTMLParser.new()
     xp.string = xhtml
     XML::Parser.default_pedantic_parser = false
     doc = xp.parse

My assertxpath gem shows how, in the method assert_libxml.

···

--
   Phlip

Check wether your xml processor supports xml catalog files. They provide a mapping from web-based
paths to local file names.

···

On 29 jul 2008, at 15.44, ruud grosmann wrote:

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

-------------------------------------
This sig is dedicated to the advancement of Nuclear Power
Tommy Nordgren
tommy.nordgren@comhem.se

Give fastxml a try. It's also a ruby interface to libxml.

http://fastxml.rubyforge.org/
--mg

ruud grosmann wrote:

···

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

hi Phlip,

thanks for the suggestion. The document is not an HTML document. It is
an XML document. It is something like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd&quot;&gt;
<test>
<p>this is a test</p>
</test>

I don't want XML::Document to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

regards, Ruud

···

On 29/07/2008, Phlip <phlip2005@gmail.com> wrote:

ruud grosmann wrote:

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

If the doctype is an HTML, open the document like this:

     xp = XML::HTMLParser.new()
     xp.string = xhtml
     XML::Parser.default_pedantic_parser = false
     doc = xp.parse

My assertxpath gem shows how, in the method assert_libxml.

--
   Phlip

hi Mark,

thanks for this hint. I had decided libxslt was not for me because of
a probblem with garbage collection after starting to use it (see other
post).
So a good alternative is welcome. I'll check it out later this week.

regards, Ruud

···

On 31/07/2008, Mark Guzman <segfault@hasno.info> wrote:

Give fastxml a try. It's also a ruby interface to libxml.
GitHub - segfault/fastxml: ruby libxml library targetting speed and ease of use. provides an hpricot-like interface to xml
http://fastxml.rubyforge.org/
--mg

ruud grosmann wrote:

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

ruud grosmann wrote:

I don't want XML::Document to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

Use string surgery to yank out the DOCTYPE.

hi Phlip,

thank you for the hint. I did it already, but I was wondering if there
is some hidden option that did it for me.

Is my assumption correct that the class not documentated very good?
After googling for some time I only found something that appeared to
be outdated. That why I eventually posted my question here.

Is using libxml the right thing to do to, or are there smarter alternatives?

thanks, Ruud

···

On 29/07/2008, Phlip <phlip2005@gmail.com> wrote:

ruud grosmann wrote:

I don't want XML::Document to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

Use string surgery to yank out the DOCTYPE.

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three (REXML, Libxml-ruby, and Hpricot), and its documentation can be very challenging. How much of the original C Libxml documentation have you been able to read?

I tried to reply to this via the ruby-talk mailing list and it didn't
work. Not sure why not, maybe someone can fill me in on that. Anyway,
here's my take:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do e.g.:
parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

···

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML,
Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How
much of the original C Libxml documentation have you been able to read?

--
Posted via http://www.ruby-forum.com/\.

To start, the rdoc documentation can be found at http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out of your xml (by hand, not programmatically) before trying to parse it, I'd say that would be a good idea. That being said, there are two attributes of the XML::Parser class that look like they may be of interest: default_load_external_dtd and default_validity_checking. Try setting both of those to false, unless you have a real dtd to validate against and the example above was fake. Of course, since this is using XML::Parser instead of XML::Document I think you would need to do e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

···

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three (REXML, Libxml-ruby, and Hpricot), and its documentation can be very challenging. How much of the original C Libxml documentation have you been able to read?

Whoops, those were supposed to be class variables. What you really want to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:

···

To start, the rdoc documentation can be found at http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out of your xml (by hand, not programmatically) before trying to parse it, I'd say that would be a good idea. That being said, there are two attributes of the XML::Parser class that look like they may be of interest: default_load_external_dtd and default_validity_checking. Try setting both of those to false, unless you have a real dtd to validate against and the example above was fake. Of course, since this is using XML::Parser instead of XML::Document I think you would need to do e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three (REXML, Libxml-ruby, and Hpricot), and its documentation can be very challenging. How much of the original C Libxml documentation have you been able to read?

hi Phill,

I've tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

        parser = XML::Parser.file( file)
        #parser.default_substitute_entities = false
        #parser.default_load_external_dtd = false
        #parser.default_validity_checking = false
        doc = parser.parse
        node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd&quot;
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd&quot;

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

···

On 30/07/2008, Phill Davies <binary011010@verizon.net> wrote:

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do
e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter
alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML, Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How much of the original C Libxml documentation have you
been able to read?

ruud grosmann wrote:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

Did I something wrong in the script?

When I was researching the difference between the normal XML parser and the HTML parser, I also observed those variables not working. That's why I didn't bring them up.

···

--
   Phlip

ruud grosmann wrote:

hi Phill,

I've tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

        parser = XML::Parser.file( file)
        #parser.default_substitute_entities = false
        #parser.default_load_external_dtd = false
        #parser.default_validity_checking = false
        doc = parser.parse
        node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd&quot;
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd&quot;

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:
    

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do
e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:
      

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter
alternatives?
          

Libxml-ruby is the most complete & accurate parser of the big three
(REXML, Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How much of the original C Libxml documentation have you
been able to read?

Hey Ruud,
    Nope, I can't see that you're doing anything wrong. I guess all I can say is if can send the actual XML so I can give it a try with it (because when I use your original example it seems to work fine as long as I set those class variables). Also, the error message you sent was broken up, if you could please try to send that again it would probably help. Here's what I'm using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;
<test>
<p>this is a test</p>
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

Thanks,
Phill

···

On 30/07/2008, Phill Davies <binary011010@verizon.net> wrote:

Hm, Java XML parsers I know have a special callback that you can set
that will deal with resolving external entities. I could not find
anything similar in libxml documentation but maybe I just looked in
the wrong places. With that you could load the file just once (or
even fetch it from some internal memory or file system). Also, I find
it a bit strange that those flags are global - this can introduce
weird bugs when using an application which parses XML concurrently and
needs different flags for each process...

Kind regards

robert

···

2008/7/30 Phill Davies <binary011010@verizon.net>:

ruud grosmann wrote:

hi Phill,

I've tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

       parser = XML::Parser.file( file)
       #parser.default_substitute_entities = false
       #parser.default_load_external_dtd = false
       #parser.default_validity_checking = false
       doc = parser.parse
       node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd&quot;
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd&quot;

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

On 30/07/2008, Phill Davies <binary011010@verizon.net> wrote:

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do
e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter
alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML, Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How much of the original C Libxml documentation have you
been able to read?

Hey Ruud,
  Nope, I can't see that you're doing anything wrong. I guess all I can say
is if can send the actual XML so I can give it a try with it (because when I
use your original example it seems to work fine as long as I set those class
variables). Also, the error message you sent was broken up, if you could
please try to send that again it would probably help. Here's what I'm using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;
<test>
<p>this is a test</p>
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

--
use.inject do |as, often| as.you_can - without end

thanks everybody,

I think I rather do a system call for saxon. It's just to many little
bugs and uncertainties to me. Thanks anyway for your efforts and
helping me.

Regards, Ruud

···

On 01/08/2008, Robert Klemme <shortcutter@googlemail.com> wrote:

2008/7/30 Phill Davies <binary011010@verizon.net>:

ruud grosmann wrote:

hi Phill,

I've tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

       parser = XML::Parser.file( file)
       #parser.default_substitute_entities = false
       #parser.default_load_external_dtd = false
       #parser.default_validity_checking = false
       doc = parser.parse
       node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd&quot;
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd&quot;

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

On 30/07/2008, Phill Davies <binary011010@verizon.net> wrote:

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd&quot;&gt;

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do
e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter
alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML, Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How much of the original C Libxml documentation have you
been able to read?

Hey Ruud,
  Nope, I can't see that you're doing anything wrong. I guess all I can
say
is if can send the actual XML so I can give it a try with it (because when
I
use your original example it seems to work fine as long as I set those
class
variables). Also, the error message you sent was broken up, if you could
please try to send that again it would probably help. Here's what I'm
using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd&quot;&gt;
<test>
<p>this is a test</p>
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

Hm, Java XML parsers I know have a special callback that you can set
that will deal with resolving external entities. I could not find
anything similar in libxml documentation but maybe I just looked in
the wrong places. With that you could load the file just once (or
even fetch it from some internal memory or file system). Also, I find
it a bit strange that those flags are global - this can introduce
weird bugs when using an application which parses XML concurrently and
needs different flags for each process...

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end