Libxml: is it possible not to use doctype declaration?

ruud · 29 July 2008 13:44

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

Phlip1 · 29 July 2008 14:14

ruud grosmann wrote:

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

If the doctype is an HTML, open the document like this:

     xp = XML::HTMLParser.new()
     xp.string = xhtml
     XML::Parser.default_pedantic_parser = false
     doc = xp.parse

My assertxpath gem shows how, in the method assert_libxml.

···

--
Phlip

Tommy_Nordgren · 29 July 2008 17:48

Check wether your xml processor supports xml catalog files. They provide a mapping from web-based
paths to local file names.

···

On 29 jul 2008, at 15.44, ruud grosmann wrote:

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

-------------------------------------
This sig is dedicated to the advancement of Nuclear Power
Tommy Nordgren
tommy.nordgren@comhem.se

Mark_Guzman · 31 July 2008 06:50

Give fastxml a try. It's also a ruby interface to libxml.

http://fastxml.rubyforge.org/
--mg

ruud grosmann wrote:

···

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

ruud · 29 July 2008 14:44

hi Phlip,

thanks for the suggestion. The document is not an HTML document. It is
an XML document. It is something like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">
<test>
this is a test
</test>

I don't want XML::Document to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

regards, Ruud

···

On 29/07/2008, Phlip <phlip2005@gmail.com> wrote:

ruud grosmann wrote:

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

If the doctype is an HTML, open the document like this:

     xp = XML::HTMLParser.new()
     xp.string = xhtml
     XML::Parser.default_pedantic_parser = false
     doc = xp.parse

My assertxpath gem shows how, in the method assert_libxml.

--
   Phlip

ruud · 31 July 2008 07:51

hi Mark,

thanks for this hint. I had decided libxslt was not for me because of
a probblem with garbage collection after starting to use it (see other
post).
So a good alternative is welcome. I'll check it out later this week.

regards, Ruud

···

On 31/07/2008, Mark Guzman <segfault@hasno.info> wrote:

Give fastxml a try. It's also a ruby interface to libxml.
GitHub - segfault/fastxml: ruby libxml library targetting speed and ease of use. provides an hpricot-like interface to xml
http://fastxml.rubyforge.org/
--mg

ruud grosmann wrote:

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

Phlip1 · 29 July 2008 15:49

ruud grosmann wrote:

I don't want XML::Document to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

Use string surgery to yank out the DOCTYPE.

ruud · 29 July 2008 17:31

hi Phlip,

thank you for the hint. I did it already, but I was wondering if there
is some hidden option that did it for me.

Is my assumption correct that the class not documentated very good?
After googling for some time I only found something that appeared to
be outdated. That why I eventually posted my question here.

Is using libxml the right thing to do to, or are there smarter alternatives?

thanks, Ruud

···

On 29/07/2008, Phlip <phlip2005@gmail.com> wrote:

ruud grosmann wrote:

I don't want XML::Document to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

Use string surgery to yank out the DOCTYPE.

Phlip1 · 29 July 2008 18:19

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three (REXML, Libxml-ruby, and Hpricot), and its documentation can be very challenging. How much of the original C Libxml documentation have you been able to read?

Phill_Davies · 29 July 2008 21:38

I tried to reply to this via the ruby-talk mailing list and it didn't
work. Not sure why not, maybe someone can fill me in on that. Anyway,
here's my take:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do e.g.:
parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

···

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML,
Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How
much of the original C Libxml documentation have you been able to read?

--
Posted via http://www.ruby-forum.com/\.

Phill_Davies · 29 July 2008 22:23

To start, the rdoc documentation can be found at http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out of your xml (by hand, not programmatically) before trying to parse it, I'd say that would be a good idea. That being said, there are two attributes of the XML::Parser class that look like they may be of interest: default_load_external_dtd and default_validity_checking. Try setting both of those to false, unless you have a real dtd to validate against and the example above was fake. Of course, since this is using XML::Parser instead of XML::Document I think you would need to do e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

···

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three (REXML, Libxml-ruby, and Hpricot), and its documentation can be very challenging. How much of the original C Libxml documentation have you been able to read?

Phill_Davies · 30 July 2008 00:53

Whoops, those were supposed to be class variables. What you really want to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:

···

To start, the rdoc documentation can be found at http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out of your xml (by hand, not programmatically) before trying to parse it, I'd say that would be a good idea. That being said, there are two attributes of the XML::Parser class that look like they may be of interest: default_load_external_dtd and default_validity_checking. Try setting both of those to false, unless you have a real dtd to validate against and the example above was fake. Of course, since this is using XML::Parser instead of XML::Document I think you would need to do e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three (REXML, Libxml-ruby, and Hpricot), and its documentation can be very challenging. How much of the original C Libxml documentation have you been able to read?

ruud · 30 July 2008 08:37

hi Phill,

I've tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

        parser = XML::Parser.file( file)
        #parser.default_substitute_entities = false
        #parser.default_load_external_dtd = false
        #parser.default_validity_checking = false
        doc = parser.parse
        node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

···

On 30/07/2008, Phill Davies <binary011010@verizon.net> wrote:

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do
e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter
alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML, Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How much of the original C Libxml documentation have you
been able to read?

Phlip1 · 30 July 2008 11:14

ruud grosmann wrote:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

Did I something wrong in the script?

When I was researching the difference between the normal XML parser and the HTML parser, I also observed those variables not working. That's why I didn't bring them up.

···

--
Phlip

Phill_Davies · 30 July 2008 15:41

ruud grosmann wrote:

hi Phill,

I've tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

 parser = XML::Parser.file( file)
 #parser.default_substitute_entities = false
 #parser.default_load_external_dtd = false
 #parser.default_validity_checking = false
 doc = parser.parse
 node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:


To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do
e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:


ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter
alternatives?


Libxml-ruby is the most complete & accurate parser of the big three
(REXML, Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How much of the original C Libxml documentation have you
been able to read?

Hey Ruud,
Nope, I can't see that you're doing anything wrong. I guess all I can say is if can send the actual XML so I can give it a try with it (because when I use your original example it seems to work fine as long as I set those class variables). Also, the error message you sent was broken up, if you could please try to send that again it would probably help. Here's what I'm using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;
<test>
this is a test
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

Thanks,
Phill

···

On 30/07/2008, Phill Davies <binary011010@verizon.net> wrote:

Robert_K1 · 1 August 2008 09:49

Hm, Java XML parsers I know have a special callback that you can set
that will deal with resolving external entities. I could not find
anything similar in libxml documentation but maybe I just looked in
the wrong places. With that you could load the file just once (or
even fetch it from some internal memory or file system). Also, I find
it a bit strange that those flags are global - this can introduce
weird bugs when using an application which parses XML concurrently and
needs different flags for each process...

Kind regards

robert

···

2008/7/30 Phill Davies <binary011010@verizon.net>:

ruud grosmann wrote:

hi Phill,

I've tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

 parser = XML::Parser.file( file)
 #parser.default_substitute_entities = false
 #parser.default_load_external_dtd = false
 #parser.default_validity_checking = false
 doc = parser.parse
 node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

On 30/07/2008, Phill Davies <binary011010@verizon.net> wrote:

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do
e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter
alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML, Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How much of the original C Libxml documentation have you
been able to read?

Hey Ruud,
 Nope, I can't see that you're doing anything wrong. I guess all I can say
is if can send the actual XML so I can give it a try with it (because when I
use your original example it seems to work fine as long as I set those class
variables). Also, the error message you sent was broken up, if you could
please try to send that again it would probably help. Here's what I'm using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;
<test>
this is a test
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

--
use.inject do |as, often| as.you_can - without end

ruud · 2 August 2008 17:38

thanks everybody,

I think I rather do a system call for saxon. It's just to many little
bugs and uncertainties to me. Thanks anyway for your efforts and
helping me.

Regards, Ruud

···

On 01/08/2008, Robert Klemme <shortcutter@googlemail.com> wrote:

2008/7/30 Phill Davies <binary011010@verizon.net>:

ruud grosmann wrote:

hi Phill,

I've tried it right away. I ended up with the following:

XML::Parser.default_load_external_dtd = false
XML::Parser.default_validity_checking = false
XML::Parser.default_substitute_entities = false

 parser = XML::Parser.file( file)
 #parser.default_substitute_entities = false
 #parser.default_load_external_dtd = false
 #parser.default_validity_checking = false
 doc = parser.parse
 node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

On 30/07/2008, Phill Davies <binary011010@verizon.net> wrote:

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::Parser.default_load_external_dtd = false
LibXML::XML::Parser.default_validity_checking = false

And then:
parser = LibXML::XML::Parser.file(<file>)
doc = parser.parse

That seems to work with your example.

Phill Davies wrote:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html\. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::Parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::Parser instead of XML::Document I think you would need to do
e.g.: parser = XML::Parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phlip wrote:

ruud grosmann wrote:

Is using libxml the right thing to do to, or are there smarter
alternatives?

Libxml-ruby is the most complete & accurate parser of the big three
(REXML, Libxml-ruby, and Hpricot), and its documentation can be very
challenging. How much of the original C Libxml documentation have you
been able to read?

Hey Ruud,
 Nope, I can't see that you're doing anything wrong. I guess all I can
say
is if can send the actual XML so I can give it a try with it (because when
I
use your original example it seems to work fine as long as I set those
class
variables). Also, the error message you sent was broken up, if you could
please try to send that again it would probably help. Here's what I'm
using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">
<test>
this is a test
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"Site.nl: Domeinnaam en webhosting - Ga online, begin nu!;

Hm, Java XML parsers I know have a special callback that you can set
that will deal with resolving external entities. I could not find
anything similar in libxml documentation but maybe I just looked in
the wrong places. With that you could load the file just once (or
even fetch it from some internal memory or file system). Also, I find
it a bit strange that those flags are global - this can introduce
weird bugs when using an application which parses XML concurrently and
needs different flags for each process...

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Topic		Replies	Views
Fast XML parser, other than libxml ruby-talk	19	139	6 April 2007
Parsing xml ruby-talk	23	159	26 March 2009
Parsing xhtml with libxml ruby-talk	2	56	17 December 2005
Ann: rexml 2.3.5 && 2.2.3 ruby-talk	30	216	12 June 2002
Ruby in XML ruby-talk	15	84	11 July 2005

Libxml: is it possible not to use doctype declaration?

Related topics