I’ve written an HTML parser that builds trees from HTML source. After
I wrote it, I discovered REXML, which does the same thing for XML.
Then I made an add-on that uses REXML’s XPath support to do XPath
queries on the resultant HTML tree. In my test version, these queries
return REXML tree elements, rather than my HTML tree elements.
Having two very similar tree structures (HTML and REXML) smells, to
me. The fact that they have somewhat different APIs confuses even me.
What I’m wondering (and would like your input on):
Should I just require REXML and not bother with my own tree
elements? I could, after all, just build a REXML document instead.
This has the disadvantage of requiring yet another package to be
installed, though.
If I don’t build an REXML tree, I could still return my own
elements from XPath queries. That is, I could use REXML transparently
and not expose the user to any of REXML’s elements. Would this be a
preferable way to provide XPath support?
I’ve written an HTML parser that builds trees from HTML source. After
I wrote it, I discovered REXML, which does the same thing for XML.
I started looking at your parser (thanks!) and wanted to load the
resulting HTML into REXML.
Then I made an add-on that uses REXML’s XPath support to do XPath
queries on the resultant HTML tree. In my test version, these queries
return REXML tree elements, rather than my HTML tree elements.
… and then I wanted to do XPath queries. So far, so good!
Having two very similar tree structures (HTML and REXML) smells, to
me. The fact that they have somewhat different APIs confuses even me.
What I’m wondering (and would like your input on):
Should I just require REXML and not bother with my own tree
elements? I could, after all, just build a REXML document instead.
This has the disadvantage of requiring yet another package to be
installed, though.
True, but REXML is part of the Windows PragProg install, and it’s quick and
easy to install from source, anyway.
If I don’t build an REXML tree, I could still return my own
elements from XPath queries. That is, I could use REXML transparently
and not expose the user to any of REXML’s elements. Would this be a
preferable way to provide XPath support?
Wouldn’t REXML still need to be installed?
BTW, quick question: Is there any documentation (aside form the terse API
docs) about your parser? Is there a way to grab the parsed HTML and assign
it to a string? All I’ve seen is how to write to a file.
I think implementing your HTML parser as an add-on to REXML is a great idea.
If your parser builds a REXML document, you will inherit any improvements to
REXML, such as faster XPath support, with no extra effort on your part. Also
it will be easier for people writing applications that process both XML and
HTML if they only need to learn one document API. The only difficulty I can
see is getting the document to write itself out as HTML rather than XML.
Cheers,
Nat.
···
Dr. Nathaniel Pryce
B13media Ltd.
Studio 3a, Aberdeen Business Centre, 22/24 Highbury Grove, London, N5 2EA http://www.b13media.com
----- Original Message -----
From: “Ned Konz” ned@bike-nomad.com
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Tuesday, June 04, 2002 4:30 PM
Subject: HTML Parser suggestions wanted
I’ve written an HTML parser that builds trees from HTML source. After
I wrote it, I discovered REXML, which does the same thing for XML.
Then I made an add-on that uses REXML’s XPath support to do XPath
queries on the resultant HTML tree. In my test version, these queries
return REXML tree elements, rather than my HTML tree elements.
Having two very similar tree structures (HTML and REXML) smells, to
me. The fact that they have somewhat different APIs confuses even me.
What I’m wondering (and would like your input on):
Should I just require REXML and not bother with my own tree
elements? I could, after all, just build a REXML document instead.
This has the disadvantage of requiring yet another package to be
installed, though.
If I don’t build an REXML tree, I could still return my own
elements from XPath queries. That is, I could use REXML transparently
and not expose the user to any of REXML’s elements. Would this be a
preferable way to provide XPath support?
I decided to have it both ways: you can either make an “old-style”
HTMLTree::Element representation (without needing REXML), or you can
make a REXML::Document representation (which of course needs REXML).
If you have REXML you can convert the HTMLTree::Element style into the
REXML::Document style.
XPath queries against HTMLTree::Element trees still return REXML
elements; I’ll probably change this soon.
True, but REXML is part of the Windows PragProg install, and it’s
quick and easy to install from source, anyway.
Yes, it’s just another dependency.
If I don’t build an REXML tree, I could still return my own
elements from XPath queries. That is, I could use REXML
transparently and not expose the user to any of REXML’s elements.
Would this be a preferable way to provide XPath support?
BTW, quick question: Is there any documentation (aside form the
terse API docs) about your parser?
No, but there should be some (there is the RDoc stuff). Care to write
it? I guess I was waiting for the API to stabilize. There will be an
article in the near future on the O’Reilly site, I believe.
Is there a way to grab the
parsed HTML and assign it to a string? All I’ve seen is how to
write to a file.
No, but I’ll add it. Is there a string IO class in Ruby? If there
were, you could use print_on() (after I make it take a stream
argument ).
I see stream.rb in the RAA, but don’t know whether it would work well
with Strings.