HTML Parser suggestions wanted

I’ve written an HTML parser that builds trees from HTML source. After
I wrote it, I discovered REXML, which does the same thing for XML.

Then I made an add-on that uses REXML’s XPath support to do XPath
queries on the resultant HTML tree. In my test version, these queries
return REXML tree elements, rather than my HTML tree elements.

Having two very similar tree structures (HTML and REXML) smells, to
me. The fact that they have somewhat different APIs confuses even me.

What I’m wondering (and would like your input on):

  1. Should I just require REXML and not bother with my own tree
    elements? I could, after all, just build a REXML document instead.
    This has the disadvantage of requiring yet another package to be
    installed, though.

  2. If I don’t build an REXML tree, I could still return my own
    elements from XPath queries. That is, I could use REXML transparently
    and not expose the user to any of REXML’s elements. Would this be a
    preferable way to provide XPath support?

Thanks,

···


Ned Konz
http://bike-nomad.com/ruby/
GPG key ID: BEEA7EFE

I’ve written an HTML parser that builds trees from HTML source. After
I wrote it, I discovered REXML, which does the same thing for XML.

I started looking at your parser (thanks!) and wanted to load the
resulting HTML into REXML.

Then I made an add-on that uses REXML’s XPath support to do XPath
queries on the resultant HTML tree. In my test version, these queries
return REXML tree elements, rather than my HTML tree elements.

… and then I wanted to do XPath queries. So far, so good!

Having two very similar tree structures (HTML and REXML) smells, to
me. The fact that they have somewhat different APIs confuses even me.

What I’m wondering (and would like your input on):

  1. Should I just require REXML and not bother with my own tree
    elements? I could, after all, just build a REXML document instead.
    This has the disadvantage of requiring yet another package to be
    installed, though.

True, but REXML is part of the Windows PragProg install, and it’s quick and
easy to install from source, anyway.

  1. If I don’t build an REXML tree, I could still return my own
    elements from XPath queries. That is, I could use REXML transparently
    and not expose the user to any of REXML’s elements. Would this be a
    preferable way to provide XPath support?

Wouldn’t REXML still need to be installed?

BTW, quick question: Is there any documentation (aside form the terse API
docs) about your parser? Is there a way to grab the parsed HTML and assign
it to a string? All I’ve seen is how to write to a file.

Thanks,

James

···

Thanks,

Ned Konz
http://bike-nomad.com/ruby/
GPG key ID: BEEA7EFE

I think implementing your HTML parser as an add-on to REXML is a great idea.
If your parser builds a REXML document, you will inherit any improvements to
REXML, such as faster XPath support, with no extra effort on your part. Also
it will be easier for people writing applications that process both XML and
HTML if they only need to learn one document API. The only difficulty I can
see is getting the document to write itself out as HTML rather than XML.

Cheers,
Nat.

···

Dr. Nathaniel Pryce
B13media Ltd.
Studio 3a, Aberdeen Business Centre, 22/24 Highbury Grove, London, N5 2EA
http://www.b13media.com

----- Original Message -----
From: “Ned Konz” ned@bike-nomad.com
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Tuesday, June 04, 2002 4:30 PM
Subject: HTML Parser suggestions wanted

I’ve written an HTML parser that builds trees from HTML source. After
I wrote it, I discovered REXML, which does the same thing for XML.

Then I made an add-on that uses REXML’s XPath support to do XPath
queries on the resultant HTML tree. In my test version, these queries
return REXML tree elements, rather than my HTML tree elements.

Having two very similar tree structures (HTML and REXML) smells, to
me. The fact that they have somewhat different APIs confuses even me.

What I’m wondering (and would like your input on):

  1. Should I just require REXML and not bother with my own tree
    elements? I could, after all, just build a REXML document instead.
    This has the disadvantage of requiring yet another package to be
    installed, though.

  2. If I don’t build an REXML tree, I could still return my own
    elements from XPath queries. That is, I could use REXML transparently
    and not expose the user to any of REXML’s elements. Would this be a
    preferable way to provide XPath support?

Thanks,

Ned Konz
http://bike-nomad.com/ruby/
GPG key ID: BEEA7EFE

I decided to have it both ways: you can either make an “old-style”
HTMLTree::Element representation (without needing REXML), or you can
make a REXML::Document representation (which of course needs REXML).

If you have REXML you can convert the HTMLTree::Element style into the
REXML::Document style.

XPath queries against HTMLTree::Element trees still return REXML
elements; I’ll probably change this soon.

Get it from http://bike-nomad.com/ruby/

I’d still like to hear your suggestions.

···

On Tuesday 04 June 2002 08:30 am, I wrote:

I’ve written an HTML parser that builds trees from HTML source.
After I wrote it, I discovered REXML, which does the same thing for
XML.


Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

True, but REXML is part of the Windows PragProg install, and it’s
quick and easy to install from source, anyway.

Yes, it’s just another dependency.

  1. If I don’t build an REXML tree, I could still return my own
    elements from XPath queries. That is, I could use REXML
    transparently and not expose the user to any of REXML’s elements.
    Would this be a preferable way to provide XPath support?

Wouldn’t REXML still need to be installed?

Only for XPath support (you can see my pre-1.03 at
http://bike-nomad.com/ruby/ruby-htmltools-1.03.tar.gz)

BTW, quick question: Is there any documentation (aside form the
terse API docs) about your parser?

No, but there should be some (there is the RDoc stuff). Care to write
it? I guess I was waiting for the API to stabilize. There will be an
article in the near future on the O’Reilly site, I believe.

Is there a way to grab the
parsed HTML and assign it to a string? All I’ve seen is how to
write to a file.

No, but I’ll add it. Is there a string IO class in Ruby? If there
were, you could use print_on() (after I make it take a stream
argument ).

I see stream.rb in the RAA, but don’t know whether it would work well
with Strings.

···

On Tuesday 04 June 2002 08:50 am, james@rubyxml.com wrote:


Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Of course, I can extend the REXML Document/Element to do this. And the
XML output from REXML looks like it would be valid XHTML.

···

On Tuesday 04 June 2002 09:07 am, Nat Pryce wrote:

The only difficulty I can see
is getting the document to write itself out as HTML rather than
XML.


Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Nat Pryce wrote:

The only difficulty I can
see is getting the document to write itself out as HTML rather than XML.

The current version of HTML is XHTML is XML.

Tobi

···


http://www.pinkjuice.com/

BTW, quick question: Is there any documentation (aside form the
terse API docs) about your parser?

No, but there should be some (there is the RDoc stuff). Care to write
it?

Oh if I only had time … :slight_smile:

James

Nat Pryce wrote:

The only difficulty I can
see is getting the document to write itself out as HTML rather than XML.
The current version of HTML is XHTML is XML.

There is still a need for older versions of HTML. For example, Pocket
Internet Explorer only supports HTML 3.

Cheers,
Nat.

···

From: “Tobias Reif” tobiasreif@pinkjuice.com


Dr. Nathaniel Pryce
B13media Ltd.
Studio 3a, Aberdeen Business Centre, 22/24 Highbury Grove, London, N5 2EA
http://www.b13media.com

XHTML has certain guidelines to maintain compatability with clients
that expect SGML/tag-soup alike code - things like:

instead of:

Presumably some sort of XHTML “pretty-printer” could be written for this
sort of thing.

···

Nat Pryce wrote:

The only difficulty I can see is getting the document to write
itself out as HTML rather than XML.

The current version of HTML is XHTML is XML.


Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

In less than a century, computers will be making substantial
progress on … the overriding problem of war and peace.
– James Slagle

Nat Pryce wrote:

There is still a need for older versions of HTML. For example, Pocket
Internet Explorer only supports HTML 3.

A “HTML compatible” [1] (MIME Type) [2] subset of XHTML works for most
of these cases.

Tobi

[1] XHTML 1.0: The Extensible HyperText Markup Language (Second Edition)
[2] XHTML Media Types - Second Edition

···


http://www.pinkjuice.com/

The current version of HTML is XHTML is XML.

There is still a need for older versions of HTML. For example, Pocket
Internet Explorer only supports HTML 3.

There are, I suspect, elements appearing in numerous web pages that are not
proper XHTML.
For example, and need to be wrapped in CDATA sections.

It might not be so simple to read in HTML and emit XHTML.

James

···

Cheers,
Nat.

james@rubyxml.com wrote:

There are, I suspect, elements appearing in numerous web pages that are not
proper XHTML.

Yes; unfortunately, most web pages are neither well-formed XML nor valid
XHTML, and many aren’t even valid HTML.

Writing a thing which can handle any tagsoup doesn’t seem like much fun.

If you know that you’ll only have to deal with XHTML (templates in a
templating framework for example), then you can subclass REXML’s classes.

Tobi

···


http://www.pinkjuice.com/