HTML Parser suggestions wanted

Ned_Konz · 4 June 2002 15:30

I’ve written an HTML parser that builds trees from HTML source. After
I wrote it, I discovered REXML, which does the same thing for XML.

Then I made an add-on that uses REXML’s XPath support to do XPath
queries on the resultant HTML tree. In my test version, these queries
return REXML tree elements, rather than my HTML tree elements.

Having two very similar tree structures (HTML and REXML) smells, to
me. The fact that they have somewhat different APIs confuses even me.

What I’m wondering (and would like your input on):

Should I just require REXML and not bother with my own tree
elements? I could, after all, just build a REXML document instead.
This has the disadvantage of requiring yet another package to be
installed, though.
If I don’t build an REXML tree, I could still return my own
elements from XPath queries. That is, I could use REXML transparently
and not expose the user to any of REXML’s elements. Would this be a
preferable way to provide XPath support?

Thanks,

···

–
Ned Konz
http://bike-nomad.com/ruby/
GPG key ID: BEEA7EFE

James8 · 4 June 2002 15:50

I’ve written an HTML parser that builds trees from HTML source. After
I wrote it, I discovered REXML, which does the same thing for XML.

I started looking at your parser (thanks!) and wanted to load the
resulting HTML into REXML.

Then I made an add-on that uses REXML’s XPath support to do XPath
queries on the resultant HTML tree. In my test version, these queries
return REXML tree elements, rather than my HTML tree elements.

… and then I wanted to do XPath queries. So far, so good!

Having two very similar tree structures (HTML and REXML) smells, to
me. The fact that they have somewhat different APIs confuses even me.

What I’m wondering (and would like your input on):

Should I just require REXML and not bother with my own tree
elements? I could, after all, just build a REXML document instead.
This has the disadvantage of requiring yet another package to be
installed, though.

True, but REXML is part of the Windows PragProg install, and it’s quick and
easy to install from source, anyway.

If I don’t build an REXML tree, I could still return my own
elements from XPath queries. That is, I could use REXML transparently
and not expose the user to any of REXML’s elements. Would this be a
preferable way to provide XPath support?

Wouldn’t REXML still need to be installed?

BTW, quick question: Is there any documentation (aside form the terse API
docs) about your parser? Is there a way to grab the parsed HTML and assign
it to a string? All I’ve seen is how to write to a file.

Thanks,

James

···

Thanks,

Ned Konz
http://bike-nomad.com/ruby/
GPG key ID: BEEA7EFE

Nat_Pryce · 4 June 2002 16:07

I think implementing your HTML parser as an add-on to REXML is a great idea.
If your parser builds a REXML document, you will inherit any improvements to
REXML, such as faster XPath support, with no extra effort on your part. Also
it will be easier for people writing applications that process both XML and
HTML if they only need to learn one document API. The only difficulty I can
see is getting the document to write itself out as HTML rather than XML.

Cheers,
Nat.

···

Dr. Nathaniel Pryce
B13media Ltd.
Studio 3a, Aberdeen Business Centre, 22/24 Highbury Grove, London, N5 2EA
http://www.b13media.com

----- Original Message -----
From: “Ned Konz” ned@bike-nomad.com
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Tuesday, June 04, 2002 4:30 PM
Subject: HTML Parser suggestions wanted

I’ve written an HTML parser that builds trees from HTML source. After
I wrote it, I discovered REXML, which does the same thing for XML.

Then I made an add-on that uses REXML’s XPath support to do XPath
queries on the resultant HTML tree. In my test version, these queries
return REXML tree elements, rather than my HTML tree elements.

Having two very similar tree structures (HTML and REXML) smells, to
me. The fact that they have somewhat different APIs confuses even me.

What I’m wondering (and would like your input on):

Should I just require REXML and not bother with my own tree
elements? I could, after all, just build a REXML document instead.
This has the disadvantage of requiring yet another package to be
installed, though.
If I don’t build an REXML tree, I could still return my own
elements from XPath queries. That is, I could use REXML transparently
and not expose the user to any of REXML’s elements. Would this be a
preferable way to provide XPath support?

Thanks,

Ned Konz
http://bike-nomad.com/ruby/
GPG key ID: BEEA7EFE

Ned_Konz · 4 June 2002 21:29

I decided to have it both ways: you can either make an “old-style”
HTMLTree::Element representation (without needing REXML), or you can
make a REXML::Document representation (which of course needs REXML).

If you have REXML you can convert the HTMLTree::Element style into the
REXML::Document style.

XPath queries against HTMLTree::Element trees still return REXML
elements; I’ll probably change this soon.

Get it from http://bike-nomad.com/ruby/

I’d still like to hear your suggestions.

···

On Tuesday 04 June 2002 08:30 am, I wrote:

I’ve written an HTML parser that builds trees from HTML source.
After I wrote it, I discovered REXML, which does the same thing for
XML.

–
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Ned_Konz · 4 June 2002 16:37

True, but REXML is part of the Windows PragProg install, and it’s
quick and easy to install from source, anyway.

Yes, it’s just another dependency.

If I don’t build an REXML tree, I could still return my own
elements from XPath queries. That is, I could use REXML
transparently and not expose the user to any of REXML’s elements.
Would this be a preferable way to provide XPath support?

Wouldn’t REXML still need to be installed?

Only for XPath support (you can see my pre-1.03 at
http://bike-nomad.com/ruby/ruby-htmltools-1.03.tar.gz)

BTW, quick question: Is there any documentation (aside form the
terse API docs) about your parser?

No, but there should be some (there is the RDoc stuff). Care to write
it? I guess I was waiting for the API to stabilize. There will be an
article in the near future on the O’Reilly site, I believe.

Is there a way to grab the
parsed HTML and assign it to a string? All I’ve seen is how to
write to a file.

No, but I’ll add it. Is there a string IO class in Ruby? If there
were, you could use print_on() (after I make it take a stream
argument ).

I see stream.rb in the RAA, but don’t know whether it would work well
with Strings.

···

On Tuesday 04 June 2002 08:50 am, james@rubyxml.com wrote:

–
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Ned_Konz · 4 June 2002 16:39

Of course, I can extend the REXML Document/Element to do this. And the
XML output from REXML looks like it would be valid XHTML.

···

On Tuesday 04 June 2002 09:07 am, Nat Pryce wrote:

The only difficulty I can see
is getting the document to write itself out as HTML rather than
XML.

–
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Tobi_Reif · 4 June 2002 19:09

Nat Pryce wrote:

The only difficulty I can
see is getting the document to write itself out as HTML rather than XML.

The current version of HTML is XHTML is XML.

Tobi

···

–
http://www.pinkjuice.com/

James8 · 4 June 2002 16:46

BTW, quick question: Is there any documentation (aside form the
terse API docs) about your parser?

No, but there should be some (there is the RDoc stuff). Care to write
it?

Oh if I only had time …

James

Nat_Pryce · 4 June 2002 20:19

Nat Pryce wrote:

The only difficulty I can
see is getting the document to write itself out as HTML rather than XML.
The current version of HTML is XHTML is XML.

There is still a need for older versions of HTML. For example, Pocket
Internet Explorer only supports HTML 3.

Cheers,
Nat.

···

From: “Tobias Reif” tobiasreif@pinkjuice.com

Dr. Nathaniel Pryce
B13media Ltd.
Studio 3a, Aberdeen Business Centre, 22/24 Highbury Grove, London, N5 2EA
http://www.b13media.com

Thomas_Hurst · 4 June 2002 22:45

XHTML has certain guidelines to maintain compatability with clients
that expect SGML/tag-soup alike code - things like:

instead of:

Presumably some sort of XHTML “pretty-printer” could be written for this
sort of thing.

···

Tobias Reif (tobiasreif@pinkjuice.com) wrote:

Nat Pryce wrote:

The only difficulty I can see is getting the document to write
itself out as HTML rather than XML.

The current version of HTML is XHTML is XML.

–
Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

In less than a century, computers will be making substantial
progress on … the overriding problem of war and peace.
– James Slagle

Tobi_Reif · 4 June 2002 20:24

Nat Pryce wrote:

There is still a need for older versions of HTML. For example, Pocket
Internet Explorer only supports HTML 3.

A “HTML compatible” [1] (MIME Type) [2] subset of XHTML works for most
of these cases.

Tobi

[1] XHTML 1.0: The Extensible HyperText Markup Language (Second Edition)
[2] XHTML Media Types - Second Edition

···

–
http://www.pinkjuice.com/

James8 · 4 June 2002 21:01

The current version of HTML is XHTML is XML.

There is still a need for older versions of HTML. For example, Pocket
Internet Explorer only supports HTML 3.

There are, I suspect, elements appearing in numerous web pages that are not
proper XHTML.
For example, and need to be wrapped in CDATA sections.

It might not be so simple to read in HTML and emit XHTML.

James

···

Cheers,
Nat.

Tobi_Reif · 4 June 2002 21:33

james@rubyxml.com wrote:

There are, I suspect, elements appearing in numerous web pages that are not
proper XHTML.

Yes; unfortunately, most web pages are neither well-formed XML nor valid
XHTML, and many aren’t even valid HTML.

Writing a thing which can handle any tagsoup doesn’t seem like much fun.

If you know that you’ll only have to deal with XHTML (templates in a
templating framework for example), then you can subclass REXML’s classes.

Tobi

···

–
http://www.pinkjuice.com/

Topic		Replies	Views
XPath and HTML ruby-talk	8	84	13 October 2003
HTML dom ruby-talk	8	101	25 June 2009
HTML parsing by REXML ruby-talk	5	74	1 April 2004
HTML parsing ruby-talk	4	82	2 February 2004
One more way to parse XML ruby-talk	5	135	24 October 2006

HTML Parser suggestions wanted

Thanks,

Thanks,

– Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/

Related topics

–
Thomas ‘Freaky’ Hurst - freaky@aagh.net - http://www.aagh.net/