HTML parsing by REXML

Hello world.

Sorry for returning to sorta well-discussed (but not in a sense I need) topic.
I can’t parse xml files by rexml since some tags in html are open (such as

, etc). Document.new errors with a message about such a tag.

Is there any workaround? I really like and accustomed to REXML and don’t want
to use another lib.

···


Yours truly, WBR, Paul Argentoff.
Jabber: paul@jabber.rtelekom.ru
RIPE: PA1291-RIPE

Hi,

···

On Fri, 02 Apr 2004 00:58:24 +0900, Paul Argentoff wrote:

Sorry for returning to sorta well-discussed (but not in a sense I need)
topic.
I can’t parse xml files by rexml since some tags in html are open (such as

, etc). Document.new errors with a message about such a tag.

Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.

Hi

Paul Argentoff wrote:

Sorry for returning to sorta well-discussed (but not in a sense I need) topic.
I can’t parse xml files by rexml since some tags in html are open (such as

, etc). Document.new errors with a message about such a tag.

Is there any workaround? I really like and accustomed to REXML and don’t want
to use another lib.

You want Ned Konz’s HTML tools.

http://bike-nomad.com/ruby/index.html

These are based on html-parser (so it’ll work with any HTML, not just
XHTML), and allow you to treat the resultant parse tree as an REXML
document (so you can search it using XPath etc). I’ve used it in a few
apps and it works perfectly.

cheers
alex

to expand on this:
html is not xml. xhtml is xml.
in html, you can have unclosed tags that look like this:

but in xhtml, the tag has to close itself: xhtml does this to be compatible with xml, which it is based on.

If you have a bunch of regular html files that you need to parse, I
would suggest running them through “HTML Tidy”, which can convert them
to well-formed xhtml for you. You should then be able to parse them
with REXML. See http://tidy.sourgeforge.net/

–Mark

···

On Apr 1, 2004, at 9:14 AM, Dario Linsky wrote:

Hi,

On Fri, 02 Apr 2004 00:58:24 +0900, Paul Argentoff wrote:

Sorry for returning to sorta well-discussed (but not in a sense I
need)
topic.
I can’t parse xml files by rexml since some tags in html are open
(such as

, etc). Document.new errors with a message about such a tag.

Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.

Yeah I had the same problem recently. I think since html allows lax
closing of elements rexml will just barf. In the end I used regular
expressions to slurp catch the lines I was interested in and regex to
capture the fields I wanted. Works really well. There’s also a html
parser class based on the python one, but it was so badly documented and
it seems to be poorly supported that I chose not to use it.

Dario Linsky wrote:

···

Hi,

On Fri, 02 Apr 2004 00:58:24 +0900, Paul Argentoff wrote:

Sorry for returning to sorta well-discussed (but not in a sense I need)
topic.
I can’t parse xml files by rexml since some tags in html are open (such as

, etc). Document.new errors with a message about such a tag.

Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.

!DSPAM:406c5b0b63886654544321!

Yan-Fa Li wrote:

Yeah I had the same problem recently. I think since html allows lax
closing of elements rexml will just barf. In the end I used regular
expressions to slurp catch the lines I was interested in and regex to
capture the fields I wanted. Works really well. There’s also a html
parser class based on the python one, but it was so badly documented and
it seems to be poorly supported that I chose not to use it.

You might have more luck with mine:

http://rubyforge.org/projects/htmltokenizer/

It is more forgiving, and pretty easy to use.

Ben