HTML parsing by REXML

Paul_Argentoff · 1 April 2004 15:58

Hello world.

Sorry for returning to sorta well-discussed (but not in a sense I need) topic.
I can’t parse xml files by rexml since some tags in html are open (such as

, etc). Document.new errors with a message about such a tag.

Is there any workaround? I really like and accustomed to REXML and don’t want
to use another lib.

···

–
Yours truly, WBR, Paul Argentoff.
Jabber: paul@jabber.rtelekom.ru
RIPE: PA1291-RIPE

Dario_Linsky1 · 1 April 2004 17:14

Hi,

···

On Fri, 02 Apr 2004 00:58:24 +0900, Paul Argentoff wrote:

Sorry for returning to sorta well-discussed (but not in a sense I need)
topic.
I can’t parse xml files by rexml since some tags in html are open (such as
, etc). Document.new errors with a message about such a tag.

Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.

alex_f · 1 April 2004 22:54

Hi

Paul Argentoff wrote:

Sorry for returning to sorta well-discussed (but not in a sense I need) topic.
I can’t parse xml files by rexml since some tags in html are open (such as
, etc). Document.new errors with a message about such a tag.
Is there any workaround? I really like and accustomed to REXML and don’t want
to use another lib.

You want Ned Konz’s HTML tools.

http://bike-nomad.com/ruby/index.html

These are based on html-parser (so it’ll work with any HTML, not just
XHTML), and allow you to treat the resultant parse tree as an REXML
document (so you can search it using XPath etc). I’ve used it in a few
apps and it works perfectly.

cheers
alex

Mark_Hubbart · 1 April 2004 18:15

to expand on this:
html is not xml. xhtml is xml.
in html, you can have unclosed tags that look like this:

but in xhtml, the tag has to close itself: xhtml does this to be compatible with xml, which it is based on.

If you have a bunch of regular html files that you need to parse, I
would suggest running them through “HTML Tidy”, which can convert them
to well-formed xhtml for you. You should then be able to parse them
with REXML. See http://tidy.sourgeforge.net/

–Mark

···

On Apr 1, 2004, at 9:14 AM, Dario Linsky wrote:

Hi,

On Fri, 02 Apr 2004 00:58:24 +0900, Paul Argentoff wrote:

Sorry for returning to sorta well-discussed (but not in a sense I
need)
topic.
I can’t parse xml files by rexml since some tags in html are open
(such as
, etc). Document.new errors with a message about such a tag.

Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.

Yan-Fa_Li · 1 April 2004 19:18

Yeah I had the same problem recently. I think since html allows lax
closing of elements rexml will just barf. In the end I used regular
expressions to slurp catch the lines I was interested in and regex to
capture the fields I wanted. Works really well. There’s also a html
parser class based on the python one, but it was so badly documented and
it seems to be poorly supported that I chose not to use it.

Dario Linsky wrote:

···

Hi,

On Fri, 02 Apr 2004 00:58:24 +0900, Paul Argentoff wrote:

Sorry for returning to sorta well-discussed (but not in a sense I need)
topic.
I can’t parse xml files by rexml since some tags in html are open (such as
, etc). Document.new errors with a message about such a tag.

Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.

!DSPAM:406c5b0b63886654544321!

Ben_Giddings1 · 1 April 2004 20:13

Yan-Fa Li wrote:

Yeah I had the same problem recently. I think since html allows lax
closing of elements rexml will just barf. In the end I used regular
expressions to slurp catch the lines I was interested in and regex to
capture the fields I wanted. Works really well. There’s also a html
parser class based on the python one, but it was so badly documented and
it seems to be poorly supported that I chose not to use it.

You might have more luck with mine:

http://rubyforge.org/projects/htmltokenizer/

It is more forgiving, and pretty easy to use.

Ben

Topic		Replies	Views
HTML Parser suggestions wanted ruby-talk	12	127	4 June 2002
REXML::Document parsing ruby-talk	2	77	11 November 2007
Removing a tag from an xml document ruby-talk	2	112	24 November 2006
HTML parsing ruby-talk	4	82	2 February 2004
HTML dom ruby-talk	8	101	25 June 2009

HTML parsing by REXML

Related topics