HTML parsing

Gavin_Sinclair · 2 February 2004 12:42

Hi folks,

I need to parse some HTML. I’ve dug around the archives and so on and
found the best solution to be Ned Konz’s ‘ruby-htmltools’, which
relies on ‘html-parser’. Both of these projects are not really
maintained, so I’m wondering what other people currently use.

Cheers,
Gavin

Emmanuel_Touzery · 2 February 2004 12:48

Gavin Sinclair wrote:

Hi folks,

I need to parse some HTML. I’ve dug around the archives and so on and
found the best solution to be Ned Konz’s ‘ruby-htmltools’, which
relies on ‘html-parser’. Both of these projects are not really
maintained, so I’m wondering what other people currently use.

i was using a home-made solution, but i just decided (this WE) to
convert it to REXML: I would use HTML tidy (which is already needed for
~60% of the pages i’m parsing now), and ask tidy to spit out XHTML. i
think that’s the best (with my home made solution, besides the
duplication of work of parsing HTML, i needed a list of tags that you
don’t need to close etc. in XHTML all is done for me… and then i get
the familiar API of REXML [even though i never used REXML yet :O) ]).

i think it’s the best.

emmanuel

Robert · 2 February 2004 13:59

“Gavin Sinclair” gsinclair@soyabean.com.au schrieb im Newsbeitrag
news:541813804653.20040202234122@soyabean.com.au…

Hi folks,

I need to parse some HTML. I’ve dug around the archives and so on and
found the best solution to be Ned Konz’s ‘ruby-htmltools’, which
relies on ‘html-parser’. Both of these projects are not really
maintained, so I’m wondering what other people currently use.

Last time I needed that I used some kind of home cooked regexp scanning.
But I didn’t need a real parser, just wanted to extract some portion from
the HTML file.

robert

Emmanuel_Touzery · 2 February 2004 12:48

Emmanuel Touzery wrote:

E) to convert it to REXML: I would use HTML tidy (which is already
needed for ~60% of the pages i’m parsing now)

(it was needed for many pages due to sloppy/invalid HTML, that tidy is
correcting)

emmanuel

Gavin_Sinclair · 2 February 2004 14:03

The library I mentioned gives you a REXML::Document as well, so I’m
using REXML for the first time. It’s very good, but I’m struggling to
really get a grip.

The single most useful improvement to REXML for a beginner, IMO, is
this: (more) reasonable implementations of #to_s and/or #inspect on
Element and Attribute objects.

As it is, I believe every element contains a link to its document,
which in my case is large, and #inspect spits out thousands of lines
of rubbish when all I want to see is the element I’m looking at. This
makes it hard to use in ‘irb’.

(I know, I should start with a small document, but I’m trying to get
my task done

Cheers,
Gavin

···

On Monday, February 2, 2004, 11:48:00 PM, Emmanuel wrote:

Gavin Sinclair wrote:

Hi folks,

I need to parse some HTML. I’ve dug around the archives and so on and
found the best solution to be Ned Konz’s ‘ruby-htmltools’, which
relies on ‘html-parser’. Both of these projects are not really
maintained, so I’m wondering what other people currently use.

i was using a home-made solution, but i just decided (this WE) to
convert it to REXML: I would use HTML tidy (which is already needed for
~60% of the pages i’m parsing now), and ask tidy to spit out XHTML. i
think that’s the best (with my home made solution, besides the
duplication of work of parsing HTML, i needed a list of tags that you
don’t need to close etc. in XHTML all is done for me… and then i get
the familiar API of REXML [even though i never used REXML yet :O) ]).

Topic		Replies	Views
Html parses ruby-talk	2	63	8 December 2005
HTML Parsing? ruby-talk	13	181	11 February 2004
Ruby (X)HTML Parser? ruby-talk	5	71	25 September 2006
Best way to parse/update HTML file? ruby-talk	8	99	9 July 2005
Decent HTML Parser? ruby-talk	0	58	12 July 2006

HTML parsing

Related Topics