Parsing HTML into a tree

Eli_Bendersky · 25 April 2007 18:20

Hello,

I have a HTML document which I need to parse into a tree. I was
looking for a suitable module to use, and came to two:

1) Hpricot, while it looks like a nice and fast HTML parser, it
appears that its interface is completely unsuitable for parsing a HTML
file into a tree. Is this possible and I'm missing something ?

2) htree - looks closer to what I need, but it documentation is very
poor (almost inexistent). When I 'pp' a parsed HTree document I see a
representation, but how can I actually traverse the tree ? At the
moment I'm using reflection to "dissasemble" the htree tree structure.
There must be a better way ! Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?

Are there other options ?

Thanks in advance
Eli

Ganesh_Gunasegaran · 25 April 2007 21:32

Hi Eli,

I am not sure what you are trying to do

Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?

gg.html (sample file)

···

~~~~~~~~~~~~~~~~
<html>
<head>
<title>Test Doc</title>
</head>
<body>
<div id="main">
Main div Content
<div id="sub">
Sub div content
<span> Test span content</span>
</div>
</body>
</html>

gg.rb
~~~~
require "htree"
tree = HTree.parse(STDIN)
tree.traverse_all_element do |e|
puts e.name
puts e.extract_text
end

Output
~~~~~~
imayam:~/work/temp/htree-0.6/test gg$ ruby gg.rb < gg.html
{XHTML namespace}html
           Test Doc
                         Main div Content
                                 Sub div content
                                  Test span content
{XHTML namespace}head
           Test Doc
{XHTML namespace}title
Test Doc
{XHTML namespace}body
                         Main div Content
                                 Sub div content
                                  Test span content
{XHTML namespace}div
                         Main div Content
                                 Sub div content
                                  Test span content
{XHTML namespace}div
                                 Sub div content
                                  Test span content
{XHTML namespace}span
Test span content

You can also traverse using 'traverse_some_element', 'each_child', 'traverse_text' etc. You can also convert Htree to rexml and traverse it. Probably some context on what actually you are trying to achieve might be helpful.

Cheers,
Ganesh Gunasegaran

On 25-Apr-07, at 11:50 PM, Eli Bendersky wrote:

Hello,

I have a HTML document which I need to parse into a tree. I was
looking for a suitable module to use, and came to two:

1) Hpricot, while it looks like a nice and fast HTML parser, it
appears that its interface is completely unsuitable for parsing a HTML
file into a tree. Is this possible and I'm missing something ?

2) htree - looks closer to what I need, but it documentation is very
poor (almost inexistent). When I 'pp' a parsed HTree document I see a
representation, but how can I actually traverse the tree ? At the
moment I'm using reflection to "dissasemble" the htree tree structure.
There must be a better way ! Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?

Are there other options ?

Thanks in advance
Eli

Topic		Replies	Views
Noob, html trees & parsing ruby-talk	1	89	13 June 2009
Need help parsing HTML with Hpricot ruby-talk	3	121	25 October 2007
[ANN] Hpricot 0.1 -- quick, cinchy HTML parsing ruby-talk	4	107	8 July 2006
HTML parser using Hpricot ruby-talk	0	83	8 January 2010
Decent HTML Parser? ruby-talk	17	102	13 July 2006

Parsing HTML into a tree

Related topics