Parsing HTML into a tree

Hello,

I have a HTML document which I need to parse into a tree. I was
looking for a suitable module to use, and came to two:

1) Hpricot, while it looks like a nice and fast HTML parser, it
appears that its interface is completely unsuitable for parsing a HTML
file into a tree. Is this possible and I'm missing something ?

2) htree - looks closer to what I need, but it documentation is very
poor (almost inexistent). When I 'pp' a parsed HTree document I see a
representation, but how can I actually traverse the tree ? At the
moment I'm using reflection to "dissasemble" the htree tree structure.
There must be a better way ! Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?

Are there other options ?

Thanks in advance
Eli

Hi Eli,

I am not sure what you are trying to do

Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?

gg.html (sample file)

ยทยทยท

~~~~~~~~~~~~~~~~
<html>
<head>
   <title>Test Doc</title>
</head>
<body>
<div id="main">
Main div Content
<div id="sub">
Sub div content
<span> Test span content</span>
</div>
</body>
</html>

gg.rb
~~~~
require "htree"
tree = HTree.parse(STDIN)
tree.traverse_all_element do |e|
   puts e.name
   puts e.extract_text
end

Output
~~~~~~
imayam:~/work/temp/htree-0.6/test gg$ ruby gg.rb < gg.html
{XHTML namespace}html
           Test Doc
                         Main div Content
                                 Sub div content
                                  Test span content
{XHTML namespace}head
           Test Doc
{XHTML namespace}title
Test Doc
{XHTML namespace}body
                         Main div Content
                                 Sub div content
                                  Test span content
{XHTML namespace}div
                         Main div Content
                                 Sub div content
                                  Test span content
{XHTML namespace}div
                                 Sub div content
                                  Test span content
{XHTML namespace}span
Test span content

You can also traverse using 'traverse_some_element', 'each_child', 'traverse_text' etc. You can also convert Htree to rexml and traverse it. Probably some context on what actually you are trying to achieve might be helpful.

Cheers,
Ganesh Gunasegaran

On 25-Apr-07, at 11:50 PM, Eli Bendersky wrote:

Hello,

I have a HTML document which I need to parse into a tree. I was
looking for a suitable module to use, and came to two:

1) Hpricot, while it looks like a nice and fast HTML parser, it
appears that its interface is completely unsuitable for parsing a HTML
file into a tree. Is this possible and I'm missing something ?

2) htree - looks closer to what I need, but it documentation is very
poor (almost inexistent). When I 'pp' a parsed HTree document I see a
representation, but how can I actually traverse the tree ? At the
moment I'm using reflection to "dissasemble" the htree tree structure.
There must be a better way ! Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?

Are there other options ?

Thanks in advance
Eli