Hello,
I have a HTML document which I need to parse into a tree. I was
looking for a suitable module to use, and came to two:
1) Hpricot, while it looks like a nice and fast HTML parser, it
appears that its interface is completely unsuitable for parsing a HTML
file into a tree. Is this possible and I'm missing something ?
2) htree - looks closer to what I need, but it documentation is very
poor (almost inexistent). When I 'pp' a parsed HTree document I see a
representation, but how can I actually traverse the tree ? At the
moment I'm using reflection to "dissasemble" the htree tree structure.
There must be a better way ! Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?
Are there other options ?
Thanks in advance
Eli
Hi Eli,
I am not sure what you are trying to do
Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?
gg.html (sample file)
ยทยทยท
~~~~~~~~~~~~~~~~
<html>
<head>
<title>Test Doc</title>
</head>
<body>
<div id="main">
Main div Content
<div id="sub">
Sub div content
<span> Test span content</span>
</div>
</body>
</html>
gg.rb
~~~~
require "htree"
tree = HTree.parse(STDIN)
tree.traverse_all_element do |e|
puts e.name
puts e.extract_text
end
Output
~~~~~~
imayam:~/work/temp/htree-0.6/test gg$ ruby gg.rb < gg.html
{XHTML namespace}html
Test Doc
Main div Content
Sub div content
Test span content
{XHTML namespace}head
Test Doc
{XHTML namespace}title
Test Doc
{XHTML namespace}body
Main div Content
Sub div content
Test span content
{XHTML namespace}div
Main div Content
Sub div content
Test span content
{XHTML namespace}div
Sub div content
Test span content
{XHTML namespace}span
Test span content
You can also traverse using 'traverse_some_element', 'each_child', 'traverse_text' etc. You can also convert Htree to rexml and traverse it. Probably some context on what actually you are trying to achieve might be helpful.
Cheers,
Ganesh Gunasegaran
On 25-Apr-07, at 11:50 PM, Eli Bendersky wrote:
Hello,
I have a HTML document which I need to parse into a tree. I was
looking for a suitable module to use, and came to two:
1) Hpricot, while it looks like a nice and fast HTML parser, it
appears that its interface is completely unsuitable for parsing a HTML
file into a tree. Is this possible and I'm missing something ?
2) htree - looks closer to what I need, but it documentation is very
poor (almost inexistent). When I 'pp' a parsed HTree document I see a
representation, but how can I actually traverse the tree ? At the
moment I'm using reflection to "dissasemble" the htree tree structure.
There must be a better way ! Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?
Are there other options ?
Thanks in advance
Eli