Hpricot question

I'm trying to use Hpricot to clean up the text in a big site full of old-style HTML. I'm just trying to do things like replacing literal quote characters with <q> and </q>. I'm hampered by the fact that my understanding of the HTML DOM comes from reading one web site yesterday and I don't know any javascript. Nonetheless, it seems that Hpricot should be able to easily give me all the text in the <body> element of each page because it has a traverse_text() method. The problem seems to be that if I apply it to a whole page, I get the text in the <head> element and all the methods for selecting seem to return an element, not a tree.

There is a get_subnode method but it doesn't seem to work as expected.

Thanks in advance for any help


The folly of mistaking a paradox for a discovery, a metaphor for a proof, a torrent of verbiage for a spring of capital truths, and oneself for an oracle, is inborn in us.
-Paul Valery, poet and philosopher (1871-1945)


The reason get_subnode gives:
...hpricot/traverse.rb:23:in `get_subnode': undefined method `get_subnode_internal' for #<Hpricot::Doc:0x5c182c>

is because Why literally hasn't written get_subnode_internal yet. maybe I'll try to write it when/if i get some time.


On Jul 31, 2006, at 6:17 AM, Chris Gehlker wrote:

I'm trying to use Hpricot to clean up the text in a big site full of old-style HTML. I'm just trying to do things like replacing literal quote characters with <q> and </q>. I'm hampered by the fact that my understanding of the HTML DOM comes from reading one web site yesterday and I don't know any javascript. Nonetheless, it seems that Hpricot should be able to easily give me all the text in the <body> element of each page because it has a traverse_text() method. The problem seems to be that if I apply it to a whole page, I get the text in the <head> element and all the methods for selecting seem to return an element, not a tree.

There is a get_subnode method but it doesn't seem to work as expected.

For blocks are better cleft with wedges,
Than tools of sharp or subtle edges,
And dullest nonsense has been found
By some to be the most profound.
-Samuel Butler,