Michael Neumann wrote:
James Britt wrote:
I had to hack Mechanize to have it grab 'p' elements, but it is dead easy to do.
What exactly do you had to hack? If it's worth, I'll add it to the lib.
At first, I just used the built-in 'links' property to get the search result links. That sort of worked; I could get an array of URLs, but they had no descriptive context. Looking at the HTML coming back from Google I saw I really needed the 'p' elements that held the search result URL + the description.
As best I could tell, the Page object has only a few built-in arrays (links, forms, maybe another, I don't recall) that get populated when calling parse_html. Adding another array, and telling parse_html to populate this array, was super easy.
In retrospect I think I could have done some sort of Xpath-thing over the tree of node held by the Page object, but I just took what seemed to be the easiest route at the time. (besides, XPath over the full node set is going to be slower than simply assembling a set of particular nodes on the first pass over the document done by parse_html.)
Where parse_html has:
when 'a'
@links << Link.new(node)
I added in
when 'p'
@paragraphs << Para.new(node)
The Para class is nothing more than a wrapper for a generic node.
I then ask for page.paragraphs and grab the ones I want.
BTW, while writing this post, I started thinking about my hackish implementation, and ended up replacing it with an arguably less hackish implementation, one that lets you do this:
agent = WWW::Mechanize.new {|a|
a.log = Logger.new(STDERR)
}
agent.watch_for_set = { 'style' => Style, 'p' => Para }
page = agent.get( url )
page.body
paragraphs = page.elements[ 'p' ]
styles = page.elements[ 'style' ]
You just have to have the calling code define the classes passed in as part of the 'watch_for_set' hash. Each of these classes then has to implement this constructor:
def initialize( node ) ; end
It's up to each class then to extract what data it wants from the node.
So one could write a Style class that grabs the text value of the node and makes each CSS selector available for inspection.
(page.elements of course only has an array for each of those element names passed in.)
James