Fun with WWW::Mechanize

I thought I would see about adding some search function to ruby-doc, and ended up taking Mike Neumann's WWW::Mechanize[0] for a test drive.

Sweet it be, as it took almost no time to get running code that takes search words, queries Google, parses the results, and creates a new page page.

Try it here:

http://www.ruby-doc.org/gs.rb/REXML%20pullparser
http://www.ruby-doc.org/gs.rb/Timeout
http://www.ruby-doc.org/gs.rb/testunit+assert

I had to hack Mechanize to have it grab 'p' elements, but it is dead easy to do.

Nice work, Herr Neumann. And a tip of the hat to folks behind Narf, whose htmltools are needed for WWW::Mechanize.

James

[0] http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc

James Britt wrote:

I thought I would see about adding some search function to ruby-doc, and ended up taking Mike Neumann's WWW::Mechanize[0] for a test drive.

Nice to hear :wink:

Sweet it be, as it took almost no time to get running code that takes search words, queries Google, parses the results, and creates a new page page.

Try it here:

http://www.ruby-doc.org/gs.rb/REXML%20pullparser
http://www.ruby-doc.org/gs.rb/Timeout
http://www.ruby-doc.org/gs.rb/testunit+assert

I had to hack Mechanize to have it grab 'p' elements, but it is dead easy to do.

What exactly do you had to hack? If it's worth, I'll add it to the lib.

Regards,

   Michael

I thought I would see about adding some search function to
ruby-doc, and ended up taking Mike Neumann's WWW::Mechanize[0] for
a test drive.

Sweet it be, as it took almost no time to get running code that
takes search words, queries Google, parses the results, and
creates a new page page.

Try it here:

http://www.ruby-doc.org/gs.rb/REXML%20pullparser
http://www.ruby-doc.org/gs.rb/Timeout
http://www.ruby-doc.org/gs.rb/testunit+assert

I had to hack Mechanize to have it grab 'p' elements, but it is
dead easy to do.

Nice work, Herr Neumann. And a tip of the hat to folks behind
Narf, whose htmltools are needed for WWW::Mechanize.

Very cool, James. Be warned, though, that Google frowns on "screen
scraping", preferring people to use the SOAP API.

[0] http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc

-austin

ยทยทยท

On Sun, 23 Jan 2005 13:45:02 +0900, James Britt <jamesUNDERBARb@neurogami.com> wrote:
--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

Michael Neumann wrote:

James Britt wrote:

I had to hack Mechanize to have it grab 'p' elements, but it is dead easy to do.

What exactly do you had to hack? If it's worth, I'll add it to the lib.

At first, I just used the built-in 'links' property to get the search result links. That sort of worked; I could get an array of URLs, but they had no descriptive context. Looking at the HTML coming back from Google I saw I really needed the 'p' elements that held the search result URL + the description.

As best I could tell, the Page object has only a few built-in arrays (links, forms, maybe another, I don't recall) that get populated when calling parse_html. Adding another array, and telling parse_html to populate this array, was super easy.

In retrospect I think I could have done some sort of Xpath-thing over the tree of node held by the Page object, but I just took what seemed to be the easiest route at the time. (besides, XPath over the full node set is going to be slower than simply assembling a set of particular nodes on the first pass over the document done by parse_html.)

Where parse_html has:

       when 'a'
         @links << Link.new(node)

I added in

       when 'p'
         @paragraphs << Para.new(node)

The Para class is nothing more than a wrapper for a generic node.

I then ask for page.paragraphs and grab the ones I want.

BTW, while writing this post, I started thinking about my hackish implementation, and ended up replacing it with an arguably less hackish implementation, one that lets you do this:

   agent = WWW::Mechanize.new {|a|
      a.log = Logger.new(STDERR)
   }
   agent.watch_for_set = { 'style' => Style, 'p' => Para }
   page = agent.get( url )
   page.body
   paragraphs = page.elements[ 'p' ]
   styles = page.elements[ 'style' ]

You just have to have the calling code define the classes passed in as part of the 'watch_for_set' hash. Each of these classes then has to implement this constructor:

   def initialize( node ) ; end

It's up to each class then to extract what data it wants from the node.

So one could write a Style class that grabs the text value of the node and makes each CSS selector available for inspection.

(page.elements of course only has an array for each of those element names passed in.)

James

Austin Ziegler wrote:

Very cool, James. Be warned, though, that Google frowns on "screen
scraping", preferring people to use the SOAP API.

Yes, well that's one reason it is not linked from the main ruby-doc page.

I started of thinking I would just add a way to do a straight-up Google search, limited to the ruby-doc.org site, but was unhappy with the resulting page being, well, another site.

Framing the search results page seemed undesirable, so I thought about scrapping the results. Curiously enough, while putting this page up on ruby-doc.org, I came across some old code that did actually use the Google API, but apparently some required files were lost in assorted site moves and upgrades.

Anyway, I though this was a neat enough demo of how easy it is to use Mechanize that I should share it. How the actual search page ends up is another matter. Time to go find my Google API key perhaps.

(Regarding Google frowning on scraping, I wondered if this was because of volume, which I expect would be low, or that it typically means omitting the sponsored ads. I figured I would add to code to *keep* the sponsored ads in the resulting page, figuring that's part of Google's revenue model. Maybe I can just keep the entire Google search results page, and simply insert a set of links to get back to ruby-doc.org. So many choices.)

James

James Britt wrote:

Michael Neumann wrote:

James Britt wrote:

I had to hack Mechanize to have it grab 'p' elements, but it is dead easy to do.

What exactly do you had to hack? If it's worth, I'll add it to the lib.

At first, I just used the built-in 'links' property to get the search result links. That sort of worked; I could get an array of URLs, but they had no descriptive context. Looking at the HTML coming back from Google I saw I really needed the 'p' elements that held the search result URL + the description.

I've added an find_all_recursive method to REXML::Node.

This should just return all paragraph nodes:

   root.find_all_recursive {|n| n.name == 'p'}

As best I could tell, the Page object has only a few built-in arrays (links, forms, maybe another, I don't recall) that get populated when calling parse_html. Adding another array, and telling parse_html to populate this array, was super easy.

I add more if there's need for it.

In retrospect I think I could have done some sort of Xpath-thing over the tree of node held by the Page object, but I just took what seemed to be the easiest route at the time. (besides, XPath over the full node set is going to be slower than simply assembling a set of particular nodes on the first pass over the document done by parse_html.)

Where parse_html has:

      when 'a'
        @links << Link.new(node)

I added in

      when 'p'
        @paragraphs << Para.new(node)

The Para class is nothing more than a wrapper for a generic node.

You could just collect the nodes itself. I see no need for a special Para class...

I then ask for page.paragraphs and grab the ones I want.

BTW, while writing this post, I started thinking about my hackish implementation, and ended up replacing it with an arguably less hackish implementation, one that lets you do this:

  agent = WWW::Mechanize.new {|a|
     a.log = Logger.new(STDERR)
  }
  agent.watch_for_set = { 'style' => Style, 'p' => Para }
  page = agent.get( url )
  page.body
  paragraphs = page.elements[ 'p' ]
  styles = page.elements[ 'style' ]

ah, that looks nice.

That's my idea:

   # nil === just return the node
   agent.watch_for_set = { 'style' => nil, 'p' => Para }
   agent.watches['p']

I'd expect #elements to behave like the #elements method of a REXML::Node, so better use #watches.

Regards,

   Michael

James Britt wrote:

Austin Ziegler wrote:

Very cool, James. Be warned, though, that Google frowns on "screen
scraping", preferring people to use the SOAP API.

Yes, well that's one reason it is not linked from the main ruby-doc page.

I started of thinking I would just add a way to do a straight-up Google search, limited to the ruby-doc.org site, but was unhappy with the resulting page being, well, another site.

Framing the search results page seemed undesirable, so I thought about scrapping the results. Curiously enough, while putting this page up on ruby-doc.org, I came across some old code that did actually use the Google API, but apparently some required files were lost in assorted site moves and upgrades.

Anyway, I though this was a neat enough demo of how easy it is to use Mechanize that I should share it. How the actual search page ends up is another matter. Time to go find my Google API key perhaps.

Would you like to share the code with us? Should I include it as an example into WWW::Mechanize?

Regards,

   Michael

Michael Neumann wrote:

James Britt wrote:

>> ..

Anyway, I though this was a neat enough demo of how easy it is to use Mechanize that I should share it. How the actual search page ends up is another matter. Time to go find my Google API key perhaps.

Would you like to share the code with us? Should I include it as an example into WWW::Mechanize?

Sure. The live version uses the first pass at the Mechanize hack; the runs-at-home version uses the more flexible version I wrote while replying to your earlier post. ("Ruby: Ain't it cool?")

But that code is different from your suggestion (and, I gather, implementation) on how else to to this (though in practice it is quite similar).

So, yes, if Mechanize adopts a way to pass in a 'watch_for set', and then makes them available via 'watches', then the Google scrape code might make a good example, even if it never goes 'live' on ruby-doc.org

I'd just need to clean it up to use the most current API.

Note that root.find_all_recursive {|n| n.name == 'p'} would work as well as what I do now; my Para class does nothing more than call node.to_s. The advantage, though, to having parse_html collect nodes on the HTML stream parse is that it is faster than re-iterating over the node tree every time you want a set of nodes.

My Google search code, then, is a somewhat gratuitous use of agent.watch_for_set (it is a good example of "Gee, I wonder if ..."), though I could perhaps add something that gives a more practical example of collecting nodes as custom classes.

Maybe create a version of Para that exposes the element CSS class and id as properties. Then replace the element CSS class value with one of my own to better control the resulting page style. Or something.

Thanks,

James