[ANN] WWW::Mechanize 0.6.0 (Rufus)

Hi,

I would like to announce that my Mechpricot pie is done baking and is
ready to eat. The main feature of this release is that Mechanize uses
Hpricot as its internal HTML parser and that you can now treat a page
object returned from mechanize as an Hpricot object. This makes screen
scraping using mechanize much easier.

You can download it through gems:
  gem install mechanize -y

or get it here:
  http://rubyforge.org/projects/mechanize/

Check out the release notes and changelog for more cool stuff.

--Aaron

Aaron Patterson wrote:

Hi,

I would like to announce that my Mechpricot pie is done baking and is
ready to eat. The main feature of this release is that Mechanize uses
Hpricot as its internal HTML parser and that you can now treat a page
object returned from mechanize as an Hpricot object. This makes screen
scraping using mechanize much easier.

Currently, I use mechanize to grab nodes based on a watch list. These are REXML Element nodes, and code that works with them expects the REXML API.

Has this changed?

···

--
James Britt

"I can see them saying something like 'OMG Three Wizards Awesome'"
   - billinboston, on reddit.com

I'm noticing some issues with the changed behavior of
WWW::Mechanize::Page#links.text

I used to just be able to grab a link using
page.links.text(/pattern/).first and it would work even if the <a> had
children. It doesn't seem to work anymore. I'm working on pinning
the issue down, but you likely have more insight. Is there a new way
to do this that's more hpricot friendly?

Hpricot integration seems like a fine idea though, glad to see you
making use of it. Thanks for all the hard work.

···

On 9/6/06, Aaron Patterson <aaron_patterson@speakeasy.net> wrote:

Hi,

I would like to announce that my Mechpricot pie is done baking and is
ready to eat. The main feature of this release is that Mechanize uses
Hpricot as its internal HTML parser and that you can now treat a page
object returned from mechanize as an Hpricot object. This makes screen
scraping using mechanize much easier.

You can download it through gems:
  gem install mechanize -y

or get it here:
  http://rubyforge.org/projects/mechanize/

Check out the release notes and changelog for more cool stuff.

--Aaron

Yes. You will get back Hpricot nodes in 0.6.0. I plan on having a pluggable
parser in 0.6.1 that will return REXML nodes for you. Hpricot seems to
support some methods similar to REXML, so depending on how complicated your
logic is, you may be able to use Hpricot just fine. Otherwise, don't
upgrade until 0.6.1.

--Aaron

···

On Thu, Sep 07, 2006 at 06:30:10AM +0900, James Britt wrote:

Aaron Patterson wrote:
>Hi,
>
>I would like to announce that my Mechpricot pie is done baking and is
>ready to eat. The main feature of this release is that Mechanize uses
>Hpricot as its internal HTML parser and that you can now treat a page
>object returned from mechanize as an Hpricot object. This makes screen
>scraping using mechanize much easier.

Currently, I use mechanize to grab nodes based on a watch list. These
are REXML Element nodes, and code that works with them expects the REXML
API.

Has this changed?

I'm noticing some issues with the changed behavior of
WWW::Mechanize::Page#links.text

I used to just be able to grab a link using
page.links.text(/pattern/).first and it would work even if the <a> had
children. It doesn't seem to work anymore. I'm working on pinning
the issue down, but you likely have more insight. Is there a new way
to do this that's more hpricot friendly?

This may be a bug in hpricot. That functionality should have remained
the same. The only difference is the parser being used. Could you possibly
send sample code or sample html to one of the mechanize mailing lists:

http://rubyforge.org/mail/?group_id=1453

I don't want to clutter ruby-talk with mechanize support stuff. :slight_smile:

Hpricot integration seems like a fine idea though, glad to see you
making use of it. Thanks for all the hard work.

No problem. Hopefully I can help you out!

--Aaron

···

On Fri, Sep 08, 2006 at 05:20:38AM +0900, Mat Schaffer wrote:

Aaron Patterson wrote:

···

On Thu, Sep 07, 2006 at 06:30:10AM +0900, James Britt wrote:

Aaron Patterson wrote:

Hi,

I would like to announce that my Mechpricot pie is done baking and is
ready to eat. The main feature of this release is that Mechanize uses
Hpricot as its internal HTML parser and that you can now treat a page
object returned from mechanize as an Hpricot object. This makes screen
scraping using mechanize much easier.

Currently, I use mechanize to grab nodes based on a watch list. These are REXML Element nodes, and code that works with them expects the REXML API.

Has this changed?

Yes. You will get back Hpricot nodes in 0.6.0. I plan on having a pluggable
parser in 0.6.1 that will return REXML nodes for you. Hpricot seems to
support some methods similar to REXML, so depending on how complicated your
logic is, you may be able to use Hpricot just fine. Otherwise, don't
upgrade until 0.6.1.

Ah, thanks. My code takes these nodes and uses them to instantiate assorted domain objects, using REXML's XPath and element methods to populate interval variables. That might be simple enough to replace with Hpath, but I'll wait to upgrade until I'm sure.

James