Screen scraping an aspx site with Mechanize

Hi,

I've been googleing for over a week now, but I can't find out how to
screen scrape an aspx-site with Mechanize (and I must use mechanize).

The site (https://portal.if.se/kopforsakring/ProductList.aspx?linkID=4)
contains a button with the text "Bil" that I want to click, and after
that click the continue button (green one with the text "Fortsätt").

I've tried using Firebug to find viewstate, but I don't know what to do
with it once I've found it.

Am I one the right track?
Can anybody help me?

···

--
Posted via http://www.ruby-forum.com/.

Have you looked into nokogiri at all? You can use mechanize for the
server interaction (GET, POST, etc), then parse the response
object's .body with nokogiri.

As long as you don't have to deliberately replicate a "click", this
would work fine (use Selenium if you actually need a click event).
Otherwise, GETing and POSTing to the links produces the same results.
Make sense?

req = Mechanize.new
resp = req.get("/path/to/desired/url")
page = Nokogiri::HTML resp.body
link = page.xpath("//xpath/to/link")
resp = req.get(link)

···

________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | astahl@hi5.com

m: 415.710.6961

On Thu, 2010-12-02 at 11:19 -0600, Sofie Willander wrote:

Hi,

I've been googleing for over a week now, but I can't find out how to
screen scrape an aspx-site with Mechanize (and I must use mechanize).

The site (https://portal.if.se/kopforsakring/ProductList.aspx?linkID=4\)
contains a button with the text "Bil" that I want to click, and after
that click the continue button (green one with the text "Fortsätt").

I've tried using Firebug to find viewstate, but I don't know what to do
with it once I've found it.

Am I one the right track?
Can anybody help me?

Hi,

I've been googleing for over a week now, but I can't find out how to
screen scrape an aspx-site with Mechanize (and I must use mechanize).

The site (https://portal.if.se/kopforsakring/ProductList.aspx?linkID=4\)
contains a button with the text "Bil" that I want to click, and after
that click the continue button (green one with the text "Fortsätt").

I've tried using Firebug to find viewstate, but I don't know what to do
with it once I've found it.

Am I one the right track?
Can anybody help me?

You should watch Ryan Bates's excellent screencast on scraping data with
Mechanize:

···

On Thu, Dec 2, 2010 at 12:19 PM, Sofie Willander <sofiewil@kth.se> wrote:

--
Posted via http://www.ruby-forum.com/\.

I don't think Mechanize will work for this. Mechanize can't process
JavaScript. I'd recommend Watir or Selenium, which actually launch and drive
instances of the browser.

···

On Thu, Dec 2, 2010 at 11:19 AM, Sofie Willander <sofiewil@kth.se> wrote:

Hi,

I've been googleing for over a week now, but I can't find out how to
screen scrape an aspx-site with Mechanize (and I must use mechanize).

Even though you can probably do this with mechanize, I'd advice against it.

Aspx website keep lots of data in view state params and other things. Use
either
1. celerity in jruby or
2. watir or selenium.

You can also dump mechanize and use only nokogiri to parse and post to the
https://portal.if.se/kopforsakring/ProductList.aspx?linkID=4 with all the
form data. That could be messy, though.

My 2 cents

Piyush

···

On Thu, Dec 2, 2010 at 10:49 PM, Sofie Willander <sofiewil@kth.se> wrote:

Hi,

I've been googleing for over a week now, but I can't find out how to
screen scrape an aspx-site with Mechanize (and I must use mechanize).

The site (https://portal.if.se/kopforsakring/ProductList.aspx?linkID=4\)
contains a button with the text "Bil" that I want to click, and after
that click the continue button (green one with the text "Fortsätt").

I've tried using Firebug to find viewstate, but I don't know what to do
with it once I've found it.

Am I one the right track?
Can anybody help me?

--
Posted via http://www.ruby-forum.com/\.

Ok, you seem to all agree on that Mechanize is not a good option for
what I want to do. I will look into the other options. Thanks so much
for your help!

···

--
Posted via http://www.ruby-forum.com/.

Mike Dalessio wrote in post #965776:

You should watch Ryan Bates's excellent screencast on scraping data with
Mechanize:

#191 Mechanize - RailsCasts

I've already watched that railscast (and the one about screen scraping
with nokogiri) and it worked fine on a non-aspx site, but did not work
at all on an aspx site (got posted pack to the same page over and
over..). Have you screen scraped an aspx site with the method Ryan Bates
shows?

···

--
Posted via http://www.ruby-forum.com/\.

Thank you for your reply! I haven't gotten it to work yet though. I get
an error on the following:

Alex Stahl wrote in post #965773:

resp = req.get(link)

The error read:
Mechanize::ResponseCodeError: 400 => Net::HTTPBadRequest
        from
/usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in
`get'
        from (irb):8

It seems to find the button (I assumed that the xpath to the link was
actually the button's xpath. Correct?). I get the button as an object,
can I use req.get() on it? What am I not doing correctly?

I've attached a textfile with the output of the two last commands. I
would be so glad if you could help me once again.

Attachments:
http://www.ruby-forum.com/attachment/5509/HTTPBadRequest.txt

···

from :0

--
Posted via http://www.ruby-forum.com/\.

Glad I could help... a few things to know:

-Mechanize throws an exception on any response which is not an HTTP 200
or 302. So the error you're receiving, HTTP 400, is not handled by
mechanize and needs to be by your client.
-#get takes a URL as its parameter, so link should be a URL string.
(Actually, there's more than one way to pass the URL - check the
following link if that's not what you want:
http://mechanize.rubyforge.org/mechanize/Mechanize.html#M000231\)
-Starting an xpath with "//*" causes the parser to look at *every*
element until it finds one which has the @id you supplied. Better to
replace "*" with the actual HTML element.

Based on the xpath in the error at the link, you're not extracting a URL
- you're getting an HTML object (or, more specifically, an XML
node/nodeset). Instead, what you want is the "href" property of the
<a> tag located at the xpath. (In the below example, '//path/to' would
be the unique HTML element(s) which is/are the parent of the anchor
tag). Access the property like so:

link = page.xpath("//path/to/a/@href").to_s
p link

It's also helpful to output the link prior to using it as a param to
#get to see what you'll ask for.

···

________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | astahl@hi5.com

m: 415.710.6961

On Fri, 2010-12-03 at 02:27 -0600, Sofie Willander wrote:

Thank you for your reply! I haven't gotten it to work yet though. I get
an error on the following:

Alex Stahl wrote in post #965773:
> resp = req.get(link)

The error read:
Mechanize::ResponseCodeError: 400 => Net::HTTPBadRequest
        from
/usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in
`get'
        from (irb):8
        from :0

It seems to find the button (I assumed that the xpath to the link was
actually the button's xpath. Correct?). I get the button as an object,
can I use req.get() on it? What am I not doing correctly?

I've attached a textfile with the output of the two last commands. I
would be so glad if you could help me once again.

Attachments:
http://www.ruby-forum.com/attachment/5509/HTTPBadRequest.txt

Alex Stahl wrote in post #965922:

Based on the xpath in the error at the link, you're not extracting a URL
- you're getting an HTML object (or, more specifically, an XML
node/nodeset). Instead, what you want is the "href" property of the
<a> tag located at the xpath. (In the below example, '//path/to' would
be the unique HTML element(s) which is/are the parent of the anchor
tag). Access the property like so:

link = page.xpath("//path/to/a/@href").to_s
p link

Now I'm even more confused.. Have you got any examples to show me?
How do I find the href for a button?

···

--
Posted via http://www.ruby-forum.com/\.

Sorry, I haven't looked too closely at the site you're scraping and had
assumed the button was wrapped by a link. But upon closer inspection
that doesn't appear to be the case. Looking at the page source, the
button is a form input element which doesn't actually cause a request to
be sent. In this case, as I noted in my first email, you will in fact
need to generate a click event. Unfortunately, that's not really what
nokogiri is for. Since you need that specific event to fire, as was
previously recommended, watir or selenium are the more appropriate
tools.

Another option would be to use wireshark to sniff for any requests which
are sent, and then try to reconstruct and send those requests via
mechanize. But this would be a little more complex than just using the
right tools.

Of course, I just checked the mechanize docs again... and there is a
#click_button method in the form object, so that could be a solution as
well. (http://mechanize.rubyforge.org/mechanize/Mechanize/Form.html\)

···

________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | astahl@hi5.com

On Fri, 2010-12-03 at 05:22 -0600, Sofie Willander wrote:

Alex Stahl wrote in post #965922:
> Based on the xpath in the error at the link, you're not extracting a URL
> - you're getting an HTML object (or, more specifically, an XML
> node/nodeset). Instead, what you want is the "href" property of the
> <a> tag located at the xpath. (In the below example, '//path/to' would
> be the unique HTML element(s) which is/are the parent of the anchor
> tag). Access the property like so:
>
> link = page.xpath("//path/to/a/@href").to_s
> p link

Now I'm even more confused.. Have you got any examples to show me?
How do I find the href for a button?