Mechanize file save on generated link

Dan_Mansfield · 12 September 2010 20:53

Hi there,
I'm working on a project to automate retrieval of content and the
download of a pdf bill from an ISPs website. I have 30 connections with
this ISP and each has its own username and password. So far I have been
able to get the content I need that is actually stored within the page.

You have to login and go to a specific page to be able to click on the
link.
The link itself isn't the pdf as its generated on the fly:
source code of the link:
<a href="/be-portal/downloadPdf"
onclick="cmCreatePageElementTag('Download PDF', 'Member Centre');"
class="btn_Link">Download PDF</a>

I'm selecting it with:
link = invoice_page.links_with(:href => "/be-portal/downloadPdf")

how can I click this link to download the pdf and store it to the
filesystem?

btw, I only started with Ruby and Mechanize less than 24 hours ago.
TIA
Regards,
Dan

···

--
Posted via http://www.ruby-forum.com/.

Andrea_Dallera · 12 September 2010 21:02

Hi Dan,

try with

File.open('myfile', 'w+') do |file|
file << agent.get_file(link['href'])
end

where agent is the mechanize agent you used to log in and get the link.

···

--

Andrea Dallera

Il 12/09/2010 22:53, Dan Mansfield ha scritto:

Hi there,
I'm working on a project to automate retrieval of content and the
download of a pdf bill from an ISPs website. I have 30 connections with
this ISP and each has its own username and password. So far I have been
able to get the content I need that is actually stored within the page.

You have to login and go to a specific page to be able to click on the
link.
The link itself isn't the pdf as its generated on the fly:
source code of the link:
<a href="/be-portal/downloadPdf"
onclick="cmCreatePageElementTag('Download PDF', 'Member Centre');"
class="btn_Link">Download PDF</a>

I'm selecting it with:
link = invoice_page.links_with(:href => "/be-portal/downloadPdf")

how can I click this link to download the pdf and store it to the
filesystem?

btw, I only started with Ruby and Mechanize less than 24 hours ago.
TIA
Regards,
Dan

Dan_Mansfield · 12 September 2010 21:54

thanks, so my script is now:
....
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with(:href => "/be-portal/downloadPdf")

File.open('myfile', 'w+') do |file|
file << agent.get_file(link['href'])
end

results in:
C:/ruby/test.rb:27:in `[]': can't convert String into Integer
(TypeError)
        from C:/ruby/test.rb:27:in `block in <main>'
        from C:/ruby/test.rb:26:in `open'
        from C:/ruby/test.rb:26:in `<main>'

what is the proper method to see the response back from clicking the
link?

···

--
Posted via http://www.ruby-forum.com/.

Mike_Dalessio1 · 13 September 2010 12:45

links_with returns an array. Try using .first to pick out the first result,
so:

link = invoice_page.links_with(:href => "/be-portal/downloadPdf").first

···

On Sun, Sep 12, 2010 at 5:54 PM, Dan Mansfield <dan@bleckfield.com> wrote:

thanks, so my script is now:
....
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with(:href => "/be-portal/downloadPdf")

File.open('myfile', 'w+') do |file|
    file << agent.get_file(link['href'])
end

results in:
C:/ruby/test.rb:27:in `': can't convert String into Integer
(TypeError)
       from C:/ruby/test.rb:27:in `block in <main>'
       from C:/ruby/test.rb:26:in `open'
       from C:/ruby/test.rb:26:in `<main>'

what is the proper method to see the response back from clicking the
link?

Dan_Mansfield · 15 September 2010 20:16

Mike Dalessio wrote:

C:/ruby/test.rb:27:in `': can't convert String into Integer
(TypeError)
       from C:/ruby/test.rb:27:in `block in <main>'
       from C:/ruby/test.rb:26:in `open'
       from C:/ruby/test.rb:26:in `<main>'

what is the proper method to see the response back from clicking the
link?

links_with returns an array. Try using .first to pick out the first
result,
so:

link = invoice_page.links_with(:href => "/be-portal/downloadPdf").first

Ok, so some progress:
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with(:href => "/be-portal/downloadPdf").first
page = agent.click(link)
#pp page
File.open('myfile.pdf', 'w+') do |file|
file << page
end

If I look at the content of page now it contains the stream of PDF data
as well as:
@code="200",
@filename="Invoice_14051844_08/09/2010.pdf",
@response=
{"date"=>"Wed, 15 Sep 2010 19:13:04 GMT",
  "server"=>"Apache",
  "expires"=>"Wed, 15 Sep 2010 19:14:04 GMT",
  "cache-control"=>"max-age=60",
  "content-disposition"=>
   "attachment;filename=Invoice_14051844_08/09/2010.pdf",
  "vary"=>"Accept-Encoding",
  "content-encoding"=>"gzip",
  "content-length"=>"7950",
  "keep-alive"=>"timeout=5, max=92",
  "connection"=>"Keep-Alive",
  "content-type"=>"application/octet-stream"},
@uri=
#<URI::HTTPS:0x335bc38
URL:https://www.bethere.co.uk/be-portal/downloadPdf>>

···

On Sun, Sep 12, 2010 at 5:54 PM, Dan Mansfield <dan@bleckfield.com> > wrote:

--
Posted via http://www.ruby-forum.com/\.

Dan_Mansfield · 15 September 2010 20:19

I tried this too:
File.open('myfile.pdf', 'w+') do |file|
file << page.body
end

Which almost works but presents a corrupt pdf file. I can see the
document properties of the pdf file but there is no content.

I have also tried using:
agent.pluggable_parser.pdf = Mechanize::FileSaver
agent.click(link)

which did not produce an error but also did not produce a pdf file
either.

···

--
Posted via http://www.ruby-forum.com/.

Dan_Mansfield · 20 September 2010 12:12

Dan Mansfield wrote:

I tried this too:
File.open('myfile.pdf', 'w+') do |file|
file << page.body
end

Which almost works but presents a corrupt pdf file. I can see the
document properties of the pdf file but there is no content.

SO CLOSE!

By doing a binary compare of a working version downloaded through the
browser and the one through Mechanize, I have found that it is saving
the line breaks as 0D 0A in hex versus just 0A in the working file.

Whilst I dig around to find how to avoid Mechanise/Ruby using that
behaviour. Has anyone else come across this and have a solution?
Thanks

···

--
Posted via http://www.ruby-forum.com/\.

Dan_Mansfield · 20 September 2010 12:39

Thanks to everyone who helped. Writing the file in Binary mode did the
trick.

In case anyone has this problem in the future here is my full script:

require 'rubygems'
require 'mechanize'

URL_LOGIN =
'https://www.bethere.co.uk/cas/login?service=https://www.bethere.co.uk/c/portal/login'
URL_BILLING = 'https://www.bethere.co.uk/group/beportal/billsandpayment'

abort "Usage: #{$0} <username> <password>" unless ARGV.length == 2

agent = Mechanize.new
agent.follow_meta_refresh = true
agent.redirect_ok = true
agent.user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6;
en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6'
login_page = agent.get(URL_LOGIN)

login_form = login_page.forms.first
login_form.username = ARGV[0]
login_form.password = ARGV[1]

redirect_page = agent.submit(login_form)

invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with(:href => "/be-portal/downloadPdf").first
page = agent.click(link)

File.open(page.filename.gsub("/","_"), 'w+b') do |file|
file << page.body.strip
end

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Trying to download files using WWW::Mechanize ruby-talk	0	107	30 May 2006
Trying to download files using WWW::Mechanize ruby-talk	3	124	30 May 2006
Trying to download files using WWW::Mechanize ruby-talk	0	112	30 May 2006
Use www mechanize issues ruby-talk	1	99	4 August 2006
Mechanize click()-Problem ruby-talk	2	76	12 February 2008

Mechanize file save on generated link

Related topics