I'm working on a couple of large sites that aren't sending the correct
response codes for missing pages. I want to use the Yahoo API key to
search Yahoo's cache to see if it has any clues about what pages are
sending bad responses. So I need to get all 1000 results from the
Yahoo cache and write it to a spreadsheet. Then I can sort the URL
data and response codes in the spreadsheet.
I can only request 100 results at a time and can set a different
"start" number for each request. The first request would be start=1,
the second, start=101 and so on.
The other problem is that it won't get the response codes. I just get
this unhelpful error message:
c:/ruby/lib/ruby/1.8/net/http.rb:1467:in `initialize': HTTP request
path
y (ArgumentError)
from c:/ruby/lib/ruby/1.8/net/http.rb:1585:in `initialize'
from hpricot_test.rb:32:in `new'
from hpricot_test.rb:32:in `get_headers'
from hpricot_test.rb:80:in `generate_workbook
from hpricot_test.rb:70:in `each'
from hpricot_test.rb:70:in `generate_workbook
from hpricot_test.rb:94
Here is the code:
#!/usr/bin/ruby -w
require 'net/http'
require 'uri'
require 'hpricot'
require 'spreadsheet/excel'
include Spreadsheet
def get_cache
# set variables for POST request
appid = 'yahooAPI-key' # a Yahoo API key goes here
query = 'http://www.example.com' # a Web site to check goes here
# this gets the first 100 results, but I want to loop through
# it 10 times with a different "start" number to get all 1000
# available results
results = 100
start = 1
post_args = {
'appid' => appid,
'query' => query,
'results' => results,
'start' => start
}
url =
URI.parse('http://search.yahooapis.com/SiteExplorerService/V1/pageData')
# send post request
@resp, @data = Net::HTTP.post_form(url, post_args)
# read XML
@doc = Hpricot(@data)
end
def get_headers(url)
# This gets the response code for the page to see if it exists
(200, 301, 404, etc.)
page = URI.parse(url)
req = Net::HTTP::Get.new(page.path)
res = Net::HTTP.start(page.host, page.port) { |http|
http.request(req)
}
return res.code
end
def generate_workbook
# create new workbook and worksheet
workbook = Spreadsheet::Excel.new("yahoo_cache.xls")
worksheet = workbook.add_worksheet('Yahoo Cache')
# set variables
current_row = 2
format_nil = nil
format_header = Format.new(
:color => 'white',
:bg_color => 'gray',
:bold => true
)
workbook.add_format(format_header)
workbook.add_format(format_nil)
# Add header row
worksheet.write(0,0,"Yahoo's Cache for Site", format_nil)
worksheet.write(1,0,"TITLE",format_header)
worksheet.write(1,1,"URL", format_header)
# worksheet.write(1,2,"CODE", format_header)
# worksheet.write(1,3,"LOCATION", format_header) # coming soon
# Add xml_data to worksheet
(@doc/"result").each do |el|
result_title = (el/"title").text
result_url = (el/"url").text
worksheet.write(current_row, 0, result_title, format_nil)
worksheet.write(current_row, 1, result_url, format_nil)
# get response codes -- this is causing an error with
"result_url" -- maybe it isn't a URL in a string?
# see error message at top of this post
# response_code ||= 0
# response_code = get_headers(result_url) # this works if I put
a URL here, but not with the result_url variable
# worksheet.write(current_row, 2, response_code, format_nil)
# move to the next row in the spreadsheet before going to the
next XML item
current_row += 1
end
# finished, close the workbook
workbook.close
end
···
====
The above code works (except the part that gets response codes). The
following code is a previous version where I tried to loop through all
1000 results. (It was using xmlsimple.) I couldn't figure out how to
store each set of XML -- each request is an entire XML file. I tried
@pass[count], but it wasn't working. Any ideas about a good way to
store each request?
# prepare to loop through 100 results
count = 1
start = 1
# pass[] = each of the 10 requests to Yahoo
@pass = []
# perform the loop
while count < 11 do
post_args = {
'appid' => appid,
'query' => query,
'results' => results,
'start' => start
}
# send post request
@resp, @data = Net::HTTP.post_form(url, post_args)
# read XML
xml_data = XmlSimple.xml_in(@data)
@pass[count] = xml_data
# puts "Count: #{count}"
# print @pass[count]
# puts "Start: #{start}"
# puts
count += 1
start += 100
end