Yahoo API and Ruby

rb1 · 1 January 2007 21:50

I'm working on a couple of large sites that aren't sending the correct
response codes for missing pages. I want to use the Yahoo API key to
search Yahoo's cache to see if it has any clues about what pages are
sending bad responses. So I need to get all 1000 results from the
Yahoo cache and write it to a spreadsheet. Then I can sort the URL
data and response codes in the spreadsheet.

I can only request 100 results at a time and can set a different
"start" number for each request. The first request would be start=1,
the second, start=101 and so on.

The other problem is that it won't get the response codes. I just get
this unhelpful error message:
c:/ruby/lib/ruby/1.8/net/http.rb:1467:in `initialize': HTTP request
path
y (ArgumentError)
        from c:/ruby/lib/ruby/1.8/net/http.rb:1585:in `initialize'
        from hpricot_test.rb:32:in `new'
        from hpricot_test.rb:32:in `get_headers'
        from hpricot_test.rb:80:in `generate_workbook
        from hpricot_test.rb:70:in `each'
        from hpricot_test.rb:70:in `generate_workbook
        from hpricot_test.rb:94

Here is the code:

#!/usr/bin/ruby -w

require 'net/http'
require 'uri'
require 'hpricot'
require 'spreadsheet/excel'
include Spreadsheet

  def get_cache
    # set variables for POST request
    appid = 'yahooAPI-key' # a Yahoo API key goes here
    query = 'http://www.example.com' # a Web site to check goes here

    # this gets the first 100 results, but I want to loop through
    # it 10 times with a different "start" number to get all 1000
    # available results
    results = 100
    start = 1

    post_args = {
      'appid' => appid,
      'query' => query,
      'results' => results,
      'start' => start
    }
    url =
URI.parse('http://search.yahooapis.com/SiteExplorerService/V1/pageData')

    # send post request
    @resp, @data = Net::HTTP.post_form(url, post_args)

    # read XML
    @doc = Hpricot(@data)
  end

  def get_headers(url)
    # This gets the response code for the page to see if it exists
(200, 301, 404, etc.)
    page = URI.parse(url)
    req = Net::HTTP::Get.new(page.path)
    res = Net::HTTP.start(page.host, page.port) { |http|
      http.request(req)
    }
    return res.code
  end

  def generate_workbook
    # create new workbook and worksheet
    workbook = Spreadsheet::Excel.new("yahoo_cache.xls")
    worksheet = workbook.add_worksheet('Yahoo Cache')

    # set variables
    current_row = 2
    format_nil = nil
    format_header = Format.new(
      :color => 'white',
      :bg_color => 'gray',
      :bold => true
    )
    workbook.add_format(format_header)
    workbook.add_format(format_nil)

    # Add header row
    worksheet.write(0,0,"Yahoo's Cache for Site", format_nil)
    worksheet.write(1,0,"TITLE",format_header)
    worksheet.write(1,1,"URL", format_header)
    # worksheet.write(1,2,"CODE", format_header)
    # worksheet.write(1,3,"LOCATION", format_header) # coming soon

    # Add xml_data to worksheet
    (@doc/"result").each do |el|
      result_title = (el/"title").text
      result_url = (el/"url").text
      worksheet.write(current_row, 0, result_title, format_nil)
      worksheet.write(current_row, 1, result_url, format_nil)

      # get response codes -- this is causing an error with
"result_url" -- maybe it isn't a URL in a string?
      # see error message at top of this post
      # response_code ||= 0
      # response_code = get_headers(result_url) # this works if I put
a URL here, but not with the result_url variable
      # worksheet.write(current_row, 2, response_code, format_nil)

      # move to the next row in the spreadsheet before going to the
next XML item
      current_row += 1
    end

    # finished, close the workbook
    workbook.close
  end

···

====
The above code works (except the part that gets response codes). The
following code is a previous version where I tried to loop through all
1000 results. (It was using xmlsimple.) I couldn't figure out how to
store each set of XML -- each request is an entire XML file. I tried
@pass[count], but it wasn't working. Any ideas about a good way to
store each request?

    # prepare to loop through 100 results
    count = 1
    start = 1

# pass[] = each of the 10 requests to Yahoo
@pass = []

    # perform the loop
    while count < 11 do
      post_args = {
        'appid' => appid,
        'query' => query,
        'results' => results,
        'start' => start
      }

      # send post request
      @resp, @data = Net::HTTP.post_form(url, post_args)

      # read XML
      xml_data = XmlSimple.xml_in(@data)
      @pass[count] = xml_data
      # puts "Count: #{count}"
      # print @pass[count]

      # puts "Start: #{start}"
      # puts
      count += 1
      start += 100
    end

Topic		Replies	Views
How to parse the "next" button on yahoo resultpages? ruby-talk	1	89	14 August 2007
Ruby API for Yahoo Search Web Services ruby-talk	3	88	2 March 2005
Retrieving yahoo! mail? ruby-talk	1	60	18 September 2007
How to get Response URL value - help requested ruby-talk	6	95	12 October 2006
[ANN] Geocoding Goodness ruby-talk	3	73	14 June 2006

Yahoo API and Ruby

Related Topics