/ ...
···
--------------------------------------------------
#!/usr/bin/ruby -w
require 'net/http'
# read the page data
http = Net::HTTP.new('finance.yahoo.com', 80)
resp, page = http.get('/q?s=IBM', nil )
# BEGIN processing HTML
def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
end
output =
table_data = parse_html(page,"table")
table_data.each do |table|
out_row =
row_data = parse_html(table,"tr")
row_data.each do |row|
cell_data = parse_html(row,"td")
cell_data.each do |cell|
cell.gsub!(%r{<.*?>},"")
end
out_row << cell_data
end
output << out_row
end
# END processing HTML
# examine the result
def parse_nested_array(array,tab = 0)
n = 0
array.each do |item|
if(item.size > 0)
puts "#{"\t" * tab}[#{n}] {"
if(item.class == Array)
parse_nested_array(item,tab+1)
else
puts "#{"\t" * (tab+1)}#{item}"
end
puts "#{"\t" * tab}}"
end
n += 1
end
end
parse_nested_array(output)
--------------------------------------------------
Notice about this program that about half the code parses the Web page and
creates an array of arrays, while the remainder shows the array. The entire
task of scraping the page is carried out in the middle of the program.
If you examine the array display created in the latter part of the program,
you will see that all the data are placed in an array that can be indexed
by table, row and cell. Simply select which array elements you want.
I want to emphasize something. The 21 lines, including spaces and comments,
between "# BEGIN processing HTML" and "# END processing HTML" are all that
is required to scrape the page. After this, you simply choose which table
cells you want to use by indexing the array.
This way of scraping pages is better if you have to post-process the
extracted data, or you need a lightweight solution for environments with
limited resources, or if you want to exercise detailed control over the
scraping process, or if you don't want to try to figure out how to use a
large, powerful library that can do absolutely anything, or if you want to
learn how to create Ruby programs.
And this way of scraping pages is not for everyone.
Also, I must add, if the Web page contains certain kinds of HTML syntax
errors, in particular any unpaired <table>, <tr> or <td> tags, my program
will break, and Hpricot probably won't. If, on the other hand, the page is
syntactically correct, this program is perfectly adequate to extract the
data.
Obligatory editorial comment: Yahoo exists because it can expose you to
advertising. That is the foundation of their business model. When you
scrape pages, you avoid having to look at their advertising. If everyone
did this, for better or worse Yahoo would go out of business (or change
their business model).
Those are the facts. If this page scraping business becomes commonplace,
eventually Yahoo and other similar Web sites will choose a different
strategy, for example, they might sell subscriptions. Or they might try to
do more than they are already doing to discourage scraping. This activity
might end up being a contest between the scrapers and the scrapees, with
the scrapees making their pages more and more complex.
I think eventually these content providers might put up their content as
graphics rather than text, as the spammers are now doing. Then the scrapers
would have to invest in OCR to get the content.
This scraping activity isn't illegal, unless of course you exploit or
re-post the scraped content.
End of editorial.
--
Paul Lutus
http://www.arachnoid.com