So. Discuss
(Thanks in advance for any advice.)
Well, there is a new kid on the block, scRUBYt! (DISCLAIMER: I am the author
so I may be biased a bit ;-), a web scraping framework based on Mechanize and Hpricot. I am planning to add (or replace? not sure yet) Mechanize with WATIR, so that it can handle javascript, too. If you can do without javascript for a moment, I think scRUBYt! is an interesting choice, because:
1) Mechanize and Hpricot are super great in themselves - now sum the power two, multiply it by n (you decide the value of n - for all the people so far I got feedback from it was much greater than 1
because of the added functionality ...
2) scRUBYt! is easy to learn and use, quite powerful, has tons of docs (check out http://scrubyt.org), nicely documented (http://scrubyt.rubyforge.org), unit tested, blackbox tested etc. API and the whole thing is designed to by extendable by your stuff - and I am usually available for support if this is still not enough.
3) I am planning to invest a lot of time into scRUBYt! - I am just releasing the next version as I write this mail, my TODO list has about 200+ items and the community seems to be very active, so I got already tons bug reports, feat requests and even patches (and the whole thing is out for about 2 weeks)
4) I am planning to launch a community site where (hopefully) the users will upload, tag, rate etc. the extractors they create - so this can be also an interesting thing if it works out.
A quick example:
···
=====================================================================
amazon_stuff = Scrubyt::Extractor.define do
fetch 'http://www.amazon.com'
fill_textfield 'field-keywords', 'logitech keyboard'
choose_option 'url', 'Computers & PC Hardware'
submit
stuff do
item_name "Logitech diNovo Edge ( 967685-0403 )"
price "$169.98"
end
end
amazon_stuff.to_xml.write($stdout, 1)
Scrubyt::ResultDumper.print_statistics(amazon_stuff)
output:
[MODE] learning
[ACTION] fetching document: http://www.amazon.com
[ACTION] typing logitech keyboard into the textfield named 'field-keywords'
[ACTION] selecting option Computers & PC Hardware from the option list 'url'
[ACTION] submitting form...
[ACTION] fetched Amazon.com : logitech keyboard
<root>
<stuff>
<item_name>Logitech diNovo Edge ( 967685-0403 )</item_name>
<price>$169.98</price>
</stuff>
<stuff>
<item_name>Logitech G15 Gaming Keyboard</item_name>
<price>$77.74</price>
</stuff>
<stuff>
<item_name>Logitech Media Keyboard Elite- Black ( 967559-0403 )</item_name>
<price>$27.43</price>
</stuff>
<stuff>
<item_name>Logitech Cordless Desktop S510</item_name>
<price>$52.79</price>
</stuff>
<stuff>
<item_name>Logitech Cordless Desktop MX 3000 Laser (967553-0403)</item_name>
<price>$60.93</price>
</stuff>
<stuff>
<item_name>Logitech Classic Keyboard</item_name>
<price>$11.99</price>
</stuff>
<stuff>
<item_name>Logitech Cordless Desktop LX 300</item_name>
<price>$38.74</price>
</stuff>
<stuff>
<item_name>Logitech diNovo Cordless Desktop</item_name>
<price>$104.99</price>
</stuff>
<stuff>
<item_name>Logitech Cordless Desktop MX 5000 Laser (967558-0403)</item_name>
<price>$116.99</price>
</stuff>
<stuff>
<item_name>Logitech Media Keyboard</item_name>
</stuff>
<stuff>
<item_name>Logitech Cordless Desktop MX3200 Laser</item_name>
<price>$76.98</price>
</stuff>
<stuff>
<item_name>Logitech G11 Gaming Keyboard</item_name>
<price>$61.73</price>
</stuff>
<stuff>
<item_name>Logitech Cordless Desktop S 530 Laser for Mac ( 967664-0403 )</item_name>
<price>$67.94</price>
</stuff>
<stuff>
<item_name>Logitech Cordless Desktop Comfort Laser</item_name>
<price>$77.81</price>
</stuff>
<stuff>
<item_name>Logitech Cordless Desktop EX110 ( 967561-0403 )</item_name>
<price>Used & new
from $24.97</price>
</stuff>
<stuff>
<item_name>Sony Playstation 2 USB Keyboard</item_name>
</stuff>
</root>
stuff extracted 16 instances.
item_name extracted 16 instances.
price extracted 14 instances.
I think you get the idea... scRUBYt! hides all the ugly stuff (HTML, XPats, form names, whatnot) and figures out everything based on your examples.
btw. don't try to run this example with 0.2.0 (the current version which is out), it needs 0.2.3 which I am going to release in a few hours.
scRUBYt! has much more features than this example suggests - if you are interested, check out http://scrubyt.org.
Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.