Page crawling and URL grabbing

Patrick_L · 27 January 2009 00:55

Hey guys,
I'm trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It's my first time doing something like this. I'm reading some
documentation right now...but a hand would be greatly appreciated. I'm
not really sure how to do regex on an html file...or even find the right
stuff within that file. I'm guessing its..

open('http://www.istockphoto.com/file_browse.php/') do |f|
f.find # dot something something
end

but I really have no idea. Any help would be great - thanks in advance!

···

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 27 January 2009 08:36

Generally speaking, regular expressions are not the best tool to extract
information from HTML. Take a look at these other tools:

Mechanize
Hpricot
Scrubyt
Nokogiri

This is an example that might get you started, although I recommend taking
a look at the above tools:

require 'open-uri'
require 'hpricot'

h = Hpricot(open("http://www.istockphoto.com/file_browse.php"\))
imgs = h.search("//[@class = searchImg]")
imgs.map {|img| img["src"]}

You should customize the criteria to choose the images (in my little
example I selected all tags which had a class searchImg, which at a
quick glance seemed what you wanted, but double check).

I recall reading somewhere that nokogiri has better XPath support than
Hpricot, so check it out.

Jesus.

···

On Tue, Jan 27, 2009 at 1:55 AM, Patrick L. <leahy16@gmail.com> wrote:

Hey guys,
I'm trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It's my first time doing something like this. I'm reading some
documentation right now...but a hand would be greatly appreciated. I'm
not really sure how to do regex on an html file...or even find the right
stuff within that file. I'm guessing its..

Miroslaw_Niegowski · 27 January 2009 08:38

Try Mechanize.
It's easy :

agent = WWW::Mechanize.new
agent.user_agent_alias='Mac Safari'
page = agent.get('http://www.istockphoto.com/file_browse.php'\);
page.links.text(/jpg/)
...

···

2009/1/27 Patrick L. <leahy16@gmail.com>:

Hey guys,
I'm trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It's my first time doing something like this. I'm reading some
documentation right now...but a hand would be greatly appreciated. I'm
not really sure how to do regex on an html file...or even find the right
stuff within that file. I'm guessing its..

open('http://www.istockphoto.com/file_browse.php/'\) do |f|
f.find # dot something something
end

Patrick_L · 27 January 2009 23:41

Miroslaw Niegowski wrote:

···

2009/1/27 Patrick L. <leahy16@gmail.com>:

open('http://www.istockphoto.com/file_browse.php/'\) do |f|
f.find # dot something something
end

Try Mechanize.
It's easy :

agent = WWW::Mechanize.new
agent.user_agent_alias='Mac Safari'
page = agent.get('http://www.istockphoto.com/file_browse.php'\);
page.links.text(/jpg/)
...

That's great, or it sounds great. Is there any documentation aside from
blog posts and this: http://mechanize.rubyforge.org/mechanize/ ? What
did you use to learn it?

--
Posted via http://www.ruby-forum.com/\.

Tsunami_Script · 27 January 2009 23:45

mechanize is very easy and intuitive ... you could basically learn to
use mechanize just by playing with it in irb . Combine that with reading
some/the docs , and you're good to go .

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Confusion trying to get IMG tags from html page ruby-talk	7	127	30 July 2005
Is there link extractor or similar html processing libs for Ruby ruby-talk	16	148	10 March 2006
Regexp help ruby-talk	6	105	22 August 2008
Scan HTML ruby-talk	15	101	3 March 2008
How to extract url's from html source of google search result ruby-talk	3	121	12 June 2005

Page crawling and URL grabbing

Related topics