Screen scraping programaticprogrammer.com?

7stud1 · 13 September 2007 12:39

The following is from "Programming Ruby 2nd" p.133:

···

----
require "net/http"

h = Net::HTTP.new("www.programaticprogrammer.com", 80)
response = h.get("/index.html")

if response.message == "OK"
puts response.body.scan(/<img src="(.*?)"/m).uniq
end
----

It doesn't work: nothing is printed. So, I modified it a little:

-----
require "net/http"

h = Net::HTTP.new("www.programaticprogrammer.com", 80)
response = h.get("/index.html")

puts response.message
puts response.code

if response.message == "OK"
puts "*"
puts response.body.scan(/<img src="(.*?)"/m).uniq
end
-----

and the output was:

Found
302

I clicked a link on their home page and tried to access the page that
was displayed, but I got the same result. What am I doing wrong?
--
Posted via http://www.ruby-forum.com/.

Ronald_Fischer1 · 13 September 2007 12:46

h = Net::HTTP.new("www.programaticprogrammer.com", 80)
response = h.get("/index.html")

puts response.message
puts response.code

if response.message == "OK"
puts "*"
puts response.body.scan(/<img src="(.*?)"/m).uniq
end
-----

and the output was:

Found
302

I clicked a link on their home page and tried to access the page that
was displayed, but I got the same result. What am I doing wrong?

Wrong URL. How about using www.pragmaticprogrammer.com instead?

I think the prOGRamatic programmers are slowly dying out anyway in
favour of the pragmatic programmers....

Ronald

···

--
Ronald Fischer <ronald.fischer@venyon.com>
Phone: +49-89-452133-162

Diego_Suarez · 13 September 2007 12:56

I've just checked it, and the code on pragmaticprogrammer.com (you made a
typo, probably) returns a 200 OK, and the program works as expected. It
could be the typo, but programmaticprogrammer does not exist for me, and
you're getting a 302 (it's a redirect to another page) so I don't know
what's happening exactly. Take a look on HTTP response codes if the problem
wasn't the URL.

Regards.

Diego.

···

On 9/13/07, 7stud -- <dolgun@excite.com> wrote:

The following is from "Programming Ruby 2nd" p.133:

----
require "net/http"

h = Net::HTTP.new("www.programaticprogrammer.com", 80)
response = h.get("/index.html")

if response.message == "OK"
   puts response.body.scan(/<img src="(.*?)"/m).uniq
end
----

It doesn't work: nothing is printed. So, I modified it a little:

-----
require "net/http"

h = Net::HTTP.new("www.programaticprogrammer.com", 80)
response = h.get("/index.html")

puts response.message
puts response.code

if response.message == "OK"
    puts "*"
    puts response.body.scan(/<img src="(.*?)"/m).uniq
end
-----

and the output was:

Found
302

I clicked a link on their home page and tried to access the page that
was displayed, but I got the same result. What am I doing wrong?
--
Posted via http://www.ruby-forum.com/\.

7stud1 · 13 September 2007 12:54

Ronald Fischer wrote:

Wrong URL. How about using www.pragmaticprogrammer.com instead?

Whoops. Thanks.

···

--
Posted via http://www.ruby-forum.com/\.

James_Britt · 14 September 2007 15:58

Just remember that with screen scraping, you are anticipating a file served by a file server, on top of that you are generally anticipating a very particular structure in that document. Web sites change frequently and without notice and even the smallest changes can blow out your scraper. So be very careful to inspect the various pages of sites you plan to scrape, and then try to write your scraper to check for things and not fail if it isn't found.

With some clever programming and a little knowledge of the site, you can make a simple but smart scraper. However, it will still be pretty fragile. html/xhtml is just too loose and human-language like, full of ambiguity and implicit meaning that humans would get, but machines would work hard to fail at.

Topic		Replies	Views
Get index.htm and read it out ruby-talk	3	102	31 October 2002
Help with IO Code ruby-talk	4	72	15 August 2003
Help with net/http ruby-talk	18	121	11 December 2010
Problem fetching web-page ruby-talk	3	106	28 November 2005
Net::HTTP.get_response - No Response ruby-talk	6	149	30 March 2011

Screen scraping programaticprogrammer.com?

Related topics