Problem fetching web-page

I'm not sure if this is the best place to ask this, but I hope someone
will be able to help, or at least point me somewhere else.

I've written a screen-scrapper (in Perl) for digg.com. It uses
HTTP::Lite to retrieve the page and regexp's to parse information. It
works, but I'd like to create a Ruby version to help me learn Ruby.

Here is the code I'm trying to use:

require 'net/http'
require 'uri'

Net::HTTP.start( 'www.digg.com', 80 ) do |http|
    print( http.get( '/' ).body )
end

If I use this to get another site (eg slashdot.org) it returns all the
HTML, as expected. With digg.com, I get this:

<BR clear="all">
<HR noshade size="1px">
<ADDRESS>
Generated Mon, 28 Nov 2005 20:22:05 GMT by Prolexic.com (SI2LON1/2.0)
</ADDRESS>
</BODY></HTML>

That looks like (I'm guessing) some kind of return message from a
load-balancer or other proxy. I've tried this from 3 different systems
(which use different ISPs) so I don't think it's my system.

Does anyone have any ideas about this? Why does the Perl code work,
but not the Ruby? Is there a fix?

Using Ruby 1.8.3 under Linux, also tried it with Ruby 1.8.2 on Mac OS X.

TIA

···

--
James M

Using open-uri, this is what I get:

irb(main):001:0> require 'open-uri'
=> true
irb(main):002:0> open('http://www.digg.com').read
OpenURI::HTTPError: 403 Forbidden
        from /usr/local/lib/ruby/1.8/open-uri.rb:574:in `proxy_open'
        from /usr/local/lib/ruby/1.8/open-uri.rb:525:in `direct_open'
        from /usr/local/lib/ruby/1.8/open-uri.rb:169:in `open_loop'
        from /usr/local/lib/ruby/1.8/open-uri.rb:164:in `catch'
        from /usr/local/lib/ruby/1.8/open-uri.rb:164:in `open_loop'
        from /usr/local/lib/ruby/1.8/open-uri.rb:134:in `open_uri'
        from /usr/local/lib/ruby/1.8/open-uri.rb:424:in `open'
        from /usr/local/lib/ruby/1.8/open-uri.rb:85:in `open'
        from (irb):2

···

On 11/28/05, James Mulholland <james.mulholland@gmail.com> wrote:

That looks like (I'm guessing) some kind of return message from a
load-balancer or other proxy. I've tried this from 3 different systems
(which use different ISPs) so I don't think it's my system.

Gregory Brown wrote:

···

On 11/28/05, James Mulholland <james.mulholland@gmail.com> wrote:

That looks like (I'm guessing) some kind of return message from a
load-balancer or other proxy. I've tried this from 3 different systems
(which use different ISPs) so I don't think it's my system.

Using open-uri, this is what I get:

When I launch the Web 2.0 Validator (web2.0validator.com), I pointed it at digg.com (among other sites) and it rejected the request.

I changed the user agent sent in the request headers, and the requests were fine after that.

James Britt

--

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

I changed the user agent sent in the request headers, and the requests
were fine after that.

Precisely that -- thanks! Seems like changing the UA to anything (I
chose the well-known "foobar" browser for my first test :slight_smile: will do the
trick:

print( http.get( '/', "User-Agent" => "foobar" ).body )

Thanks again.

···

--
James M

On 11/28/05, James Britt <james_b@neurogami.com> wrote:

Gregory Brown wrote:
> On 11/28/05, James Mulholland <james.mulholland@gmail.com> wrote:
>
>
>>That looks like (I'm guessing) some kind of return message from a
>>load-balancer or other proxy. I've tried this from 3 different systems
>>(which use different ISPs) so I don't think it's my system.
>
>
> Using open-uri, this is what I get:
>

When I launch the Web 2.0 Validator (web2.0validator.com), I pointed it
at digg.com (among other sites) and it rejected the request.

I changed the user agent sent in the request headers, and the requests
were fine after that.

James Britt

--

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools