Youtube...urgent, please help

Hi,

I'm new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction. It works fine except for some sites like
http://www.youtube.com, http://www.gmail.com where i'll get errors like
'400 Bad Request' and 'getaddrinfo: Name or service not known
(SocketError)' respectively for each of the 2 sites. I came to know that
may be it is because the url is being redirected. But i'm not sure about
it. My code for html extraction is :

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'dbi'

puts "Enter domain name :"
domain = gets
#concatinatinghttp://www.’ with the url to open the page
url = “http://www.”+domain
document = open(url)
#getting the original url of the site
url2 = document.base_uri.to_s

Can anybody please help. It is urgent. I'll be really greatful for those
who reply

Regards,
Arun Kumar

Attachments:
http://www.ruby-forum.com/attachment/3450/htmlParse.rb

···

--
Posted via http://www.ruby-forum.com/.

Arun Kumar wrote:

Hi,

I'm new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction.

You probably want Mechanize.

domain = gets
#concatinating 'http://www.' with the url to open the page
url = "http://www."+domain
  
Take a look at that URL -- I'd say you don't need 'www' in that.

But I'm guessing what's hurting is the newline at the end of it.

Quick fix:

domain = gets.chomp
url = "http://#{domain}"

David Masover wrote:

Arun Kumar wrote:

Hi,

I'm new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction.

You probably want Mechanize.

domain = gets
#concatinating 'http://www.' with the url to open the page
url = "http://www."+domain
  
Take a look at that URL -- I'd say you don't need 'www' in that.

But I'm guessing what's hurting is the newline at the end of it.

Quick fix:

domain = gets.chomp
url = "http://#{domain}"

Sorry to say David, I tried that but the same error is producing. Is it
because i've not set the user agent. Can u please tell me how to set the
user_agent for mozilla.
Thanks for ur immediate reply

···

--
Posted via http://www.ruby-forum.com/\.

Sorry to say David, I tried that but the same error is producing. Is it
because i've not set the user agent. Can u please tell me how to set the
user_agent for mozilla.

http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
List of User-Agents (Spiders, Robots, Browser) has an extensive list, for
instance.

Thanks for ur immediate reply

Don't do that, it's annoying.

martin

···

On Tue, Mar 17, 2009 at 11:28 AM, Arun Kumar <arunkumar@innovaturelabs.com> wrote:

Martin DeMello wrote:

···

On Tue, Mar 17, 2009 at 11:28 AM, Arun Kumar > <arunkumar@innovaturelabs.com> wrote:

Sorry to say David, I tried that but the same error is producing. Is it
because i've not set the user agent. Can u please tell me how to set the
user_agent for mozilla.

http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
http://www.user-agents.org/index.shtml?moz has an extensive list, for
instance.

Thanks for ur immediate reply

Don't do that, it's annoying.

martin

Can i use user-agents in hpricot? or if it can be used only for
mechanize. I've found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.
--
Posted via http://www.ruby-forum.com/\.

I found this:

http://schf.uc.org/articles/2007/02/14/scraping-gmail-with-mechanize-and-hpricot

It scraps gmail. If my memory doesn't fail, it is one that gives you
some problems.

Cheers,

Serabe

···

2009/3/17 Arun Kumar <arunkumar@innovaturelabs.com>:

Can i use user-agents in hpricot? or if it can be used only for
mechanize. I've found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.

--
http://www.serabe.com

Hpricot is an html parser, I don't think it concerns itself with
actually fetching the page. Use mechanize for that.

martin

···

On Tue, Mar 17, 2009 at 11:55 AM, Arun Kumar <arunkumar@innovaturelabs.com> wrote:

Can i use user-agents in hpricot? or if it can be used only for
mechanize.

Martin DeMello wrote:

···

On Tue, Mar 17, 2009 at 11:55 AM, Arun Kumar > <arunkumar@innovaturelabs.com> wrote:
  

Can i use user-agents in hpricot? or if it can be used only for
mechanize.
    
Hpricot is an html parser, I don't think it concerns itself with
actually fetching the page. Use mechanize for that.
  
What's more, mechanize doesn't even use hpricot anymore -- it uses nokogiri.