Youtube...urgent, please help

Arun_Kumar2 · 17 March 2009 04:42

Hi,

I'm new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction. It works fine except for some sites like
http://www.youtube.com, http://www.gmail.com where i'll get errors like
'400 Bad Request' and 'getaddrinfo: Name or service not known
(SocketError)' respectively for each of the 2 sites. I came to know that
may be it is because the url is being redirected. But i'm not sure about
it. My code for html extraction is :

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'dbi'

puts "Enter domain name :"
domain = gets
#concatinating ‘http://www.’ with the url to open the page
url = “http://www.”+domain
document = open(url)
#getting the original url of the site
url2 = document.base_uri.to_s

Can anybody please help. It is urgent. I'll be really greatful for those
who reply

Regards,
Arun Kumar

Attachments:
http://www.ruby-forum.com/attachment/3450/htmlParse.rb

···

--
Posted via http://www.ruby-forum.com/.

David_Masover · 17 March 2009 05:42

Arun Kumar wrote:

Hi,

I'm new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction.

You probably want Mechanize.

domain = gets
#concatinating 'http://www.' with the url to open the page
url = "http://www."+domain

Take a look at that URL -- I'd say you don't need 'www' in that.

But I'm guessing what's hurting is the newline at the end of it.

Quick fix:

domain = gets.chomp
url = "http://#{domain}"

Arun_Kumar2 · 17 March 2009 05:58

David Masover wrote:

Arun Kumar wrote:

Hi,

I'm new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction.

You probably want Mechanize.

domain = gets
#concatinating 'http://www.' with the url to open the page
url = "http://www."+domain

Take a look at that URL -- I'd say you don't need 'www' in that.

But I'm guessing what's hurting is the newline at the end of it.

Quick fix:

domain = gets.chomp
url = "http://#{domain}"

Sorry to say David, I tried that but the same error is producing. Is it
because i've not set the user agent. Can u please tell me how to set the
user_agent for mozilla.
Thanks for ur immediate reply

···

--
Posted via http://www.ruby-forum.com/\.

Martin_DeMello · 17 March 2009 06:08

Sorry to say David, I tried that but the same error is producing. Is it
because i've not set the user agent. Can u please tell me how to set the
user_agent for mozilla.

http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
List of User-Agents (Spiders, Robots, Browser) has an extensive list, for
instance.

Thanks for ur immediate reply

Don't do that, it's annoying.

martin

···

On Tue, Mar 17, 2009 at 11:28 AM, Arun Kumar <arunkumar@innovaturelabs.com> wrote:

Arun_Kumar2 · 17 March 2009 06:25

Martin DeMello wrote:

···

On Tue, Mar 17, 2009 at 11:28 AM, Arun Kumar > <arunkumar@innovaturelabs.com> wrote:

Sorry to say David, I tried that but the same error is producing. Is it
because i've not set the user agent. Can u please tell me how to set the
user_agent for mozilla.

http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
http://www.user-agents.org/index.shtml?moz has an extensive list, for
instance.

Thanks for ur immediate reply

Don't do that, it's annoying.

martin

Can i use user-agents in hpricot? or if it can be used only for
mechanize. I've found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.
--
Posted via http://www.ruby-forum.com/\.

Serabe · 17 March 2009 06:35

I found this:

http://schf.uc.org/articles/2007/02/14/scraping-gmail-with-mechanize-and-hpricot

It scraps gmail. If my memory doesn't fail, it is one that gives you
some problems.

Cheers,

Serabe

···

2009/3/17 Arun Kumar <arunkumar@innovaturelabs.com>:

Can i use user-agents in hpricot? or if it can be used only for
mechanize. I've found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.

--
http://www.serabe.com

Martin_DeMello · 17 March 2009 06:53

Hpricot is an html parser, I don't think it concerns itself with
actually fetching the page. Use mechanize for that.

martin

···

On Tue, Mar 17, 2009 at 11:55 AM, Arun Kumar <arunkumar@innovaturelabs.com> wrote:

Can i use user-agents in hpricot? or if it can be used only for
mechanize.

David_Masover · 17 March 2009 19:23

Martin DeMello wrote:

···

On Tue, Mar 17, 2009 at 11:55 AM, Arun Kumar > <arunkumar@innovaturelabs.com> wrote:


Can i use user-agents in hpricot? or if it can be used only for
mechanize.

Hpricot is an html parser, I don't think it concerns itself with
actually fetching the page. Use mechanize for that.

What's more, mechanize doesn't even use hpricot anymore -- it uses nokogiri.

Topic		Replies	Views
400 "Bad Request" ruby-talk	2	101	26 March 2009
[noob] Parsing problems using https and redirects ruby-talk	2	117	15 December 2007
Spidering a website to build a sitemap ruby-talk	16	123	1 July 2005
Problem fetching web-page ruby-talk	3	108	28 November 2005
Export AOL adress book ruby-talk	0	58	11 November 2006

Youtube...urgent, please help

Related topics