[noob] Parsing problems using https and redirects

Hello list,
   I have to develop a simple script to parse some parts of a web site and I
thought it could be a good opportunity to start trying Ruby.
   I found that there are two network libraries that I could supposedly use
to retrieve the contents of the web site: open-uri and net-http.

*First problem*
   This web site is accessed only with https and has a self issued
certificate. This has made it impossible so far for me to access the
contents of the web site.
   Simple examples from the Hpricot html parsing library like this one:

require 'hpricot'
require 'open-uri'
doc = Hpricot(open("https://xxxxxx"))

   will not work because the open will fail because of problems due to
https.

*Second problem*
   I need to know also how to handle redirection and cookies. But to be
fair, I still can do some further reading myself on these issues.

   Thank you very much.

Quoth Ramiro Diaz Trepat:

Hello list,
   I have to develop a simple script to parse some parts of a web site and I
thought it could be a good opportunity to start trying Ruby.
   I found that there are two network libraries that I could supposedly use
to retrieve the contents of the web site: open-uri and net-http.

*First problem*
   This web site is accessed only with https and has a self issued
certificate. This has made it impossible so far for me to access the
contents of the web site.
   Simple examples from the Hpricot html parsing library like this one:

require 'hpricot'
require 'open-uri'
doc = Hpricot(open("https://xxxxxx"))

   will not work because the open will fail because of problems due to
https.

*Second problem*
   I need to know also how to handle redirection and cookies. But to be
fair, I still can do some further reading myself on these issues.

   Thank you very much.

2) Look at mechanize.

1) Look at http-access2 (or whatever it's been renamed to).

Regards,

···

--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertillnoon.com/

Thank you very much Konrad, it seems that I am on my way now.
The only weird thing that happened now with Mechanize is that it all works
perfectly on my Linux but it doesn´t on my Mac/Leopard.
Both have Ruby 1.8.6

On the mac I get the following error while trying to execute the first
Mechanize example:

./mechanize.rb:4: uninitialized constant WWW (NameError)
    from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:27:in
`gem_original_require'
    from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:27:in `require'
    from goog.rb:2

and the code is the first example of machanize:

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get("http://www.google.com/&quot;\)
search_form = page.forms.with.name("f").first
search_form.q = "Hello"
search_results = agent.submit(search_form)
puts search_results.body

I really don't know why this constant is uninitialized and how could I
initialize it. Besides it worries my that on Linux, after installing the
mechanize gem, everything worked out of the box.

Thanks again

···

On Dec 14, 2007 10:14 PM, Konrad Meyer <konrad@tylerc.org> wrote:

Quoth Ramiro Diaz Trepat:
> Hello list,
> I have to develop a simple script to parse some parts of a web site
and I
> thought it could be a good opportunity to start trying Ruby.
> I found that there are two network libraries that I could supposedly
use
> to retrieve the contents of the web site: open-uri and net-http.
>
> *First problem*
> This web site is accessed only with https and has a self issued
> certificate. This has made it impossible so far for me to access the
> contents of the web site.
> Simple examples from the Hpricot html parsing library like this one:
>
> require 'hpricot'
> require 'open-uri'
> doc = Hpricot(open("https://xxxxxx"))
>
> will not work because the open will fail because of problems due to
> https.
>
> *Second problem*
> I need to know also how to handle redirection and cookies. But to be
> fair, I still can do some further reading myself on these issues.
>
> Thank you very much.

2) Look at mechanize.

1) Look at http-access2 (or whatever it's been renamed to).

Regards,
--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertillnoon.com/