Relative/absolute url parsing

Dagnan · 10 June 2007 22:49

Hi!

I have to collect urls from html files, I to transform relative urls to
absolute, I have to handle for example url beginning with '../' and
'./', which is kind of ennoying (my head hurts because of the
debugging process)

Actually, I test every case using regexp. Maybe you can help me finding
something faster?

Thanks.

···

--
Posted via http://www.ruby-forum.com/.

Alex_Young · 11 June 2007 07:46

Michel ( Dagnan ) wrote:

Hi!

I have to collect urls from html files, I to transform relative urls to
absolute, I have to handle for example url beginning with '../' and
'./', which is kind of ennoying (my head hurts because of the
debugging process)

Actually, I test every case using regexp. Maybe you can help me finding
something faster?

If you can shoe-horn your problem into Mechanize, it's got a private method WWW::Mechanize#to_absolute_uri, which does precisely this. Don't know if that's of any use, but it might be worthwhile looking at how it's implemented at least.

···

--
Alex

Alex_Fenton2 · 11 June 2007 09:00

Michel ( Dagnan ) wrote:

Hi!

I have to collect urls from html files, I to transform relative urls to
absolute, I have to handle for example url beginning with '../' and
'./', which is kind of ennoying (my head hurts because of the
debugging process)

Not sure, but perhaps the standard library will do what you want?

Assuming that this_page is the URL of the page you're scraping links from and '../qux' is the link URL you want to absolutise:

SCIPIUS:~ alex$ irb irb(main):001:0> require 'uri'
=> true
irb(main):002:0> this_page = URI.parse('http://www.abc.org/foo/bar/baz'\)
=> #<URI::HTTP:0x18c82e URL:http://www.abc.org/foo/bar/baz>
irb(main):003:0> this_page.merge('../qux')
=> #<URI::HTTP:0x303d42 URL:http://www.abc.org/foo/qux>

alex

Peter_Szinek3 · 11 June 2007 08:13

Hi,

If you can shoe-horn your problem into Mechanize, it's got a private method WWW::Mechanize#to_absolute_uri, which does precisely this. Don't know if that's of any use, but it might be worthwhile looking at how it's implemented at least.

+1 for Alex's solution.

I have tried to implement this in scRUBYt! (I did not know Mechanize's to_absolute_uri back then) and, well, failed (fortunately I discovered it in Mechanize since then). My solution worked for 99% of the cases, but the rest was totally PITA to hunt down. I believe Aaron and the mechanize community already did this, so why reinvent the wheel?
Believe me, if you can shoe-horn it into Mechanize as Alex suggested, do it - it will save you lot of time, nerves, headaches, money, whatnot.

Cheers,
Peter

···

_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

Dagnan · 11 June 2007 21:07

Cant find WWW::Mechanize#to_absolute_uri method (in rdoc), but you're
right, the URI lib does exactly what I want.

Ruby is so wonderful

Thank you for helping.

Alex Fenton wrote:

···

Not sure, but perhaps the standard library will do what you want?

--
Posted via http://www.ruby-forum.com/\.

Alex_Young · 11 June 2007 21:41

Michel ( Dagnan ) wrote:

Cant find WWW::Mechanize#to_absolute_uri method (in rdoc), but you're right, the URI lib does exactly what I want.

It's a private method - you'll have to do something like agent.send(:to_absolute_uri, path) to get it to work.

Ruby is so wonderful

I know

···

--
Alex

Topic		Replies	Views
Extract relative links from html page ruby-talk	1	111	6 October 2006
Converting relative URLs to absolute? ruby-talk	1	412	30 June 2005
Relative To Absolute Path Convertor ruby-talk	2	106	20 June 2007
How to get an absolute address of a link ruby-talk	2	132	28 January 2009
Problem with URI.parse ruby-talk	2	112	26 April 2006

Relative/absolute url parsing

Related topics