I have to collect urls from html files, I to transform relative urls to
absolute, I have to handle for example url beginning with '../' and
'./', which is kind of ennoying (my head hurts because of the
debugging process)
Actually, I test every case using regexp. Maybe you can help me finding
something faster?
I have to collect urls from html files, I to transform relative urls to
absolute, I have to handle for example url beginning with '../' and
'./', which is kind of ennoying (my head hurts because of the
debugging process)
Actually, I test every case using regexp. Maybe you can help me finding
something faster?
If you can shoe-horn your problem into Mechanize, it's got a private method WWW::Mechanize#to_absolute_uri, which does precisely this. Don't know if that's of any use, but it might be worthwhile looking at how it's implemented at least.
I have to collect urls from html files, I to transform relative urls to
absolute, I have to handle for example url beginning with '../' and
'./', which is kind of ennoying (my head hurts because of the
debugging process)
Not sure, but perhaps the standard library will do what you want?
Assuming that this_page is the URL of the page you're scraping links from and '../qux' is the link URL you want to absolutise:
If you can shoe-horn your problem into Mechanize, it's got a private method WWW::Mechanize#to_absolute_uri, which does precisely this. Don't know if that's of any use, but it might be worthwhile looking at how it's implemented at least.
+1 for Alex's solution.
I have tried to implement this in scRUBYt! (I did not know Mechanize's to_absolute_uri back then) and, well, failed (fortunately I discovered it in Mechanize since then). My solution worked for 99% of the cases, but the rest was totally PITA to hunt down. I believe Aaron and the mechanize community already did this, so why reinvent the wheel?
Believe me, if you can shoe-horn it into Mechanize as Alex suggested, do it - it will save you lot of time, nerves, headaches, money, whatnot.