I’ve written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I’ve noticed that URI.extract miss a lot of links. In
fact URI.extract doesn’t understand (resolve ?) relative links (for
example link). Am I wrong ? If I don’t,
what way do you advice to me to be sure to retrieve all the relative links ?
I’ve written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I’ve noticed that URI.extract miss a lot of links. In
fact URI.extract doesn’t understand (resolve ?) relative links (for
example link). Am I wrong ? If I don’t,
what way do you advice to me to be sure to retrieve all the relative links ?
I’ve written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I’ve noticed that URI.extract miss a lot of links. In
fact URI.extract doesn’t understand (resolve ?) relative links (for
example link). Am I wrong ? If I don’t,
what way do you advice to me to be sure to retrieve all the relative links
?
I think I remember there was a method URI.join which could join an absolute
URI and a relative one. Or was it URL.join?
IIRC, URI.extract(str) just scans plain text for URIs. So, links in
html would have to be absolute, not relative, ie.
“http://google.com/help”, not just “/help”.
uris.map do |item|
case item
when /^// # it’s relative to site root
“http://” + URI.parse(uri).host + item
when /^http:/ #it’s absolute
item
else # it’s relative to the current page
# merge the two uris here. This is left as an exercise
end
end
end
HTH,
Mark
···
On May 25, 2004, at 1:43 PM, Nicolas Cavigneaux wrote:
Hello,
I’ve written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I’ve noticed that URI.extract miss a lot of links. In
fact URI.extract doesn’t understand (resolve ?) relative links (for
example link). Am I wrong ? If I don’t,
what way do you advice to me to be sure to retrieve all the relative
links ?
IIRC, URI.extract(str) just scans plain text for URIs. So, links in html
would have to be absolute, not relative, ie. “http://google.com/help”, not
just “/help”.
OK, that’s what I was thinking.
else # it's relative to the current page
# merge the two uris here. This is left as an exercise ;)
end
eh eh Thank you for your help and for this little exercise
Bye.
···
On Wed, 26 May 2004 07:22:52 +0900, Mark Hubbart wrote: