Some problems with URI.extract?

Nicolas_Cavigneaux · 25 May 2004 20:43

Hello,

I’ve written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I’ve noticed that URI.extract miss a lot of links. In
fact URI.extract doesn’t understand (resolve ?) relative links (for
example link). Am I wrong ? If I don’t,
what way do you advice to me to be sure to retrieve all the relative links ?

Thank you and good evening.

···

–
Nicolas Cavigneaux | GPG KeyID : F0954C41
bounga@altern.org | http://bounga.ath.cx

Simon_Strandgaard1 · 25 May 2004 20:55

I guess you are using Ruby 1.9 from CVS ?

I just read in Oniguruma’s ChangeLog :
2004/05/25: [bug] (thanks Masahiro Sakai) [ruby-dev:23560]
ruby -ruri -ve ‘URI::ABS_URI =~
“http://example.org/Andr\xC3\xA9”’
nested STK_REPEAT type stack can’t backtrack repeat_stk.
add OP_REPEAT_INC_SG and OP_REPEAT_INC_NG_SG.

I have no idea what that problem was, only that it was URI related.

Does it work on Ruby 1.8.1/2 ?

···

Nicolas Cavigneaux bounga@altern.org wrote:

I’ve written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I’ve noticed that URI.extract miss a lot of links. In
fact URI.extract doesn’t understand (resolve ?) relative links (for
example link). Am I wrong ? If I don’t,
what way do you advice to me to be sure to retrieve all the relative links ?

–
Simon Strandgaard

Robert · 25 May 2004 21:04

“Nicolas Cavigneaux” bounga@altern.org schrieb im Newsbeitrag
news:pan.2004.05.25.20.40.03.951102@altern.org…

Hello,

I’ve written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I’ve noticed that URI.extract miss a lot of links. In
fact URI.extract doesn’t understand (resolve ?) relative links (for
example link). Am I wrong ? If I don’t,
what way do you advice to me to be sure to retrieve all the relative links
?

I think I remember there was a method URI.join which could join an absolute
URI and a relative one. Or was it URL.join?

robert

Mark_Hubbart · 25 May 2004 22:22

IIRC, URI.extract(str) just scans plain text for URIs. So, links in
html would have to be absolute, not relative, ie.
“http://google.com/help”, not just “/help”.

To get all the links out of html, you would probably need to create a
regular expression that finds all link-ish html attributes (, , , etc), parse them to see what
type of link they are, then construct a full URI based on the page’s
original location.

A quick, incomplete, untested example.

open-uri is nice

require ‘open-uri’

def get_URI_list(uri)

download the page at the uri passed

page_data = open(uri){|f|f.read}

scan it for the contents of html

attributes that are usually links

uris = page_data.scan(/(?:href|src|rel|)=“([^”]*)"/)

HTH,
Mark

···

On May 25, 2004, at 1:43 PM, Nicolas Cavigneaux wrote:

Hello,

I’ve written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I’ve noticed that URI.extract miss a lot of links. In
fact URI.extract doesn’t understand (resolve ?) relative links (for
example link). Am I wrong ? If I don’t,
what way do you advice to me to be sure to retrieve all the relative
links ?

Nicolas_Cavigneaux · 25 May 2004 21:04

No, I’m using Ruby 1.8.1.

···

On Wed, 26 May 2004 05:55:07 +0900, Simon Strandgaard wrote:

I guess you are using Ruby 1.9 from CVS ?

–
Nicolas Cavigneaux | GPG KeyID : F0954C41
bounga@altern.org | http://bounga.ath.cx

Nicolas_Cavigneaux · 28 May 2004 13:01

IIRC, URI.extract(str) just scans plain text for URIs. So, links in html
would have to be absolute, not relative, ie. “http://google.com/help”, not
just “/help”.

OK, that’s what I was thinking.

 else # it's relative to the current page
   # merge the two uris here. This is left as an exercise ;)
 end

eh eh Thank you for your help and for this little exercise

Bye.

···

On Wed, 26 May 2004 07:22:52 +0900, Mark Hubbart wrote:

Nicolas Cavigneaux | GPG KeyID : F0954C41
bounga@altern.org | http://bounga.ath.cx

Topic		Replies	Views
Extract relative links from html page ruby-talk	1	113	6 October 2006
Problem with URI.parse ruby-talk	2	115	26 April 2006
How do I parse a string to find a URL? ruby-talk	7	119	18 September 2007
URI.extract issue ruby-talk	0	76	12 February 2007
Auto-recognize links? ruby-talk	3	90	10 October 2008