Alright, I think I may have stumbled upon a bug, correct me if I'm
wrong.
I just wrote a script to pull linked-to files of a certain type of of
webpages. Granted, I'm still pretty new to Ruby. . . this really has me
stumped though. I've googled around, and looked through all the docs on
the relevant classess and I'm not getting anywhere with this.
I have a class that has two main ways of pulling the files - in one, you
give it an absolute URL to a page, and it searches through looking for
links ending with an extension, creates a list, and calls wget to get
them. In the other, it looks through a page of links to pages to see if
they have the filetype.
When I pass it a page using the absolute way, it works fine. If I pass
it in the other way, I get
"usr/lib/ruby/1.8/uri/common.rb:432:in `split': bad URI(is not URI?):
"www.thisistheurlblahexample.example" (URI::InvalidURIError)
from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'"
Alright. So it's not a properly formed URI, right...or something. Thing
is, if I copy and past the erroneous url into the call for the absolute
method, it works fine. I figure I must be missing something critical
here - like data changing when it's passed from function to function
within a class? I don't know.
Here's what I'm thinking is the relevant code: (sorry if it's shoddy,
again I am pretty new at Ruby)
require 'net/http'
require 'uri'
#if you want to grab multiple filetypes, make url the form of
"(f1|f2|f3)"
class Grabber
def initialize(url, filetype)
@url = url
@filetype = filetype
@filelist = String::new
end
def linkCrawl
page = Net::HTTP.get URI.parse("#{@url}")
page.each do |line|
#check for a link
if line =~ %r|<a\shref\s*="http.*">|i
#make the list of files from that page
#sorry about this, people. I know it's ugly. was trying things
here to
#see if it made a difference if I changed the string beforehand
instead
#of calling it inline. it didn't, and it really shouldn't.....
#frustration
line.gsub!(%r|<a\shref=|, "")
line.gsub!(%r|^\s+|, "")
print "line: #{line}\n"
createList("#{line}")
end
end
print "filelist: #{@filelist}\n"
exec "wget #{@filelist}"
end
def createList(url)
page = Net::HTTP.get URI.parse(url)
page.each do |line|
#check for a link containing one of the filetypes
if line =~ %r|.*<a\shref\s*=".*#{@filetype}".*>.*|i
#strip the url of the filetype out of the html
@filelist.concat
"#{line.slice(%r|<a\shref=".*#{@filetype}"|i).gsub!(%r|<a href=|i, "")}
"
end
end
end
def grabFiles
createList("#{@url}")
exec "wget #{@filelist}"
end
end
test = Grabber::new("http://urlgoeshere.com", "(f1|f2|f3)")
test.linkCrawl
##the one below here always works
#test = Grabber::new("http:urlgoeshere.com", "(f1|f2|f3)")
#test.grabFiles
···
--
Posted via http://www.ruby-forum.com/.