Problem with URI.parse

Alright, I think I may have stumbled upon a bug, correct me if I'm
wrong.

I just wrote a script to pull linked-to files of a certain type of of
webpages. Granted, I'm still pretty new to Ruby. . . this really has me
stumped though. I've googled around, and looked through all the docs on
the relevant classess and I'm not getting anywhere with this.

I have a class that has two main ways of pulling the files - in one, you
give it an absolute URL to a page, and it searches through looking for
links ending with an extension, creates a list, and calls wget to get
them. In the other, it looks through a page of links to pages to see if
they have the filetype.

When I pass it a page using the absolute way, it works fine. If I pass
it in the other way, I get

"usr/lib/ruby/1.8/uri/common.rb:432:in `split': bad URI(is not URI?):
"www.thisistheurlblahexample.example" (URI::InvalidURIError)
        from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'"

Alright. So it's not a properly formed URI, right...or something. Thing
is, if I copy and past the erroneous url into the call for the absolute
method, it works fine. I figure I must be missing something critical
here - like data changing when it's passed from function to function
within a class? I don't know.

Here's what I'm thinking is the relevant code: (sorry if it's shoddy,
again I am pretty new at Ruby)

require 'net/http'
require 'uri'

#if you want to grab multiple filetypes, make url the form of
"(f1|f2|f3)"
class Grabber
  def initialize(url, filetype)
    @url = url
    @filetype = filetype
    @filelist = String::new
  end

  def linkCrawl
    page = Net::HTTP.get URI.parse("#{@url}")
    page.each do |line|
      #check for a link
      if line =~ %r|<a\shref\s*="http.*">|i
        #make the list of files from that page
        #sorry about this, people. I know it's ugly. was trying things
here to
        #see if it made a difference if I changed the string beforehand
instead
        #of calling it inline. it didn't, and it really shouldn't.....
        #frustration
        line.gsub!(%r|<a\shref=|, "")
        line.gsub!(%r|^\s+|, "")
        print "line: #{line}\n"
        createList("#{line}")
      end
    end
    print "filelist: #{@filelist}\n"
    exec "wget #{@filelist}"
  end

  def createList(url)
    page = Net::HTTP.get URI.parse(url)
    page.each do |line|
      #check for a link containing one of the filetypes
      if line =~ %r|.*<a\shref\s*=".*#{@filetype}".*>.*|i
        #strip the url of the filetype out of the html
        @filelist.concat
"#{line.slice(%r|<a\shref=".*#{@filetype}"|i).gsub!(%r|<a href=|i, "")}
"
      end
    end
  end

  def grabFiles
    createList("#{@url}")
    exec "wget #{@filelist}"
  end
end

test = Grabber::new("http://urlgoeshere.com", "(f1|f2|f3)")
test.linkCrawl

##the one below here always works
#test = Grabber::new("http:urlgoeshere.com", "(f1|f2|f3)")
#test.grabFiles

···

--
Posted via http://www.ruby-forum.com/.

You really shouldn't be trying to parse html with regular expressions, there are a few libraries to do this available. Instead of using an external program (wget), you can also download the URI in ruby.

Trying your code I get:

     (URI::InvalidURIError)/uri/common.rb:432:in `split': bad URI(is not URI?): <A HREF="http://www.google.com/&quot;&gt;here&lt;/A&gt;

which seems reasonable.

Here is an example using RubyfulSoup:

require 'open-uri'
require 'fileutils'
require 'uri'
require 'rubygems' # http://docs.rubygems.org/
require 'rubyful_soup' # Rubyful Soup: "The brush has got entangled in it!" -- sudo gem install rubyful_soup

class Grabber
   def initialize(uri, file_types=)
     @uri = uri
     @file_types_re = %r{#{Regexp.union(*file_types)}$}
   end

   def grab_files
     find_uris.each do |link|
       begin
         data = open(link) { |a| a.read }
         file_path = link.host + link.path
         FileUtils.mkdir_p(File.dirname(file_path))
         open(file_path, 'wb') { |f| f.write(data) }
       rescue Exception => e
         $stderr.puts "#{e.class}: #{e}"
       end
     end
   end

   def find_uris
     soup = BeautifulSoup.new(open(@uri) { |f| f.read })
     soup.find_all('a') { |a| a['href'] =~ @file_types_re }.map do |a|
       uri = URI.parse(a['href'])
       # Create an absolute uri
       uri.host ? uri : URI.join(@uri, uri)
     end
   end
end

Grabber.new('http://google.com', %w{html}).grab_files

__END__

-- Daniel

···

On Apr 26, 2006, at 5:09 PM, Jeremiah Dodds wrote:

Alright, I think I may have stumbled upon a bug, correct me if I'm
wrong.

I just wrote a script to pull linked-to files of a certain type of of
webpages. Granted, I'm still pretty new to Ruby. . . this really has me
stumped though. I've googled around, and looked through all the docs on
the relevant classess and I'm not getting anywhere with this.

Alright, I think I may have stumbled upon a bug, correct me if I'm
wrong.

I just wrote a script to pull linked-to files of a certain type of of
webpages. Granted, I'm still pretty new to Ruby. . . this really has me
stumped though. I've googled around, and looked through all the docs on
the relevant classess and I'm not getting anywhere with this.

I have a class that has two main ways of pulling the files - in one, you
give it an absolute URL to a page, and it searches through looking for
links ending with an extension, creates a list, and calls wget to get
them. In the other, it looks through a page of links to pages to see if
they have the filetype.

When I pass it a page using the absolute way, it works fine. If I pass
it in the other way, I get

"usr/lib/ruby/1.8/uri/common.rb:432:in `split': bad URI(is not URI?):
"www.thisistheurlblahexample.example" (URI::InvalidURIError)
        from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'"

Alright. So it's not a properly formed URI, right...or something. Thing
is, if I copy and past the erroneous url into the call for the absolute
method, it works fine. I figure I must be missing something critical
here - like data changing when it's passed from function to function
within a class? I don't know.

A URI has a protocol scheme.

Here's what I'm thinking is the relevant code: (sorry if it's shoddy,
again I am pretty new at Ruby)

require 'net/http'
require 'uri'

#if you want to grab multiple filetypes, make url the form of
"(f1|f2|f3)"
class Grabber
  def initialize(url, filetype)
    @url = url
    @filetype = filetype
    @filelist = String::new

       @filelist = ''

  end

  def linkCrawl
    page = Net::HTTP.get URI.parse("#{@url}")

       page = Net::HTTP.get URI.parse(@url)

    page.each do |line|
      #check for a link
      if line =~ %r|<a\shref\s*="http.*">|i
        #make the list of files from that page
        #sorry about this, people. I know it's ugly. was trying things
here to
        #see if it made a difference if I changed the string beforehand
instead
        #of calling it inline. it didn't, and it really shouldn't.....
        #frustration

If it is ugly you should fix it. Don't leave broken windows.

        line.gsub!(%r|<a\shref=|, "")
        line.gsub!(%r|^\s+|, "")
        print "line: #{line}\n"

           puts "line: #{line}"

        createList("#{line}")

           createList line

      end
    end
    print "filelist: #{@filelist}\n"

       puts "filelist: #{@filelist}"

    exec "wget #{@filelist}"
  end

  def createList(url)

       url = @url + url

    page = Net::HTTP.get URI.parse(url)
    page.each do |line|
      #check for a link containing one of the filetypes
      if line =~ %r|.*<a\shref\s*=".*#{@filetype}".*>.*|i
        #strip the url of the filetype out of the html
        @filelist.concat
"#{line.slice(%r|<a\shref=".*#{@filetype}"|i).gsub!(%r|<a href=|i, "")}
"
      end
    end
  end

  def grabFiles
    createList("#{@url}")

       createList @url

···

On Apr 26, 2006, at 8:09 AM, Jeremiah Dodds wrote:

    exec "wget #{@filelist}"
  end
end

test = Grabber::new("http://urlgoeshere.com", "(f1|f2|f3)")
test.linkCrawl

##the one below here always works
#test = Grabber::new("http:urlgoeshere.com", "(f1|f2|f3)")
#test.grabFiles

--
Posted via http://www.ruby-forum.com/\.

--
Eric Hodel - drbrain@segment7.net - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com