How to get a http directory listing

Hi,

Has anyone ever tried to do something like this? I'm trying to log in to a url and get a listing of all the files on that url. For example, with wget I can fetch the files this way:

wget http://username:password@172.16.1.1/logs/somelog.log

this would get the individual file and download it to my local machine. However, I'm trying to automate this script to go out and look at the contents of this directory and do something for each file in there. For example, I want to do something like this:

Dir.foreach("http://username:password@172.16.1.1/logs") {|x| <do some logic here>}

However, I'm not sure if there is a simple way to do something like this in Ruby. Anyone encountered this before?

Thanks in advance,
Carlos
(Ruby Rookie)

Carlos Diaz wrote:

Hi,

Has anyone ever tried to do something like this? I'm trying to log in
to a url and get a listing of all the files on that url. For example,
with wget I can fetch the files this way:

wget http://username:password@172.16.1.1/logs/somelog.log

this would get the individual file and download it to my local
machine. However, I'm trying to automate this script to go out and
look at the contents of this directory and do something for each file
in there. For example, I want to do something like this:

Dir.foreach("http://username:password@172.16.1.1/logs&quot;\) {|x| <do some
logic here>}

However, I'm not sure if there is a simple way to do something like
this in Ruby. Anyone encountered this before?

Problem is, that there is no standard directory listing mechanism for
HTTP. Most servers even forbid to list directory contents.

Maybe wget does the job already.

Kind regards

    robert

Carlos Diaz wrote:

Hi,

Has anyone ever tried to do something like this? I'm trying to log in
to a url and get a listing of all the files on that url. For example,
with wget I can fetch the files this way:

wget http://username:password@172.16.1.1/logs/somelog.log

this would get the individual file and download it to my local machine.
However, I'm trying to automate this script to go out and look at the
contents of this directory and do something for each file in there. For
example, I want to do something like this:

Dir.foreach("http://username:password@172.16.1.1/logs&quot;\) {|x| <do some
logic here>}

However, I'm not sure if there is a simple way to do something like this
in Ruby. Anyone encountered this before?

Niklas Frykholm may have tussled with this problem:

http://raa.ruby-lang.org/project/webfetcher/

=> 402 Access Denied

Worth a look, for ideas ?

daz

Carlos Diaz wrote:

However, I'm trying to automate this script to go out and look at the contents of this directory and do something for each file in there. For example, I want to do something like this:

Dir.foreach("http://username:password@172.16.1.1/logs&quot;\) {|x| <do some logic here>}

However, I'm not sure if there is a simple way to do something like this in Ruby. Anyone encountered this before?

I assume you're talking about the normal automatically-generated directory page, where Apache generates a list of files with links to each file. In which case...

require 'uri'
require 'open-uri'
require 'html/htmltokenizer'

class WebPage
   attr_reader :links # URLs of all links on page

   # Get a web page from a specified URL
   def get(url)
     @uri = URI.parse(url)
     open(url) {|result| @body = result.read }
   end

   # Parse the web page, extracting links
   def parse
     if !@body
       return
     end
     tokenizer = HTMLTokenizer.new(@body)
     @links = Array.new
     while tag = tokenizer.getTag('a')
       # Normalize to a full URL
       url = tag.attr_hash['href']
       uri = @uri.merge(url)
       @links.push(uri.to_s)
     end
   end
end

wp = WebPage.new
wp.get('http://www.example.com/&#39;\)
wp.parse
for link in wp.links
   puts link
end

You'll find HTMLTokenizer at <URL:http://rubyforge.org/projects/htmltokenizer/&gt;\. You could also do it with REXML, of course, but the code would probably be a little harder to follow.

Making the above code robust to things like <a> elements with no href is left as an exercise for the reader :slight_smile:

mathew