IE leech

Hi out there,
does anybody know how to leech all the links from a previously opened IE
Window. I want to do this with IE not with NET/HTTP. Would be nice if anyone
can help me.

Google this group for IE and win32ole references and also check out the MSDN
docs on InternetExplorer.Application – I’m pretty sure you can talk to the
DOM and get all the links out of the current page.

···

----- Original Message -----
From: “Andi Weiss” miner@arcor.de
Newsgroups: comp.lang.ruby
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Thursday, November 07, 2002 12:48 PM
Subject: IE leech

Hi out there,
does anybody know how to leech all the links from a previously opened IE
Window. I want to do this with IE not with NET/HTTP. Would be nice if
anyone
can help me.

This is very easy with xmlscan (htmlscan) and net/http. Assuming you
save the following as ‘leech.rb’, do ./leech.rb http://path/to/page

You can create your own file-naming method if you want and pass it as a
method object into Leech.

#!/usr/bin/env ruby -w

require ‘uri’
require ‘net/http’

require ‘htmlscan’ # part of xmlscan

class Leech < HTMLScanner
def initialize(url, filename_fun = nil)
super(*)
@base = URI.parse(url).normalize unless url.kind_of? URI

@filename_fun = (filename_fun || self.method(:default_filename_fun))

end

def leech
parse(fetch(@base).body)
end

private

def default_filename_fun(url)
url.path =~ //([^/]+)$/

filename = $1 ? $1 : ''

return "#{url.host}-#{filename}"

end

def cleanup_url(base, link)
link = URI.parse(link).normalize unless link.kind_of? URI

return link if link.scheme == 'mailto' or not link.relative?

if link.relative? && link.path[0].chr != '/'
  path_ary = base.path.split('/')
  link.path = path_ary[0...path_ary.length-1].join('/') + '/' + link.path
end

link.scheme = base.scheme unless link.scheme
link.host = base.host unless link.host
link.port = base.port unless link.port

return link

end

def fetch(url)
response = nil
url = cleanup_url(@base, url)

case url.scheme
when 'http'
  begin
    url.normalize!

    Net::HTTP.start(url.host, url.port) { |http|
      response, = http.get(url.path)
    }

  rescue Net::ProtoRetriableError => err
    url = URI.parse(err.response['location'])
    retry if url.host

  end
else
  # implement other handlers here
end

return response

end

def on_stag(element, attrs)
return unless element == ‘a’
page = nil
url = cleanup_url(@base, attrs[‘href’])

$stderr.puts "fetching: #{url}" if $DEBUG

File.open(@filename_fun.call(url), 'w') { |fp|
  begin
    page = fetch(url)
    fp << page.body if page
  rescue Exception => err
    $stderr.puts "Caught #{err} while fetching #{url}"
  end
}

end
end

if FILE == $0

Leech.new(ARGV.shift).leech

end

···

Andi Weiss (miner@arcor.de) wrote:

Hi out there,
does anybody know how to leech all the links from a previously opened IE
Window. I want to do this with IE not with NET/HTTP. Would be nice if anyone
can help me.


Eric Hodel - drbrain@segment7.net - http://segment7.net
All messages signed with fingerprint:
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04

Looks very good, what you’ve written. But I don’t know how to work with it.
Sorry I’m not using ruby that long. Ruby always returns an error message:
undefined superclass HTMLScanner.
Could you help me?

“Eric Hodel” drbrain@segment7.net wrote in message
news:20021108081333.GE34910@segment7.net

You need XMLScan, you can get it here:

http://www.ruby-lang.org/en/raa-list.rhtml?id=334

If you want, you can define your own file naming function and pass it
in, the one I provided is very simple.

def my_filename_fun(url)

end

Leecher.new(ARGV.shift, method(:my_filename_fun)).leech

···

Andi Weiss (miner@arcor.de) wrote:

Looks very good, what you’ve written. But I don’t know how to work with it.
Sorry I’m not using ruby that long. Ruby always returns an error message:
undefined superclass HTMLScanner.
Could you help me?


Eric Hodel - drbrain@segment7.net - http://segment7.net
All messages signed with fingerprint:
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04

Sorry if I’m getting you on your nerves, but I don’t know how to use it. Can
you give me an example. Ruby still returns: undefined superclass
HTMLScanner. Please please help me again. Would be very nice…
Thanks

“Eric Hodel” drbrain@segment7.net wrote in message
news:20021108174521.GF34910@segment7.net