Hi out there,
does anybody know how to leech all the links from a previously opened IE
Window. I want to do this with IE not with NET/HTTP. Would be nice if anyone
can help me.
Google this group for IE and win32ole references and also check out the MSDN
docs on InternetExplorer.Application – I’m pretty sure you can talk to the
DOM and get all the links out of the current page.
···
----- Original Message -----
From: “Andi Weiss” miner@arcor.de
Newsgroups: comp.lang.ruby
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Thursday, November 07, 2002 12:48 PM
Subject: IE leech
Hi out there,
does anybody know how to leech all the links from a previously opened IE
Window. I want to do this with IE not with NET/HTTP. Would be nice if
anyone
can help me.
This is very easy with xmlscan (htmlscan) and net/http. Assuming you
save the following as ‘leech.rb’, do ./leech.rb http://path/to/page
You can create your own file-naming method if you want and pass it as a
method object into Leech.
#!/usr/bin/env ruby -w
require ‘uri’
require ‘net/http’
require ‘htmlscan’ # part of xmlscan
class Leech < HTMLScanner
def initialize(url, filename_fun = nil)
super(*)
@base = URI.parse(url).normalize unless url.kind_of? URI
@filename_fun = (filename_fun || self.method(:default_filename_fun))
end
def leech
parse(fetch(@base).body)
end
private
def default_filename_fun(url)
url.path =~ //([^/]+)$/
filename = $1 ? $1 : ''
return "#{url.host}-#{filename}"
end
def cleanup_url(base, link)
link = URI.parse(link).normalize unless link.kind_of? URI
return link if link.scheme == 'mailto' or not link.relative?
if link.relative? && link.path[0].chr != '/'
path_ary = base.path.split('/')
link.path = path_ary[0...path_ary.length-1].join('/') + '/' + link.path
end
link.scheme = base.scheme unless link.scheme
link.host = base.host unless link.host
link.port = base.port unless link.port
return link
end
def fetch(url)
response = nil
url = cleanup_url(@base, url)
case url.scheme
when 'http'
begin
url.normalize!
Net::HTTP.start(url.host, url.port) { |http|
response, = http.get(url.path)
}
rescue Net::ProtoRetriableError => err
url = URI.parse(err.response['location'])
retry if url.host
end
else
# implement other handlers here
end
return response
end
def on_stag(element, attrs)
return unless element == ‘a’
page = nil
url = cleanup_url(@base, attrs[‘href’])
$stderr.puts "fetching: #{url}" if $DEBUG
File.open(@filename_fun.call(url), 'w') { |fp|
begin
page = fetch(url)
fp << page.body if page
rescue Exception => err
$stderr.puts "Caught #{err} while fetching #{url}"
end
}
end
end
if FILE == $0
Leech.new(ARGV.shift).leech
end
···
Andi Weiss (miner@arcor.de) wrote:
Hi out there,
does anybody know how to leech all the links from a previously opened IE
Window. I want to do this with IE not with NET/HTTP. Would be nice if anyone
can help me.
–
Eric Hodel - drbrain@segment7.net - http://segment7.net
All messages signed with fingerprint:
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04
Looks very good, what you’ve written. But I don’t know how to work with it.
Sorry I’m not using ruby that long. Ruby always returns an error message:
undefined superclass HTMLScanner.
Could you help me?
“Eric Hodel” drbrain@segment7.net wrote in message
news:20021108081333.GE34910@segment7.net…
You need XMLScan, you can get it here:
http://www.ruby-lang.org/en/raa-list.rhtml?id=334
If you want, you can define your own file naming function and pass it
in, the one I provided is very simple.
def my_filename_fun(url)
…
end
Leecher.new(ARGV.shift, method(:my_filename_fun)).leech
···
Andi Weiss (miner@arcor.de) wrote:
Looks very good, what you’ve written. But I don’t know how to work with it.
Sorry I’m not using ruby that long. Ruby always returns an error message:
undefined superclass HTMLScanner.
Could you help me?
–
Eric Hodel - drbrain@segment7.net - http://segment7.net
All messages signed with fingerprint:
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04
Sorry if I’m getting you on your nerves, but I don’t know how to use it. Can
you give me an example. Ruby still returns: undefined superclass
HTMLScanner. Please please help me again. Would be very nice…
Thanks
“Eric Hodel” drbrain@segment7.net wrote in message
news:20021108174521.GF34910@segment7.net…