I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?
···
--
Bill Guindon (aka aGorilla)
I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?
--
Bill Guindon (aka aGorilla)
Bill Guindon said:
I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?
You could do this with WWW::Mechanize fairly easily. There isn't a
built-in spider system yet, but it would be a nice addition and I'm sure
Michael would add it if it was general enough.
I'm pretty familiar with Mechanize now and could help you out if you have
a problem. The basic idea would to recursively get pages and click links
until you run out of links (of course I'm sure you know this.) The cool
thing is turning that idea into code with Mechanize is very easy, since it
collects links for you, and allow you to "click" them.
In case you can't tell, I really like Mechanize
Ryan
I have a site mapping tool I'm working on which does not yet read
remote files but does map links between local files.
http://sterfish.com/lab/sitemapper/
I've been putting off announcing it until I have an actual page there,
but I guess I'm too slow.
- Shad
On 6/22/05, Bill Guindon <agorilla@gmail.com> wrote:
I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?--
Bill Guindon (aka aGorilla)
--
----------
Please do not send personal (non-list-related) mail to this address.
Personal mail should be sent to polyergic@sterfish.com.
Bill Guindon said:
> I need to spider a site and build a sitemap for it. I've looked
> around on rubyforge, and RAA, and don't see an exact match. Has
> anybody done this, or is there a library out there that I missed?You could do this with WWW::Mechanize fairly easily. There isn't a
built-in spider system yet, but it would be a nice addition and I'm sure
Michael would add it if it was general enough.I'm pretty familiar with Mechanize now and could help you out if you have
a problem. The basic idea would to recursively get pages and click links
until you run out of links (of course I'm sure you know this.) The cool
thing is turning that idea into code with Mechanize is very easy, since it
collects links for you, and allow you to "click" them.
Grabbed it as a gem, trying a simple test. Oddly enough, had to add
it's lib path to the LOAD_PATH to get rid of an error (uninitialized
constant WWW (NameError)).
Any docs available on this, or any public examples? Does look like
it'll give me a good start.
thanks much for the pointer.
On 6/22/05, Ryan Leavengood <mrcode@netrox.net> wrote:
In case you can't tell, I really like Mechanize
Ryan
--
Bill Guindon (aka aGorilla)
I have a site mapping tool I'm working on which does not yet read
remote files but does map links between local files.http://sterfish.com/lab/sitemapper/
I've been putting off announcing it until I have an actual page there,
but I guess I'm too slow.
Thanks much. I need one that works remotely, but I'll certainly poke
around in there, and see what I can do with it.
On 6/23/05, Shad Sterling <polyergic@gmail.com> wrote:
- Shad
On 6/22/05, Bill Guindon <agorilla@gmail.com> wrote:
> I need to spider a site and build a sitemap for it. I've looked
> around on rubyforge, and RAA, and don't see an exact match. Has
> anybody done this, or is there a library out there that I missed?
>
> --
> Bill Guindon (aka aGorilla)
>
>--
----------
Please do not send personal (non-list-related) mail to this address.
Personal mail should be sent to polyergic@sterfish.com.
--
Bill Guindon (aka aGorilla)
Bill Guindon said:
Grabbed it as a gem, trying a simple test. Oddly enough, had to add
it's lib path to the LOAD_PATH to get rid of an error (uninitialized
constant WWW (NameError)).
Hmmm, I didn't have to do that. Do you have rubygems in your RUBYOPT?
Mechanize does mess around with the LOAD_PATH itself because it uses new
features from the Ruby v1.9 net libraries.
But for me it worked fine, as shown in the code below.
Any docs available on this, or any public examples? Does look like
it'll give me a good start.
Unfortunately the docs are a bit light at the moment. I learned a lot by
reading the source though, which is well written. Once I get my web-site
up I was going to write an article on Mechanize, but for now that doesn't
help you much
It needs to be heavily refactored, but here is the prototype code I wrote
to help me renew books at my city library's web-site:
require 'mechanize'
require 'time'
class Book
attr_accessor :title, :author, :due_date, :checkbox
def due?
(@due_date - Time.now) < 172800.0 # 2 days
end
def to_s
"#@title by #@author, due on #@due_date\nCheckbox: #{checkbox.name}"
end
end
agent = WWW::Mechanize.new {|a| a.log = Logger.new(STDERR) }
page = agent.get('http://www.coala.org/')
link = page.links.find {|l| l.node.text =~ /BOYNTON/ }
page = agent.click(link)
link = page.links.find {|l| l.node.text =~ /My Account/ }
page = agent.click(link)
link = page.links.find {|l| l.node.text =~ /Renew My Materials/ }
page = agent.click(link)
form = page.forms[1]
form.fields.find {|f| f.name == 'user_id'}.value = 'my_id_removed'
form.fields.find {|f| f.name == 'password'}.value = 'my_password_removed'
agent.watch_for_set = {}
agent.watch_for_set['td']=nil
page = agent.submit(form, form.buttons.first)
form = page.forms[1]
books_html = page.watches['td'].find_all {|n| n.attributes['class'] =~
/itemlisting/}
books = []
books_html.each do |element|
element.each_element do |subelem|
if subelem.name == 'input' and subelem.attributes['type'] == 'checkbox'
# Checkbox for renewal
books << Book.new
books[-1].checkbox = form.checkboxes.find {|c| c.name ==
subelem.attributes['name']}
elsif subelem.name == 'label'
# Book title and author
books[-1].title = subelem.texts[0]
books[-1].author = subelem.texts[1]
elsif subelem.name == 'strong'
# Due date
books[-1].due_date = Time.parse(subelem.text)
end
end
end
books_due = false
books.each do |book|
if book.due?
books_due = true
puts "#{book.title} is due, renewing!"
book.checkbox.checked = true
end
end
if books_due
page = agent.submit(form, form.buttons.first)
puts page.body
else
puts 'Nothing was due, have a nice day!'
end
__END__
thanks much for the pointer.
No problem. Hope the above code helps too.
Ryan
> I have a site mapping tool I'm working on which does not yet read
> remote files but does map links between local files.
>
> http://sterfish.com/lab/sitemapper/
>
> I've been putting off announcing it until I have an actual page there,
> but I guess I'm too slow.Thanks much. I need one that works remotely, but I'll certainly poke
around in there, and see what I can do with it.
Yeah. I made this to help me work on a site I'm now maintaining,
which was a hideous mess when I got to it. I do plan to make it map
remote pages as well, but it will probably be awhile.
On 6/23/05, Bill Guindon <agorilla@gmail.com> wrote:
On 6/23/05, Shad Sterling <polyergic@gmail.com> wrote:
> - Shad
>
>
>
> On 6/22/05, Bill Guindon <agorilla@gmail.com> wrote:
> > I need to spider a site and build a sitemap for it. I've looked
> > around on rubyforge, and RAA, and don't see an exact match. Has
> > anybody done this, or is there a library out there that I missed?
> >
> > --
> > Bill Guindon (aka aGorilla)
> >
> >
>
>
> --
>
> ----------
>
> Please do not send personal (non-list-related) mail to this address.
> Personal mail should be sent to polyergic@sterfish.com.
>
>--
Bill Guindon (aka aGorilla)
--
----------
Please do not send personal (non-list-related) mail to this address.
Personal mail should be sent to polyergic@sterfish.com.
Bill Guindon said:
>
> Grabbed it as a gem, trying a simple test. Oddly enough, had to add
> it's lib path to the LOAD_PATH to get rid of an error (uninitialized
> constant WWW (NameError)).Hmmm, I didn't have to do that. Do you have rubygems in your RUBYOPT?
Nope, guess it's time to add it.
Mechanize does mess around with the LOAD_PATH itself because it uses new
features from the Ruby v1.9 net libraries.But for me it worked fine, as shown in the code below.
> Any docs available on this, or any public examples? Does look like
> it'll give me a good start.Unfortunately the docs are a bit light at the moment. I learned a lot by
reading the source though, which is well written. Once I get my web-site
up I was going to write an article on Mechanize, but for now that doesn't
help you muchIt needs to be heavily refactored, but here is the prototype code I wrote
to help me renew books at my city library's web-site:
[helpful code snipped]
Thanks, that gives me a better idea of what can be done with it.
Now comes the fun part of parsing through relative urls, checking for
base href's, munging similar urls (ie: /some/file.html vs.
some/file.html both called from the root). Should be interesting.
On 6/22/05, Ryan Leavengood <mrcode@netrox.net> wrote:
> thanks much for the pointer.
No problem. Hope the above code helps too.
Ryan
--
Bill Guindon (aka aGorilla)
I'll throw my little snippet in, in case anyone finds it useful.
I just wrote this up to spider my rails app to give me a list of all
the urls so I can use them later in a stress test.
Not terribly advanced, but gives you the format of:
http://www.blah.com/foo.html
{tab} http://www.blah.com/bar.html
Where tabbed out children of the foo.html are pages foo.html points to.
http://snippets.textdrive.com/posts/show/74
-Matt
I'll throw my little snippet in, in case anyone finds it useful.
I just wrote this up to spider my rails app to give me a list of all
the urls so I can use them later in a stress test.Not terribly advanced, but gives you the format of:
http://www.blah.com/foo.html
{tab} http://www.blah.com/bar.htmlWhere tabbed out children of the foo.html are pages foo.html points to.
Good stuff! It's missing a couple of features for stock sites
(handling javascript:, mailto:, #name links etc.), but those can
easily be added.
Thanks much for posting it.
i noticed webfetcher in RPAbase, haven't had a chance to play with it:
http://www.acc.umu.se/~r2d2/programming/ruby/webfetcher/webfetcher.html
i noticed webfetcher in RPAbase, haven't had a chance to play with it:
Should've thought to scan RPA. Wish it was still being updated, I
sure do miss it.
http://www.acc.umu.se/~r2d2/programming/ruby/webfetcher/webfetcher.html
Gave it a couple test drives, and it's quite nice. The following gave
me exactly what I was looking for.
require 'webfetcher'
page = WebFetcher::Page.url('http://www.somedomain.com/')
links = page.recurse.links
File.open('links.txt', 'w+') {|f| f.puts links.uniq}
Thanks much for tracking it down.
I was looking for somehting to trap 404-type errors, kind of like
Mertz' code (but in ruby):
http://gnosis.cx/TPiP/069.code
does this sound familiar to anybody?
The get_response method of Net::HTTP maybe:
http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M000033
The doc says it returns a Net::HTTPResponse object, which has the HTTP
result code in the attribute 'code'.
-m
On 6/30/05, Gene Tani <gene.tani@gmail.com> wrote:
I was looking for somehting to trap 404-type errors, kind of like
Mertz' code (but in ruby):http://gnosis.cx/TPiP/069.code
does this sound familiar to anybody?
When detecting 404's watch out for the servers that return a 200 code with a pretty "Not found" page. Those can throw a real curve ball depending on what your are trying to do.
On Jun 30, 2005, at 10:10 AM, Gene Tani wrote:
I was looking for somehting to trap 404-type errors, kind of like
Mertz' code (but in ruby):http://gnosis.cx/TPiP/069.code
does this sound familiar to anybody?
- Bill
Try it: you'll get Errno::ECONNREFUSED (Net::HTTP) or it will time out
(open-uri) on a lot of large commercial websites, like ruby-lang.org.
So either I have to rewrite headers to emulate, say, Mozilla browser,
or throttle down number of GETs it's firing out, so as not to offend
the websites' firewalls. Not clear from stdlib doc link in Marcus'
post.
Right, the point of Mertz' code is to parse <TITLE>, <META>, <BODY> for
phrases like "not found", "not available", "does not exist" when the
HTTP/FTP lib gives you a 200. But at this point I'd settle for
responses different from "timed out" or Errno::ECONNREFUSED.
I have 2 apps, one is simply validating personal bookmarks, one is
commercial. For the commercial one, I'd be happy to register a spider
per O'reilly's "Spidering Hacks". For my bookmarks, I figured this
would be easy...