Spidering a website to build a sitemap

Bill_Guindon1 · 22 June 2005 18:49

I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?

···

--
Bill Guindon (aka aGorilla)

Ryan_Leavengood2 · 22 June 2005 19:18

Bill Guindon said:

I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?

You could do this with WWW::Mechanize fairly easily. There isn't a
built-in spider system yet, but it would be a nice addition and I'm sure
Michael would add it if it was general enough.

I'm pretty familiar with Mechanize now and could help you out if you have
a problem. The basic idea would to recursively get pages and click links
until you run out of links (of course I'm sure you know this.) The cool
thing is turning that idea into code with Mechanize is very easy, since it
collects links for you, and allow you to "click" them.

In case you can't tell, I really like Mechanize

Ryan

Shad_Sterling · 23 June 2005 22:36

I have a site mapping tool I'm working on which does not yet read
remote files but does map links between local files.

http://sterfish.com/lab/sitemapper/

I've been putting off announcing it until I have an actual page there,
but I guess I'm too slow.

- Shad

···

On 6/22/05, Bill Guindon <agorilla@gmail.com> wrote:

I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?

--
Bill Guindon (aka aGorilla)

--

----------

Please do not send personal (non-list-related) mail to this address.
Personal mail should be sent to polyergic@sterfish.com.

Bill_Guindon1 · 22 June 2005 20:07

Bill Guindon said:
> I need to spider a site and build a sitemap for it. I've looked
> around on rubyforge, and RAA, and don't see an exact match. Has
> anybody done this, or is there a library out there that I missed?

You could do this with WWW::Mechanize fairly easily. There isn't a
built-in spider system yet, but it would be a nice addition and I'm sure
Michael would add it if it was general enough.

I'm pretty familiar with Mechanize now and could help you out if you have
a problem. The basic idea would to recursively get pages and click links
until you run out of links (of course I'm sure you know this.) The cool
thing is turning that idea into code with Mechanize is very easy, since it
collects links for you, and allow you to "click" them.

Grabbed it as a gem, trying a simple test. Oddly enough, had to add
it's lib path to the LOAD_PATH to get rid of an error (uninitialized
constant WWW (NameError)).

Any docs available on this, or any public examples? Does look like
it'll give me a good start.

thanks much for the pointer.

···

On 6/22/05, Ryan Leavengood <mrcode@netrox.net> wrote:

In case you can't tell, I really like Mechanize

Ryan

--
Bill Guindon (aka aGorilla)

Bill_Guindon1 · 24 June 2005 00:07

I have a site mapping tool I'm working on which does not yet read
remote files but does map links between local files.

http://sterfish.com/lab/sitemapper/

I've been putting off announcing it until I have an actual page there,
but I guess I'm too slow.

Thanks much. I need one that works remotely, but I'll certainly poke
around in there, and see what I can do with it.

···

On 6/23/05, Shad Sterling <polyergic@gmail.com> wrote:

- Shad

On 6/22/05, Bill Guindon <agorilla@gmail.com> wrote:
> I need to spider a site and build a sitemap for it. I've looked
> around on rubyforge, and RAA, and don't see an exact match. Has
> anybody done this, or is there a library out there that I missed?
>
> --
> Bill Guindon (aka aGorilla)
>
>

--

----------

Please do not send personal (non-list-related) mail to this address.
Personal mail should be sent to polyergic@sterfish.com.

--
Bill Guindon (aka aGorilla)

Ryan_Leavengood2 · 22 June 2005 20:25

Bill Guindon said:

Grabbed it as a gem, trying a simple test. Oddly enough, had to add
it's lib path to the LOAD_PATH to get rid of an error (uninitialized
constant WWW (NameError)).

Hmmm, I didn't have to do that. Do you have rubygems in your RUBYOPT?

Mechanize does mess around with the LOAD_PATH itself because it uses new
features from the Ruby v1.9 net libraries.

But for me it worked fine, as shown in the code below.

Any docs available on this, or any public examples? Does look like
it'll give me a good start.

Unfortunately the docs are a bit light at the moment. I learned a lot by
reading the source though, which is well written. Once I get my web-site
up I was going to write an article on Mechanize, but for now that doesn't
help you much

It needs to be heavily refactored, but here is the prototype code I wrote
to help me renew books at my city library's web-site:

require 'mechanize'
require 'time'

class Book
attr_accessor :title, :author, :due_date, :checkbox

  def due?
    (@due_date - Time.now) < 172800.0 # 2 days
  end

  def to_s
    "#@title by #@author, due on #@due_date\nCheckbox: #{checkbox.name}"
  end
end

agent = WWW::Mechanize.new {|a| a.log = Logger.new(STDERR) }
page = agent.get('http://www.coala.org/'\)
link = page.links.find {|l| l.node.text =~ /BOYNTON/ }
page = agent.click(link)
link = page.links.find {|l| l.node.text =~ /My Account/ }
page = agent.click(link)
link = page.links.find {|l| l.node.text =~ /Renew My Materials/ }
page = agent.click(link)
form = page.forms[1]
form.fields.find {|f| f.name == 'user_id'}.value = 'my_id_removed'
form.fields.find {|f| f.name == 'password'}.value = 'my_password_removed'
agent.watch_for_set = {}
agent.watch_for_set['td']=nil
page = agent.submit(form, form.buttons.first)
form = page.forms[1]
books_html = page.watches['td'].find_all {|n| n.attributes['class'] =~
/itemlisting/}
books =
books_html.each do |element|
  element.each_element do |subelem|
    if subelem.name == 'input' and subelem.attributes['type'] == 'checkbox'
      # Checkbox for renewal
      books << Book.new
      books[-1].checkbox = form.checkboxes.find {|c| c.name ==
subelem.attributes['name']}
    elsif subelem.name == 'label'
      # Book title and author
      books[-1].title = subelem.texts[0]
      books[-1].author = subelem.texts[1]
    elsif subelem.name == 'strong'
      # Due date
      books[-1].due_date = Time.parse(subelem.text)
    end
  end
end
books_due = false
books.each do |book|
  if book.due?
    books_due = true
    puts "#{book.title} is due, renewing!"
    book.checkbox.checked = true
  end
end
if books_due
  page = agent.submit(form, form.buttons.first)
  puts page.body
else
  puts 'Nothing was due, have a nice day!'
end
__END__

thanks much for the pointer.

No problem. Hope the above code helps too.

Ryan

Shad_Sterling · 24 June 2005 05:54

> I have a site mapping tool I'm working on which does not yet read
> remote files but does map links between local files.
>
> http://sterfish.com/lab/sitemapper/
>
> I've been putting off announcing it until I have an actual page there,
> but I guess I'm too slow.

Thanks much. I need one that works remotely, but I'll certainly poke
around in there, and see what I can do with it.

Yeah. I made this to help me work on a site I'm now maintaining,
which was a hideous mess when I got to it. I do plan to make it map
remote pages as well, but it will probably be awhile.

···

On 6/23/05, Bill Guindon <agorilla@gmail.com> wrote:

On 6/23/05, Shad Sterling <polyergic@gmail.com> wrote:

> - Shad
>
>
>
> On 6/22/05, Bill Guindon <agorilla@gmail.com> wrote:
> > I need to spider a site and build a sitemap for it. I've looked
> > around on rubyforge, and RAA, and don't see an exact match. Has
> > anybody done this, or is there a library out there that I missed?
> >
> > --
> > Bill Guindon (aka aGorilla)
> >
> >
>
>
> --
>
> ----------
>
> Please do not send personal (non-list-related) mail to this address.
> Personal mail should be sent to polyergic@sterfish.com.
>
>

--
Bill Guindon (aka aGorilla)

--

----------

Please do not send personal (non-list-related) mail to this address.
Personal mail should be sent to polyergic@sterfish.com.

Bill_Guindon1 · 22 June 2005 22:06

Bill Guindon said:
>
> Grabbed it as a gem, trying a simple test. Oddly enough, had to add
> it's lib path to the LOAD_PATH to get rid of an error (uninitialized
> constant WWW (NameError)).

Hmmm, I didn't have to do that. Do you have rubygems in your RUBYOPT?

Nope, guess it's time to add it.

Mechanize does mess around with the LOAD_PATH itself because it uses new
features from the Ruby v1.9 net libraries.

But for me it worked fine, as shown in the code below.

> Any docs available on this, or any public examples? Does look like
> it'll give me a good start.

Unfortunately the docs are a bit light at the moment. I learned a lot by
reading the source though, which is well written. Once I get my web-site
up I was going to write an article on Mechanize, but for now that doesn't
help you much

It needs to be heavily refactored, but here is the prototype code I wrote
to help me renew books at my city library's web-site:

[helpful code snipped]

Thanks, that gives me a better idea of what can be done with it.

Now comes the fun part of parsing through relative urls, checking for
base href's, munging similar urls (ie: /some/file.html vs.
some/file.html both called from the root). Should be interesting.

···

On 6/22/05, Ryan Leavengood <mrcode@netrox.net> wrote:

> thanks much for the pointer.

No problem. Hope the above code helps too.

Ryan

--
Bill Guindon (aka aGorilla)

Belorion · 29 June 2005 20:07

I'll throw my little snippet in, in case anyone finds it useful.

I just wrote this up to spider my rails app to give me a list of all
the urls so I can use them later in a stress test.

Not terribly advanced, but gives you the format of:

http://www.blah.com/foo.html
{tab} http://www.blah.com/bar.html

Where tabbed out children of the foo.html are pages foo.html points to.

http://snippets.textdrive.com/posts/show/74

-Matt

Bill_Guindon1 · 29 June 2005 20:45

I'll throw my little snippet in, in case anyone finds it useful.

I just wrote this up to spider my rails app to give me a list of all
the urls so I can use them later in a stress test.

Not terribly advanced, but gives you the format of:

http://www.blah.com/foo.html
{tab} http://www.blah.com/bar.html

Where tabbed out children of the foo.html are pages foo.html points to.

http://snippets.textdrive.com/posts/show/74

Good stuff! It's missing a couple of features for stock sites
(handling javascript:, mailto:, #name links etc.), but those can
easily be added.

Thanks much for posting it.

···

On 6/29/05, Belorion <belorion@gmail.com> wrote:

-Matt

--
Bill Guindon (aka aGorilla)

Gene_Tani · 30 June 2005 00:30

i noticed webfetcher in RPAbase, haven't had a chance to play with it:

http://www.acc.umu.se/~r2d2/programming/ruby/webfetcher/webfetcher.html

Bill_Guindon1 · 30 June 2005 02:13

i noticed webfetcher in RPAbase, haven't had a chance to play with it:

Should've thought to scan RPA. Wish it was still being updated, I
sure do miss it.

402 Access Denied

Gave it a couple test drives, and it's quite nice. The following gave
me exactly what I was looking for.

require 'webfetcher'

page = WebFetcher::Page.url('somedomain.com)
links = page.recurse.links
File.open('links.txt', 'w+') {|f| f.puts links.uniq}

Thanks much for tracking it down.

···

On 6/29/05, Gene Tani <gene.tani@gmail.com> wrote:

--
Bill Guindon (aka aGorilla)

Gene_Tani · 30 June 2005 17:10

I was looking for somehting to trap 404-type errors, kind of like
Mertz' code (but in ruby):

http://gnosis.cx/TPiP/069.code

does this sound familiar to anybody?

marcus_baker · 30 June 2005 17:33

The get_response method of Net::HTTP maybe:
http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M000033

The doc says it returns a Net::HTTPResponse object, which has the HTTP
result code in the attribute 'code'.

-m

···

On 6/30/05, Gene Tani <gene.tani@gmail.com> wrote:

I was looking for somehting to trap 404-type errors, kind of like
Mertz' code (but in ruby):

.*?(404|403).*?ERROR.*?

does this sound familiar to anybody?

Bill_Pennington · 1 July 2005 08:24

When detecting 404's watch out for the servers that return a 200 code with a pretty "Not found" page. Those can throw a real curve ball depending on what your are trying to do.

···

On Jun 30, 2005, at 10:10 AM, Gene Tani wrote:

I was looking for somehting to trap 404-type errors, kind of like
Mertz' code (but in ruby):

.*?(404|403).*?ERROR.*?

does this sound familiar to anybody?

- Bill

Gene_Tani · 1 July 2005 06:25

Try it: you'll get Errno::ECONNREFUSED (Net::HTTP) or it will time out
(open-uri) on a lot of large commercial websites, like ruby-lang.org.

So either I have to rewrite headers to emulate, say, Mozilla browser,
or throttle down number of GETs it's firing out, so as not to offend
the websites' firewalls. Not clear from stdlib doc link in Marcus'
post.

Gene_Tani · 1 July 2005 14:15

Right, the point of Mertz' code is to parse <TITLE>, <META>, <BODY> for
phrases like "not found", "not available", "does not exist" when the
HTTP/FTP lib gives you a 200. But at this point I'd settle for
responses different from "timed out" or Errno::ECONNREFUSED.

I have 2 apps, one is simply validating personal bookmarks, one is
commercial. For the commercial one, I'd be happy to register a spider
per O'reilly's "Spidering Hacks". For my bookmarks, I figured this
would be easy...

Topic		Replies	Views
Web spider ruby-talk	2	80	4 May 2005
Ruby web spiders? ruby-talk	0	63	5 March 2004
[ARTICLE] HTML Scraping Using WWW::Mechanize ruby-talk	3	114	4 December 2005
Seeking for a ruby spider robot example ruby-talk	3	122	25 August 2006
Mechanize newbie ruby-talk	3	84	14 June 2007

Spidering a website to build a sitemap

Related topics