Waiter, there's a noob in my soup!

Jeff_Pritchard · 27 March 2006 03:07

Another thread here made me realize that I have a perfect use for
RubyfulSoup. I own a site that was built with the ISP's online site
building tools. I want to move the site to different hosting before my
one year "subscription" runs out, and scraping the site and formatting
just the text of each page against a template html file will be the best
way to do it.

Alas, I am a complete noob to roob, and I always find that I learn
things much more easily by studying (i.e. stealing and modifying)
example code than by studying docs.

I was wondering if anyone could point me to some example code that is
using RubyfulSoup to parse a sitemap to get links to all the pages on
that site and request each page and grab things from it.

many thanks,
jp

···

--
Posted via http://www.ruby-forum.com/.

Gene_Tani · 27 March 2006 03:58

Jeff Pritchard wrote:

Another thread here made me realize that I have a perfect use for
RubyfulSoup. I own a site that was built with the ISP's online site
building tools. I want to move the site to different hosting before my
one year "subscription" runs out, and scraping the site and formatting
just the text of each page against a template html file will be the best
way to do it.

Alas, I am a complete noob to roob, and I always find that I learn
things much more easily by studying (i.e. stealing and modifying)
example code than by studying docs.

I was wondering if anyone could point me to some example code that is
using RubyfulSoup to parse a sitemap to get links to all the pages on
that site and request each page and grab things from it.

how about a Beautiful soup example
http://sig.levillage.org/?p=599

Gene_Tani · 27 March 2006 04:33

Jeff Pritchard wrote:

Another thread here made me realize that I have a perfect use for
RubyfulSoup. I own a site that was built with the ISP's online site
building tools. I want to move the site to different hosting before my
one year "subscription" runs out, and scraping the site and formatting
just the text of each page against a template html file will be the best
way to do it.

Alas, I am a complete noob to roob, and I always find that I learn
things much more easily by studying (i.e. stealing and modifying)
example code than by studying docs.

I was wondering if anyone could point me to some example code that is
using RubyfulSoup to parse a sitemap to get links to all the pages on
that site and request each page and grab things from it.

this is kinda close, it uses BeautifulSoup
http://sig.levillage.org/?p=599

Ryan_Leavengood · 29 March 2006 00:59

WWW::Mechanize makes this easy. The HTML parsing has been pretty
robust in my experience. So far I've used it to scrape my library's
web site to see when books are due and automatically renew them, as
well as log into Cingular.com and get my mobile phone minutes. The
library web-site has weird redirects and some other things that
Mechanize handles great, and the Cingular has a weird multi-step login
system that I got going as well without too much trouble.

When I needed support for check boxes in the form on the library
web-site, the author of WWW::Mechanize, Michael Neumann, added them in
less than 24 hours.

So anyhow, this is a slick library, and very useful.

Ryan

···

On 3/26/06, Jeff Pritchard <jp@jeffpritchard.com> wrote:

I was wondering if anyone could point me to some example code that is
using RubyfulSoup to parse a sitemap to get links to all the pages on
that site and request each page and grab things from it.

Jeff_Pritchard · 28 March 2006 03:47

Thanks Gene, but I don't know which end is up with Python. I'm an old C
dog trying to learn a new (ruby) trick. Would rather not use up any of
my precious few functional neurons with Python.

Anybody have a Ruby example of grabbing html pages from a site map full
of links, or an example of using the RubyfulSoup package to grab text
pieces from html pages?

thanks,
jp

Gene Tani wrote:

···

Jeff Pritchard wrote:

example code than by studying docs.

I was wondering if anyone could point me to some example code that is
using RubyfulSoup to parse a sitemap to get links to all the pages on
that site and request each page and grab things from it.

this is kinda close, it uses BeautifulSoup
http://sig.levillage.org/?p=599

--
Posted via http://www.ruby-forum.com/\.

Jeff_Pritchard · 29 March 2006 03:59

Thanks to all who responded.

It looks like I could write some tiny perl scripts with mechanize and
pipe their output to a ruby program so that I could do most of the work
in ruby instead of perl. Perl reminds me of work, and that is a bad
thing. Writing perl also doesn't teach me any ruby, which is an
important part of this hack.

I also just found mention in the pickaxe book of the open-uri library
that will allow me to grab lines from a URL. This, I gather, gives me
something roughly equivalent to piping the output of curl into a ruby
program.

Thanks for the help guys, I think I'm armed with enough to be dangerous
now.

jp

Ryan Leavengood wrote:

···

On 3/26/06, Jeff Pritchard <jp@jeffpritchard.com> wrote:

I was wondering if anyone could point me to some example code that is
using RubyfulSoup to parse a sitemap to get links to all the pages on
that site and request each page and grab things from it.

WWW::Mechanize makes this easy. The HTML parsing has been pretty
robust in my experience. So far I've used it to scrape my library's
web site to see when books are due and automatically renew them, as
well as log into Cingular.com and get my mobile phone minutes. The
library web-site has weird redirects and some other things that
Mechanize handles great, and the Cingular has a weird multi-step login
system that I got going as well without too much trouble.

When I needed support for check boxes in the form on the library
web-site, the author of WWW::Mechanize, Michael Neumann, added them in
less than 24 hours.

So anyhow, this is a slick library, and very useful.

Ryan

--
Posted via http://www.ruby-forum.com/\.

Jeff_Pritchard · 28 March 2006 03:51

...or even just a Ruby example of how to use a URL to get the source of
a web page?

thanks,
jp

Jeff Pritchard wrote:

···

Thanks Gene, but I don't know which end is up with Python. I'm an old C
dog trying to learn a new (ruby) trick. Would rather not use up any of
my precious few functional neurons with Python.

Anybody have a Ruby example of grabbing html pages from a site map full
of links, or an example of using the RubyfulSoup package to grab text
pieces from html pages?

thanks,
jp

Gene Tani wrote:

Jeff Pritchard wrote:

example code than by studying docs.

I was wondering if anyone could point me to some example code that is
using RubyfulSoup to parse a sitemap to get links to all the pages on
that site and request each page and grab things from it.

this is kinda close, it uses BeautifulSoup
http://sig.levillage.org/?p=599

--
Posted via http://www.ruby-forum.com/\.

James_Britt4 · 28 March 2006 04:05

Jeff Pritchard wrote:

Thanks Gene, but I don't know which end is up with Python. I'm an old C dog trying to learn a new (ruby) trick. Would rather not use up any of my precious few functional neurons with Python.

Anybody have a Ruby example of grabbing html pages from a site map full of links, or an example of using the RubyfulSoup package to grab text pieces from html pages?

I build the RubyStuff.com site by snarfing a series of Cafe Press pages, grabbing out links and product descriptions, and reassembling them into a more cohesive set of pages.

I wrote about it here:

What I like about Mechanize (though this may also be true of RubyfulSoup; I've not used it) is that it makes it easy to take the slurped page parts and create objects that better map to my business logic.

···

--
James Britt

Judge a man by his questions, rather than his answers.
- Voltaire

Gene_Tani · 28 March 2006 11:33

Jeff Pritchard wrote:

Thanks Gene, but I don't know which end is up with Python. I'm an old C
dog trying to learn a new (ruby) trick. Would rather not use up any of
my precious few functional neurons with Python.

a couple other things to look at:

http://www.linux-magazine.com/issue/51/Ruby_Web_Spiders.pdf

Dimitri_Aivaliotis · 29 March 2006 11:54

Perl??

WWW::Mechanize is a Ruby library, see
http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc and
http://rubyforge.org/projects/wee, as referenced from James' post.

- Dimitri

···

On 3/29/06, Jeff Pritchard <jp@jeffpritchard.com> wrote:

Thanks to all who responded.

It looks like I could write some tiny perl scripts with mechanize and
pipe their output to a ruby program so that I could do most of the work
in ruby instead of perl. Perl reminds me of work, and that is a bad
thing. Writing perl also doesn't teach me any ruby, which is an
important part of this hack.

Gene_Tani · 28 March 2006 11:18

Jeff Pritchard wrote:

..or even just a Ruby example of how to use a URL to get the source of
a web page?

to get the file but not the style sheets, GIF spacers and all the other
junk:

`curl -O #{oneurl}`

Pistos_Christou1 · 28 March 2006 14:31

Jeff Pritchard wrote:

...or even just a Ruby example of how to use a URL to get the source of
a web page?

If you literally just want the source, you can use:

require 'open-uri'
open( 'http://purepistos.net/diakonos' ) do |http|
html_source = http.read
end

But in most case you don't just want to read it, you want to do stuff
with it, hence we recommend Mechanize and RubyfulSoup.

Pistos

···

--
Posted via http://www.ruby-forum.com/\.

Daniel_Harple · 29 March 2006 15:28

Perl??

WWW::Mechanize is a Ruby library

It is a port of a Perl library.

I was wondering if anyone could point me to some example code that is
using RubyfulSoup to parse a sitemap to get links to all the pages on
that site and request each page and grab things from it.

Here's an example that finds all the links:

require 'net/http'
require 'rubygems'
require 'rubyful_soup'

uri = URI.parse('http://ruby-lang.org/en/'\)
soup = BeautifulSoup.new(
Net::HTTP.get_response(uri).body
)

soup.find_all('a').
map { |link| link['href'] }.
reject { |link| link.nil? }

You will get links with a host, and some with just relative/absolute paths. You can do (I didn't test this):

link_uri = URI.parse('./20050820.html')
link_uri.host = uri.host unless link_uri.host

and fetch them. You will also want to check for redirections and other errors. See the [Net::HTTP docs][1] for fetching a page when redirected.

[1]: http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/

Regards,
-- Daniel

···

On Mar 29, 2006, at 1:54 PM, Dimitri Aivaliotis wrote:
On 3/26/06, Jeff Pritchard <jp@jeffpritchard.com> wrote:

Jeff_Pritchard · 29 March 2006 15:52

You guys are great!

Yeah, on mechanize, I googled it and landed on the perl variety. Didn't
realize it has been ported to ruby.

Looks like I have all the tools I need now. Just have to roll up my
sleeves and "play" with them.

thanks,
jp

···

--
Posted via http://www.ruby-forum.com/.

James_Britt4 · 29 March 2006 16:00

Daniel Harple wrote:

···

On Mar 29, 2006, at 1:54 PM, Dimitri Aivaliotis wrote:

Perl??

WWW::Mechanize is a Ruby library

But I believe certain niceties were added, so it may not be exactly the same. (I know Michael Neumman added a feature I submitted.)

--
James Britt

"Blanket statements are over-rated"

Topic		Replies	Views
Spidering a website to build a sitemap ruby-talk	16	117	1 July 2005
Website screen scraping with Mechanize or Rubyful Soup ruby-talk	9	109	13 September 2005
Scraping websites ruby-talk	11	85	28 March 2006
Seeking for a ruby spider robot example ruby-talk	3	124	25 August 2006
Ruby HTML Tools - ruby-htmltools Examples ruby-talk	2	127	24 March 2006

Waiter, there's a noob in my soup!

Related topics