Scraping websites

Kev_Jackson2 · 24 March 2006 08:20

Hi

I'm considering a side-project where I'd have to scrape a couple of websites and aggregate the results. So far I've done a couple of experiments with html-parser, but I'm not really happy with it. I'd hate to just throw regexps at the html, so I'm really looking for an elegant way to select the correct data from the page.

Any ideas on what gems to use?

Thanks
Kev

Alder_Green · 24 March 2006 08:24

Hey Kev

Take a look at Rubyful Soup:

Very easy to parse X/HTML source, including the prevalent
not-strictly-correct-and-even-somewhat-corrupt sort.

Regards,
Alder

···

On 3/24/06, Kev Jackson <kevin.jackson@it.fts-vn.com> wrote:

Hi

I'm considering a side-project where I'd have to scrape a couple of
websites and aggregate the results. So far I've done a couple of
experiments with html-parser, but I'm not really happy with it. I'd
hate to just throw regexps at the html, so I'm really looking for an
elegant way to select the correct data from the page.

Any ideas on what gems to use?

Thanks
Kev

lg1 · 24 March 2006 10:43

i have made good experience using watir (http://wtr.rubyforge.org/ and
http://www.mjtnet.com/watir_webrecorder.htm\) - buts its an windows only
thing.

also very usable (and cross plattform) is www:mechanize
(http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc\).

hope this helps,

lars

Kev Jackson schrieb:

···

Hi

I'm considering a side-project where I'd have to scrape a couple of
websites and aggregate the results. So far I've done a couple of
experiments with html-parser, but I'm not really happy with it. I'd
hate to just throw regexps at the html, so I'm really looking for an
elegant way to select the correct data from the page.

Any ideas on what gems to use?

Thanks
Kev

alex_f_il · 28 March 2006 04:38

Kev Jackson wrote:

Hi

I'm considering a side-project where I'd have to scrape a couple of
websites and aggregate the results. So far I've done a couple of
experiments with html-parser, but I'm not really happy with it. I'd
hate to just throw regexps at the html, so I'm really looking for an
elegant way to select the correct data from the page.

Any ideas on what gems to use?

Thanks
Kev

Look at SWExplorerAutomation (www.webunittesting.com)
The program creates an automation API for any Web application which
uses HTML and DHTML and works with Microsoft Internet Explorer. The Web
application becomes programmatically accessible from any .NET language.

SWEA API provides access to Web application controls and content. The
API is generated using SWEA Visual Designer. SWEA Visual Designer helps
create programmable objects from Web page content.

Kev_Jackson2 · 24 March 2006 10:42

Alder Green wrote:

Hey Kev

Take a look at Rubyful Soup:

Rubyful Soup: "The brush has got entangled in it!"

Very easy to parse X/HTML source, including the prevalent
not-strictly-correct-and-even-somewhat-corrupt sort.

I've installed as a gem and now I'm getting unitialized constant BeautifulSoup errors

require 'net/http'
require 'rubygems'
require_gem 'rubyful_soup'

class BBCScrape
   def read
    Net::HTTP.start("news.bbc.co.uk", 80) do |h|
      response = h.get("/sport1/hi/football/eng_prem/fixtures/default.stm")
      #p response
      s = BeautifulSoup.new response.body <- fails
      p s.find_all('div', :attrs => { 'class' => 'mvb' })
    end
  end
end

I'm not sure what I'm doing wrong and all the documentation doesn't refer to gem usage

Sorry if this is me being thick at the end of a Friday...

Thanks
Kev

Kev_Jackson2 · 24 March 2006 10:56

lg wrote:

i have made good experience using watir (http://wtr.rubyforge.org/ and
http://www.mjtnet.com/watir_webrecorder.htm\) - buts its an windows only
thing.

I know how to use watir - it's great, but it's not the correct approach for this application - I want to request a page from a remote source and extract data from it - rubyful soup seems like the way to go, but for some reason I'm having difficulty with the code. Watir is good for driving a browser - I'm not interested in that for this application

Thanks
Kev

James_Britt4 · 24 March 2006 16:51

lg wrote:

i have made good experience using watir (http://wtr.rubyforge.org/ and
http://www.mjtnet.com/watir_webrecorder.htm\) - buts its an windows only
thing.

also very usable (and cross plattform) is www:mechanize
(http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc\).

This may help you with Mechanize:

···

--
James Britt

http://web2.0validator.com - We're the Dot in Web 2.0
http://refreshingcities.org - Design, technology, usability
http://yourelevatorpitch.com - Finding Business Focus
http://www.jamesbritt.com - Playing with Better Toys

Kev_Jackson2 · 28 March 2006 06:48

Look at SWExplorerAutomation (www.webunittesting.com)
The program creates an automation API for any Web application which
uses HTML and DHTML and works with Microsoft Internet Explorer. The Web
application becomes programmatically accessible from any .NET language.

SWEA API provides access to Web application controls and content. The
API is generated using SWEA Visual Designer. SWEA Visual Designer helps
create programmable objects from Web page content.

Perhaps you've misunderstood my intentions. I want to scrape a website (BBC News for example) and extract some data from the HTML returned. I want to use Ruby to do this and I also want to avoid using regular expressions to manually parse the HTML myself.

Forgive me if I'm wrong, but your response seems to be an advert for an automation product for .Net.

Someone else has already suggested RubyfulSoup which I've had some success with and I'm moving ahead with this for now.

Kev

Ross_Bamford4 · 24 March 2006 12:12

See this post from yesterday:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185578

but watch out for the typo (I meant "*doesn't* (by default) actually
require anything")

···

On Fri, 2006-03-24 at 19:42 +0900, Kev Jackson wrote:

Alder Green wrote:

>Hey Kev
>
>Take a look at Rubyful Soup:
>
>Rubyful Soup: "The brush has got entangled in it!"
>
>Very easy to parse X/HTML source, including the prevalent
>not-strictly-correct-and-even-somewhat-corrupt sort.
>
>
I've installed as a gem and now I'm getting unitialized constant
BeautifulSoup errors

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

alex_f_il · 28 March 2006 12:28

1. You don't have to use regular expressions to extract data. SWEA
works with XML and have XpathDataExtractor and TableDataExtractor to
simplify the data extraction.
You can visually define the the extraction rules using them.

2. You can use Ruby.Net for automation scripts and I like .Net.

3. SWEA supports frames, javascript, popup windows, windows and html
dialog boxes, file and image downloads with cookies and etc.Also SWEA
can work from windows service account.

SWEA have been used in many data scraping solutions with a great
success. Look at SWJobSearch. I have wrote it in a few days. Try to
write it using RubyfulSoup.

Good luck with RubyfulSoup!

Daniel_Harple · 24 March 2006 12:17

Also, check out this recent thread:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185581

-- Daniel

···

On Mar 24, 2006, at 1:12 PM, Ross Bamford wrote:

See this post from yesterday:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185578

but watch out for the typo (I meant "*doesn't* (by default) actually
require anything")

Kev_Jackson2 · 27 March 2006 01:14

See this post from yesterday:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185578

but watch out for the typo (I meant "*doesn't* (by default) actually
require anything")

Thanks! Just tried and worked perfectly. I've always used

require 'rubygems'
require_gem 'x'

before, so now I know if that fails to try:

require 'rubygems'
require 'x'

Again thanks for the help in resolving this. Learnt something today, so it's not a wasted day

Kev

Topic		Replies	Views
Decent HTML Parser? ruby-talk	0	73	12 July 2006
Scraping ruby-talk	2	100	17 November 2007
HTML parsing ruby-talk	4	82	2 February 2004
Screen scraping via regex vs. htmltools (vs. REXML) ruby-talk	5	102	2 December 2005
Ruby (X)HTML Parser? ruby-talk	5	90	25 September 2006

Scraping websites

Related topics