Performance comparison between screen scrapers

Conrad_Chu · 11 January 2007 08:36

Does anyone know how the following screen scrapers perform against one
another?

* ScrAPI
* RubyfulSoup
* HTree
* Hpricot

I'm trying to write up a tool where a person enters in a URL, and I use
an AJAX call to scrape the contents of that URL for title, description,
etc. So speed is really important (I suppose, regular expressions would
be the fastest, but I need something that is tree-based and supports
HTML tidying)

Thanks
Conrad

···

--
Posted via http://www.ruby-forum.com/.

Jan_Svitok · 11 January 2007 09:51

There was a comparision done on this list some time ago. Search for lib names.

···

On 1/11/07, Conrad Chu <conradchu@conradchu.com> wrote:

Does anyone know how the following screen scrapers perform against one
another?

* ScrAPI
* RubyfulSoup
* HTree
* Hpricot

I'm trying to write up a tool where a person enters in a URL, and I use
an AJAX call to scrape the contents of that URL for title, description,
etc. So speed is really important (I suppose, regular expressions would
be the fastest, but I need something that is tree-based and supports
HTML tidying)

Thanks
Conrad

Ross_Bamford2 · 11 January 2007 10:25

I don't know about ScrAPI or HTree, but I recently blogged an informal benchmark run between Rubyful Soup, Hpricot, and the (still developmental) libxml2 HTML parser binding in Libxml-ruby. It's at:

http://cloverhead.blogspot.com/2006/12/bit-of-benchmarking.html

···

On Thu, 11 Jan 2007 08:36:44 -0000, Conrad Chu <conradchu@conradchu.com> wrote:

Does anyone know how the following screen scrapers perform against one
another?

* ScrAPI
* RubyfulSoup
* HTree
* Hpricot

I'm trying to write up a tool where a person enters in a URL, and I use
an AJAX call to scrape the contents of that URL for title, description,
etc. So speed is really important (I suppose, regular expressions would
be the fastest, but I need something that is tree-based and supports
HTML tidying)

Thanks
Conrad

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Interfecus · 13 January 2007 13:15

Conrad Chu wrote:

Does anyone know how the following screen scrapers perform against one
another?

* ScrAPI
* RubyfulSoup
* HTree
* Hpricot

I'm trying to write up a tool where a person enters in a URL, and I use
an AJAX call to scrape the contents of that URL for title, description,
etc. So speed is really important (I suppose, regular expressions would
be the fastest, but I need something that is tree-based and supports
HTML tidying)

Thanks
Conrad

--
Posted via http://www.ruby-forum.com/\.

I haven't used them all but Hpricot is fast (the parser is written in C
with Ragel), error tolerant and perfect for this task. Take a look at
its website for a guide on how to use it.

Topic		Replies	Views
Hpricot/Rubyful Soup comparison ruby-talk	18	76	25 November 2006
Decent HTML Parser? ruby-talk	17	116	13 July 2006
Screen scraping via regex vs. htmltools (vs. REXML) ruby-talk	5	115	2 December 2005
Article on screen scraping w HTree+REXML, RubyfulSoup, WWW::Mechanize ruby-talk	2	116	14 June 2006
Simple screen scraper using scrAPI ruby-talk	14	122	30 November 2006

Performance comparison between screen scrapers

Related topics