Screen scraping via regex vs. htmltools (vs. REXML)

Dan_Kohn · 2 December 2005 17:27

I've finally reimplemented the screen scraper I mentioned on
<http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/76e8bbd4a9e48277/396cb7ea35eab14f#396cb7ea35eab14f>
using regexes and no external libraries. It is, as Daz suggested, many
times faster than REXML. My question is whether it would be smarter
(faster?, easier to code?) to use htmltools or HTMLTree::Parser
instead.

Any other comments on ways to make the code faster, cleaner, and more
Ruby-like? Finally, can you please tell me why I can't get strip to
work, if I switch the commenting for lines 15 and 16? (It doesn't
remove the leading space in the second element of the last 6 lines.)
By contrast, the gsub on line 15 does what I want.

Thanks very much in advance for any advice you can offer on which tools
to use.

# The program parses out all of the rows and then looks
# for the right kinds of cells inside. It constructs
# 2 two-dimensional arrays of the results.

require 'mechanize'
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
page = agent.get('http://www.dankohn.com/uamileage.html').body

def table_clean (table)
table.each { |row|
  row.each { |e|
    e.gsub!(/<.*?>| /m,"")
    e.gsub!(/\s+/," ")
    e.gsub!(/(^\s|\s$)/,"")
    #~ e.strip
      }
    }
end

miletable = []
summarytable = []
row = /<tr>(.*?)<\/tr>/m
milecells = /
  <td.*?class="t4">(.*?)<\/td>\s*
  <td.*?class="t4">(.*?)<\/td>\s*
  <td.*?class="t4">(.*?)<\/td>\s*
  <td.*?>(.*?)<\/td>\s*
  <td.*?class="t4">(.*?)<\/td>
  /mx
summarycells = /
  <td.*?class="t3".*?>(.*?)<\/td>\s*
  <td.*?class="t3".*?>(.*?)<\/td>
  /mx
activitycells = /
  <td.*?class="t4".*?>(.*?)<\/td>\s*
  <td.*?colspan=("4"|4).*?>(.*?)<\/td>
  /mx
page.scan(row) { |e|
  rowtext = e.to_s
  rowtext.scan(milecells) {
    miletable << [$1,$2,$3,$4,$5]
    }
  rowtext.scan(summarycells) {
    summarytable << [$1,$2]
    }
  rowtext.scan(activitycells) {
    summarytable << [$1,$3]
    }
  }
table_clean(miletable)
table_clean(summarytable)
miletable.each {|e| print e.join(":"),"\n"}
summarytable.each {|e| print e.join(":"),"\n"}

- dan

···

--
Dan Kohn <mailto:dan@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>

James_Britt4 · 2 December 2005 18:12

Dan Kohn wrote:

I've finally reimplemented the screen scraper I mentioned on
<http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/76e8bbd4a9e48277/396cb7ea35eab14f#396cb7ea35eab14f>
using regexes and no external libraries. It is, as Daz suggested, many
times faster than REXML. My question is whether it would be smarter
(faster?, easier to code?) to use htmltools or HTMLTree::Parser
instead.

The code in your post seems to use Mechanize.
If you are using agent.get to fetch the HTML then you've already parsed the html using htmltools & REXML. You can register callback objects that are invoked when the parsing process encounters matching nodes. Mechanize does this automatically for certain nodes (form stuff, I think), but you can use watch_for_set= {} to define a set of nodes to watch for.

This is what I use to construct the product pages for rubystuff.com from the multiple CafePress pages that contain the images, prices, and product description. I tell Mechanize to watch for img, tr, and td elements, and it constructs sets of custom objects of just the parts of the source HTML matching certain criteria. Then I extract the data, create RSS feeds, and turn those into a set of aggregated HTML pages.

What I like about this is that the parse process gives me business objects, with (hopefully) self-explanatory behavior. For example, I can ask one of these objects for 'product_id' or 'description'; the object encapsulates the assorted XPath/regex code needed to get that from the source HTML node, making the main part of the app easier to maintain.

James Britt

···

--

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

Dan_Kohn · 2 December 2005 18:57

Thanks for the response, James. My next question was actually about
debugging Mechanize
<http://groups.google.com/group/comp.lang.ruby/msg/04fc7473b08c16fc>.
Would you mind emailing me your scraping code, as I've been suffering
from a lack of examples to copy?

Also, are you sure Mechanize parses the whole page with get? It
doesn't wait for a find?

- dan

···

--
Dan Kohn <mailto:dan@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>

Its_Me · 2 December 2005 19:07

"James Britt" <james_b@neurogami.com> wrote in message

This is what I use to construct the product pages for rubystuff.com from

Any chance you could make that code available? Sounds like a useful example.

Is Mechanize also a good option for writing acceptance tests, compared to
Watir?

Thanks.

James_Britt4 · 2 December 2005 20:08

Dan Kohn wrote:

Thanks for the response, James. My next question was actually about
debugging Mechanize
<http://groups.google.com/group/comp.lang.ruby/msg/04fc7473b08c16fc>\.
Would you mind emailing me your scraping code, as I've been suffering
from a lack of examples to copy?

Also, are you sure Mechanize parses the whole page with get? It
doesn't wait for a find?

Don't think so, but I might be wrong. My code calls agent get, then goes right into looping over the collected nodes.

I'll see about putting my code together as an example.

As for debugging Mechanize, I've found it helpful to go to the lib source and stick in some STDERR.puts calls to inspect request and response data to be sure things are getting passed around as expected.

After that, unit tests are helpful.

James

···

--

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

James_Britt4 · 2 December 2005 20:14

itsme213 wrote:

"James Britt" <james_b@neurogami.com> wrote in message

This is what I use to construct the product pages for rubystuff.com from

Any chance you could make that code available? Sounds like a useful example.

Is Mechanize also a good option for writing acceptance tests, compared to Watir?

WATIR exposes the HTML DOM as seen by IE, which is not the raw HTML source returned from the server (but perhaps someone more up on the latest WATIR knows otherwise). Mechanize will get you the source HTML, albeit sanitized for REXML parsing.

I find WATIR most useful for walking though a series of pages where automated typing and clicking is essential. Pretty much every Web app I've written in the last 9 months uses WATIR (plus my own custom DSL on top of it) for functional testing. Major time saver.

I use Mechanize for data snarfing and occasional feed building.

James Britt

···

--

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

Topic		Replies	Views
Screen scraping using htmltools and rexml ruby-talk	0	119	24 January 2006
Screenscraping using htmltools and rexml ruby-talk	2	103	24 January 2006
Article on screen scraping w HTree+REXML, RubyfulSoup, WWW::Mechanize ruby-talk	2	111	14 June 2006
REXML screen scraping questions ruby-talk	4	68	15 September 2005
Ruby HTML Tools - ruby-htmltools Examples ruby-talk	2	125	24 March 2006

Screen scraping via regex vs. htmltools (vs. REXML)

Related topics