Ruby screen scraping

Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Does anyone have any opinions on what might be the best way to approach
this task. Ive been looking at a number of different packages including
Htree.

Thanks

···

--
Posted via http://www.ruby-forum.com/.

Hi,

···

On 11/19/06, Chris Gallagher <cgallagher@gmail.com> wrote:

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

If you want screen scraping, I would tell you to look at why's
excellent Hpricot HTML parser. It's really simple to use and very
effective.

http://code.whytheluckystiff.net/hpricot/

Cheers,
Alvim.

For HTML scraping I recommend scrAPI.

gem install scrapi

homepage:
http://blog.labnotes.org/category/scrapi/

Example scraper:

Scraper.define do
  attr_accessor :title, :author, :pub_date, :content

  process "div#GuardianArticle > h1", :title => :text
  process "div#GuardianArticle > font[size=2] > b" do |element|
    @author = element.children[0].content
    @pub_date = element.children[2].content.strip
  end
  process "div#GuardianArticleBody", :content => :text
end

···

--
Posted via http://www.ruby-forum.com/.

Chris Gallagher wrote:

Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Once you have the page (open-uri if you know the URL exactly, or
www::mechanize if you need to navigate there (i.e. fill textfields,
click buttons etc)) I recommend to check out these possibilities:

1) regular expressions
2) HPpricot
3) scrAPI
4) Rubyful soup

Regular expressions would be the most old-school solution, in some cases
such a wrapper is the most robust (but since you are in control of the
generated page as I understood, robustness is possibly not an issue).

If you can't do it with regexps, HPricot will be most probably adequate
(I would need to see the concrete page).

Finally, if neither of the above works, you should try scrAPI - and
though I don't think so you should fail after this point, Rubyful soup
is another possibility to check out.

Peter

···

__
http://www.rubyrailways.com

Chris,

There are many ways to accomplish this as others have pointed out. When I
approached a similar task three years ago, I was working in the java world
and would have loved to have some of the tools available for Ruby today.
However I believe that the technique I used has merit in some situations
today.

I was screen scraping realtor sites for data (to find the perfect house),
because I was dissatisfied with the searching and data mining capabilities
of the sites. I was mining multiple sites, so the technique had to be
flexible but also resilient because I did not control the source sites (and
they would often change their layout). My first attempt used xpath's to try
and get to data, however that was futile since developers would often change
the site's layout and even small changes would break the logic (ie. changing
nesting of tables, or adding styling around data).

After taking a step back and considering the situation from a fresh
perspective, I scrapped the idea of using xml style data location in
something that seemed too fluid, too fragile.

My second approach was much more resilient, I used simple regular
expressions to zoom in and find the data. After studying the source html I
was able to discover a way to easily get to any data for the sites I was
working on.

The basic approach was this:

1) I would use a regular expression to search into the html for something to
get me close to the data, something that seemed to be consistent and
unlikely to change. (a reference point)

2) I would then extract reasonable number of characters before and/or after
the reference point based on where the data is located. It is not necessary
to know exactly just gather a conservative amount beyond what you think you
need.

3) repeat with step 1 if needed, or use regular expression to extract the
data desired from this subsection of data extracted in step 2.

I wrapped these basic ideas in to a few simple methods to make it easy and
it turned out to be a very successful approach. I found it easy to add new
sites pretty easily and it turned out to be very robust technique that was
very forgiving to changing fluff html. It was pretty easy to find a
reference point in the html that was consistent, and once there the data was
close by, so I'd extract a healthy chunk and then it was pretty easy to
search in this smaller amount of data. Use logging for each step to help you
while you are fine tuning the approach. But once I switched over to this
approach, I never had to revisit the code once I set it up for a site, it
just worked. Low tech, simple, but suprisingly effective.

Of course after many months of daily operation mining the realtor sites, I
eventually found the perfect house and abandoned the code; it had served its
purpose well. So I don't have anything concrete to offer you (and it was in
Java), but if any of the other methods mentioned by the others don't quite
meet your needs or end up being too fragile, you might consider a variation
on this approach for your own data extraction. It is especially flexible for
scraping sites which tend to vary over time. In your case it sounds like you
have control over the source so many methods would work for you, however
don't forget that there may be some variation over time if you ever upgrade
(cruise control).

Hope it helps you or others that are pursuing this task!

Blessings,

Jeff Barczewski
MasterView project developer, http://masterview.org/
Inspired Horizons Training and Consultancy http://inspiredhorizons.com/

···

On 11/19/06, Chris Gallagher <cgallagher@gmail.com> wrote:

Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Does anyone have any opinions on what might be the best way to approach
this task. Ive been looking at a number of different packages including
Htree.

thanks guys I'll look into both of them.

Another question I would have is how would i then get this scraped info
to insert into a mysql database called say "build" and a table called
"results".

For now if you could base answers on the following htree code?

require 'open-uri'
require 'htree'
require 'rexml/document'

url = "http://www.google.com/search?q=ruby"
open(url) {
>page> page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
        >elem> puts elem.attribute('href').value }
}

which is returning a result of:

C:\>ruby script2.rb
http://www.ruby-lang.org/
http://www.ruby-lang.org/en/20020101.html


http://www.rubycentral.com/
http://www.rubycentral.com/book/


http://www.w3.org/TR/ruby/
http://poignantguide.net/
http://www.zenspider.com/Languages/Ruby/QuickRef.html

Cheers.

···

--
Posted via http://www.ruby-forum.com/.

Chris Gallagher wrote:

thanks guys I'll look into both of them.

Another question I would have is how would i then get this scraped info
to insert into a mysql database called say "build" and a table called
"results".

For now if you could base answers on the following htree code?

require 'open-uri'
require 'htree'
require 'rexml/document'

url = "ruby - Google Search;
open(url) {
>page> page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
        >elem> puts elem.attribute('href').value }
}

Something along the lines of

require "mysql"

dbh = Mysql.real_connect("localhost", "chris", "", "build")
dbh.query("
      INSERT INTO results
      VALUES (whatever)

Cheers,

Peter

···

__
http://www.rubyrailways.com

Thanks for the help.

Ill get on with it and see how it goes :slight_smile:

···

--
Posted via http://www.ruby-forum.com/.

OK,here is the full code:

require 'open-uri'
require 'htree'
require 'rexml/document'
require 'mysql'

url = "http://www.google.com/search?q=ruby"
results = []

open(url) {
>page> page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
        >elem> results << elem.attribute('href').value }

dbh = Mysql.real_connect("localhost", "peter", "****", "build")

results.each do |result|
   dbh.query("INSERT INTO result VALUES ('#{result}')")
end
}

HTH,

Peter

···

__
http://www.rubyrailways.com

wow, thanks for that code.

One question though. Does the name of the field in the table which the
scraped information is going to be inserted into need to be specified in
the code? Or is it already and i'm missing something here?

···

--
Posted via http://www.ruby-forum.com/.

Chris Gallagher wrote:

wow, thanks for that code.

Welcome :slight_smile:

One question though. Does the name of the field in the table which the
scraped information is going to be inserted into need to be specified in
the code? Or is it already and i'm missing something here?

My code assumed that the table has one column (e.g. 'url' in this case)
and the values were inserted into that column.

Otherwise if you have more columns, you can do this:

INSERT INTO people
(name, age) VALUES('Peter Szinek', '23' ).

You can do

INSERT INTO people VALUES('Peter Szinek', '23' )

as well, but in this case you have to be sure that the columns in your
DB are in the same order as in your insert query. In the first example
you don't have to care about the column ordering in the DB, as far as
the mapping between the column names (first pair of parens) and the
values (second pair of parens) are O.K.

HTH,
Peter

···

__
http://www.rubyrailways.com

ah thats great.

thanks again for your help :slight_smile:

···

--
Posted via http://www.ruby-forum.com/.

OK that code all works great but i have one last question :slight_smile:

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

<p>this is text that i want to scrape</p>

any ideas?

thanks.

···

--
Posted via http://www.ruby-forum.com/.

Chris Gallagher wrote:

OK that code all works great but i have one last question :slight_smile:

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

<p>this is text that i want to scrape</p>

Hmm this is hard to tell just by this example. If you need ALL <p>s,
then those can be queried by this XPath:

//p

I am not sure what are you using now, but in Hpricot this would be:

doc = Hpricot(open("http://stuff.com/&quot;\))
results = doc/"//p"

If you are still using, HTree, query this XPath there for the same results.

However, I guess you want something more sophisticated than ALL the
<p>s. Well this is where the troubles begin with screen scraping: you
need to figure out some rules which extract *exactly* what you want -
usually it is not that hard to come up with rules that extract more or
less, but much worse to find the right ones...

To solve this problem, you need to tell us what do you want - i.e. an
example page, and a set of objects you would like to extract.

Cheers,
Peter

···

__
http://www.rubyrailways.com

Chris Gallagher wrote:

OK that code all works great but i have one last question :slight_smile:

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

<p>this is text that i want to scrape</p>

any ideas?

Really simple:

array = page_content.scan(%r{<p>(.*?)</p>}m).flatten

Returns an array, each cell of which is a paragraph from the original page.

This is why it is a bad idea to adopt a package or library to accomplish
something that is easier to accomplish with a few lines of code, or even
one line as in this case.

At first the library seems as though it can do anyting, with no need to
understand what is actually going on. Pretty quickly you encounter
something the library cannot do, and you have to ... understand what is
going on. Then you abandon the library and write normal code.

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

···

--
Paul Lutus
http://www.arachnoid.com

I hope you're not arguing that HTML should be parsed with simple regular expression instead of a real parser. I think most would agree with me when I say that strategy seldom holds up for long.

James Edward Gray II

···

On Nov 19, 2006, at 8:50 PM, Paul Lutus wrote:

Chris Gallagher wrote:

OK that code all works great but i have one last question :slight_smile:

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

<p>this is text that i want to scrape</p>

any ideas?

Really simple:

array = page_content.scan(%r{<p>(.*?)</p>}m).flatten

Returns an array, each cell of which is a paragraph from the original page.

This is why it is a bad idea to adopt a package or library to accomplish
something that is easier to accomplish with a few lines of code, or even
one line as in this case.

At first the library seems as though it can do anyting, with no need to
understand what is actually going on. Pretty quickly you encounter
something the library cannot do, and you have to ... understand what is
going on. Then you abandon the library and write normal code.

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

Please note that the P end tag isn't required in HTML 4.01:
http://www.w3.org/TR/html4/struct/text.html#h-9.3.1

···

On 20/nov/06, at 03:50, Paul Lutus wrote:

array = page_content.scan(%r{<p>(.*?)</p>}m).flatten

--
Gabriele Marrone

James Edward Gray II wrote:

Chris Gallagher wrote:

OK that code all works great but i have one last question :slight_smile:

/ ...

My question is, how would i modify the

code in order to get it to capture say a block of text such as:

<p>this is text that i want to scrape</p>

any ideas?

Really simple:

array = page_content.scan(%r{<p>(.*?)</p>}m).flatten

Returns an array, each cell of which is a paragraph from the
original page.

/ ...

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

I hope you're not arguing that HTML should be parsed with simple
regular expression instead of a real parser. I think most would
agree with me when I say that strategy seldom holds up for long.

That depends on the complexity of the problem to be solved, and the
reliability of the source page's HTML formatting.

For a page that can pass validation of one kind or another or that is XHTML,
the simplest kinds of parsers provide terrific results. For legacy pages
and those that can be expected to have "relaxed" syntax, more robust
parsers are required.

But I must say I regularly see requests here for parsers that can be
expected to do anything, but often as not and IMHO, such a library
represents too much complexity for the majority of routine HTML/XML parsing
tasks with Web pages and documents that are often generated, not
hand-written.

This thread is an example. Beginning with the generic equivalent of "Is
there a library that can ..." followed almost immediately by "Great! But
how do I make it do this ...", requesting a really trivial extraction step
that can be accomplished in a single line of Ruby.

I find this rather ironic, since Ruby is meant to provide an easy way to
create solutions to everyday problems. One then sees a blizzard of
libraries whose purpose is to shield the user from the complexities of the
language, in a way that the remedy is often more complex than the problem
it is meant to solve.

In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but apparently
never considered writing code to solve the problem directly. After choosing
a library, the OP realized he didn't see an obvious way to solve the
original problem -- extracting specific content from the source pages.

As to modern XHTML Web pages that can pass a validator, I know from direct
recent experience that they yield to the simplest parser design, and can be
relied on to produce a tree of organized content, stripped of tags and
XHTML-specific formatting, in a handful of lines of Ruby code. It is hard
to justify bringing out the big guns for a task like this, when one could
instead use a small self-documenting routine such as I suggested.

In the bad old days of assembly and comparatively heavy, inflexible
languages like C, C++ and the like, it is easy to see why people would be
motivated to create specialized libraries to solve generic problems just
once for all time. In fact, the argument can be made that Ruby is just such
a library of generics, broadly speaking an extension/amplification of the
STL project.

Now we see people writing easy-to-use application libraries, each composed
using the easy-to-use Ruby library, but that are sometimes harder to sort
out, or make practical use of, than a short bit of code would have been.

Lest my readers think I am going overboard here on a topic dear to my heart,
let me quote the OP once again:

OK that code all works great but i have one last question :slight_smile:

This is allowing me to scrape the values of the class values on
tags and
any other attribues such as that. My question is, how would i
modify the
code in order to get it to capture say a block of text such as:

<p>this is text that i want to scrape</p>

any ideas?

In other words, after choosing a library and playing with it for a while, he
found himself back in square one, unable to solve the original problem.

To quote one of my favorite authors (William Burroughs), it seems people are
busy inventing cures for which there are no diseases.

···

On Nov 19, 2006, at 8:50 PM, Paul Lutus wrote:

--
Paul Lutus
http://www.arachnoid.com

Hola,

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

I hope you're not arguing that HTML should be parsed with simple regular
expression instead of a real parser. I think most would agree with me
when I say that strategy seldom holds up for long.

I could not agree more with James here. HTML scraping is one of the most
tedious tasks these days. Paul, how far would your scraper get with this
'HTML':

<p>This is a para.
<b/>
<p>This is another...

With Hpricot, this code

equire 'rubygems'
require 'hpricot'

doc = Hpricot(open("1.html").read)
results = doc/"//p"

works without any problems.

Of course I absolutely understand your viewpoint, but messed up HTML, as
you have seen, can make a real difference...

Peter

···

__
http://www.rubyrailways.com

Gabriele Marrone wrote:

array = page_content.scan(%r{<p>(.*?)</p>}m).flatten

Please note that the P end tag isn't required in HTML 4.01:
Paragraphs, Lines, and Phrases

Yes, I've just been converting all my site pages to XHTML, so I encountered
this difference big-time. My solution made some assumptions, one being the
OP's request --

My question is, how would i modify the
code in order to get it to capture say a block of text such as:

<p>this is text that i want to scrape</p>

any ideas?

-- was based on his knowledge that the pages in fact contained paragraphs
enclosed by <p> ... </p>.

The other assumption I made was based on context -- it seems the pages in
question are machine-generated, so presumably can be relied on to have
consistent syntax.

···

On 20/nov/06, at 03:50, Paul Lutus wrote:

--
Paul Lutus
http://www.arachnoid.com