Ruby screen scraping

Peter Szinek wrote:

/ ...

Of course I absolutely understand your viewpoint, but messed up HTML, as
you have seen, can make a real difference...

I agree completely (see my other post on this topic), but it appears the OP
was trying to read machine-generated Web content, presumably with reliable
syntax.

···

--
Paul Lutus
http://www.arachnoid.com

In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but apparently
never considered writing code to solve the problem directly.

Starting out by looking for a library that does the hard work for you is a good first step, I would say. Do we really want to be discouraging that?

As to modern XHTML Web pages that can pass a validator, I know from direct
recent experience that they yield to the simplest parser design, and can be
relied on to produce a tree of organized content, stripped of tags and
XHTML-specific formatting, in a handful of lines of Ruby code.

I've seen valid XHTML that wouldn't be much fun to parse. You still need to worry about whitespace, namespaces, the kind of quoting used, CDATA sections, ...

James Edward Gray II

···

On Nov 20, 2006, at 2:45 AM, Paul Lutus wrote:

I agree completely (see my other post on this topic), but it appears the OP
was trying to read machine-generated Web content, presumably with reliable
syntax.

Then you are right of course. I guess the problem is in the definition
of the term 'screen scraping' ( or 'web extraction' or 'web mining' or
'html extraction' - people can not even agree on it's name ).

For me 'screen scraping' means a complex thing: navigating to the
document, parsing it into something meaningful and querying the objects
of the parsed structure. In general, I am assuming that neither of these
steps are trivial - maybe because I am working for a web extraction
company for years now and I have seen every kind of nice tricks of the
other side (a.k.a the anti-scrape camp)

Of course, if you define screen scraping as the last step only (i.e. you
have a parsed model (e.g. a well formed page) and you need to query
that) - then of course regular expressions are always the first thing to
consider.

Since the OP was referring to a machine generated page, I think the
latter applies - so yep, as far as he need all <p>'s only, regular
expressions are probably the easiest thing to pull out.

Peter

···

__
http://www.rubyrailways.com

James Edward Gray II wrote:

In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but
apparently
never considered writing code to solve the problem directly.

Starting out by looking for a library that does the hard work for you
is a good first step, I would say. Do we really want to be
discouraging that?

IMHO yes, when it doesn't solve the problem at hand. This is obviously a
matter of personal taste, but I always try coding a solution first, or at
least endeavor to understand what such a solution would entail, before
going shopping for a library. It's based on KISS, and large libraries that
can be relied on to solve any problem except the problem the adopter faces,
fail the KISS principle.

/ ...

I've seen valid XHTML that wouldn't be much fun to parse. You still
need to worry about whitespace, namespaces, the kind of quoting used,
CDATA sections, ...

These are all relatively easy to parse. Even the CDATA sections are clearly
and consistently delimited, so can be reliably skipped over and
encapsulated. That was the design goal of XHTML -- to be easy to parse, to
be consistent -- assuming the syntax is followed.

I just converted my 500-page Web site to XHTML, and at the end of the
project I found that I could parse any page on the site using a very simple
parser. This was about the time the pages also began passing XHTML
validation tests.

But this is all by the way. The point is the OP adopted a powerful library,
only to discover he still couldn't solve his original problem, and if we
assume (as I did) that the pages are machine-generated and meet reasonable
syntax standards, the one-line solution I posted will meet his
requirements.

Your earlier point, that a Web page picked at random might be virtually
unparseable, is certainly true, and solutions like we are discussing assume
a high degree of cooperation between the page generator and the parser.

···

On Nov 20, 2006, at 2:45 AM, Paul Lutus wrote:

--
Paul Lutus
http://www.arachnoid.com

But if you use an already developed parser, you gain all their work on edge cases, all their testing efforts, all their optimization work, etc.

I see what you are saying about knowing you can count on the data, but your messages are filled with a lot "as long as you are sure" conditions. Dropping a bunch of those conditions is just one more advantage to using a library.

You say you are always surprised when people build up all this hefty library code when a simple regex will do, but I'm always shocked when I can replace hundreds of lines of code by loading and making use of a library. If we have to err on one side of that, I would prefer it be on the library using side.

That said, I guess we'll just have to agree to disagree. That's for the intelligent and civil debate.

James Edward Gray II

···

On Nov 20, 2006, at 11:35 AM, Paul Lutus wrote:

James Edward Gray II wrote:

I've seen valid XHTML that wouldn't be much fun to parse. You still
need to worry about whitespace, namespaces, the kind of quoting used,
CDATA sections, ...

These are all relatively easy to parse. Even the CDATA sections are clearly
and consistently delimited, so can be reliably skipped over and
encapsulated. That was the design goal of XHTML -- to be easy to parse, to
be consistent -- assuming the syntax is followed.

Turns out I actually ended up abandonning HTree and the rest. I used
net/http in order to fetch the page and then took the table of the page
that I was interested in examining and converted that using rexml. I
have now been able to grab the values that I wanted using XPath :slight_smile:

require 'net/http'
require 'uri'
require 'rexml/document'
include REXML
def fetch(uri_str, limit=10)
  fail 'http redirect too deep' if limit.zero?
  puts "Trying: #{uri_str}"
  response = Net:: HTTP.get_response(URI.parse(uri_str))
  case response
  when Net::HTTPSuccess
    response
  when NetHTTPRedirection
    fetch(response['location'], limit-1)
  else
    response.error!
  end
end

response = fetch('http://10.37.150.55:8080')

scraped_data = response.body

table_start_pos = scraped_data.index('<table class="index"
width="100%">')
#puts table_start_pos

table_end_pos = scraped_data.index('</table>') + 9
#puts table_end_pos

height = table_end_pos - table_start_pos

gathered_data = response.body[table_start_pos,height]

converted_data = REXML::Document.new gathered_data
#puts converted_data

module_name = XPath.first(converted_data, "//td[@class='data']/a/]")
puts module_name

build_status = XPath.first (converted_data, "//td[2]/em")
puts build_status.text

last_failure = XPath.first(converted_data, "//tbody/tr/td[3]")
puts last_failure.text

last_success = XPath.first(converted_data, "//tbody/tr/td[4]")
puts last_success.text

build_number = XPath.first(converted_data, "//tbody/tr/td[5]")
puts build_number.text

···

--
Posted via http://www.ruby-forum.com/.

James Edward Gray II wrote:

James Edward Gray II wrote:

I've seen valid XHTML that wouldn't be much fun to parse. You still
need to worry about whitespace, namespaces, the kind of quoting used,
CDATA sections, ...

These are all relatively easy to parse. Even the CDATA sections are
clearly
and consistently delimited, so can be reliably skipped over and
encapsulated. That was the design goal of XHTML -- to be easy to
parse, to
be consistent -- assuming the syntax is followed.

But if you use an already developed parser, you gain all their work
on edge cases, all their testing efforts, all their optimization
work, etc.

Yes, all to the good, if the feature set is needed and if the target
environment can support the library. And if the library actually solves the
original problem.

I see what you are saying about knowing you can count on the data,
but your messages are filled with a lot "as long as you are sure"
conditions. Dropping a bunch of those conditions is just one more
advantage to using a library.

Yes, unless the library serves no purpose and occupies memory and machine
cycles better spent elsewhere. Without a library, you have to work out the
problem directly. With a library, you have to work out the problems caused
by the library.

My personal favorite for this dichotomy is REXML, which apparently can do
anything, unless you have something specific in mind, then IMHO you are
better off writing your own code to parse XML data sets. It isn't as though
XML is a dark and mysterious world that is beyond the reasoning powers of
mere mortals. If it were, the designers of the scheme were wasting their
time.

In the beginning, we had all sorts of weak and limited dataset protocols.
These weaknesses are well addressed by XML, but some think XML is too
complicated to manipulate directly. So libraries like REXML get created.
But the libraries often turn out to be so complex and difficult to put into
service that in some cases one is better off writing one's own
generator/parser for the simpler applications of XML.

The complexity referenced above seems to arise from an irresistible tendency
to put every feature into a library, with the side effect that important
and trivial/esoteric features often get mixed up together in the
documentation and the interface, and the library ends up too large to
justify for simple processing tasks.

Maybe now someone will write a library to bring REXML under control. Ad
infinitum.

You say you are always surprised when people build up all this hefty
library code when a simple regex will do,

No, not always, those are not my words. But in a case like this, where the
library accomplishes everything except what the OP actually wanted, yes.
Please note that I only made this plain-code argument after the OP
explained that he had put the library in place, had run it through its
paces, only to discover that he still couldn't solve the original problem.

but I'm always shocked when
I can replace hundreds of lines of code by loading and making use of
a library. If we have to err on one side of that, I would prefer it
be on the library using side.

For myself, I prefer to know what is going on. As I said, it's just a
personal preference.

That said, I guess we'll just have to agree to disagree. That's for
the intelligent and civil debate.

You're welcome (if I read you correctly). Such an exchange is always
possible, some might say likely, between two people who both want it that
way.

An aside with some small relevance. It's just possible that the Linux kernel
maintainers' tendency to adopt existing libraries over laboriously writing
fresh code will spawn a huge legal battle with Microsoft, who clearly
intend to argue (and who are now arguing) that it is their intellectual
property embedded in Linux, and therefore all those Linux users are
actually Microsoft customers.

I can see how this post may be interpreted, so I want to say I hope no one
is misled. If I were really intent on avoiding libraries, I would write
everyting in assembly. My disdain for libraries is fully constrained by
reality and pragmatism, and there are plenty of libraries that I use with
something that approaches reckless abandon.

But ... when a specialized library can't solve a problem that is soluble
with one line of tautological Ruby code, I'm more than willing to speak up.

···

On Nov 20, 2006, at 11:35 AM, Paul Lutus wrote:

--
Paul Lutus
http://www.arachnoid.com

Chris Gallagher wrote:

Turns out I actually ended up abandonning HTree and the rest. I used
net/http in order to fetch the page and then took the table of the page
that I was interested in examining and converted that using rexml. I
have now been able to grab the values that I wanted using XPath :slight_smile:

If you are keen on XPaths, why not:

table = XPath.first(doc, "//table[@class='index' && @width='100%']")

then use 'table' instead of 'converted_data'...

or even

module_name = XPath.first(doc, "//table[@class='index' &&
@width='100%']//td[@class='data']/a/]")

etc.

(Untested since I don't have your doc, but it should +- work)

Cheers,
Peter

···

__
http://www.rubyrailways.com