Peter Szinek wrote:
Ingo Weiss wrote:
Hi,
I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt' change, but unfortunately it is far from being valid
XHTML or similar.
In order to parse the page, first you need to push it through some kind
of tidy-up engine, so you can turn invalid to HTML to XML.
That depends on what data you are after, and where you want to look for it.
If, for example, you just want to get a list of css files referenced in a page, then regexen would likely be simpler and faster than the tidy-up approach.
I recommend
this one:
http://tidy.rubyforge.org/
After this step you have reduced the problem of arbitrary (possibly
invalid) HTML parsing to XML parsing which is definitely easier, e.g.
with REXML.
Sort of. I've seen tidy make some odd assumptions about what the "correct" output should be, based on surreal HTML input. And this can throw off the XML manipulation code.
What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?
Another possibility would be Rubyful soup:
Rubyful Soup: "The brush has got entangled in it!"
You do not need pre-tidying here, just 'use it'. Examples:
soup = BeautifulSoup.new(page)
I've just been trying out BeautifulSoup to parse some nasty del.icio.us markup (it has an XHTML DOCTYPE, but is painfully broken).
I had been using some simple regex iteration over the source, but they changed that page layout, my app broke, and I thought perhaps I'd give BeautifulSoup another shot. But I realized why I stopped using it in the first place: it's way too slow. (Or at least way slower than my hand rolled hacks.)
I've tried a number of ways, over various applications, to extract stuff from HTML. If I can get predictable XML right off, then that's a big help; I can pass it into a stream parser, or use a DOM if the file isn't too large.
When handed broken markup, I've found that many times the problem is in only one or two places, most often the header (with malformed empty elements). Much time can be saved by grabbing a subset of the raw HTML (with some simple stateful line-by-line iteration) and cleaning up what I actually need (and often that extracted subset is proper XML all by itself).
There is a real cost to making the parsing/cleaning code highly robust, and if you can make certain assumptions about the source text (and live with the risks that things can change), you can often make the app faster/simpler.
···
--
James Britt
Judge a man by his questions, rather than his answers.
- Voltaire