How extract data from a web site?

Ingo_Weiss · 16 April 2006 21:12

Hi,

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt' change, but unfortunately it is far from being valid
XHTML or similar.

What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?

Ingo

Alexandru_E_Ungur · 16 April 2006 21:27

sender: "Ingo Weiss" date: "Mon, Apr 17, 2006 at 06:12:00AM +0900" <<<EOQ

Hi,

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt' change, but unfortunately it is far from being valid
XHTML or similar.

What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?

Hi,

Well for starters you could use your human interface to connect to the
Ruby-talk archives and look there for the answer...
No offence, but I'm on this list only for about two months, but this
question was already asked about a million times...

Anyway, Rubyful Soup should make you happy:

All the best,
Alex

Peter_Szinek2 · 16 April 2006 21:32

Ingo Weiss wrote:

Hi,

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt' change, but unfortunately it is far from being valid
XHTML or similar.

In order to parse the page, first you need to push it through some kind
of tidy-up engine, so you can turn invalid to HTML to XML. I recommend
this one:

http://tidy.rubyforge.org/

After this step you have reduced the problem of arbitrary (possibly
invalid) HTML parsing to XML parsing which is definitely easier, e.g.
with REXML.

What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?

Another possibility would be Rubyful soup:

You do not need pre-tidying here, just 'use it'. Examples:

soup = BeautifulSoup.new(page)

# find all <p>'s:
soup.find_all('p')

# find all tags that have an attribute align="left"
soup.find_all { |tag| tag['align'] = "left" }

You got the idea.

another possibility (never tried but looks good):

http://rubyforge.org/projects/ruby-htmltools/

There are certainly much more ways to do this, but i think these should
be enough to get you started.

HTH,
Peter

Peter_Szinek2 · 16 April 2006 21:49

Ingo Weiss wrote:

Hi,

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt' change, but unfortunately it is far from being valid
XHTML or similar.

What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?

BTW the cleanest method i have ever seen (and used/using, both at my
full-time job and for my PhD thesis software) is unfortunately not
available in Ruby. For this reason i have to use Java (which i do not
like at all, mainly when compared to Ruby) at both places. There i am
using the following method:

JavaXPCOM is a Java wrapper for mozilla native XPCOM, written by Javier
Pedemonte.
W3CConnector (a piece of software i made) generates a wrapper around
JavaXPCOM, implementing W3C interfaces.

This way, using my W3CConnector, you have a full access to the mozilla
DOM through wrapper classes implementing W3C interfaces. (The
W3CConnector wrapping is optional, just this way i can use Xerces and
other packages talking to W3C DOM interfaces.)

Access to the full Mozilla DOM tree means you can write full XPath
queries to extract nodes, and XPath is a really powerful language for
such purposes.

So, the missing layer in Ruby is RbXPCOM. Although it exists, it is
quite unfinished and abandoned since 2001. Does somebody know something
about it?

My PhD is a next generation web extraction engine, which heavily relies
on Mozilla DOM which can not be acquired (AFAIK) in any other way than i
have described here (For example i am using the coordinates of rendered
elements etc).

So does anyone have any info whether RbXPCOM will be finished in the future?

Cheers,
Peter

Xavier_Noria · 16 April 2006 22:13

Some people already suggested HTML parsers. I have done a lot of crawling and wanted to add that if you need to extract data from a single page that is machine generated a simple regex is often enough. You need to make a choice depending on the real page and the kind of stuff to extract.

-- fxn

···

On Apr 16, 2006, at 23:12, Ingo Weiss wrote:

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt' change, but unfortunately it is far from being valid
XHTML or similar.

James_Britt4 · 16 April 2006 22:37

Peter Szinek wrote:

Ingo Weiss wrote:

Hi,

I would like to use Ruby to read the content of a web site, and then
extract certain data from it. The site is machine generated so the
format doesnt' change, but unfortunately it is far from being valid
XHTML or similar.

In order to parse the page, first you need to push it through some kind
of tidy-up engine, so you can turn invalid to HTML to XML.

That depends on what data you are after, and where you want to look for it.

If, for example, you just want to get a list of css files referenced in a page, then regexen would likely be simpler and faster than the tidy-up approach.

I recommend
this one:

http://tidy.rubyforge.org/

After this step you have reduced the problem of arbitrary (possibly
invalid) HTML parsing to XML parsing which is definitely easier, e.g.
with REXML.

Sort of. I've seen tidy make some odd assumptions about what the "correct" output should be, based on surreal HTML input. And this can throw off the XML manipulation code.

What would be the easiest way to get there? I guess I need some kind of
HTML parser, or? How to I read a web site into Ruby in the first place?

Another possibility would be Rubyful soup:

Rubyful Soup: "The brush has got entangled in it!"

You do not need pre-tidying here, just 'use it'. Examples:

soup = BeautifulSoup.new(page)

I've just been trying out BeautifulSoup to parse some nasty del.icio.us markup (it has an XHTML DOCTYPE, but is painfully broken).

I had been using some simple regex iteration over the source, but they changed that page layout, my app broke, and I thought perhaps I'd give BeautifulSoup another shot. But I realized why I stopped using it in the first place: it's way too slow. (Or at least way slower than my hand rolled hacks.)

I've tried a number of ways, over various applications, to extract stuff from HTML. If I can get predictable XML right off, then that's a big help; I can pass it into a stream parser, or use a DOM if the file isn't too large.

When handed broken markup, I've found that many times the problem is in only one or two places, most often the header (with malformed empty elements). Much time can be saved by grabbing a subset of the raw HTML (with some simple stateful line-by-line iteration) and cleaning up what I actually need (and often that extracted subset is proper XML all by itself).

There is a real cost to making the parsing/cleaning code highly robust, and if you can make certain assumptions about the source text (and live with the risks that things can change), you can often make the app faster/simpler.

···

--
James Britt

Judge a man by his questions, rather than his answers.
- Voltaire

Ingo_Weiss · 17 April 2006 20:43

Thanks so much for all your replies!

I ended up using simple regex and so far it works just fine.

Ingo Weiss

···

On Apr 16, 2006, at 23:12, Ingo Weiss wrote:

> I would like to use Ruby to read the content of a web site, and then
> extract certain data from it. The site is machine generated so the
> format doesnt' change, but unfortunately it is far from being valid
> XHTML or similar.

Some people already suggested HTML parsers. I have done a lot of
crawling and wanted to add that if you need to extract data from a
single page that is machine generated a simple regex is often enough.
You need to make a choice depending on the real page and the kind of
stuff to extract.

-- fxn

Peter_Szinek2 · 16 April 2006 23:06

That depends on what data you are after, and where you want to look for it.

If, for example, you just want to get a list of css files referenced in
a page, then regexen would likely be simpler and faster than the
tidy-up approach.

Sure. But the original poster mentioned some HTML parsing so i have thought
regexps are not enough.

Sort of. I've seen tidy make some odd assumptions about what the
"correct" output should be, based on surreal HTML input. And this can
throw off the XML manipulation code.

Of course. tidy is just 'better than nothing'. I did not mean it will
work everywhere (it certainly won't) - but at least you can get closer
to your goal (in some cases)

There is a real cost to making the parsing/cleaning code highly robust,
and if you can make certain assumptions about the source text (and live
with the risks that things can change), you can often make the app
faster/simpler.

Well, for a *really* robust something, take a look at my earlier mail
in this thread. You can not get nowhere near to that with any other
tool/technique (if you think yes, LMK).
I am in the web extraction business, usually we are extracting data from
hundreds of thousands of pages on a daily basis so i have some
experience with this stuff. Our wraper generator solutions are usable
for, well, most of the pages out there (say 95%), utilizing adaptive
techniques if the page changes and other stuff for robustness etc. But
this software took 5 years to develop for a medium-sized team, and now
that it is finished we can almost rewrite it from sracth because it is
nearly unusable on some of the web2.0 pages... (due to AJAX etc)

cheers,
Peter

Topic		Replies	Views
Scraping websites ruby-talk	11	91	28 March 2006
How to extract texts from html source? ruby-talk	13	139	2 June 2005
Is there link extractor or similar html processing libs for Ruby ruby-talk	16	148	10 March 2006
Waiter, there's a noob in my soup! ruby-talk	14	151	29 March 2006
Ruby screen scraping ruby-talk	27	111	21 November 2006

How extract data from a web site?

Related topics