First let me be very clear. I hate the language that Larry "should be
lined up against a " Wall has written. IMO it encourages people
to program with, well only men can program that way, instead of their
heads.
However as bad as the language is, LWP is one of the best libraries
around when it comes to web related applications. Most notablely
I have never found a library which can parse HTML as well as
LWPs HTML parser. It is my eternal hope that I can find a library as
good, and dump the language.
With the advent of Ruby on Rails, I am hopeful that there might be a
package in Ruby that gives Perl's HTML parser a run for it's money.
I'm nt looking for an XML parser, XML parsers just can't handle
many of the web sites I want to parse. Neither can expat,libxml2
or some of the more popular libraries. Don't suggest I pass it through
Tidy then parse the XML. There are a lot of pages that Tidy can't
handle.
Finally, there will be some smartass, who will say that I should use
web sites that are written in good HTML. I don't have choice of what
pages I or the people to ask me to write scripts take our content
from. Fine. If you have the millions to pay all those webmasters to
hire HTML gurus that will generate good HTML let me know and
I will email you a list. As for me, I am too busy with real work on my
own projects to go around nagging people working on other things to
improve their coding style.
Thanks
The reply-to email address is olczyk2002@yahoo.com.
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
interaccess,
···
**
Thaddeus L. Olczyk, PhD
There is a difference between
*thinking* you know something,
and *knowing* you know something.
First let me be very clear. I hate the language that Larry "should be
lined up against a " Wall has written. IMO it encourages people
to program with, well only men can program that way, instead of their
heads.
However as bad as the language is, LWP is one of the best libraries
around when it comes to web related applications. Most notablely
I have never found a library which can parse HTML as well as LWPs HTML parser. It is my eternal hope that I can find a library as
good, and dump the language.
With the advent of Ruby on Rails, I am hopeful that there might be a
package in Ruby that gives Perl's HTML parser a run for it's money.
Look at Narf, and its htmltools and xmltree.
Or Michael Neumann's Mechanize. It wraps htmltools and xmltree.
I'm nt looking for an XML parser, XML parsers just can't handle
many of the web sites I want to parse. Neither can expat,libxml2
or some of the more popular libraries.
Have you tried libxml2 in parse_html mode with the recover option on?
I've never had a problem with any site. It handles broken, nasty HTML
quite nicely.
(Disclaimer: I don't know if the Ruby bindings expose this
functionality).
I did a poor man's port of BeautifulSoup once...if there's enough
interest, we could turn it into something useful. I assume you're
doing some screen scraping thing?
Here's the original BeautifulSoup. Look like what you need?
Look at Narf, and its htmltools and xmltree.
Or Michael Neumann's Mechanize. It wraps htmltools and xmltree.
I used Mechanize over the weekend and I just love it. In fact I had a
couple small problems that Michael fixed within hours.
I am using it to automate renewal of library books using my library's
web-site. I was amazed at how quickly I got my solution working, because
the library web-site software has some gnarly URLs and redirects that I
figured would be "fun" to deal with. But Mechanize makes it trivial.
Anyhow, the HTML from the library web-site parses fine and I easily scrape
out the information I care about (books titles, authors and due dates.)
I'm not a Python guy, so I don't know the library. However, I just browsed through the site and if you ask me, it looks downright handy.
James Edward Gray II
···
On Jun 21, 2005, at 12:51 PM, Daniel Amelang wrote:
I did a poor man's port of BeautifulSoup once...if there's enough
interest, we could turn it into something useful. I assume you're
doing some screen scraping thing?
Here's the original BeautifulSoup. Look like what you need?
On Jun 21, 2005, at 11:05 AM, James Edward Gray II wrote:
On Jun 21, 2005, at 12:51 PM, Daniel Amelang wrote:
I did a poor man's port of BeautifulSoup once...if there's enough
interest, we could turn it into something useful. I assume you're
doing some screen scraping thing?
Here's the original BeautifulSoup. Look like what you need?