Associates worldwide, here is some unpolished software for you. You'll need a
compiler to install this one and the `iconv' library installed.
gem install hpricot --source code.whytheluckystiff.net
Hpricot is a fast HTML parser, based on HTree. I converted the HTree scanner to
C and I'm just now reworking the parser. I've also started adding a bunch of
nice methods to HTree so that you won't have any desire to use REXML objects
instead.
doc = File.open(path) { |f| Hpricot.parse(f) }
# supports xpath
doc.search("//p/a").set("href", "http://google.com")
# supports css selectors
doc.search("#menu .box").each { |ele| p ele }
# slash is a shortcut
(doc/"#menu box").each ...
# symbols also imply css selectors of tag names
(doc/:p/:a).set("href", "http://google.com")
The Hpricot scanner uses Ragel (the same state machine used by Mongrel) and is
able to whip through hundreds of HTML documents in a second. (I'm benchmarking
against the sizeable Boing Boing home page, Slashdot, and others.) However,
this release still includes some of HTree's existing code, which slows things
down quite a bit and will be phased out over the next few releases.
Anyway, I have high hopes for this little guy. Please don't forget to say the
name right. It's H-pricot. Like: AYYCHH-pricot.
Subversion is here: http://code.whytheluckystiff.net/svn/hpricot/trunk.
Gracias, mi rubistos!
_why