[ANN] Hpricot 0.1 -- quick, cinchy HTML parsing

Associates worldwide, here is some unpolished software for you. You'll need a
compiler to install this one and the `iconv' library installed.

  gem install hpricot --source code.whytheluckystiff.net

Hpricot is a fast HTML parser, based on HTree. I converted the HTree scanner to
C and I'm just now reworking the parser. I've also started adding a bunch of
nice methods to HTree so that you won't have any desire to use REXML objects
instead.

  doc = File.open(path) { |f| Hpricot.parse(f) }

  # supports xpath
  doc.search("//p/a").set("href", "http://google.com")

  # supports css selectors
  doc.search("#menu .box").each { |ele| p ele }

  # slash is a shortcut
  (doc/"#menu box").each ...

  # symbols also imply css selectors of tag names
  (doc/:p/:a).set("href", "http://google.com")

The Hpricot scanner uses Ragel (the same state machine used by Mongrel) and is
able to whip through hundreds of HTML documents in a second. (I'm benchmarking
against the sizeable Boing Boing home page, Slashdot, and others.) However,
this release still includes some of HTree's existing code, which slows things
down quite a bit and will be phased out over the next few releases.

Anyway, I have high hopes for this little guy. Please don't forget to say the
name right. It's H-pricot. Like: AYYCHH-pricot.

Subversion is here: http://code.whytheluckystiff.net/svn/hpricot/trunk.

Gracias, mi rubistos!

_why

Very nice. Thanks _why

-Ezra

···

On Jul 3, 2006, at 10:28 PM, why the lucky stiff wrote:

Associates worldwide, here is some unpolished software for you. You'll need a
compiler to install this one and the `iconv' library installed.

  gem install hpricot --source code.whytheluckystiff.net

Hpricot is a fast HTML parser, based on HTree. I converted the HTree scanner to
C and I'm just now reworking the parser. I've also started adding a bunch of
nice methods to HTree so that you won't have any desire to use REXML objects
instead.

  doc = File.open(path) { |f| Hpricot.parse(f) }

  # supports xpath
  doc.search("//p/a").set("href", "http://google.com")

  # supports css selectors
  doc.search("#menu .box").each { |ele| p ele }

  # slash is a shortcut
  (doc/"#menu box").each ...

  # symbols also imply css selectors of tag names
  (doc/:p/:a).set("href", "http://google.com")

The Hpricot scanner uses Ragel (the same state machine used by Mongrel) and is
able to whip through hundreds of HTML documents in a second. (I'm benchmarking
against the sizeable Boing Boing home page, Slashdot, and others.) However,
this release still includes some of HTree's existing code, which slows things
down quite a bit and will be phased out over the next few releases.

Anyway, I have high hopes for this little guy. Please don't forget to say the
name right. It's H-pricot. Like: AYYCHH-pricot.

Subversion is here: http://code.whytheluckystiff.net/svn/hpricot/trunk\.

Gracias, mi rubistos!

_why

Okay, 0.2 is out. The above is a lifetime prescription.

If you'd rather not install Hpricot, but want to play with it, try out the
balloon. You can review it at http://balloon.hobix.com/hpricot, then run it
with:

  ruby -ropen-uri -e 'eval(open("http://balloon.hobix.com/hpricot"\).read)'

_why

···

On Tue, Jul 04, 2006 at 02:28:45PM +0900, why the lucky stiff wrote:

  gem install hpricot --source code.whytheluckystiff.net

why the lucky stiff wrote:

  gem install hpricot --source code.whytheluckystiff.net

Okay, 0.2 is out....

Hpricot chokes on this when I try it.

   require 'rubygems'
   require 'hpricot'
   require 'open-uri'
   Hpricot(open('http://www.pcmag.com/article2/0,1759,1765785,00.asp'\)).to_html

If I understand, some methods (to_html, at least) in the version
of Hpricot I have (whatever was there yesterday morning) doesn't
like ugly "html" like "<input type=checkbox checked>" where apparently
attribute values end up being null.

I can work around it by putting
    aval ||= ''
in STag in tag.rb

···

On Tue, Jul 04, 2006 at 02:28:45PM +0900, why the lucky stiff wrote:

Wonderful, thankyou. This is fixed in trunk now. Have a good time.

_why

···

On Sat, Jul 08, 2006 at 10:46:00PM +0900, Ron M wrote:

Hpricot chokes on this when I try it.

  require 'rubygems'
  require 'hpricot'
  require 'open-uri'
  Hpricot(open('http://www.pcmag.com/article2/0,1759,1765785,00.asp&#39;\)).to_html