Hpricot/Rubyful Soup comparison

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

Thanks,
Wes

···

--
Posted via http://www.ruby-forum.com/.

Wes Gamble wrote:

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and

I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot. I am absolutely sure about this. I am doing things
with HPricot which should be extremely slow (e.g. traversing the whole
tree and doing expensive operations on all Hpricot::Elements) yet
HPricot is surprisingly fast. Rubyful is nowhere near.

b) preserves the original HTML better.
Hmm this I don't know, but I guess the term 'preserves HTML better'
should be defined first with some metrics or something ( deviance from
the HTML standard? ). There are a lot of so badly formed HTML pages,
than even a human would come up with multiple solutions for their
correction.

I think the only real-life quality meter is to process your pages with
both of them and see which one yields better results. I did not play too
much with RubyfulSoup but I am writing a quite serious screen scraping
framework based on Hpricot, and so far I had no real problems - and I am
doing every kind of weird things.

Cheers,
Peter

···

__
http://www.rubyrailways.com

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

I switched from Rubyful Soup to Hpricot a while ago. The reason was performance on 1000-2000 character html chunks -- I didn't do a benchmark because there just was no need to... Hpricot is *a lot* faster.

I have no idea which preserves html better, I'm only using them to find specific bits of the html (e.g. links, images, a few other things). I do not use either to transform the input html, I *always* keep the input as it was. In all cases I have html in a string that I give to the parser, I do know that with Rubyful Soup it was absolutely necessary to dup the string first or you were liable to have changes made to the input string.

Cheers,
Bob

···

On 21-Nov-06, at 5:27 PM, Wes Gamble wrote:

Thanks,
Wes

--
Posted via http://www.ruby-forum.com/\.

----
Bob Hutchison -- blogs at <http://www.recursive.ca/hutch/&gt;
Recursive Design Inc. -- <http://www.recursive.ca/&gt;
Raconteur -- <http://www.raconteur.info/&gt;
xampl for Ruby -- <http://rubyforge.org/projects/xampl/&gt;

I have, in late August, and at that time, we found that Rubyful Soup
was ten times slower than Hpricot and Mechanize.

···

On 11/21/06, Wes Gamble <weyus@att.net> wrote:

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

Thanks,
Wes

--
Posted via http://www.ruby-forum.com/\.

--
Giles Bowkett
http://www.gilesgoatboy.org

I recently wrote a scrapper in rubyfulsoup and then rewrote it in hpricot. The hpricot version was MUCH faster, had less code and is easier to understand. I was a bit dubious of hpricot initially because of the 'strange syntax' but I am definitely sold now.

As for correctness, I can't comment.

···

On 22/11/2006, at 11:27 AM, Wes Gamble wrote:

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

I've used both Hpricot and Rubyful Soup to parse the Google News page
and found Hpricot to be much faster.

Luis
Wes Gamble wrote:

···

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

Thanks,
Wes

--
Posted via http://www.ruby-forum.com/\.

I recently did a small head-to-head with RubyfulSoup, Hpricot, and the up-and-coming (now in CVS, release in a few weeks) libxml-ruby binding to the libxml2 HTML parser. Running against the RubyfulSoup homepage (perhaps ironically, it's pretty badly formed) over 100 iterations, the attached benchmark gave out the following results. Each benchmark is parsing the original HTML and then getting back a specific node set (Hpricot and libxml2 using Xpath, RubyfulSoup using it's own query API):

                                   user system total real
rubyful soup - simple 25.900000 0.710000 26.610000 ( 26.669350)

                                   user system total real
rubyful soup - trickier 26.220000 0.010000 26.230000 ( 26.252975)

                                   user system total real
hpricot - simple xpath 7.930000 0.000000 7.930000 ( 7.950092)

                                   user system total real
hpricot - trickier xpath 8.200000 0.010000 8.210000 ( 8.212230)

                                   user system total real
libxml2 - simple xpath 0.900000 0.000000 0.900000 ( 0.899329)

                                   user system total real
libxml2 - trickier xpath 0.940000 0.000000 0.940000 ( 1.217441)

In terms of preserving the original HTML, I found the libxml2 and Hpricot parsers to be fairly even, with both doing pretty good job of fixing up broken HTML. There were minor differences in the XML produced, and from a (biased, nitpicking) spec point of view I think libxml2's output is slightly more 'proper' (self closing tags, etc). RubyfulSoup on the other hand seemed to have a few inconsistencies - it would occasionally lose tag attributes, and sometimes return varying results to the same query.

As for feature support, well, I don't want to rain on anyone's parade but the libxml HTML parser outputs an XML::Document with which you can transparently use all of libxml2's (many) features ... :wink: I couldn't get XPath functions to work with Hpricot, but then I'm not sure how complete an XPath implementation it's aiming for, and apart from that it seems pretty solid. OTOH RubyfulSoup has no Xpath support at all :frowning:

libxml-perfcomp.rb (1.63 KB)

···

On Tue, 21 Nov 2006 22:27:15 -0000, Wes Gamble <weyus@att.net> wrote:

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

--
Ross Bamford - rosco@roscopeco.remove.co.uk

HPricot is partially written in C, so it should be faster than a
pure-Ruby lib like RubyfulSoup.

Also, RubyfulSoup aims to be very resilient to malformed markup, so it
must resort to heuristics that have a performance cost. I don't know
fow HPricot handles HTML or XML with really serious flaws like tags
that open but never close and so on, but in my experience RubyfulSoup
has managed to deal amazingly well with such problems. If you need to
parse low quality markup, the performance penalty of RubyfulSoup may
be well worth the price.

Cheers,

Luciano

···

On 11/22/06, Peter Szinek <peter@rubyrailways.com> wrote:

I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot.

Thanks, Ross, that was great. Libxml2 has HTML fixup stuff? That's
sensational. Are the bindings pretty stable?

_why

···

On Sat, Nov 25, 2006 at 04:50:07AM +0900, Ross Bamford wrote:

In terms of preserving the original HTML, I found the libxml2 and Hpricot
parsers to be fairly even, with both doing pretty good job of fixing up
broken HTML.

Luciano Ramalho wrote:

I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot.

HPricot is partially written in C, so it should be faster than a
pure-Ruby lib like RubyfulSoup.

true

Also, RubyfulSoup aims to be very resilient to malformed markup,

So it's HPricot. HPricot is not just a HTML parser which can parse
(relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
it is a fact that HPricot is handling malformed pages very well.

so it

must resort to heuristics that have a performance cost. I don't know
fow HPricot handles HTML or XML with really serious flaws like tags
that open but never close and so on,

This concretely is absolutely OK. Maybe we would need a list of serious
problems and see how Hpricot vs RubyfulSoup is handling them. From what
I have seen, HPricot did not have any problems with any page...

has managed to deal amazingly well with such problems. If you need to
parse low quality markup, the performance penalty of RubyfulSoup may
be well worth the price.

I am still not sure what are the added benefits of RubyfulSoup parsing
over HPricot (although I am not claiming that there are none) - I would
like to see a real serious comparison to decide this...

Peter

···

On 11/22/06, Peter Szinek <peter@rubyrailways.com> wrote:

__
http://www.rubyrailways.com

Surely does: HTMLparser: interface for an HTML 4.0 non-verifying parser . It's a new
addition to the bindings (still in CVS) but it's really 'just another
parser' and uses the same (reasonably well tested) parser context / tree
bindings as the regular XML parsers.

···

On Sat, 2006-11-25 at 09:00 +0900, _why wrote:

On Sat, Nov 25, 2006 at 04:50:07AM +0900, Ross Bamford wrote:
> In terms of preserving the original HTML, I found the libxml2 and Hpricot
> parsers to be fairly even, with both doing pretty good job of fixing up
> broken HTML.

Thanks, Ross, that was great. Libxml2 has HTML fixup stuff? That's
sensational. Are the bindings pretty stable?

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

} Luciano Ramalho wrote:

···

On Wed, Nov 22, 2006 at 08:03:54PM +0900, Peter Szinek wrote:
} > On 11/22/06, Peter Szinek <peter@rubyrailways.com> wrote:
[...]
} > Also, RubyfulSoup aims to be very resilient to malformed markup,
} So it's HPricot. HPricot is not just a HTML parser which can parse
} (relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
} whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
} it is a fact that HPricot is handling malformed pages very well.
}
} > so it must resort to heuristics that have a performance cost. I don't
} > know fow HPricot handles HTML or XML with really serious flaws like
} > tags that open but never close and so on,
} This concretely is absolutely OK. Maybe we would need a list of serious
} problems and see how Hpricot vs RubyfulSoup is handling them. From what
} I have seen, HPricot did not have any problems with any page...

HPricot even keeps track of when tags are (incorrectly) closed by a
different close tag. This can allow you to track down issues in broken HTML
if that's your intent, but since I am mostly using HPricot for sanitization
I just set the close tags to nil so the output closes with the correct tag.
I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. <foo />) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

} > has managed to deal amazingly well with such problems. If you need to
} > parse low quality markup, the performance penalty of RubyfulSoup may
} > be well worth the price.
} I am still not sure what are the added benefits of RubyfulSoup parsing
} over HPricot (although I am not claiming that there are none) - I would
} like to see a real serious comparison to decide this...

I haven't tried RubyfulSoup, but HPricot suits my needs nicely. I am
delighted by its reliance on a bare minimum of HPricot-specific objects. It
doesn't try to behave like a real DOM, which means that it can use arrays
for child lists and ordinary references for parent nodes and hashes for
attributes, all read/write. It is possible to perform significant
transformations with minimal difficulty.

} Peter
--Greg

Thanks for the input, Peter. From your opinion and other´s, it seems
HPricot is the best option. Coming from Python, I was used to
BeautifulSoup, from which RubyfulSoup derived, and I was very happy
with it. But if we can have the same benefits with better performance,
then it´s a no-brainer!

Cheers,

Luciano

···

On 11/22/06, Peter Szinek <peter@rubyrailways.com> wrote:

> Also, RubyfulSoup aims to be very resilient to malformed markup,
So it's HPricot. HPricot is not just a HTML parser which can parse
(relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
it is a fact that HPricot is handling malformed pages very well.

Thanks for all of the comments.

I was pretty sure that Hpricot was faster since it is partially written
in C, but it's nice to hear a resounding "YES" on that topic.

My concern about "preserving original markup" has to do with this
application I'm writing, which grabs a page and then tries to display
it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
but it always attempts to fix it when I went to write the parse tree.
Which can cause problems when you try to redisplay the HTML.

Some malformed HTML is handled fine by browsers, so I'd like to preserve
the original HTML regardless of its quality. If Hpricot will not only
parse my HTML quickly, but also not fix the HTML on the way out (dumping
the parse tree), that would be ideal.

Again, thanks for all of the discussion - it's quite helpful.

Wes

···

--
Posted via http://www.ruby-forum.com/.

Mmm. Okay, good point. So if a tag comes in as self-closing, keep it that way?
I think that's reasonable.

_why

···

On Wed, Nov 22, 2006 at 09:40:33PM +0900, Gregory Seidman wrote:

I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. <foo />) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

I totally agree with you regarding preserving the original markup. In fact,
the latest Hpricot code (in subversion) has two methods for output:

* `to_html` which outputs fully closed tags and strips out bogus end tags.
* `to_original_html` which outputs the original document (as close as it can)
   with your modifications made.

So, for example, I use the `to_original_html` method in MouseHole, which is a
scriptable personal HTTP proxy (sort of like greasemonkey). Some pages (like
Boing Boing, for instance) completely break if you try to fix up the HTML. But
this new method can successfully remove stuff and alter stuff without turning the
whole page upside-down.

_why

···

On Thu, Nov 23, 2006 at 12:28:30AM +0900, Wes Gamble wrote:

My concern about "preserving original markup" has to do with this
application I'm writing, which grabs a page and then tries to display
it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
but it always attempts to fix it when I went to write the parse tree.
Which can cause problems when you try to redisplay the HTML.

_why wrote:

I totally agree with you regarding preserving the original markup. In
fact,
the latest Hpricot code (in subversion) has two methods for output:

* `to_html` which outputs fully closed tags and strips out bogus end
tags.
* `to_original_html` which outputs the original document (as close as
it can)
   with your modifications made.

sweet.

···

--
Posted via http://www.ruby-forum.com/\.

Wes Gamble wrote:

_why wrote:

I totally agree with you regarding preserving the original markup. In
fact,
the latest Hpricot code (in subversion) has two methods for output:

* `to_html` which outputs fully closed tags and strips out bogus end
tags.
* `to_original_html` which outputs the original document (as close as
it can)
   with your modifications made.

sweet.

Actually, I'm kind of hoping that I can make mods. to the parse tree,
but that no "unnecessary fixing" of bad HTML occurs.

So I'm wondering does modifying the parse tree at all and then
outputting it imply that all of the malformed HTML will be
fixed/modified in some way or not?

Thanks,
Wes

···

--
Posted via http://www.ruby-forum.com/\.

Actually, I'm kind of hoping that I can make mods. to the parse tree,
but that no "unnecessary fixing" of bad HTML occurs.

So I'm wondering does modifying the parse tree at all and then
outputting it imply that all of the malformed HTML will be
fixed/modified in some way or not?

With `to_original_html`, no malformed HTML is fixed.

  >> require 'hpricot'
  >> doc = Hpricot("<div><p>Paragraph one<p>Paragraph two <b>with <i>some</b> tags in it <b etc.=></p>")
  >> (doc/:p).set('class', 'new')
  >> puts doc.to_original_html
  <div><p class="new">Paragraph one<p class="new">Paragraph two <b>with <i>some</b> tags in it <b etc.=></p>

With `to_html`, Hpricot will line up all the tags.

_why

···

On Thu, Nov 23, 2006 at 03:57:51AM +0900, Wes Gamble wrote: