Ray Chen wrote:
I am also working on a performance app that requires feed parsing.
As previously mentioned, feed-normalizer aims to produce a 'Feed' object that is independent of the underlying format. This means it will use each parser (in a user-defined order) until it gets back a successful parse and usable a object which to interface.
What this also means is that the *primary* goal of feed-normalizer is to produce the aforementioned Feed object graph. This might mean it hitting 3 parsers before it gets that result. So performance isn't really a consideration.
Of course, you could change the order of parsing so that feed-normalizer uses the fastest parser first, and so on. feed-normalizer currently uses most strict to most liberal as its default order. Right now, this just happens to be fastest parser first, too
The two that I have tried are feedtools and syndication. First I tried feedtools for RSS and Atom, but that was too slow, so I switched to syndication for both RSS and Atom. I found syndication to break on a high percentage of Atom sites, so in the end, I sent RSS to syndication and Atom to feedtools and took the corresponding perf hit for Atom feeds.
In this case you could create a wrapper for feed-normalizer that interfaces both syndication and feedtools, and tell feed-normalizer which one to use first. I assume you'll probably encounter more RSS than Atom.
I find this approach to be decently robust, but not very elegant. I am going through > 10k feeds a day of all varieties.
Can someone comment on the robustness of Ruby RSS Parser and Lucas Carlson's SimpleRSS? I am curious about Andy's feed normalizer.
I personally have found Ruby's RSS library to be very good at handling RSS feeds that aren't broken What that means is the results should be predictable, but the chance of a good parse may be lower.
SimpleRSS on the other hand is uber-liberal, and if the feed resembles anywhere near an RSS or Atom document, you'll probably get a pretty good result back, but there are small errors sometimes.
Bob Aman did an overview of both parsers, somewhere on sporkmonger.com.
Back to performance again; I did some rudimentary benchmarks[1] of both Ruby's RSS as well as SimpleRSS. I think the results of this benchmark really make the point for SimpleRSS being a great 'backup' parser to have when nothing else will parse an ill-formed feed.
And of course, I'm always looking for patches and new parser wrappers for feed-normalizer.
HTH,
Ray
Hope that helps.
Andy
[1] http://blog.andyis.textdriven.com/articles/2006/03/28/parsers-in-the-pool