Rss parsing error

Young_Gyu_Park · 13 July 2009 14:53

At these days, I try to parse 'http://www.forbes.com/news/index.xml' using
feedzirra.
As you access this url, you can recognize what the problem is.

They added an unnecessary html tag which made malformed rss format.

But in the google rss reader, they process correctly without any problem.
This is the point I wonder how they can make it happen, while I can't.

please help me out to narrow the gap between me and google ^.^

be a happy day.

Juvenn_Woo · 13 July 2009 15:26

Hi, Young:
I think you may checkout universal feed parser at feedparser.org,
which is a python package, mainly created by Mark Pilgrim. And, I'm
guessing Google Reader uses it in the backend.
As far as I know, there's no equivalent ruby package for that.

Regards,

···

On 7/13/09, Young Gyu Park <ygpark2@gmail.com> wrote:

At these days, I try to parse 'http://www.forbes.com/news/index.xml' using
feedzirra.
As you access this url, you can recognize what the problem is.

They added an unnecessary html tag which made malformed rss format.

But in the google rss reader, they process correctly without any problem.
This is the point I wonder how they can make it happen, while I can't.

please help me out to narrow the gap between me and google ^.^

be a happy day.

--
Sent from my mobile device

Juvenn Woo

G_F · 7 September 2009 18:05

Young Gyu Park wrote:

At these days, I try to parse 'http://www.forbes.com/news/index.xml'
using
feedzirra.
As you access this url, you can recognize what the problem is.

They added an unnecessary html tag which made malformed rss format.

Glancing at the output of their feed I see no malformed RSS. I do see
them "exercising some options" that most feeds don't, such as embedding
CDATA in the link tags.

Using Nokogiri to parse this feed is easy:

#!/usr/bin/env ruby -wKU

    require 'rubygems'
    require 'nokogiri'
    require 'open-uri'

url = 'http://www.forbes.com/news/index.xml'
xml = Nokogiri::XML(open(url))

    puts "Feed title: #{ (xml%'title').content }"
    puts "Feed description: #{ (xml%'description').content }"
    puts "Feed link: #{ (xml%'link').content }"

    # get the first item
    item = (xml/'item').first
    puts "Item title: #{ (item%'title').content }"
    puts "Item link: #{ (item%'link').content }"
    puts "Item pubDate: #{ (item%'pubDate').content }"
    puts "Item description: #{ (item%'description').content }"
    puts "Item author: #{ (item%'author').content }"

Not all feeds are this straightforward or well constructed. That's where
using a pre-built library to parse comes in handy but I haven't found
one yet that handles everything out there correctly. Even Google's
reader gets it wrong on some malformed feeds.

Aaron Patterson (AKA tenderlove) has done a great job with Nokogiri.
I've tested a lot of feeds and seen occasions where the built-in RSS
reader and other libraries puked or spun off and never returned. I've
run into feeds that caused Hpricot to be unable to strip broken HTML
embedded inside the descriptions, but Nokogiri was able to handle it.
So, if you can't get a library to do what you want, jump in with
Nokogiri and give it a try.

···

--
Posted via http://www.ruby-forum.com/\.

Kouhei_Sutou1 · 14 July 2009 11:49

Hi,

In <587ca64f0907130826q6eebbb0ar8c7b3404a277f522@mail.gmail.com>
"Re: rss parsing error." on Tue, 14 Jul 2009 00:26:37 +0900,

Hi, Young:
I think you may checkout universal feed parser at feedparser.org,
which is a python package, mainly created by Mark Pilgrim. And, I'm
guessing Google Reader uses it in the backend.
As far as I know, there's no equivalent ruby package for that.

The RSS can be parsed with the bundled RSS Parser.
We doesn't need to use Universal Feed Parser.

···

Juvenn Woo <machese@gmail.com> wrote:

On 7/13/09, Young Gyu Park <ygpark2@gmail.com> wrote:

At these days, I try to parse 'http://www.forbes.com/news/index.xml' using
feedzirra.
As you access this url, you can recognize what the problem is.

They added an unnecessary html tag which made malformed rss format.

But in the google rss reader, they process correctly without any problem.
This is the point I wonder how they can make it happen, while I can't.

please help me out to narrow the gap between me and google ^.^

be a happy day.

--
kou

zotium · 7 September 2009 04:05

Has anybody tried comparing Feedzirra vs Universal Feed Parsers
performance? Which is faster when processing thousands of feeds?

Topic		Replies	Views
RSS/Atom parsing libraries for Ruby? ruby-talk	4	106	13 July 2005
Recommended library for parsing RSS and Atom feeds ruby-talk	4	143	23 June 2010
[ANN] ruby-feedparser : RSS/Atom feed parser ruby-talk	0	118	15 November 2005
[ANN] Ruby-feedparser 0.1 ruby-talk	0	112	24 November 2005
RSS parser problem ruby-talk	1	87	18 July 2008

Rss parsing error

Related topics