I'm starting to learn Ruby and I was thinking about a little app so I can
get things started as quickly as possible. Since I'm an avid blog reader,
the first thing that went though my mind was a small app that would extract
the RSS or Atom feed from a web page, giving the URL.
My first choice were regexps but I'm thinking that my little app my grow a
little bit more in the not-so-distant future and I might be doing more than
just extracting feeds.
but they don't look really standard and RAA doesn't look like it's currently
maintained. I've also heard that there's a Rails HTML parser but I couldn't
find more info (an pro'lly I'll ask on one of the Rails list).
Is there a more "standard" way to parse HTML pages in Ruby?
I'm starting to learn Ruby and I was thinking about a little app so I can
get things started as quickly as possible. Since I'm an avid blog reader,
the first thing that went though my mind was a small app that would extract
the RSS or Atom feed from a web page, giving the URL.
My first choice were regexps but I'm thinking that my little app my grow a
little bit more in the not-so-distant future and I might be doing more than
just extracting feeds.
but they don't look really standard and RAA doesn't look like it's currently
maintained. I've also heard that there's a Rails HTML parser but I couldn't
find more info (an pro'lly I'll ask on one of the Rails list).
Is there a more "standard" way to parse HTML pages in Ruby?
The closest you'll find to a standard is REXML, which is an XML parser that ships in the stdlib. You'll want to throw your HTML through Tidy first, though - but that's an easy install.
There are a couple of alternatives: Hpricot and html-parser spring instantly to mind.
If you're doing feed parsing, you probably also want to check out feedtools.
> Since I'm an avid blog reader,
> the first thing that went though my mind was a small app that would extract
> the RSS or Atom feed from a web page, giving the URL.
If you're doing feed parsing, you probably also want to check out feedtools.
Well... he probably won't learn much from the FeedTools code, but it is
convenient for this sort of thing: