REXML/RSS parse error

Hello,

I have a problem while parsing an RSS file. I try to open a URL via open-uri and it usually works fine, but with the RSS URLs from ccMixter I get a parse error. It's a bit strange because if i download the file and try to open it, it works fine.

I tried:
rss = RSS::Parser.parse(“http://ccmixter.org/media/api/query?score=400&sinceu=1157536651&limit=25&tags=remix+editorial_pick&rand=1&format=rss”,false)

And got:
RSS::NotWellFormedError: This is not well formed XML
Missing end tag for 'html' (got "head")
Line:
Position:
Last 80 unconsumed characters:
         from /usr/lib/ruby/1.8/rss/rexmlparser.rb:24:in `_parse'
         from /usr/lib/ruby/1.8/rss/parser.rb:163:in `parse'
         from /usr/lib/ruby/1.8/rss/parser.rb:78:in `parse'
         from (irb):43

If i save the file and try to open it, it works fine:
rss = RSS::Parser.parse("query",false)

Imho there should be no difference between open a local file or an URL.

Thanks for all the help I got the last days from this list,
  Patrick

Hi,

In <4577F0FB.8020300@erdbeere.net>
  "REXML/RSS parse error" on Thu, 7 Dec 2006 19:45:53 +0900,

I have a problem while parsing an RSS file. I try to open a URL via
open-uri and it usually works fine, but with the RSS URLs from ccMixter
I get a parse error. It's a bit strange because if i download the file
and try to open it, it works fine.

I tried:
rss =
RSS::Parser.parse("ccMixter (remix,editorial_pick)",false)

And got:
RSS::NotWellFormedError: This is not well formed XML
Missing end tag for 'html' (got "head")

I got some garbages after RSS 2.0:

  % ruby -r open-uri -e 'puts open("ccMixter (remix,editorial_pick)).read' | tail -n 25
      </item>
    </channel>
  </rss>
  "/web/ccmixter/www/cclib/cc-util.php"(205): Cannot modify header information - headers already sent by (output started at /web/ccmixter/www/cclib/cc-feed.php:432) [2006-12-07 07:10 am][138.243.129.4][/media/api/query?score=400&sinceu=1157536651&limit=25&tags=remix+editorial_pick&rand=1&format=rss]
  <html>
  <head>
  <style>
          body {
              font-size: 11px;
              font-family: Verdana, sans-serif;
              background-color: #F99;
              margin: 4%;
             text-align: center;
      }
  </style>
  </head>
  <body>
  <p> <img src="/mixter-files/skull.gif" /></p>
      <h3>wups, ccMixter is experiencing technical difficulties...</h3>
      <p>If you were in the middle of an upload or posting a message it probably worked OK
      but you should click <a href="/">here</a> to get back to the site's home page or
      use your browser's BACK button to return to the site and make sure.</p>
      <p>The admins have been notified of the problem and will look into it very shortly.</p>
  </body>
  </head>

Thanks,

···

Patrick Plattes <patrick@erdbeere.net> wrote:
--
kou

Kouhei Sutou schrieb:

I got some garbages after RSS 2.0:

Thank you, I hadn't seen it. I've written an e-mail to them, but the most RSS reader are able to parse this malicious file. Do you know any way to force the parser to read this file. For RSS It would be ok, to stop parsing after the closing RSS tag.

Thanks,
  Patrick

Hi,

In <45782BC1.1040307@erdbeere.net>
  "Re: REXML/RSS parse error" on Thu, 7 Dec 2006 23:56:38 +0900,

···

Patrick Plattes <patrick@erdbeere.net> wrote:

most RSS reader are able to parse this malicious file. Do you know any
way to force the parser to read this file. For RSS It would be ok, to
stop parsing after the closing RSS tag.

What about gsub(/<\/rss>.*\z/m, '</rss>')?

Thanks,
--
kou

Kouhei Sutou schrieb:

What about gsub(/<\/rss>.*\z/m, '</rss>')?

Yes, that works very well :-). I'm very happy now *g*. Thank you very much,
  Patrick