Open-uri fetches outdated content vs. curl

Try running the following program:

···

================
require 'open-uri'

feed_url = "http://www.slate.com/rss"

result1 = open(feed_url).read
puts "Saving result1.xml:"
File.open("result1.xml", "w") {|f| f.write(result1)}

result2 = `curl -L #{feed_url}`
puts "Saving result2.xml:"
File.open("result2.xml", "w") {|f| f.write(result2)}

command = "diff result1.xml result2.xml"
puts system(command)

result1 should be identical to result2, but it turns out that the feed
that open-uri fetches is outdated content (by over a month), while the
feed that curl fetches is up-to-date. Can anyone please explain what
is going on?

Thanks!

Reasons I can think of:

i) Both approaches use different paths to the server, namely a different (or no) proxy.

ii) There is something in the request that makes the server send different data.

Can you try to obtain HTTP headers from both approaches? That might clear up a few things. Also, on Unix type systems check for environment variables and ~/.xyzrc files which might affect proxy settings.

Another good idea might be to try a different tool, e.g. a web browser, to see what that turns up.

Kind regards

  robert

···

On 18.09.2008 02:05, Daniel Choi wrote:

Try running the following program:

================
require 'open-uri'

feed_url = "http://www.slate.com/rss"

result1 = open(feed_url).read
puts "Saving result1.xml:"
File.open("result1.xml", "w") {|f| f.write(result1)}

result2 = `curl -L #{feed_url}`
puts "Saving result2.xml:"
File.open("result2.xml", "w") {|f| f.write(result2)}

command = "diff result1.xml result2.xml"
puts system(command)

result1 should be identical to result2, but it turns out that the feed
that open-uri fetches is outdated content (by over a month), while the
feed that curl fetches is up-to-date. Can anyone please explain what
is going on?

Thanks for these suggestions. The problem actually just cleared itself
up, after several days where the open-uri fetch was getting outdated
content. I think it was a problem is upstream proxies. I'll try to
look at the headers out of curiosity.

···

On Sep 18, 2:26 am, Robert Klemme <shortcut...@googlemail.com> wrote:

On 18.09.2008 02:05, Daniel Choi wrote:

> Try running the following program:

> ================
> require 'open-uri'

> feed_url = "http://www.slate.com/rss"

> result1 = open(feed_url).read
> puts "Saving result1.xml:"
> File.open("result1.xml", "w") {|f| f.write(result1)}

> result2 = `curl -L #{feed_url}`
> puts "Saving result2.xml:"
> File.open("result2.xml", "w") {|f| f.write(result2)}

> command = "diff result1.xml result2.xml"
> puts system(command)
> ================

> result1 should be identical to result2, but it turns out that the feed
> thatopen-urifetches is outdated content (by over a month), while the
> feed that curl fetches is up-to-date. Can anyone please explain what
> is going on?

Reasons I can think of:

i) Both approaches use different paths to the server, namely a different
(or no) proxy.

ii) There is something in the request that makes the server send
different data.

Can you try to obtain HTTP headers from both approaches? That might
clear up a few things. Also, on Unix type systems check for environment
variables and ~/.xyzrc files which might affect proxy settings.

Another good idea might be to try a different tool, e.g. a web browser,
to see what that turns up.

Kind regards

        robert

I used net/http to do the same thing, but this time I printed out the
redirect locations. The result is very interesting. If it don't set
the "User-Agent" header, it get redirected to one proxy -- the one
with outdated content. If I set the "User-Agent" header to "Mozilla/
5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/XX (KHTML, like
Gecko) Safari/YY" (faking Apple Safari), I get redirected to another
proxy, with the up to date content.

I didn't know that servers redirected requests to bad or good proxies
depending on what the User Agent header is. But this seems to be the
case here.

Daniel, thanks for the update! This is interesting stuff. The distinction is probably not so much between "bad" or "good" proxies but between proxies tailored for a particular browser version. Maybe it's a bug and you should show this to your IT department. Could be that they changed firewall rules in the past and the "bad" proxy never gets updated because of lacking connectivity. :slight_smile:

Cheers

  robert

···

On 24.09.2008 02:11, Daniel Choi wrote:

I used net/http to do the same thing, but this time I printed out the
redirect locations. The result is very interesting. If it don't set
the "User-Agent" header, it get redirected to one proxy -- the one
with outdated content. If I set the "User-Agent" header to "Mozilla/
5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/XX (KHTML, like
Gecko) Safari/YY" (faking Apple Safari), I get redirected to another
proxy, with the up to date content.

I didn't know that servers redirected requests to bad or good proxies
depending on what the User Agent header is. But this seems to be the
case here.