Mirroring large files over HTTP

Lars_Haugseth · 30 September 2008 17:34

I'm working on a script where I want to download large files off a remote
web server and store on a local filesystem.

At the moment I'm using code like this:

require 'open-uri'

  open(filename, 'w') do |file|
    file.write(open(remote_url).read)
  end

I assume this will read the complete content of the remote file into
memory before writing it to the local file. If that assumption is
correct, what is the best/easiest way to do a buffered piecemeal
fetch/store? I've looked at the net/http library but haven't found
anything in there that looks relevant to this.

···

--
Lars Haugseth

"If anyone disagrees with anything I say, I am quite prepared not only to
retract it, but also to deny under oath that I ever said it." -Tom Lehrer

Lars_Haugseth · 2 October 2008 08:03

Turns out the OpenURI module is indeed fetching the remote resource
in segments and storing to a temporary file. However, my code above
will read the complete contents of that file into memory before
writing it back out to another file.

By inspecting the OpenURI source code I've learned that this is how
it's done (sans proxy handling, error handling etc.):

I'm a little surprised not to find any convenience method in the standard
libraries doing all this for me, though.

···

* Lars Haugseth <njus@larshaugseth.com> wrote:

I'm working on a script where I want to download large files off a remote
web server and store on a local filesystem.

At the moment I'm using code like this:

  require 'open-uri'

  open(filename, 'w') do |file|
    file.write(open(remote_url).read)
  end

I assume this will read the complete content of the remote file into
memory before writing it to the local file. If that assumption is
correct, what is the best/easiest way to do a buffered piecemeal
fetch/store? I've looked at the net/http library but haven't found
anything in there that looks relevant to this.

--
Lars Haugseth

"If anyone disagrees with anything I say, I am quite prepared not only to
retract it, but also to deny under oath that I ever said it." -Tom Lehrer

Eric_Hodel1 · 2 October 2008 08:36

Why? It's all of one line:

output.write input.read(16384) until input.eof?

···

On Oct 2, 2008, at 01:03 AM, Lars Haugseth wrote:

Turns out the OpenURI module is indeed fetching the remote resource
in segments and storing to a temporary file. However, my code above
will read the complete contents of that file into memory before
writing it back out to another file.

I'm a little surprised not to find any convenience method in the standard
libraries doing all this for me, though.

Lars_Haugseth · 2 October 2008 09:33

Nice enough, but one will need a bit more than that singke line to do the
whole operation from start to finish.

I was thinking more of something like SomeClass.mirror(url, filename).

···

* Eric Hodel <drbrain@segment7.net> wrote:

On Oct 2, 2008, at 01:03 AM, Lars Haugseth wrote:
> Turns out the OpenURI module is indeed fetching the remote resource
> in segments and storing to a temporary file. However, my code above
> will read the complete contents of that file into memory before
> writing it back out to another file.
>
> I'm a little surprised not to find any convenience method in the
> standard
> libraries doing all this for me, though.

Why? It's all of one line:

output.write input.read(16384) until input.eof?

--
Lars Haugseth

"If anyone disagrees with anything I say, I am quite prepared not only to
retract it, but also to deny under oath that I ever said it." -Tom Lehrer

Lars_Haugseth · 6 October 2008 21:33

Today I came across the the curb¹ gem (Ruby bindings for libcurl) while
reading a blog posting² about net/http performance, and this gem provides
a convenient class method that does exactly what I want:

require 'curb'
Curl::Easy.download(url, filename)

It also provides lots of other nice stuff, so I will definitely look
into using this one for my future HTTP client needs.

[1] http://curb.rubyforge.org/
[2] http://apocryph.org/analysis_ruby_18x_http_client_performance

···

* Lars Haugseth <njus@larshaugseth.com> wrote:

* Eric Hodel <drbrain@segment7.net> wrote:
>
> On Oct 2, 2008, at 01:03 AM, Lars Haugseth wrote:
> > Turns out the OpenURI module is indeed fetching the remote resource
> > in segments and storing to a temporary file. However, my code above
> > will read the complete contents of that file into memory before
> > writing it back out to another file.
> >
> > I'm a little surprised not to find any convenience method in the
> > standard
> > libraries doing all this for me, though.
>
> Why? It's all of one line:
>
> output.write input.read(16384) until input.eof?

Nice enough, but one will need a bit more than that singke line to do the
whole operation from start to finish.

I was thinking more of something like SomeClass.mirror(url, filename).

--
Lars Haugseth

"If anyone disagrees with anything I say, I am quite prepared not only to
retract it, but also to deny under oath that I ever said it." -Tom Lehrer

Topic		Replies	Views
Alternative to open-uri? ruby-talk	2	82	16 July 2006
Efficient file downloading ruby-talk	4	84	22 February 2008
Remote files? ruby-talk	3	82	25 March 2008
Download a file piece by piece ruby-talk	4	108	30 November 2006
Net/http performance question ruby-talk	2	75	31 October 2006

Mirroring large files over HTTP

Related topics