How to stream or write data into a tar.gz file as if the data were from files?

I have a gazillion little files in memory (each is really just a chunk
of data, but it represents what needs to be a single file) and I need
to throw them all into a .tar.gz archive. In this case, it must be
in .tar.gz format and it must unzip into actual files--although I pity
the fellow that actually has to unzip this monstrosity.

Here's the solutions I've come up with so far:

1. Not portable, *extremely* slow:
    write out all these "files" into a directory and make a system
call to tar (tar -xzf ...)

2. Portable but still just as slow:
    write out all these "files" into a directory and use archive-tar-
minitar to make the archive

3. Not portable, but fast:
    stream information into tar/gzip to create the archive (without
ever first writing out files)

I've been looking around on this and the closest I've come is this:
tar cvf - some_directory | gzip - > some_directory.tar.gz

Note that this would still require me to write the files to a
directory (which must be avoided at all costs), but at least the
problem now is how to write data into a tar file. I've been googling
and still haven't turned up anything yet.

4. Hack archive-tar-minitar to enable me to write my data directly
into the format. Looking at the source code, this doesn't seem
terribly hard, but not terribly easy either. Am I missing a method
already written for this kind of thing?

Others?

Right now, anything resembling #3 or #4 would work for me.

My feeling is that it shouldn't be that hard to write data into
a .tar.gz format in either linux or ruby without actually having any
files (i.e., everything in memory or streamed in).

Thanks a lot for any suggestions or ideas!

So why then do you say "without ever first writing out files"?

I'd say #3 (the original formulation) is the one to go. Googling for "ruby tar" quickly turned up this:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/32588

And there is zlib which allows to read and write GZip streams. So, if ruby-tar allows to write into any stream you got your solution.

Kind regards

  robert

···

On 15.09.2008 20:35, bwv549 wrote:

I have a gazillion little files in memory (each is really just a chunk
of data, but it represents what needs to be a single file) and I need
to throw them all into a .tar.gz archive. In this case, it must be
in .tar.gz format and it must unzip into actual files--although I pity
the fellow that actually has to unzip this monstrosity.

3. Not portable, but fast:
    stream information into tar/gzip to create the archive (without
ever first writing out files)

I've been looking around on this and the closest I've come is this:
tar cvf - some_directory | gzip - > some_directory.tar.gz

Note that this would still require me to write the files to a
directory (which must be avoided at all costs), but at least the
problem now is how to write data into a tar file. I've been googling
and still haven't turned up anything yet.

Others?

Although it's not what you're asking for, as you mention "zipping" maybe
you could consider rubyzip:

  require 'zip/zipfilesystem'
  Zip::ZipFile.open("foo.zip") { |zfs|
    zfs.file.open("member.txt") { |f| f << data }
    zfs.commit
  }

zip is not tar, but it does have a some advantages - in particular the
ability to get random-access to any particular member without having to
read through the whole thing from the start.

My feeling is that it shouldn't be that hard to write data into
a .tar.gz format in either linux or ruby without actually having any
files (i.e., everything in memory or streamed in).

When reading, rubyzip lets you spool directly out of the zip. When
writing, I think that behind the scenes it spools to a tempfile, and
when you commit it then packs this into the archive.

···

--
Posted via http://www.ruby-forum.com/\.

This maybe be a little late, but better late than never.
Have you considered using #1 with a tmpfs and memory mapped files?
This isn't exactly portable, but should be pretty fast since as far as
tar is concerned your in-memory files just look like a regular
filesystem thanks to tmpfs.

···

On Sep 15, 1:35 pm, bwv549 <jtpri...@gmail.com> wrote:

I have a gazillion little files in memory (each is really just a chunk
of data, but it represents what needs to be a single file) and I need
to throw them all into a .tar.gz archive. In this case, it must be
in .tar.gz format and it must unzip into actual files--although I pity
the fellow that actually has to unzip this monstrosity.

And Googling for "ruby tar library" turns up:

  <http://raa.ruby-lang.org/project/minitar/>

:: which looks pretty appropriate :slight_smile:

FWIW,

···

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

So why then do you say "without ever first writing out files"?

I'm just trying to show that if I can stream out a tar file, then I
can at least pipe it into gzip (on many OS's). So, I'm really stuck
at making a tar file without actually having to write files to disk
first.

And there is zlib which allows to read and write GZip streams. So, if
ruby-tar allows to write into any stream you got your solution.

I looked at ruby-tar (on your suggestion) but ruby-tar turns out to
not have any write capabilities.

So, I'm still looking deeper into archive-tar-minitar. I also found
'tarruby' (bindings to the C libtar library) in rubyforge but it seems
more difficult to hack into than minitar.

As pointed out, the difficulty here has been narrowed down to writing
tar files without having to write files out to disk first.

Sincere thanks for the suggestions.

you could consider rubyzip:

require 'zip/zipfilesystem'
Zip::ZipFile.open("foo.zip") { |zfs|
zfs.file.open("member.txt") { |f| f << data }
zfs.commit
}

This is *exactly* what I need to be able to do, except with .tar.gz
files. I will use this solution for now, even while still searching
for (or maybe writing) the .tar.gz equivalent. Short term, this will
get me by... [even though a .tar.gz equivalent would be really nice].

Thanks!!

This is *exactly* what I need to be able to do, except with .tar.gz
files. I will use this solution for now

Do test it though. I tested it streaming large files in (100MB), and
found that it created a tempfile behind the scenes. If it does this for
*all* files, then it may not be any more efficient than using
archive-tar-minitar.

But it does have a simple API, which is essentially the same as File and
Dir. (Although unfortunately you can't use it to open a zipfile which is
within a zipfile :slight_smile:

···

--
Posted via http://www.ruby-forum.com/\.

IO.popen 'tar cfz -', 'w+' do |pipe|

end

and just send files down the pipe

a @ http://codeforpeople.com/

···

On Sep 15, 2008, at 1:38 PM, bwv549 wrote:

This is *exactly* what I need to be able to do, except with .tar.gz
files. I will use this solution for now, even while still searching
for (or maybe writing) the .tar.gz equivalent. Short term, this will
get me by... [even though a .tar.gz equivalent would be really nice].

Thanks!!

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

Do test it though. I tested it streaming large files in (100MB), and

Yes, upon testing I saw that it was creating a bunch of temp files,
too. It's too bad since the API is so clean! Perhaps it will be
reimplemented someday...

···

********************************************************************
********************** A solution using Minitar *******************

So, I hacked on archive-tar-minitar for a while and came up with a
solution. Right now I add a class method that fits with the style of
the pack_file method (indeed, pilfers most of its code) and then I can
access it using the slightly lower level interface than 'pack':

require 'archive/tar/minitar'
require 'stringio'

module Archive::Tar::Minitar

  # entry may be a string (the name), or it may be a hash specifying
the
  # following:
  # :name (REQUIRED)
  # :mode 33188 (rw-r--r--) for files, 16877 (rwxr-xr-x) for dirs
  # (0O100644) (0O40755)
  # :uid nil
  # :gid nil
  # :mtime Time.now
  #
  # if data == nil, then this is considered a directory!
  # (use an empty string for a normal empty file)
  # data should be something that can be opened by StringIO
  def self.pack_as_file(entry, data, outputter) #:yields action, name,
stats:
    outputter = outputter.tar if outputter.kind_of?
(Archive::Tar::Minitar::Output)

    stats = {}
    stats[:uid] = nil
    stats[:gid] = nil
    stats[:mtime] = Time.now

    if data.nil?
      # a directory
      stats[:size] = 4096 # is this OK???
      stats[:mode] = 16877 # rwxr-xr-x
    else
      stats[:size] = data.size
      stats[:mode] = 33188 # rw-r--r--
    end

    if entry.kind_of?(Hash)
      name = entry[:name]

      entry.each { |kk, vv| stats[kk] = vv unless vv.nil? }
    else
      name = entry
    end

    if data.nil? # a directory
      yield :dir, name, stats if block_given?
      outputter.mkdir(name, stats)
    else # a file
      outputter.add_file_simple(name, stats) do |os|
        stats[:current] = 0
        yield :file_start, name, stats if block_given?
        StringIO.open(data, "rb") do |ff|
          until ff.eof?
            stats[:currinc] = os.write(ff.read(4096))
            stats[:current] += stats[:currinc]
            yield :file_progress, name, stats if block_given?
          end
        end
        yield :file_done, name, stats if block_given?
      end
    end
  end
end

#####################################
# Then to use it to make a .tgz file:
#####################################

require 'zlib'

file_names = ['a_dir/dorky1', 'dorky2', 'an_empty_dir']
file_data_strings = ['my data', 'my data also', nil]

tgz = Zlib::GzipWriter.new(File.open('my_tar.tgz', 'wb'))

Archive::Tar::Minitar::Output.open(tgz) do |outp|
  file_names.zip(file_data_strings) do |name, data|
    Archive::Tar::Minitar.pack_as_file(name, data, outp)
  end
end

***********************************************************

So, not terribly pretty, but not too terrible either.

Ara Howard wrote:

IO.popen 'tar cfz -', 'w+' do |pipe|

end

and just send files down the pipe

Uh??

"tar cfz -" creates a tarfile called "z" and tries to pack a file called
"-" in it.

"tar czf - file1 file2 file3" reads the named files from disk and sends
the *output* to stdout.

If you don't specify any files, then nothing is created:

  $ tar -czf -
  tar: Cowardly refusing to create an empty archive
  Try `tar --help' or `tar --usage' for more information.

That's for gnu tar, maybe others work differently. However, as far as I
know, you can't get tar to read the *content* of files on stdin - and
even if you could, how would you format them? That is, how would you
delimit the start and end of each file, and assign a name to each one?

···

--
Posted via http://www.ruby-forum.com/\.

sorry. i misread the OPs question. tar can only unpack to stdout, not create from stdin.

a @ http://codeforpeople.com/

···

On Sep 16, 2008, at 3:30 AM, Brian Candler wrote:

Ara Howard wrote:

IO.popen 'tar cfz -', 'w+' do |pipe|

end

and just send files down the pipe

Uh??

"tar cfz -" creates a tarfile called "z" and tries to pack a file called
"-" in it.

"tar czf - file1 file2 file3" reads the named files from disk and sends
the *output* to stdout.

If you don't specify any files, then nothing is created:

$ tar -czf -
tar: Cowardly refusing to create an empty archive
Try `tar --help' or `tar --usage' for more information.

That's for gnu tar, maybe others work differently. However, as far as I
know, you can't get tar to read the *content* of files on stdin - and
even if you could, how would you format them? That is, how would you
delimit the start and end of each file, and assign a name to each one?
--
Posted via http://www.ruby-forum.com/\.

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

So, I hacked on archive-tar-minitar for a while and came up with a
solution.

You got me interested now.

I just installed the archive-tar-minitar gem and it looks pretty easy to
generate a tar file, without any patching of the library:

  require 'rubygems'
  require 'archive/tar/minitar'

  src = {
          "foo.txt" => "This is file foo",
          "bar.txt" => "This is file bar",
  }

  File.open("test.tar","w") do |tarfile|
    Archive::Tar::Minitar::Writer.open(tarfile) do |tar|
      src.each do |name, data|
        tar.add_file_simple(name, :size=>data.size, :mode=>0644) { |f|
f.write(data) }
      end
    end
  end

All I did was a quick poke around the API (gem server --daemon; launch
web browser pointing at http://localhost:8808/\) and look for something
called "Writer" :slight_smile:

HTH,

Brian.

···

--
Posted via http://www.ruby-forum.com/\.