Zlib::GzipReader and multiple compressed blobs in a single stream

Hi,

I'm trying to inflate a set of concatenated gzipped blobs stored in a single
file. As it stands, Zlib::GzipReader only inflates the first blob. It
appears that the unused instance method would return the remaining data,
ready to be passed into Zlib::GzipReader, but it yields an error:

method `method_missing' called on hidden T_STRING object

What could be going on here?

On a related note, Zlib::GzipReader#{pos,tell} returns the position in the
output stream (zstream.total_out) whereas I am looking for the position in
the input stream. I tried making zstream.total_in available but the value
appears to be 18 bytes short in my test file, that is, the next header is
found 18 bytes beyond what zstream.total_in reports.

Does anybody know how to make the library return the correct offset into the
input stream so multiple compressed blobs can be handled?

Thanks,
Jos

···

--
Peace cannot be achieved through violence, it can only be attained through
understanding.

Hi,

I'm trying to inflate a set of concatenated gzipped blobs stored in a single
file. As it stands, Zlib::GzipReader only inflates the first blob. It
appears that the unused instance method would return the remaining data,
ready to be passed into Zlib::GzipReader, but it yields an error:

method `method_missing' called on hidden T_STRING object

What could be going on here?

I'm not sure what's going on, but I was hoping you could solve your
problem by running something like this:

File.open('gzipped.blobs') do |f|
  begin
    loop do
      Zlib::GzipReader.open(f) do |gz|
        puts gz.read
      end
    end
  rescue Zlib::GzipFile::Error
    # End of file reached.
  end
end

Unfortunately, Ruby 1.8 doesn't appear to support passing anything other
than a file name to Zlib::GzipReader.open, and Ruby 1.9 seems to always
reset the file position to the beginning of the file prior to starting
extraction when you really need it to just start working from the
current position. So it doesn't appear that you can do this with the
standard library.

As part of a ZIP library I wrote, there is a more general implementation
of a Zlib stream filter. Install the archive-zip gem and then try the
following:

gem 'archive-zip'
require 'archive/support/zlib'

File.open('gzipped.blobs') do |f|
  until f.eof? do
    Zlib::ZReader.open(f, 15 + 16) do |gz|
      gz.delegate_read_size = 1
      puts gz.read
    end
  end
end

This isn't super efficient because we have to hack the
delegate_read_size to be 1 byte in order to ensure that the trailing
gzip data isn't sucked into the read buffer of the current ZReader
instance and hence lost between iterations. It shouldn't be too bad
though since the File object should be handling its own buffering.

BTW, I wrote some pretty detailed documentation for Zlib::ZReader. It
should explain what the 15 + 16 is all about in the open method in case
you need to tweak things for your own streams.

On a related note, Zlib::GzipReader#{pos,tell} returns the position in the
output stream (zstream.total_out) whereas I am looking for the position in
the input stream. I tried making zstream.total_in available but the value
appears to be 18 bytes short in my test file, that is, the next header is
found 18 bytes beyond what zstream.total_in reports.

I think total_in is counting only the compressed data; however,
following the compressed data is a trailer as required for gzip blobs.
You could probably always add 18 to whatever you get, but as I noted
earlier, the implementation of GzipReader seems to always reset any file
object back to the beginning of the stream rather than start processing
it from an existing position. I can't find any documentation listing a
way to force GzipReader to jump to any other file position after
initialization either.

Does anybody know how to make the library return the correct offset into the
input stream so multiple compressed blobs can be handled?

Hopefully, my solution will work for you because I don't think the
current implementation in the standard library will do what you need.

-Jeremy

···

On 01/28/2011 05:09 PM, Jos Backus wrote:

It's a bug, the internal buffer that libz uses is dup'd, but this is not enough to make it safe for use by ruby. I have filed a ticket and attached a stupid patch:

http://redmine.ruby-lang.org/issues/show/4360

···

On Jan 28, 2011, at 15:09, Jos Backus wrote:

I'm trying to inflate a set of concatenated gzipped blobs stored in a single
file. As it stands, Zlib::GzipReader only inflates the first blob. It
appears that the unused instance method would return the remaining data,
ready to be passed into Zlib::GzipReader, but it yields an error:

method `method_missing' called on hidden T_STRING object

What could be going on here?

Hi Jeremy,

Thanks for your reply.

[snip]

> Hi,
>
> I'm trying to inflate a set of concatenated gzipped blobs stored in a single
> file. As it stands, Zlib::GzipReader only inflates the first blob. It
> appears that the unused instance method would return the remaining data,
> ready to be passed into Zlib::GzipReader, but it yields an error:
>
> method `method_missing' called on hidden T_STRING object
>
> What could be going on here?

I'm not sure what's going on, but I was hoping you could solve your
problem by running something like this:

File.open('gzipped.blobs') do |f|
  begin
    loop do
      Zlib::GzipReader.open(f) do |gz|
        puts gz.read
      end
    end
  rescue Zlib::GzipFile::Error
    # End of file reached.
  end
end

I tried something like this but as you point out, it doesn't work.

Unfortunately, Ruby 1.8 doesn't appear to support passing anything other
than a file name to Zlib::GzipReader.open, and Ruby 1.9 seems to always
reset the file position to the beginning of the file prior to starting
extraction when you really need it to just start working from the
current position. So it doesn't appear that you can do this with the
standard library.

That's what it looks like, yes. Bummer.

As part of a ZIP library I wrote, there is a more general implementation
of a Zlib stream filter. Install the archive-zip gem and then try the
following:

gem 'archive-zip'
require 'archive/support/zlib'

File.open('gzipped.blobs') do |f|
  until f.eof? do
    Zlib::ZReader.open(f, 15 + 16) do |gz|
      gz.delegate_read_size = 1
      puts gz.read
    end
  end
end

This isn't super efficient because we have to hack the
delegate_read_size to be 1 byte in order to ensure that the trailing
gzip data isn't sucked into the read buffer of the current ZReader
instance and hence lost between iterations. It shouldn't be too bad
though since the File object should be handling its own buffering.

This works, but sadly it is very slow. Whereas zcat takes under a second on my
test file, this code takes about 17 seconds.

BTW, I wrote some pretty detailed documentation for Zlib::ZReader. It
should explain what the 15 + 16 is all about in the open method in case
you need to tweak things for your own streams.

Great. But I didn't have to tweak anything, it just worked :slight_smile:

> On a related note, Zlib::GzipReader#{pos,tell} returns the position in the
> output stream (zstream.total_out) whereas I am looking for the position in
> the input stream. I tried making zstream.total_in available but the value
> appears to be 18 bytes short in my test file, that is, the next header is
> found 18 bytes beyond what zstream.total_in reports.

I think total_in is counting only the compressed data; however,
following the compressed data is a trailer as required for gzip blobs.
You could probably always add 18 to whatever you get, but as I noted
earlier, the implementation of GzipReader seems to always reset any file
object back to the beginning of the stream rather than start processing
it from an existing position. I can't find any documentation listing a
way to force GzipReader to jump to any other file position after
initialization either.

Yeah, you'd have to feed GZipReader the right part of the input stream
yourself and figure out how much it processed. Something tells me it's not
always 18 but depends on internal buffering, which would invalidate the
assumption of a fixed offset.

> Does anybody know how to make the library return the correct offset into the
> input stream so multiple compressed blobs can be handled?

Hopefully, my solution will work for you because I don't think the
current implementation in the standard library will do what you need.

It does, but it's very slow. Sigh.

Thanks again, Jeremy.

Cheers,
Jos

···

On Mon, Jan 31, 2011 at 02:28:30AM +0900, Jeremy Bopp wrote:

On 01/28/2011 05:09 PM, Jos Backus wrote:

--
Jos Backus
jos at catnook.com

Once your fix is in place and GZipReader#unused works correctly, is
there any convenient way to take the returned string and continue
processing it along with the remaining file contents with an instance of
GzipReader?

rewinds any IO object you give it before inflating any data, so you
can't use that method to create your instance if you need to start
reading from anywhere other than the beginning of the stream.
GzipReader.new doesn't have that problem, but there isn't any easy way
to make use of that unused data from the earlier processing along with
the remaining file contents. According to the documentation, you could
create an IO-like wrapper that will first feed in that unused data
followed by the real file data, and GzipReader.new should be able to use
that, but that's a bit of a mess.

If all that really is a design limitation of GzipReader, having the
unused data isn't very useful when attempting to inflate concatenated
gzip blobs as zcat does. You may be able to make it work with a little
judicious hacking, but it's certainly more effort than it should be.
Maybe a ZcatReader is needed to plaster over things?

BTW, why do GzipReader.open and GzipReader.new behave so differently
with regard to the IO object you pass into them? They're a little
closer in operation under Ruby 1.9 than they were under Ruby 1.8, but
the difference is still surprising given the idiom followed by File.open
and File.new where File.open is really just a simple wrapper around
File.new that can help ensure that File#close is called at the end of
your block.

-Jeremy

···

On 02/02/2011 07:33 PM, Eric Hodel wrote:

On Jan 28, 2011, at 15:09, Jos Backus wrote:

I'm trying to inflate a set of concatenated gzipped blobs stored in a single
file. As it stands, Zlib::GzipReader only inflates the first blob. It
appears that the unused instance method would return the remaining data,
ready to be passed into Zlib::GzipReader, but it yields an error:

method `method_missing' called on hidden T_STRING object

What could be going on here?

It's a bug, the internal buffer that libz uses is dup'd, but this is not enough to make it safe for use by ruby. I have filed a ticket and attached a stupid patch:

http://redmine.ruby-lang.org/issues/show/4360

From my testing, it appears that GzipReader.open in Ruby 1.9 always

Thanks, Eric!

···

On Thu, Feb 03, 2011 at 10:33:59AM +0900, Eric Hodel wrote:

It's a bug, the internal buffer that libz uses is dup'd, but this is not
enough to make it safe for use by ruby. I have filed a ticket and attached
a stupid patch:

http://redmine.ruby-lang.org/issues/show/4360

--
Jos Backus
jos at catnook.com

While I don't think you'll be able to make it as fast as zcat, given
that zcat is 100% native code, you might be able to take the
implementation of Zlib::ZReader and tweak it to avoid the need to read
only 1 byte at a time from the delegate stream. Doing so should speed
things up quite a bit. The existing code really isn't very involved.
Most of the logic you would need to tweak is in the
Zlib::ZReader#unbuffered_read method, which is actually fairly short.

When @inflater reports that it has finished, it looks like you should be
able to get whatever is left in its input buffer using
@inflater.flush_next_in (from Zlib::ZStream). Then you can initialize a
new Zlib::Inflater instance and pass that remaining data as the first
input buffer to process. You would repeat this process every time the
inflater reports it has finished until the end of the delegate is
reached and there is no further data returned by flush_next_in.

If I get some time this evening, I'll look into creating a sample
implementation. No promises though. :slight_smile:

-Jeremy

···

On 2/2/2011 1:37 PM, Jos Backus wrote:

It does, but it's very slow. Sigh.

Fwiw, with the changes just committed to trunk the following code works for me
on a file with multiple gzipped blobs:

    require 'stringio'
    require 'zlib'

    def inflate(filename)
      File.open(filename) do |file|
  zio = StringIO.new(file.read)
  loop do
    io = Zlib::GzipReader.new zio
    puts io.read
    unused = io.unused
    io.finish
    break if unused.nil?
    zio.pos -= unused.length
  end
      end
    end

    inflate "gz"

Thanks,
Jos

···

On Thu, Feb 03, 2011 at 02:03:49PM +0900, Jeremy Bopp wrote:

Once your fix is in place and GZipReader#unused works correctly, is
there any convenient way to take the returned string and continue
processing it along with the remaining file contents with an instance of
GzipReader?

--
Jos Backus
jos at catnook.com

That's great! How does the performance compare to zcat with your data?

BTW, this implementation does require that you have enough memory to
hold all of the gzipped file data at once. That will be a problem with
sufficiently large files or constrained resources.

-Jeremy

···

On 2/3/2011 3:57 PM, Jos Backus wrote:

On Thu, Feb 03, 2011 at 02:03:49PM +0900, Jeremy Bopp wrote:

Once your fix is in place and GZipReader#unused works correctly, is
there any convenient way to take the returned string and continue
processing it along with the remaining file contents with an instance of
GzipReader?

Fwiw, with the changes just committed to trunk the following code works for me
on a file with multiple gzipped blobs:

    require 'stringio'
    require 'zlib'

    def inflate(filename)
      File.open(filename) do |file|
  zio = StringIO.new(file.read)
  loop do
    io = Zlib::GzipReader.new zio
    puts io.read
    unused = io.unused
    io.finish
    break if unused.nil?
    zio.pos -= unused.length
  end
      end
    end

    inflate "gz"

That's great! How does the performance compare to zcat with your data?

Comparable:

% time zcat gz > /dev/null
zcat gz > /dev/null 0.29s user 0.00s system 99% cpu 0.296 total
% time ./gzr > /dev/null
./gzr > /dev/null 0.31s user 0.07s system 99% cpu 0.383 total
%

BTW, this implementation does require that you have enough memory to
hold all of the gzipped file data at once. That will be a problem with
sufficiently large files or constrained resources.

Using the file directly should avoid that. Since we have a File, we don't need
the StringIO object:

    require 'stringio'
    require 'zlib'

    def inflate(filename)
      File.open(filename) do |file|
  zio = file
  loop do
    io = Zlib::GzipReader.new zio
    puts io.read
    unused = io.unused
    io.finish
    break if unused.nil?
    zio.pos -= unused.length
  end
      end
    end

    inflate "gz"

Cheers,
Jos

···

On Fri, Feb 04, 2011 at 07:38:04AM +0900, Jeremy Bopp wrote:

--
Jos Backus
jos at catnook.com

That's great! How does the performance compare to zcat with your data?

Comparable:

% time zcat gz > /dev/null
zcat gz > /dev/null 0.29s user 0.00s system 99% cpu 0.296 total
% time ./gzr > /dev/null
./gzr > /dev/null 0.31s user 0.07s system 99% cpu 0.383 total
%

Excellent.

BTW, this implementation does require that you have enough memory to
hold all of the gzipped file data at once. That will be a problem with
sufficiently large files or constrained resources.

Using the file directly should avoid that. Since we have a File, we don't need
the StringIO object:

    require 'stringio'
    require 'zlib'

    def inflate(filename)
      File.open(filename) do |file|
  zio = file
  loop do
    io = Zlib::GzipReader.new zio
    puts io.read
    unused = io.unused
    io.finish
    break if unused.nil?
    zio.pos -= unused.length
  end
      end
    end

    inflate "gz"

The only case where I could see this failing now is if you were given a
non-seekable IO such as a socket or a pipe from which to read. Of
course, I apparently haven't been thinking of solutions to these
problems myself very well, but you'll probably figure out something
pretty quick. :wink:

-Jeremy

···

On 02/03/2011 06:12 PM, Jos Backus wrote:

On Fri, Feb 04, 2011 at 07:38:04AM +0900, Jeremy Bopp wrote: