Hi Jeremy,
Thanks for your reply.
[snip]
> Hi,
>
> I'm trying to inflate a set of concatenated gzipped blobs stored in a single
> file. As it stands, Zlib::GzipReader only inflates the first blob. It
> appears that the unused instance method would return the remaining data,
> ready to be passed into Zlib::GzipReader, but it yields an error:
>
> method `method_missing' called on hidden T_STRING object
>
> What could be going on here?
I'm not sure what's going on, but I was hoping you could solve your
problem by running something like this:
File.open('gzipped.blobs') do |f|
begin
loop do
Zlib::GzipReader.open(f) do |gz|
puts gz.read
end
end
rescue Zlib::GzipFile::Error
# End of file reached.
end
end
I tried something like this but as you point out, it doesn't work.
Unfortunately, Ruby 1.8 doesn't appear to support passing anything other
than a file name to Zlib::GzipReader.open, and Ruby 1.9 seems to always
reset the file position to the beginning of the file prior to starting
extraction when you really need it to just start working from the
current position. So it doesn't appear that you can do this with the
standard library.
That's what it looks like, yes. Bummer.
As part of a ZIP library I wrote, there is a more general implementation
of a Zlib stream filter. Install the archive-zip gem and then try the
following:
gem 'archive-zip'
require 'archive/support/zlib'
File.open('gzipped.blobs') do |f|
until f.eof? do
Zlib::ZReader.open(f, 15 + 16) do |gz|
gz.delegate_read_size = 1
puts gz.read
end
end
end
This isn't super efficient because we have to hack the
delegate_read_size to be 1 byte in order to ensure that the trailing
gzip data isn't sucked into the read buffer of the current ZReader
instance and hence lost between iterations. It shouldn't be too bad
though since the File object should be handling its own buffering.
This works, but sadly it is very slow. Whereas zcat takes under a second on my
test file, this code takes about 17 seconds.
BTW, I wrote some pretty detailed documentation for Zlib::ZReader. It
should explain what the 15 + 16 is all about in the open method in case
you need to tweak things for your own streams.
Great. But I didn't have to tweak anything, it just worked
> On a related note, Zlib::GzipReader#{pos,tell} returns the position in the
> output stream (zstream.total_out) whereas I am looking for the position in
> the input stream. I tried making zstream.total_in available but the value
> appears to be 18 bytes short in my test file, that is, the next header is
> found 18 bytes beyond what zstream.total_in reports.
I think total_in is counting only the compressed data; however,
following the compressed data is a trailer as required for gzip blobs.
You could probably always add 18 to whatever you get, but as I noted
earlier, the implementation of GzipReader seems to always reset any file
object back to the beginning of the stream rather than start processing
it from an existing position. I can't find any documentation listing a
way to force GzipReader to jump to any other file position after
initialization either.
Yeah, you'd have to feed GZipReader the right part of the input stream
yourself and figure out how much it processed. Something tells me it's not
always 18 but depends on internal buffering, which would invalidate the
assumption of a fixed offset.
> Does anybody know how to make the library return the correct offset into the
> input stream so multiple compressed blobs can be handled?
Hopefully, my solution will work for you because I don't think the
current implementation in the standard library will do what you need.
It does, but it's very slow. Sigh.
Thanks again, Jeremy.
Cheers,
Jos
···
On Mon, Jan 31, 2011 at 02:28:30AM +0900, Jeremy Bopp wrote:
On 01/28/2011 05:09 PM, Jos Backus wrote:
--
Jos Backus
jos at catnook.com