Speed gap between zcat and zlib's GzipReader

I'm still in 1.8.1-land, so this may be old news, but
GzipReader is (painfully) slow compared to using zcat
to accomplish the same thing:

The code:

#!/scratch/ruby/bin/ruby

require 'zlib'

f = ARGV[0]

s = Time.new
infile = Zlib::GzipReader.new(File.new(f, "r"))
#infile = IO.popen("zcat #{f}", "r")
linecount = 0
infile.each_line { |l|
  linecount += 1
}
e = Time.new
print "Read #{linecount} lines in #{e - s} seconds\n"

···

------------------------------

Tested on:
FreeBSD port-installed ruby 1.8.1
Freshly compiled 1.8.1
Freshly compiled 1.8.1 with CFLAGS=-O2
CVS version, CFLAGS=-O2

            FBSD 1.8.1 1.8.1, O0 1.8.1 -O2 CVS, -O2
popen zcat: 2.3 2.3 2.3 2.3
GzipReader: 5.8 9.2 5.8 5.9

Yowza. Before I poke more, is this expected, or a known
slowness issue?

  -Dave

--
work: dga@lcs.mit.edu me: dga@pobox.com
      MIT Laboratory for Computer Science http://www.angio.net/

David G. Andersen wrote:

Yowza. Before I poke more, is this expected, or a known
slowness issue?

I had a similar problem which was discussed here at length a year or
so ago. If you avoid the block setup and use a fixed-length read, it's
quite a bit quicker. Still nowhere near as fast as Perl though :-(.

Ahh, thanks. So the problem is really in GzipReader's each_line
handling. It's actually pretty close to as fast as it could go
when doing a fixed-length read. Byte-counting only, fixed-length
read; popen and gzipreader both take 1.4 seconds on my test file.
A zcat to /dev/null takes 1.18 seconds. Piping to 'wc' takes 1.83
seconds. No complaints.

gzfile_read is fast.
gzfile_read_more is fast (used by gzfile_read).
But gzreader_gets... is a dog. It does a memcmp()
on each byte of the input string to test it against
the delimiter - yow! So, it looks like zlib's "gets"
needs the equivalent of rb_io_getline_fast. Would
be nice if that were easily re-used, but the FILE *
access is buried pretty deep inside of it.

Guess I'll have to dig up some spare time next week. :slight_smile:

  -Dave

···

On Fri, Oct 22, 2004 at 11:30:34AM +0900, Clifford Heath scribed: > David G. Andersen wrote:

> popen("zcat foo.gz", "r") faster than GzipReader.each_line

I had a similar problem which was discussed here at length a year or
so ago. If you avoid the block setup and use a fixed-length read, it's
quite a bit quicker. Still nowhere near as fast as Perl though :-(.

--
work: dga@lcs.mit.edu me: dga@pobox.com
      MIT Laboratory for Computer Science http://www.angio.net/

On Tue, Oct 26, 2004 at 10:06:55AM +0900, David G. Andersen scribed:

Ahh, thanks. So the problem is really in GzipReader's each_line
handling.
[...]

But gzreader_gets... is a dog. It does a memcmp()
on each byte of the input string to test it against

I've attached a patch that reduces some of the overhead
for files with longer lines (but doesn't fix all of the
slowdowns). Some benchmarks, w/1.8.1 on FreeBSD,
grabbing data out of the gzipped file with file.gets():

"tarfile" - compressed JDK. Line length is long (random data...)
"words" - /usr/share/dict/words gzipped. Lines are very short.
"logfile" - logfile from one of my experiments. Lines are
             between 15 and 120 bytes long.

             popen GzReader-orig GzReader-patched
             ----- ------------- ----------------
tarfile 2.06 5.65 2.95
words 0.914 2.4 2.22
logfile 1.18 3.65 2.27

The patch is tiny and non-intrusive, which is a bonus, though its
performance improvement is not spectacular for short lines. Helps
with gzipped logfiles, at least, but someone with more {time,
knowledge of ruby's internals} might want to go in and overhaul
things for real.

  -Dave

--- orig-zlib.c Mon Oct 25 22:01:18 2004
+++ zlib.c Mon Oct 25 22:33:26 2004
@@ -2470,7 +2470,7 @@
{
     struct gzfile *gz = get_gzfile(obj);
     VALUE rs, dst;
- char *rsptr, *p;
+ char *rsptr, *p, *res;
     long rslen, n;
     int rspara;

@@ -2520,8 +2520,15 @@
       gzfile_read_more(gz);
       p = RSTRING(gz->z.buf)->ptr + n - rslen;
   }
- if (memcmp(p, rsptr, rslen) == 0) break;
- p++, n++;
+ res = memchr(p, rsptr[0], (gz->z.buf_filled - n + 1));
+ if (!res) {
+ n = gz->z.buf_filled + 1;
+ } else {
+ n += (long)(res - p);
+ p = res;
+ if (rslen == 1 || memcmp(p, rsptr, rslen) == 0) break;
+ p++, n++;
+ }
     }

     gz->lineno++;

Hi,

···

In message "Re: Speed gap between zcat and zlib's GzipReader" on Tue, 26 Oct 2004 11:37:50 +0900, "David G. Andersen" <dga@lcs.mit.edu> writes:

I've attached a patch that reduces some of the overhead
for files with longer lines (but doesn't fix all of the
slowdowns). Some benchmarks, w/1.8.1 on FreeBSD,
grabbing data out of the gzipped file with file.gets():

I'm impressed. I will merge your patch.

              matz.