SV: SV: [ANN] Archive 0.2

Thanks. :slight_smile: I’m not terribly in hurry to add Zip support, but that
will sure help me or anybody that does it.

Great - It’s great “I’m” not about to be obsoleted just yet :slight_smile:

though I’d like to think rubyzip is good enough in it self :slight_smile:

Do you think it would be possible to add a set of proxies that adapt
rubyzip interface to Archive’s? After all, the whole Archive thing is
just a matter of convenience (learn to deal with archives once, apply
the knowledge on all formats).

I don’t see why not, and I’d even like to do it. Are the interfaces that rubyzip would have to implement apparent from the archive module code? If not, do you have some documentation.

Incidently, I think that (ie, interfaces not being explicit) is one of the drawbacks of dynamically typed languages and indeed languages with generic type support (like C++ templates) - the interface one has to support becomes less easily identifiable. With templates, at least you will figure it out at compile time :slight_smile:

Cheers,

Thomas

Here is the skeleton of Archive::(Reader|Writer|Modifier|Entry)::Ar.
As you can see, most of the work is done in Entry::Ar.

module Archive
module Modifier
class Ar < Generic
# nothing to do here, taken care in Generic
end
end

module Reader
	class Ar < Generic
		def initialize(io)
			# mostly nothing to do here
		end
	end
end


module Writer
	class Ar < Generic
		# nothing to do here, taken care in Generic
	end
end


module Entry
	class Ar < Generic
		def initialize(io)
			# creates two parts and extracts an
			# entry from the io stream
			
			super(io, :header, :data)
			parse
		end

		def parse 
			@parts[:header] = EntryPart.new(Header) do
				start = @io.pos
				raw = @io.read(Header::LEN)
				[start...@io.pos, raw]
			end

			@parts[:data] = EntryPart.new(String) do
				start = @io.pos
				raw = @io.read(@parts[:header].parsed.size)
				[start...@io.pos, raw]
			end

			# ...
		end

		def update 
			@parts[:header].parsed.size = @parts[:data].raw.length
		end

		def to_arc_format 
			# ...
			update
			dump = header.parsed.to_arc_format + data.raw
			return dump
		end
		
		class Header
			# various constants such as header
			# length defined here

			def initialize(header_raw)
				# takes raw header and parses it
				# to initialize itself
			end

			def to_arc_format 
				# ...
			end
		end 
	end
end

end

Basically, you need an Archive::Entry::Zip < Archive::Entry::Generic
that, given a io stream passed to initialize, reads and parses an
archive entry, and leaves the io cursor at the beginning of the next
one. If it can also output itself in Zip format again, you will be
able to pass entries to Archive::Writer for writing.

Each Entry::XYZ has parts': a Zip entry will most likely have a data’ and a header' part. Each part, in turn, has a raw’ and a
parsed' attribute: raw’ points to the raw stuff as read from the
stream, `parsed’ to a structure modifiable from code. You will
probably create an Archive::Entry::Zip::Header object that takes care
of receiving the Zip raw header, parse it, and initialize itself. The
data part will usually remain unparsed.

One nice thing of this design is that a mbox Entry can, for example,
implement a to_zip method. Then you could do something like this:

in = File.open(“/var/mail/joe”)
out = File.open(“/home/joe/mails.zip”)

mbox = Archive::Reader::Mbox.new(in)
zip = Archive::Writer::Zip.new(out)

mbox.scan do |entry|
zip.add(entry)
end

The `add’ method from Archive::Writer will see that we’re trying to
write on a zip and that we have a Mbox::Entry#to_zip method, so will
use that, and we’ll have converted from Mbox to Zip on the fly.

(This is not in Archive::Writer::Generic yet, because I have written
no to_xyz methods so far.)

If you have any questions, just ask.

Massimiliano

p.s.: Sorry to ask, isn’t there some way to have your mail client set
correct References: or at least not modify the subject?

···

On Mon, Jul 08, 2002 at 08:41:41PM +0900, Thomas Søndergaard wrote:

Do you think it would be possible to add a set of proxies that adapt
rubyzip interface to Archive’s? After all, the whole Archive thing is
just a matter of convenience (learn to deal with archives once, apply
the knowledge on all formats).

I don’t see why not, and I’d even like to do it. Are the interfaces
that rubyzip would have to implement apparent from the archive
module code? If not, do you have some documentation.

So the assumption is that headers precede data?

Note that the Zip format usually has two headers per file: one before
the file data, and one at the very end of the file.

So ideally the archive reader would get the stream first and read the
end directory if the stream was seekable, then position before the
first header.

Then as individual pre-file headers were read, their information could
be merged as appropriate with the end-of-zip (Central Directory)
information.

···

On Tuesday 09 July 2002 06:41 am, Massimiliano Mirra wrote:

Basically, you need an Archive::Entry::Zip <
Archive::Entry::Generic that, given a io stream passed to
initialize, reads and parses an archive entry, and leaves the io
cursor at the beginning of the next one. If it can also output
itself in Zip format again, you will be able to pass entries to
Archive::Writer for writing.


Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Basically, you need an Archive::Entry::Zip < Archive::Entry::Generic
that, given a io stream passed to initialize, reads and parses an
archive entry, and leaves the io cursor at the beginning of the next
one. If it can also output itself in Zip format again, you will be
able to pass entries to Archive::Writer for writing.

Each Entry::XYZ has parts': a Zip entry will most likely have a data’ and a header' part. Each part, in turn, has a raw’ and a
parsed' attribute: raw’ points to the raw stuff as read from the
stream, `parsed’ to a structure modifiable from code. You will
probably create an Archive::Entry::Zip::Header object that takes care
of receiving the Zip raw header, parse it, and initialize itself. The
data part will usually remain unparsed.

Isn’t this unnecessarily low level access - I don’t see a need for exposing
the raw header data. Also, I don’t see a reason for treating the header and
data parts as if they are somewhat similar (ie. parts). It seems like
“over-generalization”?

One nice thing of this design is that a mbox Entry can, for example,
implement a to_zip method. […]

I don’t see any reason that mbox Entry should now about zip archives (to_zip
method implies some knowledge). With a general archive interface it is
possible to do better, and write generic code for writing entries of one
archive to another regardless of the archive type.

If you have any questions, just ask.

I don’t understand the interface completely. When you iterate over the items
in the archive, will you generate Entry objects that contain the full
uncompressed contents of the archive entry? That might be reasonable for
mbox files but for zip and tar archives it seems unreasonable, especially
for large archives. In rubyzip I have a ZipInputStream, which allows you to
iterate over the contents of a zip archive, without having to read more than
the header of the entries, that you do not care for. Like this:

require ‘zip’

Zip::ZipInputStream.open(“test/rubycode.zip”) {

zipStream>
while (entry = zipStream.getNextEntry)
# entry contains the header information. If you want the
# data read it from the zipStream as if it is an IO object
puts “entry is #{entry.name}”
puts “first 5 characters: ‘#{zipStream.read(5)}’”
end
}

Don’t you think this is better? The interface for iterating over the entries
in the archive is a little raw, but that is only because ZipInputStream is
not the preferred way of iterating over the contents of a zip file - instead
ZipFile will read the central directory, so you can do this:

Zip::ZipFile.foreach(“test/rubycode.zip”) {

entry>
puts “entry is #{entry.name}”
puts “first 5 characters: ‘#{entry.getInputStream { |is| is.read(5) }}’”
}

In either case, the data is only uncompressed and read on demand.

p.s.: Sorry to ask, isn’t there some way to have your mail client set
correct References: or at least not modify the subject?

I have been using Outlook Web Access from home to read mail, and there are
no configuration options for the sending format. This one is send with
Outlook Express - I hope the format is more agreeable to you.

Thomas

Basically, you need an Archive::Entry::Zip <
Archive::Entry::Generic that, given a io stream passed to
initialize, reads and parses an archive entry, and leaves the io
cursor at the beginning of the next one. If it can also output
itself in Zip format again, you will be able to pass entries to
Archive::Writer for writing.

So the assumption is that headers precede data?

No, the assumption (at least without modifying Reader::XYZ#scan) is
that archives are sequences of blocks and each block is an entry. How
to deal with how an entry is laid out internally is a task of
Entry::XYZ#new.

Note that the Zip format usually has two headers per file: one before
the file data, and one at the very end of the file.

So ideally the archive reader would get the stream first and read the
end directory if the stream was seekable, then position before the
first header.

Then as individual pre-file headers were read, their information could
be merged as appropriate with the end-of-zip (Central Directory)
information.

So the way to make a reader (note that Thomas was interested in making
a proxy to his code, though) would be to not just inherit the scan
method from Archive::Reader::Generic, but instead add code that goes
to the end of the files first, collects the header it needs, goes back
to the beginning, and creates each entry with entry = Entry.new(io,
extra_data), extra_data being the collected information.

It breaks the regularity of entry = Entry.new(io), but if that’s the
only way…

Massimiliano

···

On Tue, Jul 09, 2002 at 11:35:48PM +0900, Ned Konz wrote:

So the assumption is that headers precede data?

Note that the Zip format usually has two headers per file: one before
the file data, and one at the very end of the file.

So ideally the archive reader would get the stream first and read the
end directory if the stream was seekable, then position before the
first header.

The information in the central directory structure is not too relevant
unless you want to extract the entry to a filesystem, in which case the
permission attributes will be relevant. Maybe someone will miss the entry
comment field, which is not in the local header either.

Then as individual pre-file headers were read, their information could
be merged as appropriate with the end-of-zip (Central Directory)
information.

The local headers do not carry any information that is not in the central
directory, so if you have already read the central directory entry, no
merging is necessary.

Thomas

Isn’t this unnecessarily low level access - I don’t see a need for exposing
the raw header data. Also, I don’t see a reason for treating the header and
data parts as if they are somewhat similar (ie. parts). It seems like
“over-generalization”?

Perhaps, but it has worked fine for me so far. :slight_smile:

One nice thing of this design is that a mbox Entry can, for example,
implement a to_zip method. […]

I don’t see any reason that mbox Entry should now about zip archives (to_zip
method implies some knowledge).

I don’t see a problem with it, I’ve always been happy with things like
String#to_i or String#to_a.

With a general archive interface it is
possible to do better, and write generic code for writing entries of one
archive to another regardless of the archive type.

I have troubles figuring this. Could you please post a pseudo code
example?

If you have any questions, just ask.
I don’t understand the interface completely. When you iterate over the items
in the archive, will you generate Entry objects that contain the full
uncompressed contents of the archive entry? That might be reasonable for
mbox files but for zip and tar archives it seems unreasonable,

It is in the to do list already. The first incarnation of Archive
(which I posted quite some time ago) had a preloading/caching scheme,
but it was growing too complicate. When I started rewriting it, I
decided to shed it entirely and concentrate on the interface, and add
loading schemes afterwards. Plans included exposing entries as IO
objects, handling via WeakRef those who were cached, and caching the
`map’ of the archive (even marshalling to disk, as Thomas Hurst
suggested) so that it wouldn’t have to be rescanned.

I have been using Outlook Web Access from home to read mail, and there are
no configuration options for the sending format. This one is send with
Outlook Express - I hope the format is more agreeable to you.

Yes, it is more agreeable to standards, thus more agreeable to mutt,
thus more agreeable to me. Thanks. :slight_smile:

Massimiliano

···

On Wed, Jul 10, 2002 at 01:07:04AM +0900, Thomas Søndergaard wrote:

I have seen zips in which the two headers each have (different) extra
fields.

···

On Tuesday 09 July 2002 08:20 am, Thomas Søndergaard wrote:

The local headers do not carry any information that is not in the
central directory, so if you have already read the central
directory entry, no merging is necessary.


Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Isn’t this unnecessarily low level access - I don’t see a need for
exposing
the raw header data. Also, I don’t see a reason for treating the header
and
data parts as if they are somewhat similar (ie. parts). It seems like
“over-generalization”?

Perhaps, but it has worked fine for me so far. :slight_smile:

It’s not that I think it will cause problems other than it is unnecessarily
complex for users of the API. As the inventor you are not likely to have a
problem with it. :slight_smile:

One nice thing of this design is that a mbox Entry can, for example,
implement a to_zip method. […]

I don’t see any reason that mbox Entry should now about zip archives
(to_zip
method implies some knowledge).

I don’t see a problem with it, I’ve always been happy with things like
String#to_i or String#to_a.

I don’t have a problem with String#to_i and String#to_a either, except I
don’t like the naming, but that is a different issue :wink:

With a general archive interface it is
possible to do better, and write generic code for writing entries of one
archive to another regardless of the archive type.

I have troubles figuring this. Could you please post a pseudo code
example?

I can try. This is what you wrote earlier:

quote = <<END_OF_QUOTE
in = File.open(“/var/mail/joe”)
out = File.open(“/home/joe/mails.zip”)

mbox = Archive::Reader::Mbox.new(in)
zip = Archive::Writer::Zip.new(out)

mbox.scan do |entry|
zip.add(entry)
end

The `add’ method from Archive::Writer will see that we’re trying to
write on a zip and that we have a Mbox::Entry#to_zip method, so will
use that, and we’ll have converted from Mbox to Zip on the fly.
END_OF_QUOTE

I think this example is basically fine, except your comment about how
zip.add(anEntry) will recognize that Mbox::Entry has a to_zip method. This
should not be necessary because Entry is (or should be) an object that
implements an interface that is common for all archive entries.

If you have any questions, just ask.
I don’t understand the interface completely. When you iterate over the
items
in the archive, will you generate Entry objects that contain the full
uncompressed contents of the archive entry? That might be reasonable
for
mbox files but for zip and tar archives it seems unreasonable,

It is in the to do list already. The first incarnation of Archive
(which I posted quite some time ago) had a preloading/caching scheme,
but it was growing too complicate. When I started rewriting it, I
decided to shed it entirely and concentrate on the interface, and add
loading schemes afterwards. Plans included exposing entries as IO
objects, handling via WeakRef those who were cached, and caching the
`map’ of the archive (even marshalling to disk, as Thomas Hurst
suggested) so that it wouldn’t have to be rescanned.

Ok. It seems to me that your design is heavily influenced by the structure
of tar.gz files, where the archive is compressed, instead of the archive
containing compressed entries (I think this is a short-coming in tar.gz
files). Are there other archive types that are as expensive to scan? What
other archives would benefit significantly from cached (to disk) maps of the
archive? Or from the iterative access, where you have to iterate over
entries, even if you know the id or name of a particular entry that has your
interest.

Thomas

···

On Wed, Jul 10, 2002 at 01:07:04AM +0900, Thomas Søndergaard wrote:

I have seen zips in which the two headers each have (different) extra
fields.

Is that allowed according to the spec? I just looked at pkware’s Zip
application note (appnote.txt), and I can’t see anywhere that (any of) the
fields in the local header are required to be identical to those in the
central directory entry header. It seems to me though, that they should -
otherwise, which values would you use? Those in the cdir entry or those in
the local header?

Ned, I just discovered that you wrote a popular Zip module for Perl. I’d be
very interested in your comments about rubyzip. - or indeed your help :wink:

Thomas

I have seen zips in which the two headers each have (different) extra
fields.

Is that allowed according to the spec? I just looked at pkware’s Zip
application note (appnote.txt), and I can’t see anywhere that (any of) the
fields in the local header are required to be identical to those in the
central directory entry header. It seems to me though, that they should -
otherwise, which values would you use? Those in the cdir entry or those in
the local header?

Ned, I just discovered that you wrote a popular Zip module for Perl. I’d be
very interested in your comments about rubyzip. - or indeed your help :wink:

Thomas

the raw header data. Also, I don’t see a reason for treating the header
and
data parts as if they are somewhat similar (ie. parts). It seems like
“over-generalization”?
It’s not that I think it will cause problems other than it is unnecessarily
complex for users of the API.

Do you suggest adding a more abstracted interface to the archive
entry? Sounds reasonable.

With a general archive interface it is
possible to do better, and write generic code for writing entries of one
archive to another regardless of the archive type.

I have troubles figuring this. Could you please post a pseudo code
example?

I can try. This is what you wrote earlier:

quote = <<END_OF_QUOTE
in = File.open(“/var/mail/joe”)
out = File.open(“/home/joe/mails.zip”)

mbox = Archive::Reader::Mbox.new(in)
zip = Archive::Writer::Zip.new(out)

mbox.scan do |entry|
zip.add(entry)
end

The `add’ method from Archive::Writer will see that we’re trying to
write on a zip and that we have a Mbox::Entry#to_zip method, so will
use that, and we’ll have converted from Mbox to Zip on the fly.
END_OF_QUOTE

I think this example is basically fine, except your comment about how
zip.add(anEntry) will recognize that Mbox::Entry has a to_zip method. This
should not be necessary because Entry is (or should be) an object that
implements an interface that is common for all archive entries.

Sorry, I wasn’t clear. The pseudo code I was asking is the one for
the solution you are proposing.

Ok. It seems to me that your design is heavily influenced by the structure
of tar.gz files, where the archive is compressed, instead of the archive
containing compressed entries

I’m not sure I follow you here. The module I wrote is a .tar reader,
not a .tar.gz reader.

(I think this is a short-coming in tar.gz
files). Are there other archive types that are as expensive to scan? What
other archives would benefit significantly from cached (to disk) maps of the
archive?

mbox does. I know, I’ve been opening ruby-talk mboxes on a 486
lately. :slight_smile:

Or from the iterative access, where you have to iterate over
entries, even if you know the id or name of a particular entry that has your
interest.

Every archive format that does not save a `table of contents’, I
guess.

Massimiliano

···

On Wed, Jul 10, 2002 at 07:10:28AM +0900, Thomas Søndergaard wrote:

I have seen zips in which the two headers each have (different)
extra fields.

Is that allowed according to the spec? I just looked at pkware’s
Zip application note (appnote.txt), and I can’t see anywhere that
(any of) the fields in the local header are required to be
identical to those in the central directory entry header.

Their spec isn’t too helpful on the extra fields.

It seems
to me though, that they should - otherwise, which values would you
use? Those in the cdir entry or those in the local header?

Whichever you felt like, I guess. What I’d do is say that the CD ones
would have precedence over the LD ones (because they were written
later). But I’ve seen cases where the kind of field was different
(perhaps ownership in the LD and timestamps in the CD).

I think you can see this in Info-Zip’s zip files (this is from my
zipinfo.pl that comes with the Archive::Zip module):

$LHMEMBER35 = bless( {
“uncompressedSize” => 252,
“versionMadeBy” => 23,
“bitFlag” => 0,
“fileName” => “xx.vim~”,
“crc32” => 100336855,
“desiredCompressionMethod” => 8,
“localExtraField” =>
“UT\t\0\3-\244\361;\233\325%=Ux\4\0\364\1\364\1”,
“desiredCompressionLevel” => -1,
“externalFileAttributes” => “2176057344”,
“lastModFileDateTime” => 728594063,
“compressionMethod” => 8,
“cdExtraField” => “UT\5\0\3-\244\361;Ux\0\0”,
“diskNumberStart” => 0,
“internalFileAttributes” => 1,
“versionNeededToExtract” => 20,
“fileComment” => “”,
“externalFileName” => “(xx.zip)”,
“compressedSize” => 111,
“localHeaderRelativeOffset” => 41512,
“fh” => $Archive::Zip::BufferedFileHandle,
“fileAttributeFormat” => 3,
“dataOffset” => 41570
}, ‘Archive::Zip::ZipFileMember’ );

Here both the CD and the LD have UT and Ux fields. And they’re
different.

Ned, I just discovered that you wrote a popular Zip module for
Perl.

Yes. That’s the only reason I learned about the zip format. I also was
one of the designers of the Microsoft Tape Format, but don’t tell
anyone…

I’d be very interested in your comments about rubyzip. - or
indeed your help :wink:

I’m trying to help …

good luck,

···

On Tuesday 09 July 2002 03:26 pm, Thomas Søndergaard wrote:

Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE