[ANN] Metadata 1.2

tarball: http://dark.fhtr.org/repos/metadata/metadata-1.2.tar.gz
gem: http://dark.fhtr.org/repos/metadata/metadata-1.2.gem
git: http://dark.fhtr.org/repos/metadata

Changes

···

-------

  * fixed id3v2 date parsing
  * audio genres are turned into string names
    e.g. "(0)" becomes "Blues" and
           "(0)Blues" becomes "Blues" as well
  * handling "copyright"-field of mplayer output
  * ability to add file basename and file dirname to metadata
    with --name and --path
  * File.Size for directories is now the directory entry count
  * Don't try to md5sum/sha1sum non-files

Thanks
------

  Konrad Meyer for his patient testing and bug reports.
  Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities
  (along with being the author of wmainfo-rb and flacinfo-rb.)

Description
-----------

  This package `Metadata' comes with a library called `metadata' and
  a small program called `mdh'.

  The library probes files for their metadata (e.g. jpeg dimensions
  and camera make, mp3 artist, pdf text and word count) and returns the
  metadata as a Hash. All strings in the metadata are converted to UTF-8.

  The `mdh'-program can print out file metadata as YAML and package the
  metadata with the file.

  The metadata hash follows the shared file metadata spec naming, with some
  additional fields, see list at the end of this file (Appendix A.)

  For details on the MDH file format, see the end of this file (Appendix B.)

Usage
-----

  # print out metadata for myfile.jpg
  mdh myfile.jpg

  # create myfile.jpg.mdh, which consists of an MDH metadata header + myfile.jpg
  mdh -c myfile.jpg

  # print out the metadata header from an MDH file
  mdh -e -p myfile.jpg.mdh

  # strip out the metadata header from an MDH file and write the actual file
  # to myfile.jpg
  mdh -e myfile.jpg.mdh

  # include file path, filename, md5sum and sha1sum in the metadata header
  mdh --path --name -m -s myfile.jpg

  # print out the list of options
  mdh -h

  > require 'metadata'
  > Metadata.extract('myfile.jpg')
  > Metadata.extract_text('myfile.pdf')
  > Pathname.new("myfile.jpg").metadata

List of supported formats
-------------------------

  Audio:
    Whatever you manage to make mplayer play.
    Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma.

    Successfully tested with:
      mp3, flac, ogg, wav, ra, m4a, wma

    Should also work:
      wv, mpc, ape

  Video:
    Whatever you manage to make mplayer play.

    Successfully tested with:
      wmv, mov, divx, xvid, flv, ogm, mpg, mkv

  Images:
    Should handle pretty much anything.
    I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw.

    Successfully tested with:
      Web formats:
        jpeg, png, gif, svg
      Camera raws:
        nef, dng, crw, pef, orf
      Image editor state dumps:
        psd, xcf
      The rest:
        tga, tif, bmp, xpm, ppm

  Documents:
    Successfully tested with:
      Web formats:
        html, txt
      Print formats:
        pdf, ps, ps.gz
      OO formats:
        sxi, odp
      MS formats:
        doc, ppt, xls

    - I'm using unoconv to convert OO & MS docs to temp PDFs for the text &
      dimensions extraction, so those bits of data are missing. MSOffice docs
      are missing dimensions for the same reason. Here's a way to get them:
      ( first, get Thumbnailer: http://dark.fhtr.org/repos/thumbnailer/ )
      $ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg
      $ mdh foo.odp
      $ rm foo.odp-temp.pdf /tmp/foo.jpg

  Others:
    - BitTorrent .torrent files
    - Archive contents
    - Whatever `extract' outputs and I am handling

Requirements
------------

  * Ruby 1.8

  * Tons of metadata extraction programs and libs.
    This package has many dependencies since there is no single universal
    metadata header format that all files use. Blame resource forks, filename
    extensions, bags of bytes and mimetypes.

    List of gems:
      flacinfo-rb
      wmainfo-rb
      MP4Info
      id3lib-ruby
      apetag

    List of Debian packages:
      dcraw
      libimlib2-ruby
      extract
      libimage-exiftool-perl
      poppler-utils
      mplayer
      html2text
      imagemagick
      unhtml
      pstotext
      antiword
      catdoc
      shared-mime-info

  * You do want to install the latest versions of dcraw and
    shared-mime-info to be able to handle camera raw images.
    http://cybercom.net/~dcoffin/dcraw/
    http://freedesktop.org/wiki/Software/shared-mime-info

  * Python + chardet library
    http://chardet.feedparser.org/

Install
-------

  De-compress archive and enter its top directory.
  Then type:

   ($ su)
    # ruby setup.rb

  These simple step installs this program under the default
  location of Ruby libraries. You can also install files into
  your favorite directory by supplying setup.rb some options.
  Try "ruby setup.rb --help".

Appendix A: Metadata fields
--------------------------------------

  This list contains the metadata fields output by Metadata and mdh.
  The list follows the shared file metadata spec for the most part.
  http://wiki.freedesktop.org/wiki/Specifications/shared-filemetadata-spec

  field name | field type
  ----------------------------------------------------------------------
  Archive.Contents array of pathnames

  Audio.Band string
  Audio.Composer string
  Audio.Conductor string
  Audio.Copyright string (copyright message)
  Audio.Grouping string
  Audio.Image binary string (embedded image data)
  Audio.InterpretedBy string
  Audio.Lyricist string
  Audio.Publisher string
  Audio.RemixedBy string
  Audio.Subtitle string
  Audio.Tempo integer
  Audio.VariableBitrate boolean
  Audio.Writer string
  Audio.Publicationright string
  Audio.File string
  Audio.EAN/UPC string
  Audio.ISBN string
  Audio.Catalog string
  Audio.LC string
  Audio.Media string
  Audio.Index string
  Audio.Related string
  Audio.ISRC string
  Audio.Abstract string
  Audio.Language string
  Audio.Bibliography string
  Audio.Introplay string
  Audio.Dummy string
  Audio.DebutAlbum string
  Audio.RecordDate string
  Audio.RecordLocation string
  v-- ORIGINAL FIELDS USED --v
  Audio.Title string
  Audio.Artist string
  Audio.Album string
  Audio.AlbumArtist string
  Audio.AlbumTrackCount integer
  Audio.TrackNo integer
  Audio.DiscNo integer
  Audio.Performer string
  Audio.Duration float
  Audio.ReleaseDate datetime
  Audio.Comment string
  Audio.Genre string
  Audio.Codec string
  Audio.Samplerate integer
  Audio.Bitrate float
  Audio.Channels integer
  Audio.Lyrics string

  Doc.Album string
  Doc.Artist string
  Doc.Charset string
  Doc.Description string
  Doc.Genre string
  Doc.Language string
  Doc.ModifyDate date
  Doc.PageSizeName string (A4, A5, letter, ...)
  Doc.RevisionHistory array of strings
  Doc.ParagraphCount integer
  Doc.LineCount integer
  Doc.CharacterCount integer
  Doc.LastSavedBy string
  Doc.Keywords array of strings
  Doc.Template string
  v-- ORIGINAL FIELDS USED --v
  Doc.Title string
  Doc.Subject string
  Doc.Author string
  Doc.PageCount integer
  Doc.WordCount integer
  Doc.Created datetime

  File.Software string (software used to create the file)
  File.MD5Sum string (md5sum of file's contents)
  File.SHA1Sum string (sha1sum of file's contents)
  v-- ORIGINAL FIELDS USED --v
  File.Name string (basename of the file)
  File.Path string (dirname of the file)
  File.Format string (mime type, inode/directory for dirs)
  File.Size integer
  File.Content string
  File.Modified string

  Image.DateCreated date
  Image.DateTimeCreated date
  Image.DateTimeOriginal date
  Image.DimensionUnit string (px, mm, pt, ...)
  Image.Editor string
  Image.EXIF string (exiftool output)
  Image.FrameCount integer
  Image.LayerCount integer
  Image.Modified date
  Image.OriginatingProgram string
  Image.ComponentCount integer
  Image.ColorMode string (e.g. RGB)
  Image.ColorSpace string (e.g. sRGB)
  v-- ORIGINAL FIELDS USED --v
  Image.Height float
  Image.Width float
  Image.Title string
  Image.Date datetime
  Image.Creator string
  Image.Description string
  Image.Software string
  Image.CameraMake string
  Image.CameraModel string
  Image.ExposureProgram string
  Image.ExposureTime float
  Image.Fnumber float
  Image.Flash boolean
  Image.FocalLength float
  Image.ISOSpeed float
  Image.MeteringMode string
  Image.WhiteBalance string
  Image.Copyright string

  Location.Latitude float
  Location.Longitude float

  Video.Album string
  Video.Artist string
  Video.Bitrate integer
  Video.Codec string
  Video.Comment string
  Video.Duration float
  Video.Framerate float (frames per second)
  Video.Genre string
  Video.ReleaseDate date
  Video.Title string
  Video.TrackNo integer
  Video.Demuxer string

  BitTorrent.Name string
  BitTorrent.Files array of { 'path' => string,
                                       'length' => integer,
                                       'md5sum' => string }
  BitTorrent.Length integer (size of single-file torrents)
  BitTorrent.MD5Sum string (md5sum for single-file torrents)
  BitTorrent.PieceCount integer
  BitTorrent.PieceLength integer (length of a single piece
  BitTorrent.Comment string
  BitTorrent.Announce string (announce url)
  BitTorrent.AnnounceList array of arrays of strings
  BitTorrent.Nodes array of [hostname, port] -arrays

Appendix B: The MDH file format
-------------------------------

  MDH files are built as follows:

  bytes | content
  ---------------
    3 | "MDH" - MDH file format identifier
    1 | "\x01" - MDH file format version number
    4 | Long, network byte order - the size of the metadata struct in bytes
   var | YAML - The MDH metadata struct
   var | The actual file contents

  All string fields in the metadata are UTF-8.

License
-------

  Ruby's

--
Ilmari Heikkinen <ilmari.heikkinen gmail com>