[SUMMARY] Mailing List Files (#115)

I've been playing a little with TMail lately, which is what really inspired this
quiz. I thought that a simple solution to this problem would be to pull the
pages down with open-uri and then dump them into TMail and just pull the
attachments from that. It turns out to be a bit harder to do that than I
expected, but one solution did follow that path.

What I love about this plan is the fact that you are just stitching the real
tools together. I like leaning on libraries to get tons of functionality with
just a few lines of code. Apparently, so does Louis J Scoras! Check out this
list of dependencies that kick-starts his solution (I've removed the excellent
comments in the code to save space):

  #!/usr/bin/env ruby
  
  require 'action_mailer'
  require 'cgi'
  require 'delegate'
  require 'elif'
  require 'fileutils'
  require 'hpricot'
  require 'open-uri'
  require 'tempfile'
  
  # ...

Wow.

Let's start with the standard libraries. Louis pulls in cgi to handle HTML
escapes, delegate to wrap existing classes, fileutils for easy directory
creation, open-uri to fetch web pages with, and tempfile for creating temporary
files, of course. That's an impressive set of tools all of which ship with
Ruby.

The other three dependancies are external. You can get them all as gems.
action_mailer is a component of the Rails framework used to handle email. Louis
doesn't actually use the action_mailer part, just the bundled TMail dependency.
This is a trick for getting TMail as a gem.

elif is a little library I wrote as a solution to an earlier quiz (#64). It
reads files line by line, but in reverse order. In other words, you get the
last line first, then the next to last line, all the way up to the first line.

hpricot is a fun little HTML parser from Why the Lucky Stiff. It has a very
unique interface that makes it popular for web scraping usage.

Now that Louis has imported all the tools he could find, he's ready to do some
fetching. Here's the start of that code:

  module Quiz115
  class QuizMail < DelegateClass(TMail::Mail)
    class << self
      attr_reader :archive_base_url
  
      def archive_base_url
        @archive_base_url ||
        "http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/"
      end
  
      def solutions(quiz_number)
        doc = Hpricot(
          open("http://www.rubyquiz.com/quiz#{quiz_number}.html")
        )
        (doc/'#links'/'li/a').collect do |link|
          [CGI.unescapeHTML(link.inner_text), link['href']]
        end
      end
    end
    
    # ...

This object we are examining now is a TMail enhancement, via delegation. This
section has some class methods added for easy usability. I believe the
attr_reader line is actually intended to be attr_writer though, giving you a way
to override the base URL. The reader is defined manually and just defaults to
the Ruby Talk mailing list.

The solutions() method is a neat added feature of the code which will allows you
to pass in a Ruby Quiz number in order to fetch all the solution emails for that
quiz. Here you can see some Hpricot parsing. Its XPath-in-Ruby style syntax is
used to pull the solution links off of the quiz page at rubyquiz.com.

Let's get to the real meat of this class now:

    # ...
    
    def initialize(mail)
      temp_path = to_temp_file(mail)
      boundary = MIME::BoundaryFinder.new(temp_path).find_boundary
  
      @tmail = TMail::Mail.load(temp_path)
      @tmail.set_content_type 'multipart', 'mixed',
        'boundary' => boundary if boundary
  
      super(@tmail)
    end
  
    private
  
    def to_temp_file(mail)
      temp = Tempfile.new('qmail')
  
      temp.write(if (Integer(mail) rescue nil)
        url = self.class.archive_base_url + mail
        open(url) { |f| x = cleanse_html f.read }
      else
        web = URI.parse(mail).scheme == 'http'
        open(mail) { |m| web ? cleanse_html(m.read) : m.read }
      end)
  
      temp.close
      temp.path
    end
  
    def cleanse_html(str)
      CGI.unescapeHTML(
        str.gsub(/\A.*?<div id="header">/mi,'').gsub(/<[^>]*>/m, '')
      )
    end
  end
  
  # ...

In initialize() the passed mail reference is fetched into a temporary file and a
special boundary search is performed, which we will examine in detail in just a
moment. The temp file is then handed off to TMail. After that a content_type
header is synthesized, as long as we found a boundary.

The actual fetch is made in to_temp_file(). The code that fills the Tempfile is
a little tricky there, but all is really does is recognize when we are loading
via the web so it can cleanse_html(). That method just strips the tags around
the message and unescapes entities.

Now we need to dig into that boundary problem I sidestepped earlier. The
messages on the web archives are missing their Content-type header and we need
to restore it in order to get TMail to accept the message. With messages that
contain attachments, that header should be multipart/mixed. However, the header
also points to a special boundary string that divides the parts of the message.
We have to find that string so we can set it in the header.

The next class handles that operation:

  # ...
  
  module MIME
    class BoundaryFinder
      def initialize(file)
        @elif = ::Elif.new(file)
        @in_attachment_headers = false
      end
  
      def find_boundary
        while line = @elif.gets
          if @in_attachment_headers
            if boundary = look_for_mime_boundary(line)
              return boundary
            end
          else
            look_for_attachment(line)
          end
        end
        nil
      end
  
      private
  
      def look_for_attachment line
        if line =~ /^content-disposition\s*:\s*attachment/i
          puts "Found an attachment" if $DEBUG
          @in_attachment_headers = true
        end
      end
  
      def look_for_mime_boundary line
        unless line =~ /^\S+\s*:\s*/ || # Not a mail header
               line =~ /^\s+/ # Continuation line?
          puts "I think I found it...#{line}" if $DEBUG
          line.strip.gsub(/^--/, '')
        else
          nil
        end
      end
    end
  end
  end
  
  # ...

This class is a trivial parser that hunts for the missing boundary. It uses
Elif to read the file backwards, watching for an attachment to come up. When it
detects that it is inside an attachment, it switches modes. In the new mode if
skips over headers and continuation lines until it reaches the first line that
doesn't seem to be part of the headers. That's the boundary.

The rest of the code just put's these tools to work:

  # ...
  
  include Quiz115
  include FileUtils
  
  def process_mail(mailh, outdir)
  begin
    t = QuizMail.new(mailh)
    if t.has_attachments?
      t.attachments.each do |attachment|
        outpath = File.join(outdir, attachment.original_filename)
        puts "\tWriting: #{outpath}"
        File.open(outpath, 'w') do |out|
          out.puts attachment.read
        end
      end
    else
      outfile = File.join(outdir, 'solution.txt')
      File.open(outfile, 'w') {|f| f.write t.body}
    end
  rescue => e
    puts "Couldn't parse mail correctly. Sorry! (E: #{e})"
  end
  end
  
  def to_dirname(solver)
  solver.downcase.delete('!#$&*?(){}').gsub(/\s+/, '_')
  end
  
  # ...

process_mail() builds a QuizMail object out of the passed reference number, then
copies the attachments from TMail to files in the indicated directory. If the
message has no attachments, you just get the full message instead.

to_dirname() is a directory name sanitize for when the code in downloading the
solutions from a quiz, as mentioned earlier.

Here's the application code:

  # ...
  
  query = ARGV[0]
  outdir = ARGV[1] || '.'
  
  unless query
  $stderr.puts "You must specify either a ruby-talk message id, or a
  quiz number (prefixed by 'q')"
  exit 1
  end
  
  if query =~ /\Aq/i
  quiz_number = query.sub(/\Aq/i, '')
  puts "Fetching all solutions for quiz \##{quiz_number}"
  
  QuizMail.solutions(quiz_number).each do |solver, url|
    puts "Fetching solution from #{solver}."
  
    dirname = to_dirname(solver)
    solver_dir = File.join(outdir, dirname)
  
    mkdir_p solver_dir
    process_mail(url, solver_dir)
  end
  else
  process_mail(query, outdir)
  end
  
  exit 0

This code just pulls in the arguments, and runs them through one of two
processes. If the number is prefixed with a q, the code scrapes rubyquiz.com
for that quiz number and pulls all the solutions. It creates a directory for
each solution, then processes each of those messages. Otherwise, it handles
just the individual message.

My thanks to those who helped me solve this problem for all quiz fans. We now
have an excellent resource to share with people who ask about retrieving the
garbled solutions.

Tomorrow, it's back to fun and games for the quiz, but this time we're on a
search for pure strategy...