[SUMMARY] Statistician II (#168)

I don't know if it was the metaprogramming that scared people away
this week, or perhaps folks are away on summer vacations. In any case,
I'm going to summarize this week's quiz by looking at the submission
from _Matthias Reitinger_. The solution is, as Matthias indicates,
unexpectedly concise. "I guess that's just the way Ruby works."

Matthias' code implements the `Statistician` module in three parts,
each a class. Here is the first class, `Rule`:

    class Rule
      def initialize(pattern)
        @fields = []
        pattern = Regexp.escape(pattern).gsub(/\\\[(.+?)\\\]/, '(?:\1)?').
          gsub(/<(.+?)>/) { @fields << $1; '(.+?)' }
        @regexp = Regexp.new('^' + pattern + '$')
      end

      def match(line)
        @result = if md = @regexp.match(line)
          Hash[*@fields.zip(md.captures).flatten]
        end
      end

      def result
        @result
      end
    end

`Rule` makes use of regular expressions built-up as discussed in the
previous quiz, so I'm not going to discuss that here. I will point
out, though, the initialization of the `@fields` member in the
initializer. Note the last `gsub` call: it uses the block form of
`gsub`.

    gsub(/<(.+?)>/) { @fields << $1; '(.+?)' }

As the `(.+?)` string is last evaluated in the block, that provides
the replacement in the string. However, makes use of the just-matched
expression to extract the field names. This avoids making a second
pass over the source string to get those fields names, and is arguably
simpler.

The `match` method matches input lines against the regular expression,
returning nil if the input didn't match, or a hash if it did. Field
names (`@fields`) are first paired (`zip`) with the matched values
(`md.captures`), then `flatten`-ed into a single array, finally
expanded (`*`) and passed to a `Hash` initializer that treats
alternate items as keys and values. The end result of `Rule#match`,
when the input matches, is a hash that looks like this:

    { 'amount' => '108', 'name' => 'Tempest Warg' }

That hash is returned, but also stored internally into member
`@result` for future reference, accessed by the last method, `result`.

The next class is `Reportable`:

    class Reportable < OpenStruct
      class << self
        attr_reader :records

        def inherited(klass)
          klass.class_eval do
            @rules, @records = [], []
          end
          super
        end

        def rule(pattern)
          @rules << Rule.new(pattern)
        end

        def match(line)
          if rule = @rules.find { |rule| rule.match(line) }
            @records << self.new(rule.result)
          end
        end
      end
    end

This small class is the extent of the metaprogramming going on in the
solution, and it's not much, though perhaps unfamiliar to some. Let's
get into some of it. We'll ignore the `OpenStruct` inheritance for the
moment, coming back to it later.

Everything inside the `Reportable` class is surrounded by a block that
opens with `class << self`. There is a [good summary on the Ruby Talk
mailing list][1], but its use here can be summed up in two words:
class methods. The `class << self` mechanism is not strictly about
class methods, but in this context it affects similar behavior.
Alternatively, these methods could have been defined in this manner:

    class Reportable < OpenStruct
      def Reportable.rule(pattern)
        # etc.
      end

      def Reportable.match(line)
        # etc.
      end

      # etc.
    end

In the end, the `class << self` mechanism is cleaner looking, and also
allows for use of `attr_reader` in a natural way.

The next interesting bit is the `inherited` method. This is a class
method, here implemented on `Reportable`, that is called whenever
`Reportable` is subclassed (which happens repeatedly in the client
code). It's a convenient hook that allows the other bit of
metaprogramming to happen.

    klass.class_eval do
      @rules, @records = [], []
    end

`klass` is the class derived from `Reportable` (i.e. our client's
classes for future statistical analysis). Here, Matthias initializes
two members, both to empty arrays, in the scope of class `klass`. This
serves to ensure that every class derived from `Reportable` gets its
own, separate members, not shared with other `Reportable` subclasses.

This could be done without metaprogramming, but would require effort
from the user.

    class Reportable
      # class methods here
    end

    class Offense < Reportable
      @rules, @records = [], []
      # rules, etc.
    end

    class Defense < Reportable
      @rules, @records = [], []
      # rules, etc.
    end

If the client forgot to initialize those two members, or got the names
wrong, the class wouldn't work, exceptions would be thrown, [cats and
dogs living together][2]... you get the idea.

You might consider defining those data members in the `Reportable`
class itself, like so:

    class Reportable
      @rules, @records = [], []

      # class methods, without inherited
    end

The problem with this is that every `Reportable` subclass would now
share the same rules and records arrays: not the desired outcome.

In the end, the `class_eval` used here, called from `inherited`, is
the right way to do things. It provides a way for the superclass to
inject functionality into the subclass.

Getting back to functionality, `Reportable#match` is straightforward,
but let me highlight one line:

    @records << self.new(rule.result)

If you recall, `result` returns a hash of field names to values. And
`Reportable` is attempting to pass that hash to its own initializer,
of which none is defined. This is where `OpenStruct` comes in.

[OpenStruct][3] "allows you to create data objects and set arbitrary
attributes." And `OpenStruct` provides an initializer that takes the
hash Matthias provides, and does the expected.

    data = OpenStruct.new( {'amount' => '108', 'name' => 'Tempest Warg'} )
    p data.amount # -> 108
    p data.name # -> Tempest Warg

By subclassing `Reportable` from `OpenStruct`, all of the client's
classes will inherit the same behavior, which fulfills many of the
requirements provided in the class specification.

The final class, `Reporter`, is pretty trivial. It reads through a
data source a line at a time, finding a matching rule (and creating
the appropriate record in the process) or adding the input line to
`@unmatched` which the client can query later.

Next week we'll take a short break from the Statistician for some
simple stuff. (Part III of Statistician will return in the not-distant
future.)

[1]: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/57252
[2]: http://www.youtube.com/watch?v=w91-GMc3j7I
[3]: http://www.ruby-doc.org/stdlib/libdoc/ostruct/rdoc/classes/OpenStruct.html

···

--
Matthew Moss <matthew.moss@gmail.com>

I wanted to add one more note...

      klass\.class\_eval do
        @rules, @records = \[\], \[\]
      end

Considering that this bit of code injects @rules and @records into
klass, my preference is that they be named something _less_
straightforward. My own, similar solution used @reportable_rules and
@reportable_records.

The reason? There is nothing preventing a client from further
extending their own subclasses of Reportable. Actually, I will lightly
encourage that in part 3. To avoid potential name conflicts with
client-side extensions, I'd go with names more complex than the simple
@rules and @records.