[SUMMARY] NDiff (#46)

Both solutions had some really interesting material in them. I would certainly
talk about both, if time and space allowed.

Paul Vaillant had some interesting parsing code (see DataFile.parse_line()) and
comparison code (see DataFile::CompareResults.compare()). Paul's entire
solution was also very clean and readable, though it did come out a bit longer
than Daniel Sheppard's code. Finally, Paul's code is over three times faster,
which could be a real asset for significant data sets.

Daniel's code is the one I want to look at below, because of the way it handles
an interesting Ruby problem. What makes this challenge fun, I think, is that
it's one of the edge cases where Ruby's internal iterator can feel awkward.
You're always walking through two separate collections in this problem. Ruby
has some nice solutions for this, buried in the dark corners of its standard
library, but many of us just aren't aware of them. One of these tools,
SyncEnumerator from the generator library, is heavily used in the following
code. Let's see how something like that can make a real difference in the end
result:

  require 'generator'
  require 'pathname'
  require 'optparse'
  require 'bigdecimal'
  
  class NumberLine
    include Enumerable
    def initialize(line)
      @line = line
    end
    def each
      @line.scan( / ((-)?([0-9]+|\.)
                    \.?([0-9]+)?[Ee]?
                    ([-+]?[0-9]+)?) /x) do |match|
        yield BigDecimal.new(match[0])
      end
    end
    def to_s
      @line
    end
  end
  
  # ...

This is a simple encapsulation of a single line of one or more numbers. The
only method of interest here is each(), which scan()s the line looking for all
the numbers it can find. That lengthy Regexp just matches an optional minus
sign, followed by some digits (or a period), optionally followed by a period and
some more digits, optionally following by an E or e and another number. This
isn't a fool-proof match since it will catch a String like "-." or even "..",
but it's probably good enough for most cases. Each number matched is fed to
BigDecimal and then yielded to a given block. The use of the library here slows
things down a bit, but it avoids floating point inaccuracies too, which is more
important to this challenge.

  # ...
  
  class NumericalDifferencer
    attr_accessor :digits, :tolerance
    def initialize
      @digits = 0
      @tolerance = 0.0
    end
    def each_diff(file1, file2)
      line = 0
      diff = nil
      SyncEnumerator.new(file1, file2).each do |line1, line2|
        line1, line2 = NumberLine.new(line1), NumberLine.new(line2)
        line += 1
        num_enum = SyncEnumerator.new(line1, line2)
        if num_enum.all? do |a,b|
          unless @digits == 0
            a,b = a.round(@digits-1), b.round(@digits-1)
          end
          (a - b).abs <= @tolerance
        end
          if diff
            yield diff
            diff = nil
          end
        else
          diff = NumericalDiff.new(line) unless diff
          diff.add_lines(line1, line2)
        end
      end
      yield diff if diff
    end
  end
  
  # ...

Here we get into the actual file comparison code and some SyncEnumerator fun.
The first SyncEnumerator created here walks through the two files given in step.
Each iteration of the block receives the next line from both files. Those lines
are turned into NumberLine objects, which we have already seen. Then a
SyncEnumerator is made from those objects, which allows the code to traverse
each number found in both lines, again in step. That object is used in the if
condition to ensure that all?() numbers in the line are within the needed
tolerance (rounding if requested). When any numbers of the lines don't match
close enough a NumericalDiff object is created and the current lines, plus any
following lines that also differ, are added to it. When the NumericalDiff is
complete, because the code has returned to matching lines or completed running
through the files, the diff object is yielded to the provided block.

Let's take a look at that NumercialDiff class:

  # ...
  
  class NumericalDiff
    attr_reader :start_line, :left, :right
    def initialize(start_line)
      @start_line = start_line
      @left = []
      @right = []
    end
    def add_lines(line1, line2)
      @left << line1
      @right << line2
    end
    def to_s
      lines = "#{@start_line},#{@start_line + @left.length - 1}"
      str = "#{lines}c#{lines}\n"
      str << @left.collect { |x| "< #{x}" }.join
      str << "---\n"
      str << @right.collect { |x| "> #{x}" }.join
    end
  end
  
  # ...

No black magic in here. The work method is to_s() and it should be easy enough
to see that it's just creating the quiz format there.

That's really the majority of the solution. The rest is mainly option parsing:

  # ...
  
  differ = NumericalDifferencer.new
  quiet = false
  statistics = false
  
  opts = OptionParser.new do |opts|
          opts.banner = "Usage: #{$0} [options] file1 file2"
      
          opts.separator ""
          opts.separator "Numerically compare files line by line, " +
                         "numerical field by numerical field."
          opts.separator ""
          opts.on("-d", "--digits INT", Integer,
                  "Maximum number of significant digits that should match.",
                  "(default: 0)") do |digits|
            differ.digits = digits
          end
          opts.on("-t", "--tolerance DBL", String,
                  "Tolerate <= DBL distance between numbers.",
                  "(default: 0.0)") do |tolerance|
            differ.tolerance = BigDecimal.new(tolerance)
          end
          opts.on("-h", "--help", "Output this help.") do |help|
      puts opts
      exit 0
          end
    opts.on("-q", "--quiet", "No output, just exit code.") do |value|
      quiet = value
    end
    opts.on("-s", "--statistics",
            "Provide comparison statistics only.") do |value|
      statistics = value
    end
  end
  
  # ...

The top few lines there create a variables to hold the diff tools and setting
information. Then the last chunk of code is OptionParser 101.

Finally, here's the code that kicks it all off:

  # ...
  
  begin
    opts.parse!(ARGV)
    if quiet && statistics
      raise "--quiet and --statistics are mutually exclusive"
    end
    raise "Must pass two filenames" unless ARGV.length == 2
    files = ARGV.collect { |x| Pathname.new(x) }
    files.each do |f|
      unless f.exist? && f.file?
        raise "'#{f}' does not exist"
      end
    end
    File.open(files[0]) do |file1|
      File.open(files[1]) do |file2|
        if(statistics)
          distances = []
          SyncEnumerator.new(file1, file2).each do |line1, line2|
            line1, line2 = NumberLine.new(line1),
                           NumberLine.new(line2)
            SyncEnumerator.new(line1, line2).each do |num1, num2|
              distances << (num1 - num2).abs
            end
          end
          class << distances
            def median
              sorted = self.sort
              mid = sorted.size / 2
              if sorted.size % 2 == 0
                  (sorted[mid] + sorted[mid - 1]) / 2
              else
                  sorted[mid]
              end
            end
            def mean
              self.inject(0) { |x, sum| sum + x / self.size}
            end
          end
          puts(<<-EOF)
  Numbers compared: #{distances.size}
  Distance range: #{distances.min}..#{distances.max}
  Median distance: #{distances.median}
  Mean distance: #{distances.mean}
          EOF
        elsif(quiet)
          differ.each_diff(file1, file2) do |diff|
            exit 1
          end
          exit 0
        else
          different = false
          differ.each_diff(file1, file2) do |diff|
            different = true
            puts diff
          end
          exit(different ? 1 : 0)
        end
      end
    end
  rescue => e
    warn e
    warn opts
    exit 1
  end

I know that looks like a lot of code, but the majority of it is the extra credit
part of the challenge. If you can spot that big if statement in the middle,
you'll see that the majority of the code is in the first branch. That part
calculates statistics, by implementing its own inline custom traversal of the
files (again with SyncEnumerator). The really interesting feature in this code
is that the results are just compiled into an Array and then that Array is
modified with extra methods to make outputting the results easy. If you're not
familiar with Ruby's class << some_object ... end idiom, it simply adds the
methods to just that one object. A handy trick put to good use here, I think.

The other two very short branches (elsif and else), actually solve the quiz.
They just react according to the each_diff() method we examined earlier.

It's probably worth mentioning again here, the generator library is pure Ruby
and makes use of continuations, which can be quite slow. BigDecimal is never
going to be as fast as using Floats either, for obvious reasons. The result is
that this code is quite a bit slower than Paul's solution. (Over three times
slower, in my tests.) That doesn't affect small data sets, like the ones in the
quiz, very much, but you may start waiting a bit on longer inputs.

My thanks to both quiz solvers, for showing off some super cool Ruby techniques.

Tomorrows quiz is a simple challenge for all you web application junkies, that
may just prove interesting to the community at large. Join in the fun and show
us just how cool your favorite web application framework can be...

Ruby Quiz wrote:

Both solutions had some really interesting material in them. I would certainly
talk about both, if time and space allowed.

You need to take credit for the Summary
on the webpage version:

  Ruby Quiz - NDiff (#46)

Otherwise, the rest of my team will be
(incorrectly) dazzled by "my" Ruby knowledge
when I point them there. :slight_smile:

Later,

···

--
Bil
http://fun3d.larc.nasa.gov

My general rule is that if it doesn't have a by-line, it was me. (This has been added to the Ruby Quiz FAQ.)

However, I made an exception in this case and changed the page as requested.

James Edward Gray II

···

On Sep 15, 2005, at 11:01 AM, Bil Kleb wrote:

Ruby Quiz wrote:

Both solutions had some really interesting material in them. I would certainly
talk about both, if time and space allowed.

You need to take credit for the Summary
on the webpage version:

Ruby Quiz - NDiff (#46)

Otherwise, the rest of my team will be
(incorrectly) dazzled by "my" Ruby knowledge
when I point them there. :slight_smile: