[SUMMARY] Crossword Solver (#132)

Many of these solutions got pretty sluggish when run on large crosswords. The
reason is simple: the search space for this problem is quite large. You have
to try a significant number of words in each position before you will find a
good fit for the board.

The good news is that this summer Google has paid to bring a very powerful
search tool to Ruby. I've been lucky enough to have a ring-side seat for this
process and now I want to show you a little about this new tool.

You've probably seen Andreas Launila solving a fair number of the recent quizzes
using his Gecode/R library. Gecode/R is a wrapper over the C Gecode library for
constraint programming. Constraint programming is a technique for describing
problems in a way that a tool like Gecode can then use to search the solution
space and find answers for you. Gecode is a very smart little searcher though
and will heavily prune the search space based on your constraints. This leads
to some mighty quick results, as you can see:

  $ time ruby solve_crossword.rb /path/to/scowl-6/final/english-words.20
                                 < /path/to/test_board.txt
  Reading the dictionary...
  Please enter the template (end with ^D)
  Building the model...
  Searching for a solution...
  A B I D E
  
  N C A
  
  G H O S T
  
  E N E
  
  L A S E R
  
  real 0m0.430s
  user 0m0.360s
  sys 0m0.068s

Let's have a look at Andreas's code to see how it sets things up for Gecode.
Here's the start of the code:

  require 'enumerator'
  require 'rubygems'
  require 'gecoder'
  
  # The base we use when converting words to and from numbers.
  BASE = ('a'..'z').to_a.size
  # The offset of characters compared to digits in word-numbers.
  OFFSET = 'a'[0]
  # The range of integers that we allow converted words to be in. We are
  # only using the unsigned half, we could use both halves, but it would
  # complicate things without giving a larger allowed word length.
  ALLOWED_INT_RANGE = 0..Gecode::Raw::Limits::Int::INT_MAX
  # The maximum length of a word allowed.
  MAX_WORD_LENGTH = (Math.log(ALLOWED_INT_RANGE.last) /
    Math.log(BASE)).floor
  
  # ...

You can see that Andreas loads Enumerator and Gecode/R here as well as setting
up some constants. The constants relate to how this code will model words in
the database. The plan here is to represent words as numbers made up of base 26
digits which represent letters of the alphabet. This will allow Andreas to use
Gecode's integer variables to model the problem. The downside is that word size
will be limited by the maximum size of an integer and thus this solution has a
weakness in that it can't be used to solve the larger puzzles.

You can see this conversion from numbers to words in the Dictionary class:

  # ...
  
  # Describes an immutable dictionary which represents all contained words
  # as numbers of base BASE where each digit is the corresponding letter
  # itself converted to a number of base BASE.
  class Dictionary
    # Creates a dictionary from the contents of the specified dictionary
    # file which is assumed to contain one word per line and be sorted.
    def initialize(dictionary_location)
      @word_arrays = []
      File.open(dictionary_location) do |dict|
        previous_word = nil
        dict.each_line do |line|
          word = line.chomp.downcase
          # Only allow words that only contain the characters a-z and are
          # short enough.
          next if previous_word == word or word.size > MAX_WORD_LENGTH or
            word =~ /[^a-z]/
          (@word_arrays[word.length] ||= []) << self.class.s_to_i(word)
          previous_word = word
        end
      end
    end
  
    # Gets an enumeration containing all numbers representing word of the
    # specified length.
    def words_of_size(n)
      @word_arrays[n] || []
    end
  
    # Converts a string to a number of base BASE (inverse of #i_to_s ).
    def self.s_to_i(string)
      string.downcase.unpack('C*').map{ |x| x - OFFSET }.to_number(BASE)
    end
  
    # Converts a number of base BASE back to the corresponding string
    # (inverse of #s_to_i ).
    def self.i_to_s(int)
      res = []
      loop do
        digit = int % BASE
        res << digit
        int /= BASE
        break if int.zero?
      end
      res.reverse.map{ |x| x + OFFSET }.pack('C*')
    end
  end
  
  # ...

We've already talked about the number representation which is the majority of
the code here. Do have a look at initialize() and words_of_size() though, to
see how words are being stored. An Array is created where indices represent
word lengths and the values at those indices are nested Arrays of words with
that length. This makes getting a list of words that could work in a given slot
of the puzzle easy and fast.

The s_to_i() method above relies on a helper method added to Array, which is
simply this:

  class Array
    # Computes a number of the specified base using the array's elements
    # as digits.
    def to_number(base = 10)
      inject{ |result, variable| variable + result * base }
    end
  end

Again, this is just another piece of the conversion I explained earlier.

With a Dictionary created, it's time to model the problem in Gecode constraints:

  # Models the solution to a partially completed crossword.
  class Crossword < Gecode::Model
    # The template should take the format described in RubyQuiz #132 . The
    # words used are selected from the specified dictionary.
    def initialize(template, dictionary)
      @dictionary = dictionary
  
      # Break down the template and create a corresponding square matrix.
      # We let each square be represented by integer variable with domain
      # -1...BASE where -1 signifies # and the rest signify letters.
      squares = template.split(/\n\s*\n/).map!{ |line| line.split(' ') }
      @letters = int_var_matrix(squares.size, squares.first.size,
        -1...BASE)
  
      # Do an initial pass, filling in the prefilled squares.
      squares.each_with_index do |row, i|
        row.each_with_index do |letter, j|
          unless letter == '_'
            # Prefilled letter.
            @letters[i,j].must == self.class.s_to_i(letter)
          end
        end
      end
  
      # Add the constraint that sequences longer than one letter must form
      # words. @words will accumulate all word variables created.
      @words = []
      # Left to right pass.
      left_to_right_pass(squares, @letters)
      # Top to bottom pass.
      left_to_right_pass(squares.transpose, @letters.transpose)
  
      branch_on wrap_enum(@words), :variable => :largest_degree,
        :value => :min
    end
  
    # Displays the solved crossword in the same format as shown in the
    # quiz examples.
    def to_s
      output = []
      @letters.values.each_slice(@letters.column_size) do |row|
        output << row.map{ |x| self.class.i_to_s(x) }.join(' ')
      end
      output.join("\n\n").upcase.gsub('#', ' ')
    end
    
    # ...

After storing the dictionary, this code breaks the crossword template down into
an integer matrix created using the Gecode/R helper method int_var_matrix().
This will be our puzzle of words Gecode is expected to fill in.

The next two sections of the initialize() method build up the constraints.
These are the rules that must be satisfied when we have found a correct
solution.

The first of these chunks of code sets rules for any letters that were given to
us in the template. This code tells Gecode that the number in that position of
the matrix must equal the value of the provided letter. Take a good look at
this RSpec like syntax because Andreas has spent a considerable effort on making
it easy to express your constraints in a natural syntax and I hope you will
agree with me that the end result is quite nice.

The other chunk of constraints are defined using a helper method we will examine
in just a moment. The result of this code though is to ensure that the numbers
selected represent letters that form actual words.

The final step in describing the problem to Gecode is to choose a branching
strategy. This tells Gecode which variables it will need to make guesses about
in order to find a solution as well as selecting a heuristic to use when guesses
must be made. In this case, words will be selected based on how much of the
overall puzzle they affect, hopefully ruling out large sections of the search
space quickly.

The problem model we just examined is pretty much always how constraint
programming goes. You just need to remember the three steps: create some
variables for Gecode to fill in, define the rules for the values you want Gecode
to find, and select a strategy for Gecode to use in solving the problem.

The to_s() method above just creates the output used in the quiz examples for
display to the user.

Let's have a look at the helper methods used in the model definition now:

  # ...
  
    private
  
    # Parses the template from left to right, line for line, constraining
    # sequences of two or more subsequent squares to form a word in the
    # dictionary.
    def left_to_right_pass(template, variables)
      template.each_with_index do |row, i|
        letters = []
        row.each_with_index do |letter, j|
          if letter == '#'
            must_form_word(letters) if letters.size > 1
            letters = []
          else
            letters << variables[i,j]
          end
        end
        must_form_word(letters) if letters.size > 1
      end
    end
  
    # Converts a word from integer form to string form, including the #.
    def self.i_to_s(int)
      if int == -1
        return '#'
      else
        Dictionary.i_to_s(int)
      end
    end
  
    # Converts a word from string form to integer form, including the #.
    def self.s_to_i(string)
      if string == '#'
        return -1
      else
        Dictionary.s_to_i(string)
      end
    end
  
    # Constrains the specified variables to form a word contained in the
    # dictionary.
    def must_form_word(letter_vars)
      raise 'The word is too long.' if letter_vars.size > MAX_WORD_LENGTH
      # Create a variable for the word with the dictionary's words as
      # domain and add the constraint.
      word = int_var @dictionary.words_of_size(letter_vars.size)
      letter_vars.to_number(BASE).must == word
      @words << word
    end
  end
  
  # ...

The i_to_s() and s_to_i() methods here are mostly just wrappers over the
Dictionary counterparts we examined earlier. The real interest is the other two
methods that together define the word constraints.

First, left_to_right_pass() is used to walk the puzzle looking for runs of two
or more letters that will need to become words in the Dictionary. Each time it
finds one, a hand-off is made to must_form_word(), which builds the actual
constraint.

With the problem modeled, it takes just a touch more code to turn this into a
full solution:

  # ...
  
  puts 'Reading the dictionary...'
  dictionary = Dictionary.new(ARGV.shift || '/usr/share/dict/words')
  puts 'Please enter the template (end with ^D)'
  template = ''
  loop do
    line = $stdin.gets
    break if line.nil?
    template << line
  end
  puts 'Building the model...'
  model = Crossword.new(template, dictionary)
  puts 'Searching for a solution...'
  puts((model.solve! || 'Failed').to_s)

You can see that this application code is just reading in the dictionary and
template, then constructing a model and pulling a solution with the solve!()
method. That's all it really takes to get answers after describing your problem
to Gecode.

If you want to continue exploring Andreas's Gecode/R wrapper, drop by the Web
site which has links to many useful resources:

  http://gecoder.rubyforge.org/

My thanks to all who put a good deal of effort into a hard search problem.

Tomorrow we will take another peek at our dictionary, this time to see what
numbers are hiding in there...