[SUMMARY] Learning Tic-Tac-Toe (#11)

In an interesting contrast, this quiz generated a lot of good discussion, but
only two solutions. I believe that may be because the problem turned out to be
more complicated than I intended. I know I personally ran into a few
complications and didn't have a chance to finish my own solution.

Again, the discussion was excellent and you should probably skim the thread, if
you weren't following it this weekend. Multiple Tic-Tac-Toe servers (written in
Ruby, of course) were posted, so programs could play against each other
remotely.

I posted a message about how Tic-Tac-Toe positions can be transformed by
rotation and "mirroring" to other seemly different layouts that can be handled
the same.

I also posted a "tictactoe.rb" library that makes most interaction with the game
trivially simple.

Finally, a few of us posted some notes about the problems we ran into. I agree
with Hans Fugal, who said that you can learn almost as much from those.

The two solutions posted are similar. Basically, they learn to avoid their
mistakes over time. They accomplish this by "scoring" the moves they made at
each position in a game, based on whether they won or lost. Eventually, this
knowledge allows them to select mainly strong moves, simply by remembering how
they've done in the past, in the same position.

I'll show Brian Schroeder's code here, but both were interesting to examine.
Brian's solution contained eight files of Ruby code, the embedded documentation
for said files, charts, a write-up of the process, and was all wrapped up in a
handy web page, so you can dig as deep as you like into what he's done. (And he
once asked where *I* find the time!) For the purposes of this summary, I'll
stick to his learning code.

Here it is:

  class Learning < BasicInterface
    attr_accessor :random_prob
    attr_reader :player
    
    def initialize
    @state_values = Hash.new(0)
    @state_transitions = {}
    @random_prob = 0.05
    end
  
    def new_game(player)
    @player = player
    @states_visited = []
    end
    
    def choose_move(game)
    moves = game.moves
    if !@state_transitions[game.state_id] or rand < random_prob
      move = moves[rand(moves.length)]
    else
      move_id = @state_transitions[game.state_id].max{ |(ma,sa),(mb,sb)|
        @state_values[sa] <=> @state_values[sb]
      }[0]
      move = moves.select{|m| m.move_id == move_id}[0]
    end
    move
    end
  
    def inform_of_move(before, after, move)
    @states_visited << before.state_id << after.state_id
    (@state_transitions[before.state_id] ||= {})[move.move_id] =
      after.state_id
    
    if after.final?
      winner = after.winner
      if winner
      value = winner == self.player ? 100.0 : -1000.0
      else
      value = 0.0
      end
      
      factor = 1.0
      while state = @states_visited.pop
      @state_values[state] = (1.0 - factor) * @state_values[state] +
        factor * value
      factor *= 0.5
      end
    end
    end
  end

The initialize() method sets up Brian's @state_values and @state_transitions,
which constitute the AI's brain.

@state_values will hold scores for the positions the AI has won or lost with
before. @state_transitions holds a "map" of how to get from position to
position. When these are filled in, the AI will have "learned" what positions
are desirable and how it can reach them.

Knowing this, choose_move() is easy to breakdown. It checks to see if it knows
anything about the moves from the current position. If it does, it selects the
highest score it can find for itself (else branch). If it doesn't, it goes with
a random choice from all available moves (if branch).

The final piece of the puzzle is inform_of_move(). This method remembers all
moves made in the game, mainly. When it sees a final position, it scores all
those moves based on whether it won or lost. The scoring scale is slanted, to
encourage the AI to avoid losses.

That's the heart of Brian's solution. The rest of the code is interface,
client/server, a perfect minimax player, and Tic-Tac-Toe details.

For the curious, this quiz was inspired by the research of Donald Michie. In
1961 he built a "machine" that learned to play perfect Tic-Tac-Toe against
humans, using matchboxes and beads. He called the machine MENACE (Matchbox
Educable Naughts And Crosses Engine).

304 matchboxes where labeled with images of Tic-Tac-Toe positions and filled
with colored beads representing possible moves. At each move, a bead would be
rattled out of the proper box to determine a move. When MENACE would win, more
beads of the colors played would be added to each position box. When it would
lose, the beads were left out to discourage these moves.

Michie claimed that he trained MENACE in 220 games, being beaten by it eight out
of the final ten games.

Thanks to those who tried this one, successful or not. Sorry it turned out to
be a bit involved.

Tomorrow, "How Ruby Can Help You Beat Your Grandmother In Scrabble" is the
topic. If your grandmother is anything like mine, you'll appreciate all the
help you can get!

In an interesting contrast, this quiz generated a lot of good discussion, but
only two solutions. I believe that may be because the problem turned out to
be more complicated than I intended. I know I personally ran into a few
complications and didn't have a chance to finish my own solution.

Again, the discussion was excellent and you should probably skim the thread,
if you weren't following it this weekend. Multiple Tic-Tac-Toe servers
(written in Ruby, of course) were posted, so programs could play against each
other remotely.

I posted a message about how Tic-Tac-Toe positions can be transformed by
rotation and "mirroring" to other seemly different layouts that can be
handled the same.

I also posted a "tictactoe.rb" library that makes most interaction with the
game trivially simple.

Finally, a few of us posted some notes about the problems we ran into. I
agree with Hans Fugal, who said that you can learn almost as much from those.

The two solutions posted are similar. Basically, they learn to avoid their
mistakes over time. They accomplish this by "scoring" the moves they made at
each position in a game, based on whether they won or lost. Eventually, this
knowledge allows them to select mainly strong moves, simply by remembering
how they've done in the past, in the same position.

I'll show Brian Schroeder's code here, but both were interesting to examine.
Brian's solution contained eight files of Ruby code, the embedded
documentation for said files, charts, a write-up of the process, and was all
wrapped up in a handy web page, so you can dig as deep as you like into what
he's done. (And he once asked where *I* find the time!) For the purposes of
this summary, I'll stick to his learning code.

Here it is:

  class Learning < BasicInterface
    attr_accessor :random_prob
    attr_reader :player
    
    def initialize
    @state_values = Hash.new(0)
    @state_transitions = {}
    @random_prob = 0.05
    end
  
    def new_game(player)
    @player = player
    @states_visited =
    end
    
    def choose_move(game)
    moves = game.moves
    if !@state_transitions[game.state_id] or rand < random_prob
      move = moves[rand(moves.length)]
    else
      move_id = @state_transitions[game.state_id].max{ |(ma,sa),(mb,sb)|
        @state_values[sa] <=> @state_values[sb]
      }[0]
      move = moves.select{|m| m.move_id == move_id}[0]
    end
    move
    end
  
    def inform_of_move(before, after, move)
    @states_visited << before.state_id << after.state_id
    (@state_transitions[before.state_id] ||= {})[move.move_id] =
      after.state_id
    
    if after.final?
      winner = after.winner
      if winner
      value = winner == self.player ? 100.0 : -1000.0
      else
      value = 0.0
      end
      
      factor = 1.0
      while state = @states_visited.pop
      @state_values[state] = (1.0 - factor) * @state_values[state] +
        factor * value
      factor *= 0.5
      end
    end
    end
  end

The initialize() method sets up Brian's @state_values and @state_transitions,
which constitute the AI's brain.

@state_values will hold scores for the positions the AI has won or lost with
before. @state_transitions holds a "map" of how to get from position to
position. When these are filled in, the AI will have "learned" what
positions are desirable and how it can reach them.

Knowing this, choose_move() is easy to breakdown. It checks to see if it
knows anything about the moves from the current position. If it does, it
selects the highest score it can find for itself (else branch). If it
doesn't, it goes with a random choice from all available moves (if branch).

Thanks for the writeup james. I have to correct this paragraph here, as the
program would not work if it worked the way you describe it. It is important to
see, that one should differentiate between exploration and exploitation
behaviour. Exploration means, that the player tries out new moves to learn more
about this game, exploitation means usage of the learned knowledge. Your
description suggest, that once the game know how to make a move, it makes it.
If it would exhibit this behaviour, it would never learn how to play different
than the first game. The only thing would be, that it would learn that it plays
badly (set the score of the states to -INFINITY after an infinit number of
games).

In fact, there is an adjustable chance, that the player pics a random move,
even though it know a move. This is the exploitation factor, that is adjusted
with the badly named random_prob attribute.

Regards,

Brian

···

On Thu, 16 Dec 2004 23:13:16 +0900 Ruby Quiz <james@grayproductions.net> wrote:

The final piece of the puzzle is inform_of_move(). This method remembers all
moves made in the game, mainly. When it sees a final position, it scores all
those moves based on whether it won or lost. The scoring scale is slanted,
to encourage the AI to avoid losses.

That's the heart of Brian's solution. The rest of the code is interface,
client/server, a perfect minimax player, and Tic-Tac-Toe details.

For the curious, this quiz was inspired by the research of Donald Michie. In
1961 he built a "machine" that learned to play perfect Tic-Tac-Toe against
humans, using matchboxes and beads. He called the machine MENACE (Matchbox
Educable Naughts And Crosses Engine).

304 matchboxes where labeled with images of Tic-Tac-Toe positions and filled
with colored beads representing possible moves. At each move, a bead would
be rattled out of the proper box to determine a move. When MENACE would win,
more beads of the colors played would be added to each position box. When it
would lose, the beads were left out to discourage these moves.

Michie claimed that he trained MENACE in 220 games, being beaten by it eight
out of the final ten games.

Thanks to those who tried this one, successful or not. Sorry it turned out
to be a bit involved.

Tomorrow, "How Ruby Can Help You Beat Your Grandmother In Scrabble" is the
topic. If your grandmother is anything like mine, you'll appreciate all the
help you can get!

--
Brian Schröder
http://ruby.brian-schroeder.de/

You're correct, of course. I was trying not to over-complicate my explanation, but that is a pretty important detail I omitted. Thanks for keeping me honest.

James Edward Gray II

···

On Dec 16, 2004, at 9:39 AM, Brian Schröder wrote:

On Thu, 16 Dec 2004 23:13:16 +0900 > Ruby Quiz <james@grayproductions.net> wrote:

Knowing this, choose_move() is easy to breakdown. It checks to see if it
knows anything about the moves from the current position. If it does, it
selects the highest score it can find for itself (else branch). If it
doesn't, it goes with a random choice from all available moves (if branch).

Thanks for the writeup james. I have to correct this paragraph here, as the
program would not work if it worked the way you describe it. It is important to
see, that one should differentiate between exploration and exploitation
behaviour. Exploration means, that the player tries out new moves to learn more
about this game, exploitation means usage of the learned knowledge. Your
description suggest, that once the game know how to make a move, it makes it.
If it would exhibit this behaviour, it would never learn how to play different
than the first game. The only thing would be, that it would learn that it plays
badly (set the score of the states to -INFINITY after an infinit number of
games).

In fact, there is an adjustable chance, that the player pics a random move,
even though it know a move. This is the exploitation factor, that is adjusted
with the badly named random_prob attribute.