[QUIZ] Bytecode Compiler (#100)

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

···

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Ross Bamford

Note: This quiz isn't really as much work as it might seem!

This quiz involves writing (in Ruby, of course) a compiler for basic arithmetic
expressions. The output from this compiler should be an array of unsigned
byte-sized ints, which can be fed into the included interpreter
(http://www.rubyquiz.com/interp.rb) in order to execute the compiled expression.

The bytecode format is very simple, while the interpreter is also very simple,
implemented as a stack machine. They have the following general characteristics:

  * Bytecode is stored as an array of unsigned byte-sized Fixnums.
  * All stack-bound numbers are signed integers
  * The following operations are supported:
    * Addition
    * Subtraction
    * Multiplication
    * Division
    * Raise to power
    * Integer modulo
  * Where an operator would return a floating point value,
    the value is truncated to an integer.
  * Short CONST and long LCONST instructions allow constants
    to be pushed to the stack. These instructions expect their
    operands to hold a signed short or long, respectively,
    in network byte order.

Your compiler interface should be via a singleton method on a module 'Compiler',
taking a string, such that:

  Compiler.compile('3+2')

Returns an array of instructions (and operands) that, when fed to the
interpreter, will execute the expression 3+2 and return the result (hopefully
5). For example, a correct (but non-optimal) result from the above call would
be:

  [2,0,0,0,3, # LCONST 3
   2,0,0,0,2, # LCONST 2
   10] # ADD

Your compiler should support all basic arithmetic operations and explicit
precedence (parenthesis). As standard, syntax/precedence/ associativity/etc.
should follow Ruby itself. Obviously, specific implementation is entirely up to
you, though bear in mind that your compiler must be capable of running inline in
the same Ruby process as the interpreter, without affecting any code outside
itself.

The quiz also includes a number of tests
(http://www.rubyquiz.com/test_bytecode.rb) that will test your compiler's
functionality, with expressions becoming more complex as the tests go on. To
pass all the tests, a compiler will have to not only generate correct bytecode,
but it will also need to generate the shortest code it can for a given
expression.

Here is the bytecode spec:

  # 0x01: CONST (cbyte1, cbyte2) ... => ..., const
    Push a 15-bit signed integer to the stack.
    The two bytes immediately following the instruction represent the
    constant.
  
  # 0x02: LCONST (cbyte1, cbyte2, cbyte3, cbyte4) ... => ..., const
    Push a 31-bit signed integer to the stack.
    The four bytes immediately following the instruction represent the
    constant.
  
  # 0x0a: ADD () ..., value1, value2 => ..., result
    Pop the top two values from the stack, add them, and push the result
    back onto the stack.
  
  # 0x0b: SUB () ..., value1, value2 => ..., result
    Pop the top two values from the stack, subtract value2 from value1,
    and push the result back onto the stack.
  
  # 0x0c: MUL () ..., value1, value2 => ..., result
    Pop the top two values from the stack, multiply value1 by value2,
    and push the result back onto the stack.
  
  # 0x0d: POW () ..., value1, value2 => ..., result
    Pop the top two values from the stack, raise value1 to the power of
    value2, and push the result back onto the stack.
  
  # 0x0e: DIV () ..., value1, value2 => ..., result
    Pop the top two values from the stack, divide value1 by value2,
    and push the result back onto the stack.
  
  # 0x0f: MOD () ..., value1, value2 => ..., result
    Pop the top two values from the stack, modulo value1 by value2,
    and push the result back onto the stack.
  
  # 0xa0: SWAP () ..., value1, value2 => ..., value2, value1
    Swap the top two stack values.

require 'interp'

module Compiler

   # use eval and Value class below to compile
   # expression into bytecode
   def Compiler.compile(s)
     s.gsub!(/([0-9]+)/, 'Value.new(stack, \1)')
     stack = []
     eval(s)
     stack
   end

   class Value
     attr_reader :number # constant value or nil for on stack
     ON_STACK = nil

     def initialize(stack, number)
       @number = number
       @stack = stack
     end

     # generate code for each binary operator (except -@)
     # algorithm:
     # push constants (or don't if already on stack)
     # swap if necessary
     # push bytecode
     # create stack item
     {'+' => Interpreter::Ops::ADD,
      '-' => Interpreter::Ops::SUB,
      '*' => Interpreter::Ops::MUL,
      '**'=> Interpreter::Ops::POW,
      '/' => Interpreter::Ops::DIV,
      '%' => Interpreter::Ops::MOD}.each do |operator, byte_code|
        Value.module_eval <<-FUNC
         def #{operator}(rhs)
           push_const(@number)
           push_const(rhs.number)
           # may need to swap integers on stack for all but plus
           #{
             if operator != "+"
               "@stack << Interpreter::Ops::SWAP if rhs.number == nil &&
                                                    @number != nil"
             end
           }
           @stack << #{byte_code}
           Value.new(@stack, ON_STACK)
         end
        FUNC
     end

     def -@
       if @number != ON_STACK
         @number = -@number
         push_const(@number)
       else
         push_const(@number)
         push_const(0)
         @stack << Interpreter::Ops::SWAP
         @stack << Interpreter::Ops::SUB
       end
       Value.new(@stack, ON_STACK)
     end

     def push_const(number)
       if number != ON_STACK
         if (-32768..32767).include?(number)
           @stack << Interpreter::Ops::CONST
         else
           @stack << Interpreter::Ops::LCONST
           @stack << ((number >> 24) & 0xff)
           @stack << ((number >> 16) & 0xff)
         end
         @stack << ((number >> 8) & 0xff)
         @stack << (number & 0xff)
       end
     end
   end
end

The simplest way I found to do this problem was to let Ruby do the
legwork of parsing the expression for me, so I didn't have to worry
about things like parens or operator precedence.

I defined a Const and an Expr class and defined operators inside of
them to create a parse tree, added a to_const method to Fixum and
created a regular expression to convert an expression 1+1 to
'1.to_const() + 1.to_const()', so evaluating that expression would
produce a parse tree for that experssion.

Emitting the bytescodes is then a post-order traversal of the parse
tree.

---- Compiler.rb

# Operator overrides to create an expression tree. Mixed into
# Const and Expr so:
# Const <op> Const => Expr
# Const <op> Expr => Expr
# Expr <op> Const => Expr
module CreateExpressions
  def +(other) Expr.new(:add, self, other) end
  def -(other) Expr.new(:sub, self, other) end
  def *(other) Expr.new(:mul, self, other) end
  def /(other) Expr.new(:div, self, other) end
  def %(other) Expr.new(:mod, self, other) end
  def **(other) Expr.new(:pow, self, other) end
end

# Add a method to fixnum to create a const from an integer
class Fixnum
  def to_const
    Const.new(self)
  end
end

# An integer value
class Const
  include CreateExpressions
  # Opcodes to push shorts or longs respectively onto the stack
  OPCODES = {2 => 0x01, 4 => 0x02}

  def initialize(i)
    @value = i
  end

  def to_s
    @value
  end

  # Emits the bytecodes to push a constant on the stack
  def emit
    # Get the bytes in network byte order
    case @value
      when (-32768..32767): bytes = [@value].pack("n").unpack("C*")
      else bytes = [@value].pack("N").unpack("C*")
    end
    bytes.insert 0, OPCODES[bytes.size]
  end
end

# A binary expression
class Expr
  include CreateExpressions
  OPCODES = {:add => 0x0a, :sub => 0x0b, :mul => 0x0c, :pow => 0x0d,
    :div => 0x0e, :mod => 0x0f}

  def initialize(op, a, b)
    @op = op
    @first = a
    @second = b
  end

  # Emits a human-readable s-expression for testing
  # (preorder traversal of parse tree)
  def to_s
    "(#{@op.to_s} #{@first.to_s} #{@second.to_s})"
  end

  # Bytecode emitter for an expression (postorder traversal of parse
tree)
  def emit
    # emit LHS, RHS, opcode
    @first.emit << @second.emit << OPCODES[@op]
  end
end

# Compile and print out parse tree for expressions
class Compiler
  # Creates bytecodes from an arithmatic expression
  def self.compile(expr)
    self.mangle(expr).emit.flatten
  end

  # Prints a representation of the parse tree as an S-Expression
  def self.explain(expr)
    self.mangle(expr).to_s
  end

private
  # Name-mangles an expression so we create a parse tree when calling
  # Kernel#eval instead of evaluating the expression:
  # [number] => [number].to_const()
  def self.mangle(expr)
    eval(expr.gsub(/\d+/) {|s| "#{s}.to_const()"})
  end
end

Here is my solution, a simple recursive descent parser. It's a bit more code than is strictly necessary because it is loosely modeled after a parser for a real language I have written recently.

require 'English'

class Compiler
   def self.compile(expr)
     return self.new(expr).compile.unpack('C*')
   end

   def initialize(expr)
     # a very simple tokenizer
     @tok = []
     until expr.empty?
       case expr
       when /\A\s+/ # skip whitespace
       # don't tokenize '1-1' as '1', '-1'
       when (@tok.last.is_a? Integer) ? /\A\d+/ : /\A\-?\d+/
         @tok << $MATCH.to_i
       # any other character and '**' are literal tokens
       when /\A\*\*|./
         @tok << $MATCH
       end
       expr = $POSTMATCH
     end
   end

   def compile
     code = compile_expr(0)
     raise "syntax error" unless @tok.empty?
     return code
   end

private

   OPS = {'+'=>0xa, '-'=>0xb, '*'=>0xc, '**'=>0xd, '/'=>0xe, '%'=>0xf}

   def compile_expr(level)
     # get the tokens to parse at this precedence level
     tok = [['+', '-'], ['*', '/', '%'], ['**']][level]
     if tok
       # if we are to actually parse a bi-op, do so
       left = compile_expr(level + 1)
       # for left-associative ops, find as many ops in a row as possible
       while tok.include?(@tok.first)
         op = OPS[@tok.shift]
         # '**' is right-associative, so add a special case for that
         right = compile_expr(op == OPS['**'] ? level : level + 1)
         left << right + op.chr
       end
       return left
     end
     # if we are at a level higher than the ops, try to parse an
     # atomic - either a numeral or an expression in paranthesis
     tok = @tok.shift
     if tok == '('
       expr = compile_expr(0)
       raise "')' expected" unless @tok.shift == ')'
       return expr
     end
     raise 'number expected' unless tok.is_a? Integer
     return (tok < -32768 || tok > 32767) ? [2, tok].pack('CN') :
             [1, tok].pack('Cn')
   end
end

if $0 == __FILE__
   p Compiler.compile(ARGV[0])
end

This is my first RubyQuiz submission. Like Cameron Pope, I decdied to let
Ruby's parser do the heavy lifting by monkey-patching a to_expr onto Integer
and running a regex on Compiler::compile's input.

This differes from his entry by using method_missing to handle all
operators, and I didn't bother with separate number/expression classes.

Also, his byte conversion looks much shorter -- I'll have to see how it
works and wrap my head around Array#pack.

I've dabbled in Ruby for a while, but I still feel like I'm at the
hello_world stage; constructive criticism is *quite* welcome.

# Compiler.rb

class Integer
  # Much easier than properly parsing a string for Exper.new(???)
  # If I *really* wanted to avoid monkey-patching, I could define
  # Expr#-@ and expr#+@ instead.
  def to_expr
    Compiler::Expr.new self
  end
end

module Compiler
  CONST = 2**15
  LCONST = 2**31

  def Compiler.compile input
    # I initially tried Expr.new(\1), but this way lets me use Ruby's
    # sign-parsing.
    m = input. gsub /(\d+)(\D*)/, '\1.to_expr()\2'
    exp = eval m
    exp.compile
  end

  # The meat of this module...
  # Rather than do any parsing, I'm just converting all the expression's
numbers
  # into Expr objects, whose +-/*, etc. methods (all method_missing) just
build
  # a parse-tree when I run eval.
  class Expr
    attr_reader :val

    OPERATORS = { :+ => 0x0a,
                :- => 0x0b,
                :* => 0x0c,
                :** => 0x0d,
                :/ => 0x0e,
                :% => 0x0f,
                :swap => 0xa0 } # Swap doesn't have an operator, but
whatever.

    def initialize *v
      @val = v
    end

    # Take care of all those operators
    def method_missing sym, *args
      if OPERATORS.include? sym
        Expr.new [val, args.first, sym]
      else
        raise "Unknown operator: #{sym}, #{args.inspect}"
      end
    end

    def to_s
      "Expr: <#{flatten.join ' '}>"
    end

    def flatten
      # Flatten the array as much as we can, then tackle any Expr objects.
      # Finally, make sure the result is also flat
      # (because the map turns each Expr into an array)
      val.flatten.map do |i|
        if i.respond_to? :flatten
          i.flatten
        else
          i
        end
      end.flatten
    end

    def compile
      # Get a flat copy of our value, then encode each number and symbol.
      # Finally, flatten all the encoded numbers into our answer.
      arr = flatten
      arr.map do |i|
        if i.is_a? Integer
          bytes_for(i)
        elsif OPERATORS.include? i
          OPERATORS[i]
        else
          # What's the preferred method of dealing with this?
          # I could raise a different exception, or attempt to call the same
          # method in my superclass...
          raise "Unknown operator: #{i.inspect}, #{i.class}"
        end
      end.flatten
    end

    # Convert a number to bytes
    def bytes_for number
      type = size = 0
      values = []
      if number < CONST and number >= -CONST
        type, size = 1, 2
      elsif number < LCONST and number >= -LCONST
        type, size = 2, 4
      else
        raise "#{number} is too big to encode!"
      end

      size.times do |s|
        number, byte = number.divmod 256
        # I could use << here, but then I'd need to reverse values.
        values.unshift byte
      end
      [type, *values]
    end

  end #expr

end #Compiler

# This is a solution to Ruby Quiz #100

···

#
# It's basically just a shunting algorithm, but with a twist
# since it needs to distinguish between a "-" that's part of
# a number and a "-" that's an operator. To do that, I use
# a state machine while parsing to remember if I need next
# an operator or an integer.

require 'strscan'
class Compiler
  # A small class made so that I can use case ... when
  # with a StringScanner
  class Token < Regexp
    def initialize(re)
      super(re)
    end
    # Using is_a? instead of respond_to? isn't very duck-typey,
    # but unfortunately String#scan and StringScanner#scan mean
    # completely different things.
    def ===(s)
      if (s.is_a?(StringScanner))
        s.scan(self)
      else
        super(s)
      end
    end
  end
  
  # The tokens I need
  WSPACE = Token.new(/\s+/)
  LPAREN = Token.new(/\(/)
  RPAREN = Token.new(/\)/)
  OP = Token.new(/\*\*|[+*%\/-]/)
  NEG = Token.new(/-/)
  INT = Token.new(/\d+/)
  
  OpValMap = {'+' => 0x0a, '-' => 0x0b, '*' => 0x0c,
              '**' => 0x0d, '/' => 0x0e, '%' => 0x0f}

  def initialize(instring)
    @scanner = StringScanner.new(instring)
    @opstack = Array.new
    @outarr = Array.new
  end

  def compile()
    state = :state_int
    while state != :state_end
      case @scanner
      when WSPACE
        next
      else
        state = send(state)
        raise "Syntax error at index #{@scanner.pos}" if ! state
      end
    end
    while ! @opstack.empty?
      op = @opstack.pop
      raise "Mismatched parens" if LPAREN === op
      @outarr << OpValMap[op]
    end
    @outarr
  end
  
  # Class method as required by the test harness
  def self.compile(instring)
    new(instring).compile
  end

  private
  # Expecting an operator or right paren
  def state_op
    case @scanner
    when RPAREN
      while not LPAREN === @opstack[-1]
        raise "Mismatched parens" if @opstack.empty?
        @outarr << OpValMap[@opstack.pop]
      end
      @opstack.pop
      :state_op
    when OP
      op = @scanner.matched
      while is_lower(@opstack[-1], op)
        @outarr << OpValMap[@opstack.pop]
      end
      @opstack << op
      :state_int
    else
      # I would handle this with an EOS token, but unfortunately
      # StringScanner is broken w.r.t. @scanner.scan(/$/)
      :state_end if @scanner.eos?
    end
  end
  
  # state where we're expecting an integer or left paren
  def state_int
    case @scanner
    when LPAREN
      @opstack << @scanner.matched
      :state_int
    when INT
      integer(@scanner.matched.to_i)
      :state_op
    when NEG
      :state_neg
    end
  end
  
  # The state where we've seen a minus and are expecting
  # the rest of the integer
  def state_neg
    case @scanner
    when INT
      integer(-(@scanner.matched.to_i))
      :state_op
    end
  end
  
  # Handle an integer
  def integer(i)
    if (i <= 32767 and i >= -32768)
      @outarr << 0x01
      @outarr.push(*([i].pack("n").unpack("C*")))
    else
      @outarr << 0x02
      @outarr.push(*([i].pack("N").unpack("C*")))
    end
  end
  
  # Define the precedence order
  # One thing to note is that for an operator a,
  # is_lower(a,a) being true will make that operator
  # left-associative, while is_lower(a,a) being false
  # makes that operator right-associative. Note that
  # we want ** to be right associative, but all other
  # operators to be left associative.
  def is_lower(op_on_stack, op_in_hand)
    case op_on_stack
      when nil, LPAREN; false
      when /\*\*|[*\/%]/; op_in_hand =~ /^.$/
      when /[+-]/; op_in_hand =~ /[+-]/
    end
  end
end
__END__

--
s=%q( Daniel Martin -- martin@snowplow.org
       puts "s=%q(#{s})",s.map{|i|i}[1] )
       puts "s=%q(#{s})",s.map{|i|i}[1]

Note that the creator of this quiz left out one important case from
their tests:

  def test_02a
    assert_equal [2**1**2], Interpreter.new(Compiler.compile('2**1**2')).run
  end

This tests that your compiler is properly making **
right-associative. Some solutions already posted fail this test.

2**2**2 was an unfortunate test case to choose, since 2**4 == 4**2.

···

--
s=%q( Daniel Martin -- martin@snowplow.org
       puts "s=%q(#{s})",s.map{|i|i}[1] )
       puts "s=%q(#{s})",s.map{|i|i}[1]

My solution is a bit different than all the ones I've seen so far. I had no intention of writing an expression parser :slight_smile: All it does is

1. Define a method (to_bc) to have a fixnum return its own bytecode
2. Redefine the array operators to return the appropriate bytecode representations
3. Add .to_bc after every number in the expression string

Here it is:

class Compiler
  def self.compile(str)
    eval(str.gsub(/(\d+)([^\d])/,'\1.to_bc\2').gsub(/([^\d])(\d+)$/,'\1\2.to_bc'))
  end
end

class Fixnum
  def to_bc
    return (self >= 0 ? [1,self/256,self%256] : [1,(self+32768)/256+128,(self+32768)%256]) if self <= 32767 and self >= -32768
    res = [(24..30),(16..23),(8..15),(0..7)].map { |range| range.map {

x> self } }.map { |byte| byte.inject_with_index(0) { |s,x,i|

s+x*2**i } }
    ([2] << (self > 0 ? res[0] : res[0]+128) << res[1..3]).flatten
  end
end

class Array
  {:+ => 10, :- => 11, :* => 12, :** => 13, :confused: => 14, :% => 15}.each do

op,opcode|

    define_method(op) { |x| self.concat(x).concat([opcode]) }
  end
  def inject_with_index(sum)
    each_with_index { |x,i| sum = yield(sum,x,i) }
    sum
  end
end

Note: This quiz isn't really as much work as it might seem!

Ouch! :slight_smile: Oh well, I don't think my Ruby is up to some of the more
cunning techniques I see many have employed.

Your compiler should support all basic arithmetic operations and explicit
precedence (parenthesis). As standard, syntax/precedence/ associativity/etc.
should follow Ruby itself.

By associativity does this also mean that "+--+1" - which could be
rewritten as "0+(0-(0-(0+1)))" - should be parsed into a bytecode
which, however well optimised, on execution results in the answer "1"?

I should warn that my solution below is rather long. I went down a
possibly more traditional route of writing a generic tokenizer/lexer.
I don't know if these are still commonly used but I couldn't find an
existing implementation in the Ruby Standard Library.

I also tried to document the functions using rdoc so someone else
might make use of it. For those who haven't tried it yet, just type
'rdoc' at the command prompt and it makes a nice doc directory under
the current directory with an index.html to start browsing the
file/classes/methods. Nice!

Back to the task...

My wanting a solution that coped with all difficult expressions that
Ruby itself can deal with (using the lexicon allowed) meant having to
get things like the aforementioned negation with parentheses working:

-(---3) # => 3

...and power precedence (as others have pointed out) combined with
negation and parentheses turned out to be tricky:

64**-(-(-3+5)**3**2) #=> a big number

There's a big list of test cases such as these in my unit tests (included).

So having written the lexer class, I now set up the state transition
table and ask the lexer to tokenize the expression. The tokens are
then parsed and the bytecode is generated using a simple mapping.

The code follows; there are two files in total.

Thanks for another fun challenge and congrats all round for reaching
the 100th Ruby Quiz!

Marcel

#!/usr/bin/env ruby

···

On 03/11/06, Ruby Quiz <james@grayproductions.net> wrote:
##################################################################
# = compiler_mw.rb - bytecode compiler
#
# Author:: Marcel Ward <wardies ^a-t^ gmaildotcom>
# Documentation:: Marcel Ward
# Last Modified:: Monday, 06 November 2006

require 'interp'
require 'lexer_mw'

module Compiler
  # The lexer needs to know the character sets involved in deciding
  # which state transition will be fired...
  CHAR_SETS = {
        :plus => [?+], :minus => [?-],
        :digit => /\d/,
        :div_mod => [?/, ?%], # matches '/' or '%'
        :asterisk => [?*],
        :open_paren => [?(], :close_paren => [?)]
      }

  # Tell the lexer how to parse a datastream: which tokens to
  # generate, what state to switch to, etc.
  # This table was designed according to my vague recollection of
  # the dragon book on compiler construction by Aho/Sethi/Ullman.
  STATE_TRANS_TABLE = {
    :s_start => {
        :plus => {:next_s_skip => :s_start},
        :minus => {:next_s => :s_negate},
        :digit => {:next_s => :s_numeric},
        :open_paren => {:next_s => :s_start,
                          :token => :tok_open_paren}
      },
    :s_negate => {
        :plus => {:next_s_skip => :s_negate},
        :minus => {:next_s => :s_start},
        :digit => {:next_s => :s_numeric},
        :open_paren => {:next_s_backtrack => :s_start,
                          :token => :tok_negate}
      },
    :s_numeric => {
        :plus => {:next_s_backtrack => :s_operator,
                          :token => :tok_int},
        :minus => {:next_s_backtrack => :s_operator,
                          :token => :tok_int},
        :digit => {:next_s => :s_numeric},
        :div_mod => {:next_s_backtrack => :s_operator,
                          :token => :tok_int},
        :asterisk => {:next_s_backtrack => :s_operator,
                          :token => :tok_int},
        :close_paren => {:next_s_backtrack => :s_operator,
                          :token => :tok_int},
        :eof => {:next_s_backtrack => :s_operator,
                          :token => :tok_int},
      },
    :s_operator => {
        :plus => {:next_s => :s_start,
                          :token => :tok_add},
        :minus => {:next_s => :s_start,
                          :token => :tok_subtract},
        :div_mod => {:next_s => :s_start,
                          :token => :tok_div_mod},
        :asterisk => {:next_s => :s_mult_or_power},
        :close_paren => {:next_s => :s_operator,
                          :token => :tok_close_paren},
        :eof => {} # when :next_s... is absent, finish
      },
    :s_mult_or_power => {
        :plus => {:next_s_backtrack => :s_start,
                          :token => :tok_multiply},
        :minus => {:next_s_backtrack => :s_start,
                          :token => :tok_multiply},
        :digit => {:next_s_backtrack => :s_start,
                          :token => :tok_multiply},
        :asterisk => {:next_s => :s_start,
                          :token => :tok_power},
        :open_paren => {:next_s_backtrack => :s_start,
                          :token => :tok_multiply}
      }
  }

  # Compiles a string expression _sum_ into bytecode and returns
  # the bytecode array (as per Ruby Quiz 100 requirements).
  def self.compile(sum)
    lexer = LexerMW.new()
    lexer.init_char_sets(CHAR_SETS)
    lexer.init_state_transitions(STATE_TRANS_TABLE)

    toks = lexer.tokenize(sum)

    puts toks.inspect + "\n\n" + toks.map {|a,b| b}.join(' ') \
      if $DEBUG == 1

    # Get the mnemonic stack by parsing the tokens.
    mnemonic_stack = parse(toks)
    puts "\nParsed toks => #{mnemonic_stack.inspect}" if $DEBUG == 1

    # Last stage now, we convert our internal mnemonics directly
    # to a byte stack in the required bytecode format.
    mnemonics_to_bytecode(mnemonic_stack)
  end

  MNEMONIC_TO_BYTECODE = {
      :tok_add => Interpreter::Ops::ADD,
      :tok_subtract => Interpreter::Ops::SUB,
      :tok_multiply => Interpreter::Ops::MUL,
      :tok_divide => Interpreter::Ops::DIV,
      :tok_modulo => Interpreter::Ops::MOD,
      :tok_power => Interpreter::Ops::POW
    }

  # This exception is raised by the mnemonic-to-bytecode method when
  # an integer constant cannot be pushed onto the interpreter
  # bytecode stack because it is too big to fit the
  # Interpreter::Ops::LCONST instruction.
  class OutOfRangeError < StandardError
  end

  # Convert our internal _mnemonics_ directly to a byte array and
  # return this in the required bytecode format, ready to execute.
  def self.mnemonics_to_bytecode(mnemonics)
    bc =
    mnemonics.each do
      >mnem>
      if MNEMONIC_TO_BYTECODE.has_key? mnem
        bc << MNEMONIC_TO_BYTECODE[mnem]
      else
        # Try packing this value as a 2-or 4-byte signed string
        # and ensure we get back the same value on unpacking it.
        if [mnem] == [mnem].pack('s').unpack('s')
          # 2-bytes will be enough
          bc << Interpreter::Ops::CONST
          bc.concat([mnem].pack('n').unpack('C*'))
        elsif [mnem] == [mnem].pack('l').unpack('l')
          # 4-bytes will be enough
          bc << Interpreter::Ops::LCONST
          bc.concat([mnem].pack('N').unpack('C*'))
        else
          # It could be dangerous to silently fail when a
          # number will not fit in a 4-byte signed int.
          raise OutOfRangeError
        end
      end
    end
    bc
  end

  # If there is a mismatch in the number of parenthesis, this
  # exception is raised by the #parse routine.
  # E.g. "3+(4-2" and "(3-10))" are both considered invalid.
  class ParenthesisError < Exception
  end

  # The operator precedence hash helps the #parse method to
  # decide when to store up operators and when to flush a load
  # out. The
  PAREN_PRECEDENCE = 0
  OP_PRECEDENCE = {
      :tok_end => -1,
      :tok_open_paren => PAREN_PRECEDENCE,
      :tok_close_paren => PAREN_PRECEDENCE,
      :tok_add => 1, :tok_subtract => 1,
      :tok_multiply => 2, :tok_div_mod => 2,
      :tok_power => 3,
      :tok_negate => 4
    }

  # Parse an array of [token,value] pairs as returned by
  # LexerMW::tokenize. Returns our own internal quasi-bytecode
  # mnemonic array.
  def self.parse(tokens)
    operator_stack =
    ops =

    # Push the bottom-most element with precedence equivalent to that
    # of :tok_end so when we see :tok_end all pending operation
    # tokens on the stack get popped
    precedence_stack = [OP_PRECEDENCE[:tok_end]]

    tokens.each do
      >tok, val|
      if tok == :tok_int
        # "--3".to_i => 0 is bad, so use eval("--3") => 3 instead.
        ops << eval(val)
      else
        precedence = OP_PRECEDENCE[tok]
        if not tok == :tok_open_paren
          while precedence <= precedence_stack.last &&
                  precedence_stack.last > PAREN_PRECEDENCE
            # Workaround for the fact that the ** power operation
            # is calculated Right-to-left,
            # i.e. 2**3**4 == 2**(3**4) /= (2**3)**4
            break if tok == :tok_power &&
              precedence_stack.last == OP_PRECEDENCE[:tok_power]

            precedence_stack.pop
            ops << operator_stack.pop
          end
        end

        # Divide and modulo come out of the lexer as the same token
        # so override tok according to its corresponding value
        tok == :tok_div_mod && \
          tok = (val == '/') ? :tok_divide : :tok_modulo

        case tok
        when :tok_close_paren
          precedence_stack.pop == PAREN_PRECEDENCE \
            or raise ParenthesisError
        when :tok_negate
          # val contains just the minuses ('-', '--', '---', etc.)
          # Optimise out (x) === --(x) === ----(x), etc.
          if val.size % 2 == 1
            # No negate function for -(x) so simulate using 0 - (x)
            precedence_stack.push precedence
            operator_stack.push :tok_subtract
            ops << 0
          end
        when :tok_end
          raise ParenthesisError if precedence_stack.size != 1
        else
          precedence_stack.push precedence
          operator_stack.push tok unless tok == :tok_open_paren
        end
      end
    end
    ops
  end
end

if $0 == __FILE__
  eval DATA.read, nil, $0, __LINE__+4
end

__END__

require 'test/unit'

class TC_Compiler < Test::Unit::TestCase
  def test_simple
    @test_data = [
      '8', '124', '32767', # +ve CONST
      '-1', '-545', '-32768', # -ve CONST
      '32768', '294833', '13298833', # +ve LCONST
      '-32769', '-429433', '-24892810', # -ve LCONST
      '4+5', '7-3', '30+40+50', '14-52-125', # ADD, SUB
      '512243+1877324', '40394-12388423', # LCONST, ADD, SUB
      '3*6', '-42*-90', '94332*119939', # MUL
      '8/3', '-35/-15', '593823/44549', # DIV
      '8%3', '243%-59', '53%28%9', # MOD
      '531%-81%14', '849923%59422', #
      '-2147483648--2147483648', # SUB -ve LCONST
      '2**14', '-4**13+2' # POW
    ]
    @test_data.each do
      >sum>
      assert_equal [eval(sum)],
        Interpreter.new(Compiler.compile(sum)).run,
        "whilst calculating '#{sum}'"
    end
  end

  def test_advanced
    @test_data = [
      '-(423)', '-(-523*32)', '---0',
      '-(-(-(16**--++2)))',
      '3**(9%5-1)/3+1235349%319883+24*-3',
      '+42', '((2*-4-15/3)%16)', '4**3**((2*-4-15/3)%16)',
      '64**-(-(-3+5)**3**2)', '4*165%41*341/7/2/15%15%13',
      '--(---(4**3**((2*-4-15/3)%16))+++-410--4)'
    ]
    @test_data.each do
      >sum>
      assert_equal [eval(sum)],
        Interpreter.new(Compiler.compile(sum)).run,
        "whilst calculating '#{sum}'"
    end
  end
end

#!/usr/bin/env ruby
##################################################################
# = lexer_mw.rb - generic lexical analyser
#
# Author:: Marcel Ward <wardies ^a-t^ gmaildotcom>
# Documentation:: Marcel Ward
# Last Modified:: Monday, 06 November 2006
#
# Solution for Ruby Quiz number 100 - http://www.rubyquiz.com/

$DEBUG = 0

# If the lexer fails to find an appropriate entry in the state
# transition table for the current character and state, it
# raises this exception.
class LexerFailure < StandardError
end

# If the lexer encounters a character for which no matching charset
# has been supplied then it raises this exception.
#
# This exception will never be raised if #init_state_transitions
# has been called with an appropriate catch-all charset id.
class InvalidLexeme < StandardError
end

class LexerMW
  # Creates an instance of the lexer class.
  #
  # _lexer_eof_ascii_::
  # defines the ASCII byte value that the lexer considers as
  # end-of-file when it is encountered. When #tokenize is called,
  # the supplied datastream is automatically appended with this
  # character.
  def initialize(lexer_eof_ascii = 0)
    @s_trans = {}
    @columns = {}
    @lex_eof = lexer_eof_ascii
  end

  # Initialize the character set columns to be used by the lexer.
  #
  # _cs_defs_::
  # a hash containing entries of the form <tt>id => match</tt>,
  # where _match_ defines the characters to be matched and _id_
  # is the id that will be passed to the finite state machine
  # to inidicate the character grouping encountered.
  # _eof_charset_id_::
  # defines the character set identifier which the lexer will
  # attempt to match in the state machine table when the
  # end-of-file character defined in #new is encountered.
  #
  # The content of _match_ falls into one of two main categories:
  #
  # _regexp_:: e.g. <tt>/\d/</tt> will match any digit 0..9; or
  # _enum_:: an enumeration that describes the set of allowed
  # character byte values, e.g.
  # the array <tt>[?*, ?/, ?%]</tt> matches
  # <b>*</b>, <b>/</b> or <b>%</b>, while the range
  # <tt>(?a..?z)</tt> matches lowercase alphas.
  #
  # e.g.
  #
  # init_char_sets({
  # :alphanum => /[A-Z0-9]/,
  # :underscore => [?_],
  # :lower_vowel => [?a, ?e, ?i, ?o, ?u],
  # :special => (0..31)
  # },
  # :end_line)
  #
  # It is the responsibility of the caller to ensure that the
  # match sets for each column are mutually exclusive.
  #
  # If a 'catch-all' set is needed then it is not necessary
  # to build the set of all characters not already matched.
  # Instead, see #init_state_transitions parameter list.
  #
  # Note, the contents of the hash is duplicated and stored
  # internally to avoid any inadvertent corruption from outside.
  def init_char_sets(cs_defs, eof_charset_id = :eof)
    @charsets = {}
    # Make a verbatim copy of the lexer charset columns
    cs_defs.each_pair do
      >charset_id, match|
      @charsets[charset_id] = match.dup # works for array/regexp
    end
    # Add an end-of-file charset column for free
    @charsets[eof_charset_id] = [@lex_eof]
    puts "@charsets =\n#{@charsets.inspect}\n\n" if $DEBUG == 1
  end

  # Initialize the state transition table that will be used by the
  # finite state machine to convert incoming characters to tokens.
  #
  # _st_::
  # a hash that defines the state transition table to be used
  # (see below).
  # _start_state_::
  # defines the starting state for the finite state machine.
  # _catch_all_charset_id_::
  # defines an optional charset id to be tried if the character
  # currently being analysed matches none of the charsets
  # in the charset table. The default +nil+ ensures that the
  # InvalidLexeme exception is raised if no charsets match.
  #
  # The state transition table hash _st_ maps each valid original
  # state to a hash containing the _rules_ to match when in that
  # state.
  #
  # Each hash entry _rule_ maps one of the character set ids
  # (defined in the call to #init_char_sets) to the _actions_ to be
  # carried out if the current character being analysed by the lexer
  # matches.
  #
  # The _action_ is a hash of distinct actions to be carried out for
  # a match. The following actions are supported:
  #
  # <tt>:next_s => _state_</tt>::
  # sets the finite state machine next state to be _state_ and
  # appends the current character to the lexeme string being
  # prepared, absorbing the current character in the datastream.
  #
  # <tt>:next_s_skip => _state_</tt>::
  # as above but the lexeme string being prepared remains static.
  #
  # <tt>:next_s_backtrack => _state_</tt>::
  # as for _next_s_skip_ above but does not absorb the current
  # character (it will be used for the next state test).
  #
  # <tt>:token => _tok_</tt>::
  # appends a hash containing a single entry to the array of
  # generated tokens, using _tok_ as the key and a copy of the
  # prepared lexeme string as the value.
  #
  # When the end of the datastream is reached, the lexer looks for
  # a match against charset <tt>:eof</tt>.
  #
  # When the performed actions contain no +next_s+... action, the
  # lexer assumes that a final state has been reached and returns
  # the accumulated array of tokens up to that point.
  #
  # e.g.
  #
  # init_state_transitions({
  # :s1 => {:alpha => {next_s = :s2},
  # :period => {:token => :tok_period}},
  # :s2 => {:alphanum => {next_s = :s2},
  # :underscore => {next_s_skip == :s2},
  # :period => {next_s_backtrack = :s1}
  # :eof => {}}, // final state, return tokens
  # }, :s1, :other_chars)
  #
  # Note, the contents of the hash is duplicated and stored
  # internally to avoid any inadvertent corruption from outside.
  def init_state_transitions(st, start_state = :s_start,
                             catch_all_charset_id = nil)
    @start_state = start_state
    @others_key = catch_all_charset_id
    @s_trans = {}
    # Make a verbatim copy of the state transition table
    st.each_pair do
      >orig_state, lexer_rules|
      @s_trans[orig_state] = state_rules = {}
      lexer_rules.each_pair do
        >lexer_charset, lexer_actions|
        state_rules[lexer_charset] = cur_actions = {}
        lexer_actions.each_pair do
          >action, new_val|
          cur_actions[action] = new_val
        end
      end
    end
    puts "@s_trans =\n#{@s_trans.inspect}\n\n" if $DEBUG == 1
  end

  # Tokenize the datastream in _str_ according to the specific
  # character set and state transition table initialized through
  # #init_char_sets and #init_state_transitions.
  #
  # Returns an array of token elements where each element is
  # a pair of the form:
  #
  # [:token_name, "extracted lexeme string"]
  #
  # The end token marker [:tok_end, nil] is appended to the end
  # of the result on success, e.g.
  #
  # tokenize(str)
  # # => [[:tok_a, "123"], [:tok_b, "abc"], [:tok_end, nil]]
  #
  # Raises the LexerFailure exception if no matching state
  # transition is found for the current state and character.
  def tokenize(str)
    state = @start_state
    lexeme = ''
    tokens =
    # Append our end of file marker to the string to be tokenized
    str += "%c" % @lex_eof
    str.each_byte do
      >char>
      char_as_str = "%c" % char
      loop do
        match = @charsets.find {
          >id, match|
          (match.kind_of? Regexp) ? \
            (match =~ char_as_str) : (match.include? char)
          } || [@others_key, @charsets[@others_key]] or \
            raise InvalidLexeme

        # Look for the action matching our current state and the
        # character set id for our current char.
        action = @s_trans[state][match.first] or raise LexerFailure

        # If found, action contains our hash of actions, e.g.
        # {:next_s_backtrack => :s_operator, :token => :tok_int}
        puts "#{char==@lex_eof?'<eof>':char_as_str}: " \
          "#{state.inspect} - #{action.inspect}" if $DEBUG == 1

        # Build up the lexeme unless we're backtracking or skipping
        lexeme << char_as_str if action.has_key? :next_s

        tokens << [action[:token], lexeme.dup] && lexeme = '' if \
          action.has_key? :token

        # Set the next state, or - when there is no specified next
        # state - we've finished, so return the tokens.
        state = action[:next_s] || action[:next_s_skip] ||
          action[:next_s_backtrack] or
             return tokens << [:tok_end, nil]

        break unless action.has_key? :next_s_backtrack
      end
    end
    tokens
  end
end

if $0 == __FILE__
  eval DATA.read, nil, $0, __LINE__+4
end

__END__

require 'test/unit'

class TC_LexerMW < Test::Unit::TestCase
  def test_simple
    @lexer = LexerMW.new()

    @char_sets = {
        :letter => (?a..?z),
        :digit => (/\d/),
        :space => [?\s, ?_]
      }

    @lexer.init_char_sets(@char_sets)

    @st = {
        :extract_chars => {
          :letter => {:next_s => :extract_chars},
          :digit => {:next_s => :extract_chars},
          :space => {:next_s_skip => :extract_chars,
                       :token => :tok_text},
          :eof => {:token => :tok_text}
          },
        :extract_alpha => {
          :letter => {:next_s => :extract_alpha},
          :digit => {:next_s_backtrack => :extract_num,
                       :token => :tok_alpha},
          :space => {:next_s_skip => :extract_alpha,
                       :token => :tok_alpha},
          :other => {:next_s_skip => :extract_alpha},
          :eof_exit => {}
          },
        :extract_num => {
          :letter => {:next_s_backtrack => :extract_alpha,
                       :token => :tok_num},
          :digit => {:next_s => :extract_num},
          :space => {:next_s_skip => :extract_num},
          :others => {:next_s_skip => :extract_alpha,
                       :token => :tok_num}
          }
      }
    @lexer.init_state_transitions(@st, :extract_chars)
    assert_equal [
        [:tok_text, "123"], [:tok_text, "45"],
        [:tok_text, "6"], [:tok_text, "78"],
        [:tok_text, "abcd"], [:tok_text, "efghi"],
        [:tok_text, "jklmn"], [:tok_end, nil]
      ], @lexer.tokenize("123 45 6_78 abcd efghi_jklmn")

    @lexer = LexerMW.new(?$)
    @lexer.init_char_sets(@char_sets, :eof_exit)
    @lexer.init_state_transitions(@st, :extract_num, :others)
    assert_equal [
        [:tok_num, "12345678"], [:tok_alpha, "abcd"],
        [:tok_alpha, "efghi"], [:tok_num, "445"],
        [:tok_alpha, ""], [:tok_num, "1222"], [:tok_end, nil]
      ], @lexer.tokenize("123 45 6_78 abcd efghi445!12_22!ab$45")

  end
end

Some people started doing the Ruby quiz problems using Haskell, and
this was a perfect opportunity for me to learn some Haskell. So here's
my solution below, in Haskell. It's hard to test the byte code
interpretation but all the expression do evaluate to the correct
values.

If anyone has questions about the Haskell code, please let me know.
I'm just learning it and its really cool!

BTW, I too spent way more time on this than I should have!

(this solution, along with others, can be found on
http://www.haskell.org/haskellwiki/Haskell_Quiz/Bytecode_Compiler)

This solution should work correctly. I was unable to test the byte
codes generated, for obvious reasons. However, all test strings from
the quiz do evaluate to the correct values.

To see the (symbolic) byte codes generated, run generate_tests. To see
the actual byte codes, run compile_tests. To see that the values
produced by each expression match those expected, run eval_tests. The
tests are contained in the variables test1,test2, ..., test6, which
correspond to the six "test_n" methods fouind in the quiz's test
program.

The byte codes aren't optimized. For example, SWAP is never used.
However, they should produce correct results (even for negative and
LCONST/CONST values).

The code below is literate Haskell.

\begin{code}
import Text.ParserCombinators.Parsec hiding (parse)
import qualified Text.ParserCombinators.Parsec as P (parse)
import Text.ParserCombinators.Parsec.Expr
import Data.Bits

-- Represents various operations that can be applied
-- to expressions.
data Op = Plus | Minus | Mult | Div | Pow | Mod | Neg
  deriving (Show, Eq)

-- Represents expression we can build - either numbers or expressions
-- connected by operators.
data Expression = Statement Op Expression Expression
           > Val Integer
           > Empty
  deriving (Show)

-- Define the byte codes that can be generated.
data Bytecode = NOOP | CONST Integer | LCONST Integer
            > ADD
            > SUB
            > MUL
            > POW
            > DIV
            > MOD
            > SWAP
  deriving (Show)

-- Using imported Parsec.Expr library, build a parser for expressions.
expr :: Parser Expression
expr =
  buildExpressionParser table factor
  <?> "expression"
  where
  -- Recognizes a factor in an expression
  factor =
    do{ char '('
          ; x <- expr
          ; char ')'
          ; return x
          }
      <|> number
      <?> "simple expression"
  -- Recognizes a number
  number :: Parser Expression
  number = do{ ds <- many1 digit
              ; return (Val (read ds))
              }
          <?> "number"
  -- Specifies operator, associativity, precendence, and constructor to execute
  -- and built AST with.
  table =
    [[prefix "-" (Statement Mult (Val (-1)))],
      [binary "^" (Statement Pow) AssocRight],
      [binary "*" (Statement Mult) AssocLeft, binary "/" (Statement
Div) AssocLeft, binary "%" (Statement Mod) AssocLeft],
      [binary "+" (Statement Plus) AssocLeft, binary "-" (Statement
Minus) AssocLeft]
       ]
    where
      binary s f assoc
         = Infix (do{ string s; return f}) assoc
      prefix s f
         = Prefix (do{ string s; return f})

-- Parses a string into an AST, using the parser defined above
parse s = case P.parse expr "" s of
  Right ast -> ast
  Left e -> error $ show e

-- Take AST and evaluate (mostly for testing)
eval (Val n) = n
eval (Statement op left right)
        > op == Mult = eval left * eval right
        > op == Minus = eval left - eval right
        > op == Plus = eval left + eval right
        > op == Div = eval left `div` eval right
        > op == Pow = eval left ^ eval right
        > op == Mod = eval left `mod` eval right

-- Takes an AST and turns it into a byte code list
generate stmt = generate' stmt []
       where
               generate' (Statement op left right) instr =
                       let
                               li = generate' left instr
                               ri = generate' right instr
                               lri = li ++ ri
                       in case op of
                               Plus -> lri ++ [ADD]
                               Minus -> lri ++ [SUB]
                               Mult -> lri ++ [MUL]
                               Div -> lri ++ [DIV]
                               Mod -> lri ++ [MOD]
                               Pow -> lri ++ [POW]
               generate' (Val n) instr =
                if abs(n) > 32768
                then instr ++ [LCONST n]
                else instr ++ [CONST n]

-- Takes a statement and converts it into a list of actual bytes to
-- be interpreted
compile s = toBytes (generate $ parse s)

-- Convert a list of byte codes to a list of integer codes. If LCONST or CONST
-- instruction are seen, correct byte representantion is produced
toBytes ((NOOP):xs) = 0 : toBytes xs
toBytes ((CONST n):xs) = 1 : (toConstBytes (fromInteger n)) ++ toBytes xs
toBytes ((LCONST n):xs) = 2 : (toLConstBytes (fromInteger n)) ++ toBytes xs
toBytes ((ADD):xs) = 0x0a : toBytes xs
toBytes ((SUB):xs) = 0x0b : toBytes xs
toBytes ((MUL):xs) = 0x0c : toBytes xs
toBytes ((POW):xs) = 0x0d : toBytes xs
toBytes ((DIV):xs) = 0x0e : toBytes xs
toBytes ((MOD):xs) = 0x0f : toBytes xs
toBytes ((SWAP):xs) = 0x0a : toBytes xs
toBytes [] = []

-- Convert number to CONST representation (2 element list)
toConstBytes n = toByteList 2 n
toLConstBytes n = toByteList 4 n

-- Convert a number into a list of 8-bit bytes (big-endian/network byte order).
-- Make sure final list is size elements long
toByteList :: Bits Int => Int -> Int -> [Int]
toByteList size n =
    if (length bytes) < size
    then (replicate (size - (length bytes)) 0) ++ bytes
    else bytes
    where
      bytes = reverse $ toByteList' n
      -- for negative, and with signed bit and remove negative. Then
continue recursion.
      toByteList' 0 = []
      toByteList' a | a < 0 = (a .&. 511) : toByteList' (abs(a) `shiftR` 8)
                    > otherwise = (a .&. 255) : toByteList' (a `shiftR` 8)

-- All tests defined by the quiz, with the associated values they
should evaluate to.
test1 = [(2+2, "2+2"), (2-2, "2-2"), (2*2, "2*2"), (2^2, "2^2"), (2
`div` 2, "2/2"),
  (2 `mod` 2, "2%2"), (3 `mod` 2, "3%2")]

test2 = [(2+2+2, "2+2+2"), (2-2-2, "2-2-2"), (2*2*2, "2*2*2"), (2^2^2,
"2^2^2"), (4 `div` 2 `div` 2, "4/2/2"),
  (7`mod`2`mod`1, "7%2%1")]

test3 = [(2+2-2, "2+2-2"), (2-2+2, "2-2+2"), (2*2+2, "2*2+2"), (2^2+2, "2^2+2"),
  (4 `div` 2+2, "4/2+2"), (7`mod`2+1, "7%2+1")]

test4 = [(2+(2-2), "2+(2-2)"), (2-(2+2), "2-(2+2)"), (2+(2*2),
"2+(2*2)"), (2*(2+2), "2*(2+2)"),
  (2^(2+2), "2^(2+2)"), (4 `div` (2+2), "4/(2+2)"), (7`mod`(2+1), "7%(2+1)")]

test5 = [(-2+(2-2), "-2+(2-2)"), (2-(-2+2), "2-(-2+2)"), (2+(2 * -2),
"2+(2*-2)")]

test6 = [((3 `div` 3)+(8-2), "(3/3)+(8-2)"), ((1+3) `div` (2 `div`
2)*(10-8), "(1+3)/(2/2)*(10-8)"),
    ((1*3)*4*(5*6), "(1*3)*4*(5*6)"), ((10`mod`3)*(2+2),
"(10%3)*(2+2)"), (2^(2+(3 `div` 2)^2), "2^(2+(3/2)^2)"),
    ((10 `div` (2+3)*4), "(10/(2+3)*4)"), (5+((5*4)`mod`(2+1)),
"5+((5*4)%(2+1))")]

-- Evaluates the tests and makes sure the expressions match the expected values
eval_tests = map eval_tests [test1, test2, test3, test4, test5, test6]
  where
    eval_tests ((val, stmt):ts) =
      let eval_val = eval $ parse stmt
      in
        if val == eval_val
        then "True" : eval_tests ts
        else (stmt ++ " evaluated incorrectly to " ++ show eval_val ++
" instead of " ++ show val) : eval_tests ts
    eval_tests [] = []

-- Takes all the tests and displays symbolic bytes codes for each
generate_tests = map generate_all [test1,test2,test3,test4,test5,test6]
  where generate_all ((val, stmt):ts) = generate (parse stmt) : generate_all ts
        generate_all [] = []

-- Takes all tests and generates a list of bytes representing them
compile_tests = map compile_all [test1,test2,test3,test4,test5,test6]
  where compile_all ((val, stmt):ts) = compile stmt : compile_all ts
        compile_all [] = []

\end{code}

My 2nd solution, it's the same as the first except i stole the pack/unpack stuff.

class Compiler
  def self.compile(s)
    eval(s.gsub(/(\d+)([^\d])/,'\1.bc\2').gsub(/([^\d])(\d+)$/,'\1\2.bc'))
  end
end

class Fixnum
  def bc
    lead,pt = ( (-2**15...2**15)===self ? [1,'n'] : [2,'N'] )
    [lead].concat([self].pack(pt).unpack('C*'))
  end
end

class Array
  {:+ => 10,:- => 11,:* => 12,:** => 13,:confused: => 14,:% => 15}.each do |op,code|
    define_method(op) { |x| self.concat(x).concat([code]) }
  end
end

This quiz has turned out to have some interesting solutions, I'm looking
forward to writing up the summary :slight_smile: For the record, here's the sample
solution I made when I suggested the quiz to James - it takes a similar
approach to some of the other solutions in letting Ruby's parser do the
heavy lifting. I pulled _why's Sandbox into the mix again to let me keep
dangerously modified core classes on a tight rein.

# --- compiler.rb
require 'sandbox'

class Compiler
  class << self
    def sb
      unless @sb
        @sb = Sandbox.new

        @sb.eval <<-EOC
          class Object
            private
            def ldconsts(o)
              if (o > -32769) && (o < 32768)
                [].push(0x01, *[o].pack('n').unpack('C*'))
              else
                [].push(0x02, *[o].pack('N').unpack('C*'))
              end
            end
          end

          class Fixnum
            def +(o)
              if o.is_a? Array
                o.push(*ldconsts(self)).push(0x0a)
              else
                ldconsts(self).push(*ldconsts(o)).push(0x0a)
              end
            end

            def -(o)
              if o.is_a? Array
                o.push(*ldconsts(self)).push(0xa0, 0x0b)
              else
                ldconsts(self).push(*ldconsts(o)).push(0x0b)
              end
            end

            def *(o)
              if o.is_a? Array
                o.push(*ldconsts(self)).push(0x0c)
              else
                ldconsts(self).push(*ldconsts(o)).push(0x0c)
              end
            end

            def **(o)
              if o.is_a? Array
                o.push(*ldconsts(self)).push(0xa0, 0x0d)
              else
                ldconsts(self).push(*ldconsts(o)).push(0x0d)
              end
            end

            def /(o)
              if o.is_a? Array
                o.push(*ldconsts(self)).push(0xa0, 0x0e)
              else
                ldconsts(self).push(*ldconsts(o)).push(0x0e)
              end
            end

            def %(o)
              if o.is_a? Array
                o.push(*ldconsts(self)).push(0xa0, 0x0f)
              else
                ldconsts(self).push(*ldconsts(o)).push(0x0f)
              end
            end
          end

          class Array
            def +(o)
              if o.is_a? Array
                o.push(*self).push(0x0a)
              else
                self.push(*ldconsts(o)).push(0x0a)
              end
            end

            def -(o)
              if o.is_a? Array
                o.push(*self).push(0xa0, 0x0b)
              else
                self.push(*ldconsts(o)).push(0x0b)
              end
            end

            def *(o)
              if o.is_a? Array
                o.push(*self).push(0x0c)
              else
                self.push(*ldconsts(o)).push(0x0c)
              end
            end

            def **(o)
              if o.is_a? Array
                o.push(*self).push(0xa0, 0x0d)
              else
                self.push(*ldconsts(o)).push(0x0d)
              end
            end

            def /(o)
              if o.is_a? Array
                o.push(*self).push(0xa0, 0x0e)
              else
                self.push(*ldconsts(o)).push(0x0e)
              end
            end

            def %(o)
              if o.is_a? Array
                o.push(*self).push(0xa0, 0x0f)
              else
                self.push(*ldconsts(o)).push(0x0f)
              end
            end
          end
        EOC
      end

      @sb
    end

    def compile(code)
      [*sb.eval(code)]
    end
  end
end

My solution in Haskell - and this time it actually works. The previous
implementation didn't work well with negative numbers and CONST/LCONST
weren't generating correctly.

The code is "literate" haskell, which means it must be saved in a file
with "lhs" extension to run under WinHugs. To test the generated byte
codes, run "interpret_tests" after loading the file. Other functions
which demonstrate what is generated are:

  compile_tests - Spits out byte codes for all test expressions
  generate_tests - Spits out symbolic byte codes for all test expressions
  evaluate_tests - Evaluates ASTs generated (not bytecode) for all
test expressions.

This solution is also posted at

http://www.haskell.org/haskellwiki/Haskell_Quiz/Bytecode_Compiler/Solution_Justin_Bailey

Thanks again for a great quiz!

Justin

\begin{code}
import Text.ParserCombinators.Parsec hiding (parse)
import qualified Text.ParserCombinators.Parsec as P (parse)
import Text.ParserCombinators.Parsec.Expr
import Data.Bits
import Data.Int

-- Represents various operations that can be applied
-- to expressions.
data Op = Plus | Minus | Mult | Div | Pow | Mod | Neg
  deriving (Show, Eq)

-- Represents expression we can build - either numbers or expressions
-- connected by operators. This structure is the basis of the AST built
-- when parsing
data Expression = Statement Op Expression Expression
           > Val Integer
           > Empty
  deriving (Show)

-- Define the byte codes that can be generated.
data Bytecode = NOOP | CONST Integer | LCONST Integer
            > ADD
            > SUB
            > MUL
            > POW
            > DIV
            > MOD
            > SWAP
  deriving (Show)

-- Using imported Parsec.Expr library, build a parser for expressions.
expr :: Parser Expression
expr =
  buildExpressionParser table factor
  <?> "expression"
  where
  -- Recognizes a factor in an expression
  factor =
    do{ char '('
          ; x <- expr
          ; char ')'
          ; return x
          }
      <|> number
      <?> "simple expression"
  -- Recognizes a number
  number :: Parser Expression
  number = do{ ds <- many1 digit
              ; return (Val (read ds))
              }
          <?> "number"
  -- Specifies operator, associativity, precendence, and constructor to execute
  -- and built AST with.
  table =
    [[prefix "-" (Statement Mult (Val (-1)))],
      [binary "^" (Statement Pow) AssocRight],
      [binary "*" (Statement Mult) AssocLeft, binary "/" (Statement
Div) AssocLeft, binary "%" (Statement Mod) AssocLeft],
      [binary "+" (Statement Plus) AssocLeft, binary "-" (Statement
Minus) AssocLeft]
       ]
    where
      binary s f assoc
         = Infix (do{ string s; return f}) assoc
      prefix s f
         = Prefix (do{ string s; return f})

-- Parses a string into an AST, using the parser defined above
parse s = case P.parse expr "" s of
  Right ast -> ast
  Left e -> error $ show e

-- Take AST and evaluate (mostly for testing)
eval (Val n) = n
eval (Statement op left right)
        > op == Mult = eval left * eval right
        > op == Minus = eval left - eval right
        > op == Plus = eval left + eval right
        > op == Div = eval left `div` eval right
        > op == Pow = eval left ^ eval right
        > op == Mod = eval left `mod` eval right

-- Takes an AST and turns it into a byte code list
generate stmt = generate' stmt []
       where
               generate' (Statement op left right) instr =
                       let
                               li = generate' left instr
                               ri = generate' right instr
                               lri = li ++ ri
                       in case op of
                               Plus -> lri ++ [ADD]
                               Minus -> lri ++ [SUB]
                               Mult -> lri ++ [MUL]
                               Div -> lri ++ [DIV]
                               Mod -> lri ++ [MOD]
                               Pow -> lri ++ [POW]
               generate' (Val n) instr =
                if abs(n) > 32768
                then LCONST n : instr
                else CONST n : instr

-- Takes a statement and converts it into a list of actual bytes to
-- be interpreted
compile s = toBytes (generate $ parse s)

-- Convert a list of byte codes to a list of integer codes. If LCONST or CONST
-- instruction are seen, correct byte representantion is produced
toBytes ((NOOP):xs) = 0 : toBytes xs
toBytes ((CONST n):xs) = 1 : (toConstBytes (fromInteger n)) ++ toBytes xs
toBytes ((LCONST n):xs) = 2 : (toLConstBytes (fromInteger n)) ++ toBytes xs
toBytes ((ADD):xs) = 0x0a : toBytes xs
toBytes ((SUB):xs) = 0x0b : toBytes xs
toBytes ((MUL):xs) = 0x0c : toBytes xs
toBytes ((POW):xs) = 0x0d : toBytes xs
toBytes ((DIV):xs) = 0x0e : toBytes xs
toBytes ((MOD):xs) = 0x0f : toBytes xs
toBytes ((SWAP):xs) = 0x0a : toBytes xs
toBytes [] = []

-- Convert number to CONST representation (2 element list)
toConstBytes n = toByteList 2 n
toLConstBytes n = toByteList 4 n

-- Convert a number into a list of 8-bit bytes (big-endian/network byte order).
-- Make sure final list is size elements long
toByteList :: Bits Int => Int -> Int -> [Int]
toByteList size n = reverse $ take size (toByteList' n)
    where
      toByteList' a = (a .&. 255) : toByteList' (a `shiftR` 8)

-- All tests defined by the quiz, with the associated values they
should evaluate to.
test1 = [(2+2, "2+2"), (2-2, "2-2"), (2*2, "2*2"), (2^2, "2^2"), (2
`div` 2, "2/2"),
  (2 `mod` 2, "2%2"), (3 `mod` 2, "3%2")]

test2 = [(2+2+2, "2+2+2"), (2-2-2, "2-2-2"), (2*2*2, "2*2*2"), (2^2^2,
"2^2^2"), (4 `div` 2 `div` 2, "4/2/2"),
  (7`mod`2`mod`1, "7%2%1")]

test3 = [(2+2-2, "2+2-2"), (2-2+2, "2-2+2"), (2*2+2, "2*2+2"), (2^2+2, "2^2+2"),
  (4 `div` 2+2, "4/2+2"), (7`mod`2+1, "7%2+1")]

test4 = [(2+(2-2), "2+(2-2)"), (2-(2+2), "2-(2+2)"), (2+(2*2),
"2+(2*2)"), (2*(2+2), "2*(2+2)"),
  (2^(2+2), "2^(2+2)"), (4 `div` (2+2), "4/(2+2)"), (7`mod`(2+1), "7%(2+1)")]

test5 = [(-2+(2-2), "-2+(2-2)"), (2-(-2+2), "2-(-2+2)"), (2+(2 * -2),
"2+(2*-2)")]

test6 = [((3 `div` 3)+(8-2), "(3/3)+(8-2)"), ((1+3) `div` (2 `div`
2)*(10-8), "(1+3)/(2/2)*(10-8)"),
    ((1*3)*4*(5*6), "(1*3)*4*(5*6)"), ((10`mod`3)*(2+2),
"(10%3)*(2+2)"), (2^(2+(3 `div` 2)^2), "2^(2+(3/2)^2)"),
    ((10 `div` (2+3)*4), "(10/(2+3)*4)"), (5+((5*4)`mod`(2+1)),
"5+((5*4)%(2+1))")]

-- Evaluates the tests and makes sure the expressions match the expected values
eval_tests = concat $ map eval_tests [test1, test2, test3, test4, test5, test6]
  where
    eval_tests ((val, stmt):ts) =
      let eval_val = eval $ parse stmt
      in
        if val == eval_val
        then ("Passed: " ++ stmt) : eval_tests ts
        else ("Failed: " ++ stmt ++ "(" ++ show eval_val ++ ")") : eval_tests ts
    eval_tests [] = []

-- Takes all the tests and displays symbolic bytes codes for each
generate_tests = concat $ map generate_all [test1,test2,test3,test4,test5,test6]
  where generate_all ((val, stmt):ts) = (stmt, generate (parse stmt))
: generate_all ts
        generate_all [] = []

-- Takes all tests and generates a list of bytes representing them
compile_tests = concat $ map compile_all [test1,test2,test3,test4,test5,test6]
  where compile_all ((val, stmt):ts) = (stmt, compile stmt) : compile_all ts
        compile_all [] = []

interpret_tests = concat $ map f' [test1, test2, test3, test4, test5, test6]
  where
    f' tests = map f'' tests
    f'' (expected, stmt) =
      let value = fromIntegral $ interpret [] $ compile stmt
      in
        if value == expected
        then "Passed: " ++ stmt
        else "Failed: " ++ stmt ++ "(" ++ (show value) ++ ")"

fromBytes n xs =
  let int16 = (fromIntegral ((fromIntegral int32) :: Int16)) :: Int
      int32 = byte xs
      byte xs = foldl (\accum byte -> (accum `shiftL` 8) .|. (byte))
(head xs) (take (n - 1) (tail xs))
  in
    if n == 2
    then int16
    else int32

interpret [] [] = error "no result produced"
interpret (s1:s) [] = s1
interpret s (o:xs) | o < 10 = interpret ((fromBytes (o*2) xs):s) (drop (o*2) xs)
interpret (s1:s2:s) (o:xs)
  > o == 16 = interpret (s2:s1:s) xs
  > otherwise = interpret (((case o of 10 -> (+); 11 -> (-); 12 ->
(*); 13 -> (^); 14 -> div; 15 -> mod) s2 s1):s) xs

\end{code}

Perl's regex engine can do this, as can Ruby 1.9's Oniguruma engine. Both allow recursive definitions, which is what it takes.

James Edward Gray II

···

On Nov 5, 2006, at 9:43 AM, Wilson Bilkovich wrote:

Also, helpful tip I'd like to send back to my past self.. Regular
expressions aren't powerful enough to find matched parentheses.

Did you write that language in Ruby? I'm just curious.

James Edward Gray II

···

On Nov 5, 2006, at 11:18 AM, Dennis Ranke wrote:

Here is my solution, a simple recursive descent parser. It's a bit more code than is strictly necessary because it is loosely modeled after a parser for a real language I have written recently.

Good catch. Here's an updated copy of mine that:
A) Steals your cool C* unpack trick. :slight_smile:
B) Gets rid of a temporary variable and a couple of 'pop' loops in
exchange for some more golf.
C) Avoids re-initializing the hashes for improved performance.
D) Passes your new test_02a

class Compiler
  def self.compile(input)
    @bytecodes ||= {'+' => 0x0a, '-' => 0x0b, '*' => 0x0c, '**' =>
0x0d, '/' => 0x0e, '%' => 0x0f}
    encode postfix(input)
  end

  def self.encode(tokens)
    tokens.collect do |token|
      number = token =~ /\-?\d+/ ? token.to_i : nil
      if (-32768..32767).include?(number)
        [0x01] + [number].pack('n').unpack('C*')
      elsif !number.nil? # long
        [0x02] + [number].pack('N').unpack('C*')
      else
        @bytecodes[token]
      end
    end.flatten
  end

  def self.postfix(infix)
    stack, stream, last = , , nil
    tokens = infix.scan(/\d+|\*\*|[-+*\/()%]/)
    tokens.each_with_index do |token,i|
      case token
      when /\d+/; stream << token
      when *@bytecodes.keys
        if token == '-' and last.nil? || (last =~ /\D/ && tokens[i+1] =~ /\d/)
          tokens[i+1] = "-#{tokens[i+1]}"
        else
          stream << stack.pop while stack.any? && preceded?(stack.last, token)
          stack << token
        end
      when '('; stack << token
      when ')'; (stream += stack.slice!(stack.rindex('('),
stack.size).reverse).pop
      end
      last = token
    end
    stream += stack.reverse
  end

  def self.preceded?(last, current)
    @ops ||= {'+' => 1, '-' => 1, '%' => 2, '/' => 2, '*' => 2, '**'
=> 3, '(' => 0, ')' => 0}
    @ops[last] >= @ops[current] && current != '**' # right associative mayhem!
  end
end

···

On 11/5/06, Daniel Martin <martin@snowplow.org> wrote:

Note that the creator of this quiz left out one important case from
their tests:

  def test_02a
    assert_equal [2**1**2], Interpreter.new(Compiler.compile('2**1**2')).run
  end

This tests that your compiler is properly making **
right-associative. Some solutions already posted fail this test.

2**2**2 was an unfortunate test case to choose, since 2**4 == 4**2.

Note that the creator of this quiz left out one important case from
their tests:

  def test_02a
    assert_equal [2**1**2], Interpreter.new(Compiler.compile('2**1**2')).run
  end

This tests that your compiler is properly making **
right-associative. Some solutions already posted fail this test.

Phew - I expected I'd left out far more than just one important case :).
Seriously though, good catch - thanks. It looks like all the solutions
so far are passing it now.

2**2**2 was an unfortunate test case to choose, since 2**4 == 4**2.

Ahem. Oops. :slight_smile:

···

On Mon, 2006-11-06 at 11:27 +0900, Daniel Martin wrote:

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

Changed my previous solution to use some of the things others have used that need to be pounded into my skull like \d in regular expression and pack/unpack. Also added unary + and cleaned up code.

···

############
require 'interp'

module Compiler

   # compile expression into bytecode array
   def Compiler.compile(s)
     stack = []
     eval(s.gsub(/(\d+)/, 'Value.new(stack, \1)'))
     stack
   end

   class Value
     attr_reader :value # constant value or nil for on stack
     ON_STACK = nil

     def initialize(stack, value)
       @stack = stack
       @value = value
     end

     # generate code for each binary operator
     {'+' => Interpreter::Ops::ADD,
      '-' => Interpreter::Ops::SUB,
      '*' => Interpreter::Ops::MUL,
      '**'=> Interpreter::Ops::POW,
      '/' => Interpreter::Ops::DIV,
      '%' => Interpreter::Ops::MOD}.each do |operator, byte_code|
        Value.module_eval <<-OPERATOR_CODE
         def #{operator}(rhs)
           push_if_value(@value)
           push_if_value(rhs.value)
           # swap stack items if necessary
           #{if operator != "+"
               "@stack << Interpreter::Ops::SWAP if rhs.value == nil &&
                                                    @value != nil"
             end}
           @stack << #{byte_code}
           Value.new(@stack, ON_STACK)
         end
        OPERATOR_CODE
     end

     def -@
       if @value != ON_STACK
         push_if_value(-@value)
       else
         push_if_value(0)
         @stack << Interpreter::Ops::SWAP << Interpreter::Ops::SUB
       end
       Value.new(@stack, ON_STACK)
     end

     def +@
       push_if_value(@value)
       Value.new(@stack, ON_STACK)
     end

     def push_if_value(value)
       if value != ON_STACK
         if (-32768..32767).include?(value)
           @stack << Interpreter::Ops::CONST
           @stack.concat([value].pack("n").unpack("C*"))
         else
           @stack << Interpreter::Ops::LCONST
           @stack.concat([value].pack("N").unpack("C*"))
         end
       end
     end

   end
end

#!/usr/bin/env ruby

···

#
# compiler.rb - Byte-code compiler for simple arithmetic expressions
#
# Lou Scoras <louis.j.scoras@gmail.com>
# Wed Nov 8 20:33 EST 2006
#
# Here's my solution for Rubyquiz 100. Nothing too fancy on this one: I went
# for trying to make the shunting algorith as readable as possile.
#
# As for the parsing, StringScanner is very nice, although it might have been
# overkill for this problem.
#
# Thanks again to Ross and James for another fun quiz.

require 'enumerator'
require 'interp'
require 'optparse'
require 'strscan'

class Token
attr_reader :type, :value

def initialize(value)
   @value = value
end

%w|number lparen rparen op|.each do |a|
   module_eval %{ def #{a}?; false end }
end
end

class Paren < Token
def initialize(value, type)
   super(value)
   @type = type
end
def lparen?; @type == :lparen end
def rparen?; @type == :rparen end
end

class Number < Token
def initialize(value)
   super(value.to_i)
end

def to_bc
   code, fmt = ((-32768..32767).include? value) ? [0x01, 'n'] : [0x02, 'N']
   [code, *[value].pack(fmt).to_enum(:each_byte).to_a]
end
def number?; true end
end

class Op < Token
attr_reader :precedence

CodeTable = [:+, :-, :*, :**, :/, :%].to_enum(:each_with_index).
             inject({}) {|h, (op,i)| h[op] = i + 0x0a; h}

def initialize(value,assoc,prec)
   super(value.to_sym)
   @assoc, @precedence = assoc, prec
end

%w|assoc lassoc rassoc|.each do |a|
   module_eval %{
     def #{a}?
       @assoc == :#{a}
     end
   }
end

def op?; true end

def to_bc
   CodeTable[value]
end
end

class Compiler
class << self

   def compile(exp)
     shunting_yard(exp).collect {|t| t.to_bc }.flatten
   end

   def tokens(i)
     input = StringScanner.new(i)
     until input.eos?
       case
       when t = input.scan(/\d+/) : yield Number.new(t)
       when t = input.scan(/[(]/) : yield Paren.new(t, :lparen)
       when t = input.scan(/[)]/) : yield Paren.new(t, :rparen)
       when t = input.scan(/\*\*/) : yield Op.new(t, :rassoc, 3)
       when t = input.scan(%r<[%/]>) : yield Op.new(t, :lassoc, 2)
       when t = input.scan(%r<[*]>) : yield Op.new(t, :assoc, 2)
       when t = input.scan(%r<[-]>) : yield Op.new(t, :lassoc, 1)
       when t = input.scan(%r<[+]>) : yield Op.new(t, :assoc, 1)
       when input.scan(/\s+/) : # skip ws
       else
         raise RuntimeError, "Parse Error: near '#{input.peek(8)}'"
       end
     end
   end

   def shunting_yard(s)
     stack, queue = [] , []
     last_tok, negate = nil, false # detect unary minus
     tokens(s) do |token|
       case
       when token.number?
         queue << (negate ? Number.new(-token.value) : token)
         negate = false
       when token.op?
         if !last_tok || (last_tok.op? || last_tok.lparen?) &&
                         (token.value == :slight_smile:
           negate = true
         else
           while stack.size > 0 and stack.last.op?
             other_op = stack.last
             if ( token.assoc? || token.lassoc? and
      token.precedence <= other_op.precedence) ||
                 (token.rassoc? and token.precedence < other_op.precedence)
               queue << stack.pop
             else
               break
             end
           end
           stack << token
         end
       when token.lparen?
         stack << token
       when token.rparen?
         while stack.size != 0 and op = stack.pop
           break if op.lparen?
           queue << op
         end
       end
       last_tok = token
     end
     stack.reverse.each do |op|
       queue << op
     end
     queue
   end

   def to_rpn(exp)
     shunting_yard(exp).collect{|t| t.value}.join(' ')
   end

   DCBin = '/usr/bin/dc'

   def dc_eval(exp)
     if File.executable?(DCBin)
       exp = to_rpn(exp)
       IO.popen(DCBin, "w+") do |f|
         f.write(exp.gsub(/\*\*/, '^') + ' p')
         f.close_write
         f.read
       end
     end
   end

end
end

if $0 == __FILE__
opt = OptionParser.new do |opt|
   opt.banner = "Usage: #$0 compile_method"
   opt.separator ''

   opt.on('-c', '--compile [expression]',
           'prints bytecode sequence for [expression]') do |exp|
     p Compiler.compile(exp)
   end

   opt.on('-d', '--dc-eval [expression]',
           'trys to evaluate [expression] using dc(1)') do |exp|
     puts Compiler.dc_eval(exp)
   end

   opt.on('-i', '--interpret [expression]',
       'uses the byte-code interpreter to process [expression]') do |exp|
     puts Interpreter.new(Compiler.compile(exp)).run
   end

   opt.on('-r', '--show-rpn [expression]',
       'prints out an RPN translated version of [expression]') do |exp|
     puts Compiler.to_rpn(exp)
   end

   opt.on('-h', '--help') { puts opt }
end
if ARGV.empty?
   puts opt
else
   opt.parse(ARGV)
end
end