A few months back I needed a lexer written in Ruby. All I found on the
RAA was ruby-lex which apparently an extension dependent on flex.
So here’s what I came up with, I’m posting it here as a snippet (since we
don’t have Rubycookbook.org anymore )… there’s probably also a
snippet section on the wiki and I’ll put it there as well. Basically
you just define a hash of tokens where a regex is the key and a token
type is the value (see the bottom of the listing for a usage example).
Feel free to offer ideas for improvement, that’s why I’m posting it.
###################Lexer.rb##################
class Lexer
def initialize(string,tokenHash)
@string = string
@tokenHsh = tokenHash
end
def each_token
tokenType = nil
tokenString = ""
re = nil
prc = nil
while @string.length > 0
#skip white space
if(@string[0…0] =~ /\s/)
@string = @string[1…-1]
next
end
beforeLen = @string.length
@tokenHsh.each { |re,prc|
if index = @string.index(re)
next if index > 0
tokenType = prc
puts "tokenType is: #{tokenType} " if $DEBUG
tokenString = $1
puts “tokenString is: #{tokenString}” if $DEBUG
@string = @string[index + $1.length … -1]
yield tokenType,tokenString
else
if @string.length == 0
puts "Finished!"
return
elsif index && index > 0
puts "Error"
end
end
}
if beforeLen == @string.length
puts ">>>>>>>> ERROR <<<<<<<<<"
puts "unknown token-> #@string"
exit
end
end #while
end #each_token
end
class Token
def to_s
self.type
end
end
#example:
if $0 == FILE
#define some Token classes:
class OpenParen < Token
end
class CloseParen < Token
end
class Word < Token
end
class Str < Token
end
class Number < Token
end
class Comma < Token
end
#define token hash:
tokens = {
/(()/ => OpenParen,
/())/ => CloseParen,
/([-[:digit:]]+)/ => Number,
/([A-Za-z][0-9A-Za-z_]+)/ => Word,
/("[-.0-9A-Za-z_\s+:]+")/ => Str,
/(,)/ => Comma
}
string = ‘(coma, seperated, list (999)(“string”)(((,)))’
lexer = Lexer.new(string,tokens)
puts "Tokenize: #{string}"
lexer.each_token {|token,str|
puts “#{token} : #{str}”
}
end
···
##################################################################