Lexing and Parsing in Ruby

Ooh this is so,so sweet…

If you know anything about lexical analyzers and parsers, read this
carefully and realize I have done something very very very Good
here…

Here is a lexical analyzer that matches that breaks a

Graphviz “Dot” file into tokens…

digraph a_name {

a -> b;

c -> d -> e;

f -> {a;b;c}

d;

};"

···

Yields “digraph”,“a_name”,"{",“a”,"->",“b”,";",…]

def tokenize(string)
regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9.]+|[0-9.]+) | -> | { | } | ; ) }x
match_data = regexp.match( string)
while match_data
token = match_data[0]
yield token unless token =~ %r{ ^\s }x
match_data = regexp.match( match_data.post_match)
end
end

Now here is the trick…

If we map tokens to characters

then parsing becomes a matter of …

Matching a regular Expression again!!

def parse(string)

Build up a list of tokens in an array

token_list = []

Make a one to one map between token type and the array

token_string = ‘’

tokenize(string) do |token|
token_list << token
if token == 'digraph’
token_string += 'd’
elsif token =~ %r{ [{};] }x
token_string += token
elsif token == '->'
token_string += '>'
elsif
token_string += 'i’
end
end

Now our example is a string “di{i>i;i>i>i;i>{i;i;i}i;}”

regexp = %r{ ^ d i { ((i ; | i (> i)+ ; | i > { i (; i)* })*) } ;? $ }x
match_data = regexp.match( token_string)
raise “Parse error ‘#{token_string}’ doesn’t match ‘#{regexp.source}’” unless match_data
return [token_list[1], match_data[1], token_list[match_data.begin(1)…match_data.end(1)]]
end

name, edge_string, edge_list = parse(io)

Now edge_string is i>i;i>i>i;i>{i;i;i}i;

name is “a_name”

edge_list is [“a”,"->",“b”,";",“c”,…]

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : john.carter@tait.co.nz
New Zealand

A Million Monkeys can inflict worse things than just Shakespeare on
your system.

“John Carter” john.carter@tait.co.nz schrieb im Newsbeitrag
news:Pine.LNX.4.58.0311191821040.6241@parore.tait.co.nz…

Ooh this is so,so sweet…

If you know anything about lexical analyzers and parsers, read this
carefully and realize I have done something very very very Good
here…

Sorry, I don’t see what’s so very very very good about it. Will you
enlighten me?

Here is a lexical analyzer that matches that breaks a

Graphviz “Dot” file into tokens…

digraph a_name {

a → b;

c → d → e;

f → {a;b;c}

d;

};"

Yields “digraph”,“a_name”,“{”,“a”,“->”,“b”,“;”,…]

def tokenize(string)
regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9.]+|[0-9.]+)
→ | { | } | ; ) }x
match_data = regexp.match( string)
while match_data
token = match_data[0]
yield token unless token =~ %r{ ^\s }x
match_data = regexp.match( match_data.post_match)
end
end

def tokenize(string)
string.scan( /[a-z_][a-z_\d]* | [±]?[.\d]+ | → | [{};]/xi ) do |m|
yield m
end
end

Now here is the trick…

If we map tokens to characters

then parsing becomes a matter of …

Matching a regular Expression again!!

I don’t know the Graphviz format but this does only work as a real parser
(i.e. one that verifies the input against a grammar) if that file format
does not use nesting constructs with arbitrary depth, i.e., if it has a
regular grammar (vs. context free grammar). This property of the language
does not change if you replace tokens by shorter tokens.

def parse(string)

Build up a list of tokens in an array

token_list =

Make a one to one map between token type and the array

token_string = ‘’

tokenize(string) do |token|
token_list << token
if token == ‘digraph’
token_string += ‘d’
elsif token =~ %r{ [{};] }x
token_string += token
elsif token == ‘->’
token_string += ‘>’
elsif
token_string += ‘i’
end
end

Btw, note that string << ‘i’ is more efficient than string += ‘i’.

And you can combine the two steps with a slight modified regexp, since you
know what you get during scanning already:

def tokenize(string)
tk=
tl=“”

string.scan( /(digraph) | ([a-z_][a-z_\d]*) | ([±]?[.\d]+) | (->) |
([{};])/xi ) do |di,name,num,arr,noise|
if di
tk << di
tl << ‘d’
elsif name
tk << name
tl << ‘i’
elsif num
tk << num
tl << ‘i’
elsif arr
tk << arr
tl << ‘>’
elsif noise
tk << noise
tl << noise
else
# never here
end
end

[tk, tl]
end

Now you can do

irb(main):144:0> tokens, pattern = tokenize( “digraph a_name {
irb(main):145:1” a → b;
irb(main):146:1" c → d → e;
irb(main):147:1" f → {a;b;c}
irb(main):148:1" d;
irb(main):149:1" };" )
=> [[“digraph”, “a_name”, “{”, “a”, “->”, “b”, “c”, “->”, “d”, “->”, “e”,
“f”, "
->", “{”, “a”, “b”, “c”, “}”, “d”, “}”], “di{i>ii>i>ii>{iii}i}”]
irb(main):150:0> tokens
=> [“digraph”, “a_name”, “{”, “a”, “->”, “b”, “c”, “->”, “d”, “->”, “e”,
“f”, "-

", “{”, “a”, “b”, “c”, “}”, “d”, “}”]
irb(main):151:0> pattern
=> “di{i>ii>i>ii>{iii}i}”
irb(main):152:0>

Rest as before.

Cheers

robert

Now our example is a string “di{i>i;i>i>i;i>{i;i;i}i;}”

regexp = %r{ ^ d i { ((i ; | i (> i)+ ; | i > { i (; i)* })*) } ;?
$ }x
match_data = regexp.match( token_string)
raise “Parse error ‘#{token_string}’ doesn’t match ‘#{regexp.source}’”
unless match_data
return [token_list[1], match_data[1],
token_list[match_data.begin(1)…match_data.end(1)]]

···

end

name, edge_string, edge_list = parse(io)

Now edge_string is i>i;i>i>i;i>{i;i;i}i;

name is “a_name”

edge_list is [“a”,“->”,“b”,“;”,“c”,…]

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : john.carter@tait.co.nz
New Zealand

A Million Monkeys can inflict worse things than just Shakespeare on
your system.