I'm parsing some html and have a table-driven state machine that configures
itself by reading tuples (one per line) of "present state", "pattern to match",
and "next state" with this code:
lineArray=Array.new()
stateTable=Hash.new()
File.open("stateTable.txt") { |file|
file.each { |line|
lineArray=line.scan(/[^\s]+/)
stateTable[lineArray[0]]=lineArray[1..-1]
}
}
I've printed this hash out in several different ways with the same results: the
key-value pairs look as expected (no extraneous spaces, newlines, etc.). Once
the hash is set up, it drives a state machine with this code:
1 state="html"
2 while input=gets() # text lines are the s.m.'s "clock"
3 if input.chomp().length>0 # skip blank lines
4 if stateTable.has_key?(state) # is current state defined by a tuple?
# for now all states are defined
5 if input=~Regexp.new(stateTable[state][0]) # change state if match
6 state=stateTable[state][3]
7 elsif # else complain
8 print("\nline #{$NR}: no match on #{stateTable[state][0]}\n")
9 exit
10 end
11 end # if state in stateTable
12 end # if input.chomp()
13 end # while
I have confirmed multiple times and ways that stateTable["html"][0] contains
"<html>" yet the if on line 4 is never successful even though the first
non-blank line in the input is "<html>". I tried doing it manually by inserting
the following between lines 3 and 4:
if input=~/<html>/ ...
and this worked (moving the state machine to the next state which is "title")
but the problem repeated itself all over again in that state too. So I have no
problem pattern matching with regex literals but can't pattern match with
regex's derived from ostensibly identical strings read from a file.
For the conditional on line 5 I have also tried:
Regexp.new(Regexp.escape(stateTable[state][0]))
and
Regexp.new(stateTable[state][0].to_s)
and
Regexp.new(stateTable[state][0]).match(input) # returned nil
to no avail.
For line 5 I initially had:
if input=~stateTable[state][0]
This didn't work either and generated the following warning:
warning: string=~string will be obsolete; use explicit regexp
I'm using version 1.8.1 (2003-12-25) on Windows (i386-mswin32).
The point of this post is not to get better ways to parse html (but feel free to
suggest them anyway - the point is to find out why I can't read a string from
a file and then use it (as expected) as a regex in a match operator expression.
I humbly await searing insight and enlightenment from the collective (to which
resistance is futile in any case).