Strings and regex's

jdm · 14 September 2005 03:06

I'm parsing some html and have a table-driven state machine that configures
itself by reading tuples (one per line) of "present state", "pattern to match",
and "next state" with this code:

lineArray=Array.new()
stateTable=Hash.new()

  File.open("stateTable.txt") { |file|
    file.each { |line|
      lineArray=line.scan(/[^\s]+/)
      stateTable[lineArray[0]]=lineArray[1..-1]
    }
  }

I've printed this hash out in several different ways with the same results: the
key-value pairs look as expected (no extraneous spaces, newlines, etc.). Once
the hash is set up, it drives a state machine with this code:

1 state="html"
2 while input=gets() # text lines are the s.m.'s "clock"
3 if input.chomp().length>0 # skip blank lines
4 if stateTable.has_key?(state) # is current state defined by a tuple?
# for now all states are defined
5 if input=~Regexp.new(stateTable[state][0]) # change state if match
6 state=stateTable[state][3]
7 elsif # else complain
8 print("\nline #{$NR}: no match on #{stateTable[state][0]}\n")
9 exit
10 end

11 end # if state in stateTable
12 end # if input.chomp()
13 end # while

I have confirmed multiple times and ways that stateTable["html"][0] contains
"<html>" yet the if on line 4 is never successful even though the first
non-blank line in the input is "<html>". I tried doing it manually by inserting
the following between lines 3 and 4:

if input=~/<html>/ ...

and this worked (moving the state machine to the next state which is "title")
but the problem repeated itself all over again in that state too. So I have no
problem pattern matching with regex literals but can't pattern match with
regex's derived from ostensibly identical strings read from a file.

For the conditional on line 5 I have also tried:
  Regexp.new(Regexp.escape(stateTable[state][0]))
and
  Regexp.new(stateTable[state][0].to_s)
and
  Regexp.new(stateTable[state][0]).match(input) # returned nil
to no avail.

For line 5 I initially had:
if input=~stateTable[state][0]
This didn't work either and generated the following warning:
warning: string=~string will be obsolete; use explicit regexp

I'm using version 1.8.1 (2003-12-25) on Windows (i386-mswin32).

The point of this post is not to get better ways to parse html (but feel free to
suggest them anyway - the point is to find out why I can't read a string from
a file and then use it (as expected) as a regex in a match operator expression.

I humbly await searing insight and enlightenment from the collective (to which
resistance is futile in any case).

David_A_Black3 · 14 September 2005 03:19

Hi --

···

On Wed, 14 Sep 2005, jdm wrote:

I'm parsing some html and have a table-driven state machine that configures
itself by reading tuples (one per line) of "present state", "pattern to match",
and "next state" with this code:

lineArray=Array.new()
stateTable=Hash.new()

File.open("stateTable.txt") { |file|
   file.each { |line|
     lineArray=line.scan(/[^\s]+/)
     stateTable[lineArray[0]]=lineArray[1..-1]
   }
}

I've printed this hash out in several different ways with the same results: the
key-value pairs look as expected (no extraneous spaces, newlines, etc.). Once
the hash is set up, it drives a state machine with this code:

1 state="html"
2 while input=gets() # text lines are the s.m.'s "clock"
3 if input.chomp().length>0 # skip blank lines

Is it possible that you need chomp! instead of chomp ? I'm not sure
what the regex looks like that you're testing it against later, but if
it doesn't allow a terminal \n then that may be the problem.

David

--
David A. Black
dblack@wobblini.net

David_A_Black3 · 14 September 2005 03:42

Hi --

···

On Wed, 14 Sep 2005, jdm wrote:

I'm parsing some html and have a table-driven state machine that configures
itself by reading tuples (one per line) of "present state", "pattern to match",
and "next state" with this code:

I think my chomp theory is probably wrong.

Can you share a line or two of stateTable.txt?

David

--
David A. Black
dblack@wobblini.net

Brian_Schroder1 · 14 September 2005 03:42

I can't see the error you describe, but I'll just clean up and debug
your code a bit, maybe that will help you see the problem. I'm typing
this directly into the mail, so beware of any spelling bugs I'll
introduce.

There is a bug in line 7. The else part is never reached because you
used an elsif, which will always execute the print line, which returns
nil. Therefor you get an error message but won't enter the error
branch.

Additionally you allow for only one arrow from each state. I don't
think that was your intention. I changed the code to allow for
multiple state changes.

A state machine is a nice thing, but it won't help you very much with
html, because it can't count. So it can't match opening and closing
tags. And if the input you are processing is normal html, you are not
feeding the sm tokens but lines which can include multiple tokens.

It would be more efficient to create the regexp pattern only once and
not on each matching try. I.e.

state_table=Hash.new() { | h, k | h[k] = }
StateChange = Struct.new(:pattern, :next_state)

   File.open("stateTable.txt") do |file|
     file.each do |line|
       present_state, pattern, next_state = *line.scan(/[^\s]+/)
       state_table[present_state] <<
         StateChange.new(Regexp.new(pattern), next_state)
     end
   end

   state="html"
   while input = gets # text lines are the s.m.'s "clock"
     input.strip!
     next if input.empty? # skip blank lines

     raise "State undefined" unless state_table.has_key?(state)

     next_states = state_table[state].select { | state_change |
       state_change.pattern =~ input
     }

     if next_states.length == 1
       state = next_states[0].next_state
     elsif next_states.empty?
       raise "No match on #{state}: #{input}"
     elsif
       raise "Too many matches on #{state}: #{input}"
     end

end # while

hth,

Brian

···

On 14/09/05, jdm <xyz@xyz.com> wrote:

I'm parsing some html and have a table-driven state machine that configures
itself by reading tuples (one per line) of "present state", "pattern to match",
and "next state" with this code:

  lineArray=Array.new()
  stateTable=Hash.new()

  File.open("stateTable.txt") { |file|
    file.each { |line|
      lineArray=line.scan(/[^\s]+/)
      stateTable[lineArray[0]]=lineArray[1..-1]
    }
  }

I've printed this hash out in several different ways with the same results: the
key-value pairs look as expected (no extraneous spaces, newlines, etc.). Once
the hash is set up, it drives a state machine with this code:

1 state="html"
2 while input=gets() # text lines are the s.m.'s "clock"
3 if input.chomp().length>0 # skip blank lines
4 if stateTable.has_key?(state) # is current state defined by a tuple?
                                       # for now all states are defined
5 if input=~Regexp.new(stateTable[state][0]) # change state if match
6 state=stateTable[state][3]
7 elsif # else complain
8 print("\nline #{$NR}: no match on #{stateTable[state][0]}\n")
9 exit
10 end

11 end # if state in stateTable
12 end # if input.chomp()
13 end # while

I have confirmed multiple times and ways that stateTable["html"][0] contains
"<html>" yet the if on line 4 is never successful even though the first
non-blank line in the input is "<html>". I tried doing it manually by inserting
the following between lines 3 and 4:

  if input=~/<html>/ ...

and this worked (moving the state machine to the next state which is "title")
but the problem repeated itself all over again in that state too. So I have no
problem pattern matching with regex literals but can't pattern match with
regex's derived from ostensibly identical strings read from a file.

For the conditional on line 5 I have also tried:
  Regexp.new(Regexp.escape(stateTable[state][0]))
and
  Regexp.new(stateTable[state][0].to_s)
and
  Regexp.new(stateTable[state][0]).match(input) # returned nil
to no avail.

For line 5 I initially had:
  if input=~stateTable[state][0]
This didn't work either and generated the following warning:
  warning: string=~string will be obsolete; use explicit regexp

I'm using version 1.8.1 (2003-12-25) on Windows (i386-mswin32).

The point of this post is not to get better ways to parse html (but feel free to
suggest them anyway - the point is to find out why I can't read a string from
a file and then use it (as expected) as a regex in a match operator expression.

I humbly await searing insight and enlightenment from the collective (to which
resistance is futile in any case).

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

Gavin_Kistner2 · 14 September 2005 12:44

Just an aside - you might be interested in my TagTreeScanner class, just to look at.
http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html

(I still need to finish and upload the version with the nicer DSL setup. But at its heart, it's a state machine running regexp against strings to determine the next state (and build a tree along the way).)

···

On Sep 13, 2005, at 9:06 PM, jdm wrote:

I'm parsing some html and have a table-driven state machine that configures
itself by reading tuples (one per line) of "present state", "pattern to match",
and "next state" with this code:

David_A_Black3 · 14 September 2005 03:21

(Don't chain it, though

David

···

On Wed, 14 Sep 2005, David A. Black wrote:

3 if input.chomp().length>0 # skip blank lines

Is it possible that you need chomp! instead of chomp ?

--
David A. Black
dblack@wobblini.net

Topic		Replies	Views
Newbie qustion ruby-talk	1	80	2 August 2006
Str.scan ruby-talk	5	71	15 June 2007
Newbie qustion ruby-talk	4	105	4 August 2006
Extracting text from HTML ruby-talk	7	80	11 May 2003
Html parsing using regular expressions ruby-talk	2	107	25 October 2006

Strings and regex's

Related topics