Hello:
I have developed a fuzzy parser that uses regular expressions to
partition a text file. It can be used on individual partitions to create
a hierarchical representation of the data. The reason it is called
‘fuzzy’ is because a grammar can be defined to look for keywords that
define structure while ignoring other parts of the file.
I think it is a good start, but was wondering if it would be of interest
for me to release to the RAA. I would appreciate any comments on it. I
am happy with the code for the most part except I think the initial
parse method is klunky in the way it handles files or array inputs.
The code and testcases are below.
Jim
— fuzzyparser.rb
Author: Jim Freeze
···
=Description
FuzzyParser uses regular expressions to partition a file into
a general hierarchy.
=Revision History
JFN 12/17/2002 Birthday
if RUBY_VERSION < “1.7.3”
raise “fuzzyparser.rb requires Ruby 1.7.3 or later”
end
module FuzzyParser
FP_Line = Struct.new(:re, :proc)
FP_Range = Struct.new(:begin_re, :end_re, :proc)
class Parser
def initialize(grammar)
@grammar = grammar
@filter_mode = false
end#initialize
def next_line(&block)
return nil if @line.nil?
@line = yield
@line
end
def parse_line(g)
if g.re.match @line
g.proc.call(@parse_tree, @line)
true
else
false
end
end
private :parse_line
def parse_range(g, &block)
sub =
while @line =~ g.begin_re … @line =~ g.end_re
g.proc.call(sub, @line)
next_line(&block)
end
@parse_tree << sub unless sub.empty?
if sub.size > 1
true
else
false
end
end
private :parse_range
def _parse(&block)
loop {
next_line(&block)
matched = false
@grammar.each { |g|
return if @line.nil?
#puts " "*20 + g.class.to_s
#puts " "*20 + g.begin_re.inspect if g.kind_of?(FP_Range)
#puts " "*20 + g.re.inspect if g.kind_of?(FP_Line)
case g
when FP_Line
if matched = parse_line(g)
next_line(&block)
retry
end
when FP_Range
if parse_range(g, &block)
puts " ========== ‘#{@line.strip}’ =============="
retry
end
else
raise "Unrecognized parser statement #{g.class}" unless
matched
end#case
}
raise "No grammar defined for line #{$.}:\n => #{@line}" unless
matched || @filter_mode
}
end
private :_parse
def parse(file_or_array)
@line = “”
@parse_tree =
if file_or_array.kind_of?(Array)
line_no = -1
_parse {
next nil if line_no == file_or_array.size - 1
file_or_array[line_no+=1]
}
else
File.open(file_or_array) { |f|
_parse {
next nil if f.eof?
f.readline
}
}
end
@parse_tree
end
end#class Parser
end#module FuzzyParser
— end fuzzyparser.rb
— begin tc_fuzzyparser.rb
require ‘test/unit’
require ‘fuzzyparser/fuzzyparser’
class TC_FuzzyParser < Test::Unit::TestCase
include FuzzyParser
INFILE1 = “lines”
INFILE2 = “lines2”
GRAMMAR1 = [
FP_Line.new(/0/, proc { |stack, line| stack << line.strip }),
FP_Line.new(/1/, proc { |stack, line| stack << line.strip }),
FP_Line.new(/2/, proc { |stack, line| stack << line.strip }),
FP_Line.new(/3/, proc { |stack, line| stack << line.strip }),
FP_Range.new( /4/, /7/,
proc { |stack, line| stack << line.strip.strip }),
FP_Line.new(/8/, proc { |stack, line| stack << line.strip }),
FP_Line.new(/9/, proc { |stack, line| stack << line.strip }),
]
OUT1 = [“0”, “1”, “2”, “3”, [“4”, “5”, “6”, “7”], “8”, “9”]
OUT2 = [{“VERSION”=>“6.00”},
[“BEGIN_HEADER”, “line 3”, “line 4”, “END_HEADER”],
[“BEGIN_DB”, “line 8”, “line 9”, “line 10”, “END_DB”],
“Sol”, “Solitary Line”]
GRAMMAR2 = [
FP_Line.new(/^\s*$/, proc { |stack, line| nil }),
FP_Line.new(/^\s*#/, proc { |stack, line| nil }),
FP_Line.new(
/! VERSION/,
proc { |stack, line|
m = /^!\s*(VERSION)\s+=\s+(.*)$/.match(line.strip)
stack << {m[1]=>m[2]} }
),
FP_Range.new(
/^BEGIN_HEADER/,
/^END_HEADER/,
proc { |stack, line| stack << line.strip }
),
FP_Range.new(
/^BEGIN_DB/,
/^END_DB/,
proc { |stack, line| stack << line.strip }
),
FP_Line.new(/^Sol/, proc { |stack, line| stack << line.strip }),
#FP_Line.new(/^\w+/, proc { |stack, line| stack << line.strip }),
]
GRAMMAR3 = [
FP_Line.new(/^\s*(line)\s+(\d+)/,
proc { |stack, line|
m = /^\s*(line)\s+(\d+)/.match(line.strip)
stack << {m[1]=>m[2]} }
),
FP_Line.new(/.*/, proc { |stack, line| nil }),
]
OUT3 = [{“line”=>“3”}, {“line”=>“4”}]
def set_up
end
def tear_down
end
def test_lines
p = FuzzyParser::Parser.new(GRAMMAR1)
o = p.parse(INFILE1)
assert_equal(OUT1, o)
#p o
end#test_lines
def test_lines2
p = FuzzyParser::Parser.new(GRAMMAR2)
o = p.parse(INFILE2)
assert_equal(OUT2, o)
#p o
p_sub = FuzzyParser::Parser.new(GRAMMAR3)
o_sub = p_sub.parse(o[1])
assert_equal(OUT3, o_sub)
#p o_sub
end#test_lines2
end#class TC_FuzzyParser
—end tc_fuzzyparser.rb
— here are the data files
cat lines
0
1
2
3
4
5
6
7
8
9
cat lines2
! VERSION = 6.00
BEGIN_HEADER
line 3
line 4
END_HEADER
BEGIN_DB
line 8
line 9
line 10
END_DB
Sol
Solitary Line
#word 1
#word 2
#word 2
–
Jim Freeze
Excellent time to become a missing person.