Generic Parsing Library

(Adam Sanderson) #1

I was wondering if anyone would be interested in, or knows of a generic
parsing library. I am continually faced with reading in bizarre text
files and parsing them. They tend to have regular structures though
(at the whim of researcher who made them). I'd like to write up some
sort of declarative code to parse these files. There's a lot of room
for reuse.

The data tends to be structured, but not rigorously, and changes
whenever someone feels like it, it's not hard to parse manually but
wouldn't it be nice to do a little metaprogramming at the top of a
class and say something like this? (not a rigorous example)

class LoopDetector
  one :header, :hash, :start_after=>/^\*+$/,
:end_before=>/^\*+$/, :split=>/:\s+/
  many :days, LoopData, :start_after=>:header,
:end_before=>/\n\n\n/
end

Most of the data can be broken down into:
- Spacer lines
- Hashes
- Tables
- Garbage (No seriously, some of these files have completely pointless
information in a lot of them)

Any ideas folks?
  .adam sanderson

Here's one example of the type of data I get to play with (in reality
it goes from 00:00 -> 23:55 for each set of Loop Data, and there are
about 200 sets of Raw Loop Data). For anyone who's interested this is
loop detector data, which measures the amount of traffic on freeways.

···

***********************************
Filename: 0076ON04.cdl
Extracted by: CDR_Auto version 3.31 BETA g
Creation Date: Mar27/05 (Sun)
Creation Time: 20:23:09
    File Type: TEXT
***********************************

ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
01/01/04 (Thu)

---Raw Loop Data Listing---

  Time Vol Occ Flg nPds
  00:00 5 0.4% 1 15
  00:05 11 1.2% 1 15
  00:10 14 1.2% 1 15
  23:50 3 0.5% 2 15
  23:55 3 0.4% 1 15

ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
01/02/04 (Fri)

---Raw Loop Data Listing---

  Time Vol Occ Flg nPds
  00:00 0 0.0% 0 0
  00:05 0 0.0% 0 0
  00:10 0 0.0% 0 0
  00:15 0 0.0% 0 0
  23:50 0 0.0% 0 0
  23:55 26 3.8% 2 10

(Kirk Haines) #2

Once upon a time, in a job far, far away, I wrote an ETL (Extract, Transform,
Load) system in Perl. It had a lot of bells and whistles, but the core of it
was an XML language that described data sources, transformations, and data
destinations. The code was "compiled" to Perl to execute it.

It was neat. Every since I started using Ruby, I've thought that it could be
done better with Ruby.

For starters, the whole notion of an XML language that is parsed and turned
into executable code could be replaced by a domain specific language. It'd
be a beautiful thing, and I bet it could be done in a LOT fewer lines than
when I did it in Perl.

Kirk Haines

···

On Tuesday 16 August 2005 3:46 pm, Adam Sanderson wrote:

I was wondering if anyone would be interested in, or knows of a generic
parsing library. I am continually faced with reading in bizarre text
files and parsing them. They tend to have regular structures though
(at the whim of researcher who made them). I'd like to write up some
sort of declarative code to parse these files. There's a lot of room
for reuse.

(James Edward Gray II) #3

I've just recently been throwing together my own tool for this. I just got done using it in a real-world (paid) project. It's small and really just a chainsaw tool for data mining, but it seems to be a good start. I haven't documented it yet, but here are a couple of examples from my unit tests:

     def test_complex
         path = File.join(File.dirname(__FILE__), "ross_report.txt")
         test = self

         input(path) do
             @state = :skip
             start_skipping_at("\f")
             stop_skipping_at(/\A-[- ]+-\Z/)
             skip(/\A\s*\Z/)
             skip(/--\Z/)

             find_in_skipped(/((?:Period|Week)\s+\d.+?)\s*\Z/) do |period>
                 test.assert_equal("Period 02/2002", period)
             end

             stop_at("*** Selection Criteria ***")

             read do |line|
                 test.assert_match(/\A\s+(?:Sales|Cust|SA)|\A[-\w]+\s+/, line)
             end
         end

         path = File.join(File.dirname(__FILE__), "car_ads.txt")

         data = input(path, "") do
             @state = :skip
             stop_skipping_at("Save Ad")
             skip(/\A\s*\Z/)

             pre { @price = @miles = nil }
             read(/\$([\d,]+\d)/) { |price| @price = price.delete(",").to_i }
             read(/([\d,]*\d)\s*m/) { |miles| @miles = miles.delete(",").to_i }

             read do |ad|
                 if @price and @price < 20_000 and @miles and @miles < 40_000
                     (@ads ||= Array.new) << ad.strip
                 end
             end
         end

         assert_equal([<<END_AD.strip], data.ads)
2003 Chrysler Town & Country LX
      $16,990, green, 21,488 mi, air, pw, power locks, ps, power mirrors,
dual air bags, keyless entry, intermittent wipers, rear defroster, alloy,
pb, abs, cruise, am/fm stereo, CD, cassette, tinted glass
VIN:2C4GP44363R153238, Stock No:C153238, CALL DAN PERKINS AT 1-800-432-6326
END_AD
     end

__END__

The first half of that is parsing the report from Ruby Quiz #17 (http://www.rubyquiz.com/quiz17.html). The second half is parsing a listing of car ads (very unstructured data) looking for cars below a certain price and mileage.

If people think this looking promising, I'll be happy to make it available.

James Edward Gray II

···

On Aug 16, 2005, at 4:46 PM, Adam Sanderson wrote:

I was wondering if anyone would be interested in, or knows of a generic
parsing library.

(Patrick May) #4

Quoting Adam Sanderson <netghost@gmail.com>:

I was wondering if anyone would be interested in, or knows of a
generic parsing library.

http://i.loveruby.net/en/prog/racc.html might be of use.

~ Patrick

(7rans) #5

Hi Adam,

Actually I have written something similar to what you describe, though
it is token based. It may be adaptable to what you describe. Certainly
it could use some twaeking, more testing and any improvements you might
offer. Here's an example of parsing something like XML.

  require 'yaml'

  s = %Q{
  [p]
  This is plain paragraph.
  [t][b]This bold.[b.]This tee'd off.[t.]&tm;
  [p.]
  }

  tokens = []

  t = TokenParser::Token.new( :ONE )
  t.start = lambda { |match| %r{ \[ (.*?) \] }mx }
  t.stop = lambda { |match| %r{ \[ [ ]* (#{resc(match[1])}) (.*?) \. \]
}mx }
  tokens << t

  t = TokenParser::UnitToken.new( :TWO )
  t.start = lambda { |match| ; %r{ \& (.*?) \; }x }
  tokens << t

  cp = TokenParser.new( *tokens )
  d = cp.parse( s )
  y d

outputs (don't let this scare you, its easy to traverse the content)

  --- &id004 !ruby/array:TokenParser::Main
  - "
    "
  - &id002 !ruby/object:TokenParser::Marker
    content:
      - >

        This is plain paragraph.

      - &id001 !ruby/object:TokenParser::Marker
        content:
          - !ruby/object:TokenParser::Marker
            content:
              - This bold.
            inner_range: !ruby/range '36...46'
            match: !ruby/object:MatchData {}
            outer_range: !ruby/range '33...50'
            parent: *id001
            token: &id003 !ruby/object:TokenParser::Token
              key: :ONE
              parser:
              start: !ruby/object:Proc {}
              stop: !ruby/object:Proc {}
          - "This tee'd off."
        inner_range: !ruby/range '33...65'
        match: !ruby/object:MatchData {}
        outer_range: !ruby/range '30...69'
        parent: *id002
        token: *id003
      - !ruby/object:TokenParser::Marker
        content: []
        match: !ruby/object:MatchData {}
        outer_range: !ruby/range '69...73'
        parent: *id002
        token: !ruby/object:TokenParser::UnitToken
          key: :TWO
          parser:
          start: !ruby/object:Proc {}
    inner_range: !ruby/range '4...74'
    match: !ruby/object:MatchData {}
    outer_range: !ruby/range '1...78'
    parent: *id004
    token: *id003

Let me know if you'd like a copy to play with.

T.

(Gavin Kistner) #6

I wrote TagTreeScanner, which can be used to parse text files when the desired output is a hierarchy of nodes and text (i.e. an XML type file).

http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html

···

On Aug 16, 2005, at 3:46 PM, Adam Sanderson wrote:

I was wondering if anyone would be interested in, or knows of a generic
parsing library. I am continually faced with reading in bizarre text
files and parsing them. They tend to have regular structures though
(at the whim of researcher who made them). I'd like to write up some
sort of declarative code to parse these files. There's a lot of room
for reuse.

(James Edward Gray II) #7

Any good tutorials on Racc hiding on some corner of the net? I would like to learn more about it.

James Edward Gray II

···

On Aug 16, 2005, at 5:31 PM, Patrick May wrote:

http://i.loveruby.net/en/prog/racc.html might be of use.

(Adam Sanderson) #8

It would be interesting to look at one way or another. It looks like
it could be useful for controlling some of the parsing. My old code
for parsing these types of files was in Java, and as with most of my
Java code, I realized that I've been trying to write in ruby all along
:wink:
  .adam sanderson

(Adam Sanderson) #9

The mention of rockit from above was good too, it looks pretty
compelling. Here's the project:
  http://rockit.sourceforge.net/
It looks pretty interesting. However rockit, racc, and rbison seem to
be somewhat involved for writing parsers. I'm thinking of something
much simpler perhaps. These work great for very common syntaxes where
you have a very large number of documents. I'm thinking more along the
lines of a flexible library for quickly defining many syntaxes with a
limited set of documents.

Then again, I might just need to play with these parsers a bit and see.
  .adam sanderson

(Steven Jenkins) #10

James Edward Gray II wrote:

Any good tutorials on Racc hiding on some corner of the net? I would like to learn more about it.

The Racc distribution includes several sample applications, including a 4-function calculator. If you've ever used yacc, that's probably enough to get you going. If not, Kernighan & Pike's _The Unix Programming Environment_ has a nice expository treatment of yacc in Chapter 8. (Now 22 years old, K&P is still a first-rate technical book.)

Steve

(Patrick Gundlach) #11

Hi,

http://i.loveruby.net/en/prog/racc.html might be of use.

Any good tutorials on Racc hiding on some corner of the net? I would
like to learn more about it.

If there is some interest, I might post a small parser
(using racc/stringscanner) which I am working on right now (the parsing part
is finished). I am a novice with racc, though. But I think it is
pretty straightforward.

Patrick

(Just a simple language, like

set a b

do this using this, that, other
write 'file a', 'file.b.txt'

and alike)

(James Edward Gray II) #12

I don't want to send out a big announcement message until I get documentation in there, but my parsing library is on RubyForge now:

http://rubyforge.org/projects/input/

It should be easy to figure out how to use it from the unit tests in CVS. I did release a gem, if you want to install it.

James Edward Gray II

···

On Aug 16, 2005, at 5:36 PM, Adam Sanderson wrote:

It would be interesting to look at one way or another.

(Eric Mahurin) #13

Here is an earlier version of what I'm working on:

http://groups-beta.google.com/group/comp.lang.ruby/browse_thread/thread/227313d6ff2ab1fa/8758532b85f2b001?q=bnf&rnum=2#8758532b85f2b001

Hopefully I'll do another release in a couple of weeks (under
the rubyforge project grammar).

Feature-wise, I think what I'm doing is closest to antlr (for
C++/C#/Java/Python). But, I believe it will be more powerful
and easier to use since you write the grammar in the target
progamming language (Ruby) and there is no need for a
code-generation phase.

···

--- Adam Sanderson <netghost@gmail.com> wrote:

The mention of rockit from above was good too, it looks
pretty
compelling. Here's the project:
  http://rockit.sourceforge.net/
It looks pretty interesting. However rockit, racc, and
rbison seem to
be somewhat involved for writing parsers. I'm thinking of
something
much simpler perhaps. These work great for very common
syntaxes where
you have a very large number of documents. I'm thinking more
along the
lines of a flexible library for quickly defining many
syntaxes with a
limited set of documents.

Then again, I might just need to play with these parsers a
bit and see.
  .adam sanderson

____________________________________________________
Start your day with Yahoo! - make it your home page

(Matthew Desmarais) #14

The mention of rockit from above was good too, it looks pretty
compelling. Here's the project:
  http://rockit.sourceforge.net/
It looks pretty interesting. However rockit, racc, and rbison seem to
be somewhat involved for writing parsers. I'm thinking of something
much simpler perhaps. These work great for very common syntaxes where
you have a very large number of documents. I'm thinking more along the
lines of a flexible library for quickly defining many syntaxes with a
limited set of documents.

Then again, I might just need to play with these parsers a bit and see.
  .adam sanderson

Adam,

I've actually worked a bunch with Racc fairly recently. I'm don't have
the time (or expertise) to write up any tutorials any time soon, but I can
give you a brief description of my experience.

I started out by making incorrect assumptions. Racc does a great job of
generating a parser. The problem (that I misunderstood) is that at its
heart, Racc is a parser-generator and not much more. Racc generates a
parser that will accept or reject a sentence according to the grammar that
it is generated from. Where you go from there is up to you.

I used Racc to generate an Ada parser (without too much trouble). Though
I haven't put too much time into it, I've found the hard part to be
generating an Abstract Syntax Tree that worked for me. It's the AST that
will allow me to manipulate the stuff that I parsed.

Oh geez, I almost forgot. I got started using the Parser Generators
chapter from the Ruby Developer's Guide from Syngress. I think the
chapter may have been written by Robert Feldt (Mr. Rockit, or at least the
one that isnt't Herbie Hancock) and I enjoyed it. There's rockit sutff in
there too.

Good luck to you, and if you have any specific questions or problems feel
free to contact me personally.

mattD

(Adam Sanderson) #15

Great.
This looks very much like what I was imagining, or at least some part
of it. I think I'll play with it a little bit today or tonight and see
what I can do. By the way, I never knew you could do:
  ?u or ?a
to get the int codes for a letter, how odd.
  .adam sanderson

(James Edward Gray II) #16

Great.
This looks very much like what I was imagining, or at least some part
of it. I think I'll play with it a little bit today or tonight and see
what I can do.

I have ideas for more features and I will document the next release, so hopefully it will be more approachable. I'm using it to do real-world tasks now though, so I think it has potential.

By the way, I never knew you could do:
  ?u or ?a
to get the int codes for a letter, how odd.

Ruby's just full of surprises. :wink:

James Edward Gray II

···

On Aug 19, 2005, at 12:41 PM, Adam Sanderson wrote: