Parsing, BNF, TreeTop, GhostWheel,

Hi,

I'm looking for a parser which can be fed with some (A)BNF-style rules.

Gave TreeTop a try
---cut---------------------------------------------------------
grammar MathGrammar
  rule additive
    multitive '+' additive / multitive
  end
  rule multitive
    primary '*' multitive / primary
  end
  rule primary
    '(' additive ')' / number
  end
  rule number
    [1-9] [0-9]*
  end
end
---cut---------------------------------------------------------

as well as GhostWheel
---cut---------------------------------------------------------
MathParser = GhostWheel.build_parser {

  rule( :additive , alt( :multitive,
                          seq( :multitive, '+', :additive )
  ))
  rule( :multitive , alt( seq( :primary, '*', :multitive ),
                          :primary
  ))
  rule( :primary , alt( seq( '(', :additive, ')' ),
                          :number
  ))
  rule( :number , seq( /[1-9]/ , zplus( /0-9/ ) ))

  parser( :main, seq(:additive, eof()) )
}
---cut---------------------------------------------------------

'1+1' parses fine.
However when I change the definition of "additive" from
    multitive '+' additive / multitive
to
    multitive / multitive '+' additive
it fails to parse.

Is this a problem with packrat/PEG parsers in general?
If so, which parser is more appropriate?
It should hand back a parse tree.

Memory consumption is not an issue.

Regards,
  Philipp

Philipp Kempgen wrote:

I'm looking for a parser which can be fed with some (A)BNF-style rules.

...

If so, which parser is more appropriate?
It should hand back a parse tree.

Alternatively I'd appreciate if Regexps could return *all*
occurrences of named capture groups inside repetitions etc.
instead of just the last match for each name. Feasible?

Regards,
  Philipp

(Sent previously, but it didn't appear. Sorry if you get this twice).

Philipp Kempgen wrote:

Is this a problem with packrat/PEG parsers in general?

Yes. PEG parsers are greedy (iterate as far as possible and never
backtrack on that), and also take the first alternative that succeeds,
without trying subsequent alternatives. It's actually quite useful
and natural, once you get the hang of it.

Pure PEGs have no lookahead, as Treetop does. It's often necessary
though, in the light of the above.

Also, your approach will yield the wrong results when you add '-' into
the mix, because of associativity. "1-2-3" will calculate as 1-(2-3).
A better way is to use iteration, see

<http://github.com/nathansobo/treetop/blob/master/examples/lambda_calculus/arithmetic.treetop&gt;

Clifford Heath.

I was trying to do the same thing and asked about repeated named captures.

It seems that using String#scan is the closest anything Ruby has, as the Oniguruma regex engine doesn't support it. I think it's a real shame as named capture groups are really useful.

Regards,
Iain

···

On 12 Aug 2010, at 19:52, Philipp Kempgen wrote:

Philipp Kempgen wrote:

I'm looking for a parser which can be fed with some (A)BNF-style rules.

...

If so, which parser is more appropriate?
It should hand back a parse tree.

Alternatively I'd appreciate if Regexps could return *all*
occurrences of named capture groups inside repetitions etc.
instead of just the last match for each name. Feasible?

Regards,
Philipp

<shameless plug>

You may want to look at: http://kanocc.rubyforge.org/

<http://kanocc.rubyforge.org/&gt;&lt;/shameless plug>

best regards

Christian Surlykke

···

2010/8/17 Clifford Heath <no@spam.please.net>

(Sent previously, but it didn't appear. Sorry if you get this twice).

Philipp Kempgen wrote:

Is this a problem with packrat/PEG parsers in general?

Yes. PEG parsers are greedy (iterate as far as possible and never
backtrack on that), and also take the first alternative that succeeds,
without trying subsequent alternatives. It's actually quite useful
and natural, once you get the hang of it.

Pure PEGs have no lookahead, as Treetop does. It's often necessary
though, in the light of the above.

Also, your approach will yield the wrong results when you add '-' into
the mix, because of associativity. "1-2-3" will calculate as 1-(2-3).
A better way is to use iteration, see

<
treetop/examples/lambda_calculus/arithmetic.treetop at master · nathansobo/treetop · GitHub
>

Clifford Heath.

Philipp Kempgen wrote:

I'm looking for a parser which can be fed with some (A)BNF-style rules.

...

If so, which parser is more appropriate?
It should hand back a parse tree.

Alternatively I'd appreciate if Regexps could return *all*
occurrences of named capture groups inside repetitions etc.
instead of just the last match for each name. Feasible?

As I understand you want /f(o)+/ =~ "foo" to return ["o", "o"] as match for the group (used normal capturing groups for simplicity).

I was trying to do the same thing and asked about repeated named captures.
OSDIR

It seems that using String#scan is the closest anything Ruby has, as the Oniguruma regex engine doesn't support it. I think it's a real shame as named capture groups are really useful.

Regular expression engines generally do only return one match per group - regardless of naming or not naming groups. I'm not up to date with current Perl's regular expressions which are the only ones I can imagine to be crazy enough to provide such a feature. :slight_smile: Otherwise String#scan is indeed the proper tool to get multiple matches. The example above could be done like this

if /f(o+)/ =~ s
   $1.scan /o/ do |match|
     p match
   end
end

or this

if /f(o+)/ =~ s
   matches = $1.scan /o/
   p matches
end

Kind regards

  robert

···

On 13.08.2010 00:50, Iain Barnett wrote:

On 12 Aug 2010, at 19:52, Philipp Kempgen wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Christian Surlykke wrote:

You may want to look at: http://kanocc.rubyforge.org/

Actually, I think Citrus does a better job than Treetop (of which I'm primary maintainer and a serious user). It provides a BNF built on ruby functions and operators, and also a Treetop/PEG-like parser for grammar files as an alternative. A brief look at Kanocc indicates Citrus may be preferable over it also.

Citrus doesn't memoize by default however, you have to enable that. It also doesn't have support for semantic predicates, an important feature (for me) which I added to Treetop. Otherwise it is preferable in most ways, and I think Michael Jackson has done a sterling job. Kudos!

<http://github.com/mjijackson/citrus/&gt;

Clifford Heath.

.Net definitely offers it, which is ironic as none of the .Net devs I used to work with ever used regex.

Scan is a really poor alternative (for this kind of task), I think. Instead of just writing one regex which expresses what I want to do much better, I end up having to write a lot of extra code to get the same effect. Scan's cool for more simple things.

Regards
Iain

···

On 13 Aug 2010, at 11:15, Robert Klemme wrote:

Regular expression engines generally do only return one match per group - regardless of naming or not naming groups. I'm not up to date with current Perl's regular expressions which are the only ones I can imagine to be crazy enough to provide such a feature. :slight_smile:

Robert Klemme wrote:

Philipp Kempgen wrote:

I'm looking for a parser which can be fed with some (A)BNF-style rules.

...

It should hand back a parse tree.

Alternatively I'd appreciate if Regexps could return *all*
occurrences of named capture groups inside repetitions etc.
instead of just the last match for each name. Feasible?

As I understand you want /f(o)+/ =~ "foo" to return ["o", "o"] as match
for the group (used normal capturing groups for simplicity).

FRUIT = '(?<FRUIT> Apple|Banana|Pear|Orange)'
FRUIT_COLLECTION = '(?<FRUIT_COLLECTION> ' << FRUIT << '*)'

re = Regexp.new( '^' << FRUIT_COLLECTION << '$', Regexp::EXTENDED )
re.match( 'PearBananaApple' )[:FRUIT]
=> "Apple"

I would want e.g. matchdata[:FRUIT] to be an Array
[ 'Pear', 'Banana', 'Apple' ]
and not just 'Apple' (the last FRUIT).

Actually something similar to a concrete syntax tree / parse tree
would be even better:

ROOT "PearBananaApple"
  FRUIT_COLLECTION "PearBananaApple"
    FRUIT "Pear"
    FRUIT "Banana"
    FRUIT "Apple"

the Oniguruma regex engine doesn't support it. I think it's a real shame as named capture groups are really useful.

Exactly.
Anyway thanks for your input.

Regards,
  Philipp

···

On 13.08.2010 00:50, Iain Barnett wrote:

On 12 Aug 2010, at 19:52, Philipp Kempgen wrote:

Clifford Heath wrote:

Yes. PEG parsers are greedy (iterate as far as possible and never
backtrack on that), and also take the first alternative that succeeds,
without trying subsequent alternatives. It's actually quite useful
and natural, once you get the hang of it.

Pure PEGs have no lookahead, as Treetop does. It's often necessary
though, in the light of the above.

What I wanted to do is to construct/generate a parser from a bunch
of (A)BNF rules (dozends) from an RFC so backtracking is a
requirement while lookahead is not. The grammar specification for
the parser (/ parser generator) does not need to be exactly like BNF
but should not require me to think about where lookahead is needed
as this manual conversion can be error-prone.
That seems to rule out PEG parsers.

Clifford Heath wrote:

Christian Surlykke wrote:

You may want to look at: http://kanocc.rubyforge.org/

Actually, I think Citrus does a better job than Treetop (of which I'm primary maintainer and a serious user). It provides a BNF built on ruby functions and operators, and also a Treetop/PEG-like parser for grammar files as an alternative. A brief look at Kanocc indicates Citrus may be preferable over it also.

Citrus doesn't memoize by default however, you have to enable that. It also doesn't have support for semantic predicates, an important feature (for me) which I added to Treetop.

<http://github.com/mjijackson/citrus/&gt;

Thanks a lot for the pointers Clifford and Christian!
Will have a look as soon as time permits.

  Philipp

Philipp Kempgen wrote:

Clifford Heath wrote:
What I wanted to do is to construct/generate a parser from a bunch
of (A)BNF rules (dozends) from an RFC so backtracking is a
requirement while lookahead is not. The grammar specification for
the parser (/ parser generator) does not need to be exactly like BNF
but should not require me to think about where lookahead is needed
as this manual conversion can be error-prone.

Most of the BNFs in RFCs cannot be directly automated by any parser
generator. In addition, quite a few of them are slightly incorrect
and/or incomplete, but haven't been fixed because they're always
implemented by humans, who paper over the issues, and introduce new
errors instead :wink:

The upshot is that you're not going to find a direct implementation
that works without thinking.

That seems to rule out PEG parsers.

On the contrary, I believe that a PEG parser might represent the most
direct mapping. PEGs do backtrack, just not on a successful rule or
iteration. You simply have to introduce enough lookahead to prevent
false success or excessive iteration... which is actually pretty easy
once you get the hang of it... the point here is that the *rest* of
the grammar is much more likely to look like the EBNF you started
with. Oh, yeah, you have to refactor any left-recursion too, which can
be difficult.

Clifford Heath.

Clifford Heath wrote:

Philipp Kempgen wrote:

Clifford Heath wrote:
What I wanted to do is to construct/generate a parser from a bunch
of (A)BNF rules (dozends) from an RFC so backtracking is a
requirement while lookahead is not. The grammar specification for
the parser (/ parser generator) does not need to be exactly like BNF
but should not require me to think about where lookahead is needed
as this manual conversion can be error-prone.

Most of the BNFs in RFCs cannot be directly automated by any parser
generator. In addition, quite a few of them are slightly incorrect
and/or incomplete, but haven't been fixed because they're always
implemented by humans, who paper over the issues, and introduce new
errors instead :wink:

The upshot is that you're not going to find a direct implementation
that works without thinking.

That seems to rule out PEG parsers.

On the contrary, I believe that a PEG parser might represent the most
direct mapping. PEGs do backtrack, just not on a successful rule or
iteration. You simply have to introduce enough lookahead to prevent
false success or excessive iteration... which is actually pretty easy
once you get the hang of it... the point here is that the *rest* of
the grammar is much more likely to look like the EBNF you started
with. Oh, yeah, you have to refactor any left-recursion too, which can
be difficult.

Actually, while it is true that PEGs as specificied in Bryan Ford's
paper do not support left recursion, PEGs are usually implemented
using packrat parsers which *can* support direct left recursion just
fine as Alessandro Warth showed in his aptly titled paper "Packrat
Parsers Can Support Left Recursion", although with the caveat that you
lose the linear time complexity guarantee that makes packrat parsing
so compelling in the first place.

jwm

Jörg W Mittag wrote:

Actually, while it is true that PEGs as specificied in Bryan Ford's
paper do not support left recursion, PEGs are usually implemented
using packrat parsers which *can* support direct left recursion just
fine as Alessandro Warth showed in his aptly titled paper "Packrat
Parsers Can Support Left Recursion",

Nice, thanks for the pointer. I'm not brave enough to try to implement
it in Treetop (the builder used for code gen is a bit of a pain), but
it'd certainly be cool to have. Probably easier to add to Citrus, where
there's no code generator in the way.

Clifford Heath.