Eric Mahurin wrote:
I still don't know enough about packrat parsing yet. Do you think it
possible for this type of parser to reach the same performance as LL (or
LALR) parsers or is there just extra overhead that you need at parse time
Unless you can analyse the heck out of the source grammar, as ANTLR does,
I think there's always going to be an overhead. Terence Parr's achievement
is to figure out where backtracking might be needed and do it efficiently
only when needed.
Treetop could be improved with some local optimisations, like using Regexp,
and perhaps looking at the rest of the rule to reduce memoization for
example,
which might pick up 10x improvement or mor, but I expect there'll always be
a
a large overhead over using a hand-crafted parser. There's a suggestion in
the
wind to implement "skip" rules, that produce no syntax nodes, for things
like
whitespace skipping.
Currently, saying "space*" will generate one node for *each* space - Regexp
would fix that. So even for a PEG parser, Treetop is unnecessarily slow.
It does handle lexer-free parsers (but you
can have a lexer) and handles LL(*) parsing, but with a performance
penalty
- backtracking. From my understanding, packrat parsers shine in
backtracking by memoizing to maintain linear performance. I'm wondering
if
I could use this technique and still maintain my non-backtracking
performance.
The major part of the cost is the construction of so many objects.
If you provide memoize-when-hinted, I think that'd be best.
If you also provided a sweet metagrammar (I find your earlier Grammar
examples make my eyes bleed), I'd be onto it like a shot. I need the
prospect of non-Ruby code generation however...
I'm curious, could you give an example of what makes your eyes bleed? I
would like to improve the usability where possible. The Grammar layer
itself will always be just ruby objects with a ruby DSL. But, I would like
to provide translators from various formats (BNF, PEG, regex, etc) to
Grammar objects. I'm sure I could even give you a faster replacement for
Regexp (using ruby2cext). These translators would simply be Grammars that
build Grammars (or maybe the code, so you could see/edit it).
Also, with Grammar 0.8, I've completely separated the parser generation.
Grammar is just a light-weight layer used to hold the description of the
language you are parsing. The real work is in the "engine" that does parser
generation. There aren't many bounds as to what an "engine" could do with a
Grammar. An engine could:
- generate a parser in another target language (engine might allow an
"action" to be simply a code string)
- directly parse while traversing a Grammar
- generate another type of parser - LALR, packrat ???
- build a Graphviz diagram of a Grammar tree
- "inspect" a Grammar (Grammar#inspect isn't useful because every Grammar
level mainly just has a lambda)
- translate to another format - regex, BNF, PEG
Right now I've implemented only a few engines (some built on each other) -
Ruby (generates very flat LL ruby parsers), RubyCall (puts method calls in
certain places), Ruby0 (parses directly instead of generating a parser),
Ruby2CExt (uses ruby2cext to generate a rubyC parser).
Perhaps an option to "memoize everything" could gather statistics based
on actual parser behaviour with real input, and produce the backtracking
hints automatically? That'd be really sweet, because you wouldn't need to
be a language wizard to work out where to put them.
Personally, I don't like the idea of automatic backtracking. Regexp does
this. The big problem I see with auto-backtracking is that when the parse
fails, you just get a simple "it failed" response. Or you could possibly
generate a tree of possible failing locations (and expectations) in the
input. Nothing as simple as "expected this, got that, at line 123". Ever
try to debug a large Regexp? If it doesn't parse your input, it just
returns false and you have nothing to go on. I see the auto-backtracking
feature of Regexp as the reason it can't do much more than that. If/when I
make a Regexp translator, it would generate a Grammar that can backtrack on
every alternative (except with the (?>...) no backtrack construct) to be
compatible. I guess an "engine" could also backtrack by default.
In my very first incarnation of Grammar (called Syntax on RAA), I did do
auto-backtracking, but found the above problem. I'd rather it be the
exception than the rule.
Does Treetop auto-backtrack? How do you deal with error reporting?
If backtracking isn't always on, do you memoize only when you might possibly
backtrack? In Grammar, I'm wondering if it is possible to take to a
memoizing performance hit only when backtracking is allowed.
I thinking I could also make an "engine" for Grammar that did
packrat or even LALR parsing instead of LL parsing.
Nathan was thinking about implementing Treetop using a VM, Truth is,
if the VM had the right hints, it'd be awesome.
I'd also could make a Grammar engine for generating VM bytecode if that were
possible.
Also, any of you have JSON parser for ANTLR with a Ruby target? JSON is
what I've been benchmarking with (because of the ruby quiz) and I'd like
to
compare against ANTLR.
Not I, but I'd be surprised if Google doesn't find one. It'd be pretty
simple to throw one together anyhow.
I already found it on the antlr site, but not with a ruby target. I didn't
want to go relearn antlr (v3 now) and how to get it to generate ruby.
Have you used ANTRWorks BTW? It's excellent!
Once. I'd like to make something like this for Grammar (at least an engine
that graphically displays a Grammar).
Eric
···
On Tue, Sep 23, 2008 at 4:20 AM, Clifford Heath <no@spam.please.net> wrote: