ANTLR Target for Ruby

Hi everyone,

I've asked the following question in the ANTLR mailing list, with no
luck. Since it's quite Ruby-centric, perhaps someone here knows the
answer, given the popularity of ANTLR.

I'm in the process of migrating a YACC grammar to ANTLR.

However, since the rest of the project is written in Ruby, I'd like to
generate the ANTLR code in Ruby, if possible. Currently I'm generating
Java and using JRuby to circumvent this limitation. I'd prefer to use
MRI-Ruby, though

I've read the wiki page related to the Ruby Target (http://
www.antlr.org/wiki/display/ANTLR3/Antlr3RubyTarget), and the blog of
its developer (http://split-s.blogspot.com/2005/12/antlr-for-
ruby.html). The development seems to have halted in version 3.0b4.
This version could suit my needs as I'm only using basic Lexer and
Parser generation. No tree parsers.

Does anyone know where to get this version? It's not listed in the
Download directory (http://www.antlr.org/download/) and I couldn't
find it in the development repository. Has anyone used it
successfully?

Thanks in advance.

arcadio wrote:

I've asked the following question in the ANTLR mailing list, with no
luck. Since it's quite Ruby-centric, perhaps someone here knows the
answer, given the popularity of ANTLR.

I'm in the process of migrating a YACC grammar to ANTLR.

However, since the rest of the project is written in Ruby, I'd like to
generate the ANTLR code in Ruby, if possible.

Have you tried just adding the required options section below your grammar
statement? It just worked for me, though the generated code is a little
quirky. The Ruby generator has been a standard part of ANTLR for years.

options {
        language = Ruby;
        output = AST;
}

I found that ANTLR was very poor at reporting situations for which it
couldn't generate correct code (or rather, the designer has a strange
idea of "correct"), and in particular the lexing behaviour is very
non-intuitive. I was attempting a very difficult grammar that needs
LL(*) behaviour, and as I know now, needs to handle lexing using parse
rules (no lexer at all). After some unhelpful arguments, I gave up and
implemented my grammar using Treetop. Terribly slow, but powerful and
simple to use, once you get the hang of PEG parsing.

Clifford Heath.

Hey Clifford,

I still don't know enough about packrat parsing yet. Do you think it
possible for this type of parser to reach the same performance as LL (or
LALR) parsers or is there just extra overhead that you need at parse time
(as opposed to accounted for at parser generation time)? In my
benchmarking, the 3 packrat parsers I looked at (Treetop, Ghostwheel, and
Peggy) are 10-100 X slower than pure ruby LL and LALR parsers.

I also was a previous ANTLR user (you might find my C preprocessor in ANTLR
2.7.6 releases) before writing my own parser. I decided to stick with LL
parsing for my Grammar project. It does handle lexer-free parsers (but you
can have a lexer) and handles LL(*) parsing, but with a performance penalty
- backtracking. From my understanding, packrat parsers shine in
backtracking by memoizing to maintain linear performance. I'm wondering if
I could use this technique and still maintain my non-backtracking
performance. I thinking I could also make an "engine" for Grammar that did
packrat or even LALR parsing instead of LL parsing. Independently I should
be able to translate a PEG syntax, Regexp syntax, or BNF (YACC/ANTLR) to a
Grammar.

Also, any of you have JSON parser for ANTLR with a Ruby target? JSON is
what I've been benchmarking with (because of the ruby quiz) and I'd like to
compare against ANTLR.

Eric

···

On Sun, Sep 21, 2008 at 7:56 AM, Clifford Heath <no@spam.please.net> wrote:

arcadio wrote:

I've asked the following question in the ANTLR mailing list, with no
luck. Since it's quite Ruby-centric, perhaps someone here knows the
answer, given the popularity of ANTLR.

I'm in the process of migrating a YACC grammar to ANTLR.

However, since the rest of the project is written in Ruby, I'd like to
generate the ANTLR code in Ruby, if possible.

Have you tried just adding the required options section below your grammar
statement? It just worked for me, though the generated code is a little
quirky. The Ruby generator has been a standard part of ANTLR for years.

options {
      language = Ruby;
      output = AST;
}

I found that ANTLR was very poor at reporting situations for which it
couldn't generate correct code (or rather, the designer has a strange
idea of "correct"), and in particular the lexing behaviour is very
non-intuitive. I was attempting a very difficult grammar that needs
LL(*) behaviour, and as I know now, needs to handle lexing using parse
rules (no lexer at all). After some unhelpful arguments, I gave up and
implemented my grammar using Treetop. Terribly slow, but powerful and
simple to use, once you get the hang of PEG parsing.

Clifford Heath.

Eric Mahurin wrote:

I still don't know enough about packrat parsing yet. Do you think it
possible for this type of parser to reach the same performance as LL (or
LALR) parsers or is there just extra overhead that you need at parse time

Unless you can analyse the heck out of the source grammar, as ANTLR does,
I think there's always going to be an overhead. Terence Parr's achievement
is to figure out where backtracking might be needed and do it efficiently
only when needed.

Treetop could be improved with some local optimisations, like using Regexp,
and perhaps looking at the rest of the rule to reduce memoization for example,
which might pick up 10x improvement or mor, but I expect there'll always be a
a large overhead over using a hand-crafted parser. There's a suggestion in the
wind to implement "skip" rules, that produce no syntax nodes, for things like
whitespace skipping.

Currently, saying "space*" will generate one node for *each* space - Regexp
would fix that. So even for a PEG parser, Treetop is unnecessarily slow.

It does handle lexer-free parsers (but you
can have a lexer) and handles LL(*) parsing, but with a performance penalty
- backtracking. From my understanding, packrat parsers shine in
backtracking by memoizing to maintain linear performance. I'm wondering if
I could use this technique and still maintain my non-backtracking
performance.

The major part of the cost is the construction of so many objects.
If you provide memoize-when-hinted, I think that'd be best.
If you also provided a sweet metagrammar (I find your earlier Grammar
examples make my eyes bleed), I'd be onto it like a shot. I need the
prospect of non-Ruby code generation however...

Perhaps an option to "memoize everything" could gather statistics based
on actual parser behaviour with real input, and produce the backtracking
hints automatically? That'd be really sweet, because you wouldn't need to
be a language wizard to work out where to put them.

I thinking I could also make an "engine" for Grammar that did
packrat or even LALR parsing instead of LL parsing.

Nathan was thinking about implementing Treetop using a VM, Truth is,
if the VM had the right hints, it'd be awesome.

Also, any of you have JSON parser for ANTLR with a Ruby target? JSON is
what I've been benchmarking with (because of the ruby quiz) and I'd like to
compare against ANTLR.

Not I, but I'd be surprised if Google doesn't find one. It'd be pretty
simple to throw one together anyhow.

Have you used ANTRWorks BTW? It's excellent!

Clifford Heath.

Eric Mahurin wrote:

I still don't know enough about packrat parsing yet. Do you think it
possible for this type of parser to reach the same performance as LL (or
LALR) parsers or is there just extra overhead that you need at parse time

Unless you can analyse the heck out of the source grammar, as ANTLR does,
I think there's always going to be an overhead. Terence Parr's achievement
is to figure out where backtracking might be needed and do it efficiently
only when needed.

Treetop could be improved with some local optimisations, like using Regexp,
and perhaps looking at the rest of the rule to reduce memoization for
example,
which might pick up 10x improvement or mor, but I expect there'll always be
a
a large overhead over using a hand-crafted parser. There's a suggestion in
the
wind to implement "skip" rules, that produce no syntax nodes, for things
like
whitespace skipping.

Currently, saying "space*" will generate one node for *each* space - Regexp
would fix that. So even for a PEG parser, Treetop is unnecessarily slow.

It does handle lexer-free parsers (but you

can have a lexer) and handles LL(*) parsing, but with a performance
penalty
- backtracking. From my understanding, packrat parsers shine in
backtracking by memoizing to maintain linear performance. I'm wondering
if
I could use this technique and still maintain my non-backtracking
performance.

The major part of the cost is the construction of so many objects.
If you provide memoize-when-hinted, I think that'd be best.
If you also provided a sweet metagrammar (I find your earlier Grammar
examples make my eyes bleed), I'd be onto it like a shot. I need the
prospect of non-Ruby code generation however...

I'm curious, could you give an example of what makes your eyes bleed? I
would like to improve the usability where possible. The Grammar layer
itself will always be just ruby objects with a ruby DSL. But, I would like
to provide translators from various formats (BNF, PEG, regex, etc) to
Grammar objects. I'm sure I could even give you a faster replacement for
Regexp (using ruby2cext). These translators would simply be Grammars that
build Grammars (or maybe the code, so you could see/edit it).

Also, with Grammar 0.8, I've completely separated the parser generation.
Grammar is just a light-weight layer used to hold the description of the
language you are parsing. The real work is in the "engine" that does parser
generation. There aren't many bounds as to what an "engine" could do with a
Grammar. An engine could:

   - generate a parser in another target language (engine might allow an
   "action" to be simply a code string)
   - directly parse while traversing a Grammar
   - generate another type of parser - LALR, packrat ???
   - build a Graphviz diagram of a Grammar tree
   - "inspect" a Grammar (Grammar#inspect isn't useful because every Grammar
   level mainly just has a lambda)
   - translate to another format - regex, BNF, PEG

Right now I've implemented only a few engines (some built on each other) -
Ruby (generates very flat LL ruby parsers), RubyCall (puts method calls in
certain places), Ruby0 (parses directly instead of generating a parser),
Ruby2CExt (uses ruby2cext to generate a rubyC parser).

Perhaps an option to "memoize everything" could gather statistics based

on actual parser behaviour with real input, and produce the backtracking
hints automatically? That'd be really sweet, because you wouldn't need to
be a language wizard to work out where to put them.

Personally, I don't like the idea of automatic backtracking. Regexp does
this. The big problem I see with auto-backtracking is that when the parse
fails, you just get a simple "it failed" response. Or you could possibly
generate a tree of possible failing locations (and expectations) in the
input. Nothing as simple as "expected this, got that, at line 123". Ever
try to debug a large Regexp? If it doesn't parse your input, it just
returns false and you have nothing to go on. I see the auto-backtracking
feature of Regexp as the reason it can't do much more than that. If/when I
make a Regexp translator, it would generate a Grammar that can backtrack on
every alternative (except with the (?>...) no backtrack construct) to be
compatible. I guess an "engine" could also backtrack by default.

In my very first incarnation of Grammar (called Syntax on RAA), I did do
auto-backtracking, but found the above problem. I'd rather it be the
exception than the rule.

Does Treetop auto-backtrack? How do you deal with error reporting?

If backtracking isn't always on, do you memoize only when you might possibly
backtrack? In Grammar, I'm wondering if it is possible to take to a
memoizing performance hit only when backtracking is allowed.

  I thinking I could also make an "engine" for Grammar that did

packrat or even LALR parsing instead of LL parsing.

Nathan was thinking about implementing Treetop using a VM, Truth is,

if the VM had the right hints, it'd be awesome.

I'd also could make a Grammar engine for generating VM bytecode if that were
possible.

Also, any of you have JSON parser for ANTLR with a Ruby target? JSON is

what I've been benchmarking with (because of the ruby quiz) and I'd like
to
compare against ANTLR.

Not I, but I'd be surprised if Google doesn't find one. It'd be pretty
simple to throw one together anyhow.

I already found it on the antlr site, but not with a ruby target. I didn't
want to go relearn antlr (v3 now) and how to get it to generate ruby.

Have you used ANTRWorks BTW? It's excellent!

Once. I'd like to make something like this for Grammar (at least an engine
that graphically displays a Grammar).

Eric

···

On Tue, Sep 23, 2008 at 4:20 AM, Clifford Heath <no@spam.please.net> wrote:

Hopefully it's fixed now, but last time I tried using ANTLR for Ruby output it was very broken. That was a while ago though (shortly after version three released, I think).

James Edward Gray II

···

On Sep 23, 2008, at 8:58 AM, Eric Mahurin wrote:

I already found it on the antlr site, but not with a ruby target. I didn't want to go relearn antlr (v3 now) and how to get it to generate ruby.

Eric Mahurin wrote:

The major part of the cost is the construction of so many objects.
If you provide memoize-when-hinted, I think that'd be best.
If you also provided a sweet metagrammar (I find your earlier Grammar
examples make my eyes bleed), I'd be onto it like a shot. I need the
prospect of non-Ruby code generation however...

I'm curious, could you give an example of what makes your eyes bleed?

I must admit I haven't looked closely at Grammar for a while, however...

DSLs are fine for folk who don't know how to make proper grammars :-).
They tend to have a somewhat high signal-to-noise ratio, by which I mean
there are too many extraneous characters/symbols, or symbols that aren't
intuitively communicative.

Every non-essential symbol in a sentence is a significant barrier
to a newcomer, though experienced users learn to skip them and don't
understand what the fuss is about. The problem is that the difficulty
of grasping which symbols are significant and which aren't goes with
the *square* of the S/N ratio.

XML is a perfect example of how to do this wrongly :slight_smile: sendmail.cf
and Perl are even better examples ;-).

I would like
to provide translators from various formats (BNF, PEG, regex, etc) to
Grammar objects.

That's what I'm talking about, good stuff.

Also, with Grammar 0.8, I've completely separated the parser generation.
Grammar is just a light-weight layer used to hold the description of the
language you are parsing.

So you're already halfway to a VM, where the generated parser doesn't have
to be human-readable at all :slight_smile:

Personally, I don't like the idea of automatic backtracking. Regexp does
this. The big problem I see with auto-backtracking is that when the parse
fails, you just get a simple "it failed" response.

Not the case with Treetop. It assumes that the furthest point reached is
the failure point (not often true but an assumption that works, given...)
and lists the text leading up to that, and enumerates the terminal tokens
that would have allowed it to get further. This works *really* well.
Try it - call the "failure_reason" method on a failed parse to see it
in action.

Nothing as simple as "expected this, got that, at line 123".

It's exactly as simple as that. The diagnosis might be wrong, but to
the human, it's a comprehensible mistake.

Does Treetop auto-backtrack?

That's what memoizing is for - it's the core of how PEG works.

If backtracking isn't always on, do you memoize only when you might possibly
backtrack? In Grammar, I'm wondering if it is possible to take to a
memoizing performance hit only when backtracking is allowed.

I don't think you can guess when backtracking will be needed.
As I said, it requires heavy analysis - Terrence didn't get his
PhD for nothing!

But if you always allow backtracking ala PEG, but only memoize in
places where experience (aka statistics collected) have shown it's
needed, you could have the best of both worlds.

I already found it on the antlr site, but not with a ruby target. I didn't
want to go relearn antlr (v3 now) and how to get it to generate ruby.

You just ask for Ruby in your options block at the top of the grammar.
The Ruby target is quite strange and limited though IIRC. The lexer
in all ANTLR grammars is definitely non-intuitive, it doesn't choose
the longest possible match like every other lexer in the world. Terr
isn't concerned about his users, just about big-noting himself. I
wouldn't say "their weak support ran me off" so much as just "they ran
me off". When he pisses someone off, that just proves his superiority
more... and I have the emails to prove it. You mighta thought DHH had
an attitude problem, heh, he's tame!

Have you used ANTRWorks BTW? It's excellent!
Once. I'd like to make something like this for Grammar (at least an engine
that graphically displays a Grammar).

ANTLRWorks shows railroad diagrams, which I used to produce syntax
documentation for CQL. But the ability to interactively test individual
rules is great. A lot of work though - I'd have thought your skills
were better spent elsewhere.

Clifford Heath.

I doubt it is fixed... I tried to submit patches to get the ruby code to work and Terr was a PITA to no end so I walked away. He later came to me asking if I wanted to maintain the ruby side, but... too little, too late.

I just checked perforce and the last change to ruby runtime was:

Change 3846 on 2007/06/26 by martin@martin.MartinXP.antlr3 '- fixed various bugs related...'

My last emails from Terr were 2007-08, so ... no, probably not fixed in the slightest.

···

On Sep 23, 2008, at 07:08 , James Gray wrote:

On Sep 23, 2008, at 8:58 AM, Eric Mahurin wrote:

I already found it on the antlr site, but not with a ruby target. I didn't want to go relearn antlr (v3 now) and how to get it to generate ruby.

Hopefully it's fixed now, but last time I tried using ANTLR for Ruby output it was very broken. That was a while ago though (shortly after version three released, I think).

I wrote a PEG parser generator a couple of years ago to support my
Walrus object-oriented templating system (inspired by the Cheetah
template engine in Python, and carrying over a lot of its syntax and
directives). Actually, the reason I did this was because I didn't have
any luck getting ANTLR to do what I wanted it to do.

The parser generator uses a Ruby DSL to specify the grammar,
dynamically constructing the parser on the fly. To get an idea of what
the DSL looks like and what it can do, check out the parser for the
Walrus language itself; it is a fairly complex example because it
demonstrates things like "here documents", include files, and parses a
subset of Ruby itself in addition to the Walrus markup:

http://git.wincent.com/Walrus.git?a=blob;f=lib/walrus/parser.rb

Like Treetop, it is fairly slow, but it works. The major performance
penalty actually seems to be the memory bloat associated with
memoization. There is also a resource leak in there where a lot of the
memoized data doesn't appear to be getting garbage-collected when you
run the parser over lots of documents in the same session; never got
to the bottom of that one despite spending quite a few hours on it. My
workaround was to process multiple files in batches, running Walrus
once for each file rather than passing it all the files at once. Not
very elegant but it worked.

Given that it works I haven't really done any more development on it.
But an academic itch that I've wanted to scratch for some time now has
been turning it into a performance powerhouse; either by replacing the
dynamically generated parser with a custom Ragel lexer (packaged as a
C extension for Ruby) or by doing static grammar analysis like ANTLR
does to identify places where prediction could be used to circumvent
unnecessary backtracking and jump straight to the right alternative.

But I think both of those options would require me to lose the Ruby
DSL and switch to a custom, non-Ruby language for specifying grammars.
One of the things I always liked about the system was that you wrote
your grammars in Ruby. Evidently, you can only do static analysis once
you've parsed the grammar, and that would be fiendishly difficult (or
impossible?) to do with a dynamically-constructed one built on the fly
by evaluating the Ruby DSL.

Cheers,
Wincent

DSLs are fine for folk who don't know how to make proper grammars :-).

OK. I guess that's that source of the "bleeding eyes"... I'm sounds like I
won't be able to convince you why a DSL is better than a specific format,
and you vice-versa. A quick summary is: although you lose the exact syntax
that you may want, you gain in power and flexibility. Using Grammar
(directly), you could template Grammars (method/lambda could return a
Grammar based on Grammar args and other params), use plain ruby for extra
control when specifying a Grammar tree, or simply embed Grammar objects in a
ruby program (like Regexp objects can be - although they parse the grammar
from a string).

I would like

to provide translators from various formats (BNF, PEG, regex, etc) to
Grammar objects.

That's what I'm talking about, good stuff.

Also, with Grammar 0.8, I've completely separated the parser generation.

Grammar is just a light-weight layer used to hold the description of the
language you are parsing.

So you're already halfway to a VM, where the generated parser doesn't have
to be human-readable at all :slight_smile:

Better - it is a compiler. Compiles from a Grammar to a target language.
Grammar has always done this. Previously, Ruby code generation was done in
the Grammar class itself. I've separated it to add quite a bit of
flexibility.

Personally, I don't like the idea of automatic backtracking. Regexp does

this. The big problem I see with auto-backtracking is that when the parse
fails, you just get a simple "it failed" response.

Not the case with Treetop. It assumes that the furthest point reached is
the failure point (not often true but an assumption that works, given...)
and lists the text leading up to that, and enumerates the terminal tokens
that would have allowed it to get further. This works *really* well.
Try it - call the "failure_reason" method on a failed parse to see it
in action.

OK. I guess it just remembers the furthest point it reached when
backtracking and when all possibilities are exhausted, it reports this
furthest point. Sounds reasonable. I guess I could try something like this
when I backtrack. Right now, I ignore errors/mismatches that cause a
backtrack.

Does Treetop auto-backtrack?

That's what memoizing is for - it's the core of how PEG works.

Don't you mean packrat instead of PEG? I thought PEG was just a format for
describing something to parse, like BNF. And packrat refers to the type of
parser - like LL or LALR. From my understanding packrat is a cousin of LL,
but adds memoizing to achieve linear backtracking performance.

If backtracking isn't always on, do you memoize only when you might

possibly
backtrack? In Grammar, I'm wondering if it is possible to take to a
memoizing performance hit only when backtracking is allowed.

I don't think you can guess when backtracking will be needed.
As I said, it requires heavy analysis - Terrence didn't get his
PhD for nothing!

In Grammar, I don't try to guess. Backtracking is normally not done. To
allow for backtracking, I provide a Grammar#backtrack method that returns a
new Grammar that can backtrack to the beginning of where the Grammar started
parsing the input when it fails in the middle. I would only apply
memoization for one of these - if this is doable.

I already found it on the antlr site, but not with a ruby target. I

didn't
want to go relearn antlr (v3 now) and how to get it to generate ruby.

You just ask for Ruby in your options block at the top of the grammar.
The Ruby target is quite strange and limited though IIRC. The lexer
in all ANTLR grammars is definitely non-intuitive, it doesn't choose
the longest possible match like every other lexer in the world. Terr
isn't concerned about his users, just about big-noting himself. I
wouldn't say "their weak support ran me off" so much as just "they ran
me off". When he pisses someone off, that just proves his superiority
more... and I have the emails to prove it. You mighta thought DHH had
an attitude problem, heh, he's tame!

I didn't encounter that a few years ago. I mostly just learned a lot about
LL parsing and wanted to make something better. It does sound like others
have been run off though...

  Have you used ANTRWorks BTW? It's excellent!

Once. I'd like to make something like this for Grammar (at least an
engine
that graphically displays a Grammar).

ANTLRWorks shows railroad diagrams, which I used to produce syntax
documentation for CQL. But the ability to interactively test individual
rules is great. A lot of work though - I'd have thought your skills
were better spent elsewhere.

I was only thinking making a Grammar engine that generated dot files for
Graphviz to get a visualization of a Grammar.

Eric

···

On Tue, Sep 23, 2008 at 9:04 PM, Clifford Heath <no@spam.please.net> wrote:

Yeah, I too felt like Ruby was a second class citizen. I was trying to make the switch to ANTLR as my parsing tool, but their week support ran me off.

James Edward Gray II

···

On Sep 23, 2008, at 1:30 PM, Ryan Davis wrote:

On Sep 23, 2008, at 07:08 , James Gray wrote:

On Sep 23, 2008, at 8:58 AM, Eric Mahurin wrote:

I already found it on the antlr site, but not with a ruby target. I didn't want to go relearn antlr (v3 now) and how to get it to generate ruby.

Hopefully it's fixed now, but last time I tried using ANTLR for Ruby output it was very broken. That was a while ago though (shortly after version three released, I think).

I doubt it is fixed... I tried to submit patches to get the ruby code to work and Terr was a PITA to no end so I walked away. He later came to me asking if I wanted to maintain the ruby side, but... too little, too late.

I just checked perforce and the last change to ruby runtime was:

Change 3846 on 2007/06/26 by martin@martin.MartinXP.antlr3 '- fixed various bugs related...'

My last emails from Terr were 2007-08, so ... no, probably not fixed in the slightest.

It seems like everybody writing parser (generators) is coming from ANTLR. I
count Clifford Heath, James Gray, Ryan Davis, Wincent, and me.

Wincent, do you happen to have something in Walrus that parses JSON and
generates ruby (tree of hashs and arrays)? I've benchmarked a bunch of
parser generators out there using JSON and would always like to add more.

···

On Wed, Sep 24, 2008 at 3:49 AM, Wincent Colaiuta <win@wincent.com> wrote:

I wrote a PEG parser generator a couple of years ago to support my
Walrus object-oriented templating system (inspired by the Cheetah
template engine in Python, and carrying over a lot of its syntax and
directives). Actually, the reason I did this was because I didn't have
any luck getting ANTLR to do what I wanted it to do.

When I wrote GhostWheel, I was convinced the DSL was better too. Then I tried using it and changed my mind. :slight_smile:

The DSL adds a lot of Ruby noise that makes it harder to think about and understand the language being parsed. It keeps you from focusing on the right things.

For example, compare this JSON parser using GhostWheel's DSL:

http://ghostwheel.rubyforge.org/svn/trunk/test/example/tc_json_ruby.rb

and this using a grammar:

http://ghostwheel.rubyforge.org/svn/trunk/test/example/tc_json_ghost_wheel.rb

I'm not arguing my grammar is perfect either. I think it could get even more readable. But have a look at how clear rules like value_space, value_content, value, and value_with_array_sep play out.

I also think most of the advantages you list can be gained with non-DSL grammars. I mean, we can always dynamically build up a String in Ruby too and that's the dumbest thing I could possibly think of.

Anyway, the two don't have to be exclusive. As you see, GhostWheel supports both.

James Edward Gray II

···

On Sep 24, 2008, at 7:23 AM, Eric Mahurin wrote:

I'm sounds like I
won't be able to convince you why a DSL is better than a specific format,
and you vice-versa. A quick summary is: although you lose the exact syntax
that you may want, you gain in power and flexibility. Using Grammar
(directly), you could template Grammars (method/lambda could return a
Grammar based on Grammar args and other params), use plain ruby for extra
control when specifying a Grammar tree, or simply embed Grammar objects in a
ruby program (like Regexp objects can be - although they parse the grammar
from a string).

I think your right about the first part, but not about the second
part; Packrat has nothing to do with LL or LALR parsing.

A PEG is a Parsing Expression Grammar. Its distinguishing features
are:

- has the usual repetition (zero or more, one or more) operators,
sequence operators, alternative operators, as well as "and predicates"
and "not predicates"

- grammars are non-ambiguous because alternatives are specified as a
set of ordered choices

- lends itself to recognition by recursive descent, and maps quite
well onto the way human beings perform recognition in their own minds

A Packrat parser is a memoizing recursive-descent parser that
recognizes input based on a PEG. Its distinguishing feature is the
memoization.

At least, that's my understanding of it.

Cheers,
Wincent

···

On 24 sep, 14:23, Eric Mahurin <eric.mahu...@gmail.com> wrote:

On Tue, Sep 23, 2008 at 9:04 PM, Clifford Heath <n...@spam.please.net> wrote:
> Does Treetop auto-backtrack?

> That's what memoizing is for - it's the core of how PEG works.

Don't you mean packrat instead of PEG? I thought PEG was just a format for
describing something to parse, like BNF. And packrat refers to the type of
parser - like LL or LALR. From my understanding packrat is a cousin of LL,
but adds memoizing to achieve linear backtracking performance.

Eric Mahurin wrote:

DSLs are fine for folk who don't know how to make proper grammars :-).

A quick summary is: although you lose the exact syntax
that you may want, you gain in power and flexibility.

As James has said, I don't believe the latter is necessarily true.
It's beholden on we who want folk to create proper grammars to use
best practise ourselves, anyhow.

Don't you mean packrat instead of PEG?

You're right, but I identify the two, as I've never seen one used
without the other.

I was only thinking making a Grammar engine that generated dot files for
Graphviz to get a visualization of a Grammar.

Ok. That wouldn't use useful to me, because when I have to document a
grammar, the diagrams aren't as appropriate for that audience cf a railroad
diagram. See my "Introduction to CQL" here, which has railroads extracted
from ANTLRWorks: <dataconstellation.com/ActiveFacts/CQLIntroduction.pdf>.

Clifford Heath.

···

On Tue, Sep 23, 2008 at 9:04 PM, Clifford Heath <no@spam.please.net> wrote:

Wincent Colaiuta wrote:

Like Treetop, it is fairly slow, but it works.

Nice!

But an academic itch that I've wanted to scratch for some time now has
been turning it into a performance powerhouse; either by replacing the
dynamically generated parser with a custom Ragel lexer (packaged as a
C extension for Ruby) or by doing static grammar analysis like ANTLR
does to identify places where prediction could be used to circumvent
unnecessary backtracking and jump straight to the right alternative.

What do you think of my third option: Provide memoization only where a
test workload shows the need for it? Backtrack in every situation just
in case, but memoize where testing shows it's useful - totally pragmatic
approach...

One of the things I always liked about the system was that you wrote
your grammars in Ruby.

I think that's an error, based on the presumption that "is Ruby, is good".
Not all that's Ruby is golden, err... red ;-). By which I mean that
one size doesn't fit all, or we wouldn't need parsers in the first place.

Clifford Heath.

Hi Wincent,

Just because you are using a Ruby DSL, you shouldn't be limited. There is
nothing that says your DSL couldn't build exactly the same data structures
that you'd build if you read/parsed in a non-Ruby DSL. In my Grammar
project, I create a Grammar tree with a Ruby DSL (simply methods that create
and combine Grammars). I can parse directly using this tree or I can
generate highly optimized code. Here are a few optimizations: flattening
(method calls only needed for recursion), tail call optimization (right
recursion), left recursion (normally unheard of with LL), some refactoring,
and boolean minimizations. The main thing that should matter in terms of
analysis should be your data structures, not how you get there. A ruby DSL
will only enhance flexibility.

Eric

···

On Wed, Sep 24, 2008 at 3:49 AM, Wincent Colaiuta <win@wincent.com> wrote:

But I think both of those options would require me to lose the Ruby
DSL and switch to a custom, non-Ruby language for specifying grammars.
One of the things I always liked about the system was that you wrote
your grammars in Ruby. Evidently, you can only do static analysis once
you've parsed the grammar, and that would be fiendishly difficult (or
impossible?) to do with a dynamically-constructed one built on the fly
by evaluating the Ruby DSL.

I doubt they even gave it that long... *rimshot*

···

On Sep 23, 2008, at 12:19 , James Gray wrote:

Yeah, I too felt like Ruby was a second class citizen. I was trying to make the switch to ANTLR as my parsing tool, but their week support ran me off.

Have you tried the JSON parser one built in to 1.9? It's a native extension, but I don't know how its performance compares with other JSON parsers.

Dave

···

On Sep 24, 2008, at 7:23 AM, Eric Mahurin wrote:

Wincent, do you happen to have something in Walrus that parses JSON and
generates ruby (tree of hashs and arrays)? I've benchmarked a bunch of
parser generators out there using JSON and would always like to add more.

I think my DSL is still fairly clean because you can simply use variable
assignments for "rules" and use the "+" and "|" operators for sequences and
alternations. Those help a lot. Ideally I would use a space instead of "+"
to match BNF, but it doesn't seem that bad.

But, I think I'll also have converters (PEG, BNF, regex) to give the non-DSL
approach.

···

On Wed, Sep 24, 2008 at 8:32 AM, James Gray <james@grayproductions.net>wrote:

On Sep 24, 2008, at 7:23 AM, Eric Mahurin wrote:

I'm sounds like I

won't be able to convince you why a DSL is better than a specific format,
and you vice-versa. A quick summary is: although you lose the exact
syntax
that you may want, you gain in power and flexibility. Using Grammar
(directly), you could template Grammars (method/lambda could return a
Grammar based on Grammar args and other params), use plain ruby for extra
control when specifying a Grammar tree, or simply embed Grammar objects in
a
ruby program (like Regexp objects can be - although they parse the grammar
from a string).

When I wrote GhostWheel, I was convinced the DSL was better too. Then I
tried using it and changed my mind. :slight_smile:

The DSL adds a lot of Ruby noise that makes it harder to think about and
understand the language being parsed. It keeps you from focusing on the
right things.

For example, compare this JSON parser using GhostWheel's DSL:

http://ghostwheel.rubyforge.org/svn/trunk/test/example/tc_json_ruby.rb

and this using a grammar:

http://ghostwheel.rubyforge.org/svn/trunk/test/example/tc_json_ghost_wheel.rb

I'm not arguing my grammar is perfect either. I think it could get even
more readable. But have a look at how clear rules like value_space,
value_content, value, and value_with_array_sep play out.

I also think most of the advantages you list can be gained with non-DSL
grammars. I mean, we can always dynamically build up a String in Ruby too
and that's the dumbest thing I could possibly think of.

Anyway, the two don't have to be exclusive. As you see, GhostWheel
supports both.