When little languages grow

Dave_Burt2 · 27 January 2005 02:25

Hi,

I'm having trouble understanding what you mean by this. Can you say it in
Ruby?

Cheers,
Dave

···

"ts" <decoux@moulon.inra.fr> wrote:

> Thread.new(input){|source|
> $SAFE=5
> instance_eval source
> }.value

Sorry to say this but this is the most common error that I see when
someone try to eval some code with $SAFE >= 4

The code will be eval'ed with $SAFE >= 4 but the result (#value) will be
used with $SAFE = 0 and you can have problems.

The result of #eval must be cleaned with $SAFE >= 4, before it's
returned.

Hugh_Sasse · 27 January 2005 12:10

Hi Hugh.

Two suggestions:

Most of the docs for parser-generators assume some (lots?) knowledge of
compiler theory. An exception is the O'Reilly Lex&Yacc book. If any of

[...]
Thanks, I'll see if I can get my hands on this. I've read a few books
about these but that description sounds good.

I think that if you do this often, mastering one of these tools will be
a life skill you'll never regret! And yacc (or one of the others) is a
mini-language in its own right, always good to learn another.

Agreed, I'd really like to be fluent in it.

On another track, I have deliberately NOT used these grammers when
parseing the BNF for internet email. In my experience, if you can write
the grammer down in BNF, its usually pretty easy to write a recursive
descent parser for it. It's a little tedious though, which is why all
the tools exist to generate the code for you...

According to "Crafting a Compiler" ISBN 0-8053-3201-4 it seems that
recursive descent implies LL(1), so I think I'd agree with aiming
for that constraint. It describes tecniques for removing the
ambiguity that increases LL(1) to LL(k), so often this is doable.

Have fun,
Sam

Thank you,
Hugh

···

On Thu, 27 Jan 2005, Sam Roberts wrote:

Hugh_Sasse · 27 January 2005 13:35

(In response to news:Pine.GSO.4.60.0501261622400.24999
@brains.eng.cse.dmu.ac.uk by Hugh Sasse Staff Elec Eng)

I seem to have run into my parsing problem again. [...]
So I'm wondering what other people do.

Just a few random thoughts:

- All of this depends on how complex your mini-language is. Should it be
Turing complete or just really something like fstab ? See
http://www.faqs.org/docs/artu/ch08s01.html (a chapter from the Art of
Unix Programming, including a taxonomy of languages).

Interesting that declarative languages are considered less general,
but that may be an artifact of those selected for the figure.

- If your language needs loopiness and control constructs like Ruby has

Mine doesn't, but...

(or should look like Ruby), then you could help me think about a cut-down
version of Ruby for configuration files: This version of Ruby should be

Yes, I'd like one of those. Even having Lua Tables as a native data
type to Ruby would be a big help, but this would be good to have for
the more general case.

···

On Thu, 27 Jan 2005, Kaspar Schiess wrote:
,

accessible from within Ruby and be totally safe (=cut off of the external
world). Such a thing would have to be execution timeouted (endless loops
and stack overflows) too...

Good idea.

- Your problem was essentially the topic of one of my semester projects.
I tried out a lot of different parser generators, for different
languages. None really could score in the category of simplicity.
Probably just a complex subject. But the problem of having to create a
minilanguage pops up all the time, as you say.

Thanks for this confirmation that I'm still on the right track!

- Parse error reporting: A lot of programming languages have bad parse
error reporting, which essentially comes from the fact that the parser
generator does not help you in this. ANTLR is a bit better there, but IMO
still sucks.

Yes it is difficult. I think you have to sprinkle '.' all over yacc
grammars to do this.

I think there are two really good ideas in this thread:
a) Create and Integrate Grammars as first class objects in Ruby.
(Clifford Heath)
b) Use a cut-down version of Ruby that is designed with security in mind
(various people, this has been done).

I can't help feeling that we're missing a really simple way to do all of
this ...

I think so too, but I can't see the wood for the trees...

kaspar

hand manufactured code - www.tua.ch/ruby

Hugh

Mark_Probert · 26 January 2005 18:48

Hi ..

···

On Wednesday 26 January 2005 10:33, Hugh Sasse Staff Elec Eng wrote:

I'll probably stay with the pure ruby one, but I'll certainly look
at this. I suspect that the 1 might be the problem for what I'm
trying to do. I can't remember how LL differs from LR now, but I'll
find that pretty easily.

If you are starting from scratch, then LL(1) usually isn't a problem. If you
can define the grammar in well formed BNF, then LL(1) will be fine. The real
problem comes with feature creep and making sure that you don't hose your
grammar too much

Regards,

--
-mark. (probertm at acm dot org)

Ed_Hames · 26 January 2005 19:40

Is the "mini-language" given or you can build your own "mini-language"?
If you can write your own language, then you should consider writing a
DSL suitable for your problem a la Rails, and then, let the Ruby
interpreter do the whole job.

Kind Regards,
Ed

···

On Thu, 27 Jan 2005 04:31:05 +0900, Hugh Sasse Staff Elec Eng <hgs@dmu.ac.uk> wrote:

On Thu, 27 Jan 2005, Trans wrote:

> Hi Hugh,
>
> What are you trying to parse exactly? Is it full Ruby source or a small
> subset?

Not full ruby source, just a mini-language, maybe using Ruby's
parser instead of botching(!) my own...
>

--
Alcohol is the anesthesia by which we endure the operation of life.
-- George Bernard Shaw

7rans · 26 January 2005 22:30

Hugh, I sent you an email with parser.rb attached, but I'm not sure if
I had the right email address. Let me know if you got it thanks.

This is being sent via google groups, so I assume it will get to you

T.

Carlos · 27 January 2005 01:23

[Hugh Sasse Staff Elec Eng <hgs@dmu.ac.uk>, 2005-01-26 19.18 CET]

My present example is that I want to parse Constructive Solid
Geometry descriptions, at the moment limited to cones, spheres,
bricks with the co-ordinates specified, and I need to specify
material types as well.

I'd also like to be able to declare new objects so they can be
placed. Silly example: Get two small spheres to cap the ends of
a cylinder and call the result a Sausage. Then place several
Sausages in the space at different points.

Later I'd have to extend the language to be able to rotate them into
psoition. Lots of creeping featurism is likely, I suspect. I
didn't have material types to deal with before.

How about using POV-ray's scene description language?

Here somebody did a yacc parser for that language:
http://www.io.com/~wwagner/pov.html
(skip to header "The Parser"). Maybe you can adapt it to use with racc.

(I don't know about the licence.)

Good luck.

Mathieu_Bouchard1 · 27 January 2005 01:46

Another thing I like about Tcl is that it has _no_ special cases. Its
syntax, while superficially like shell scripts, is deeply like Lisp
(actually simpler: think Scheme with a few details backwards).

The equivalent of Ruby's "def" in Tcl is "proc". Well, proc is a
procedure, so you can redefine it, so as to add features to it.

That kind of goodies is the things that make Tcl quite bearable to me,
despite lacking a damn lot features (I originally switched to Perl from
Tcl)

I still have to work with Tcl because of a GUI made in Tk, but I'm quite
happy with it.

···

On Thu, 27 Jan 2005, Booker C. Bense wrote:

_ Of course it was vast overkill for 99% of what you need to
do with that kind of system. As much as I dislike Tcl for other
reasons, it is a very good language for extending applications
via simple scripting. Adding your own specialized commands is
fairly straightforward. I know quite a few scientific groups
that are using Python to do similar things these days.

_____________________________________________________________________
Mathieu Bouchard -=- Montréal QC Canada -=- http://artengine.ca/matju

Hugh_Sasse · 27 January 2005 11:56

Mark Probert wrote:

My personal solution to this is to use Coco/R, an LL(1) scanner/generator.

I haven't used Coco, but I'd second Mark's recommendation to stay with
an LL(1) parser if possible. If not, then LL(n) ala ANTLR, but not LALR
ala yacc. The LL parsers are easy to write manually using recursive
descent, which I've done a few times.

I dug out my copy of "Crafting a Compiler", ISBN 0-8053-3021-4
last night, and agree that LL(1) is simpler because it can be done
from a fairly straightforward table. Yacc and Bison are LALR I
think.

[1] I find that thinking in the manner of a shift/reduce parser is
particularly unnatural to me. ... Maybe there is something I can read which will turn the problem around, so it becomes easy to handle?

Shift just means "delay a decision about what I've just seen". Reduce is
the operation you do when you do decide. If you explore the ambiguity
in your grammar rules, these start to make more sense.

That's a good description, but it's difficult to hold this
state in one's, (or is that just my?), head

The primary disadvantage of Coco/R is the LL(1) part.

ANTLR does LL(n) for arbitrary n I believe - though you should avoid
n > 3 or humans start to have trouble parsing your language :-).

It's a shame that the ANTLR folk at Purdue went Java-only when they
dropped their old C-based implementation. A multi-lingual ANTLR would
be super-cool, especially if it would generate Ruby.

There is also PCCTS

http://dynamo.ecn.purdue.edu/~hankd/PCCTS/

but, although very flexible, it is not what I'd call simple. I've
not really retained what I read about it, but I remember it is a
good design of something with sufficient complexity to meet its
goals.

As an example, Ruby can not, as far as I have tried, be converted into an LL(1) grammar, though C can.

Not without a tie-in to the lexical analyser to help recognise goto
labels, which require LL(2). Such a tie-in is commonly used however.

A simple example of the ruby grammar

Good example, thanks Mark.

I should point out that the major reason for the success of XML
(contrary to most of the hyped claims about it) is that it allows
people to create languages without having to create parsers. Or
rather, they use an XML parser which yields a DOM, and can process
the AST at will.

This is a good point. I'd not really given much thought to XML.
I'm thinking of this for describing problems expressed in
constructive solid geometry, and XML is pretty unpleasant to edit by
hand. I've not got into XSLT, but maybe there {is, could be} a utility that
could take something with Rubyesque blocks and transform it into
XML. I'm not in a position to start writng it, and, of course, I'd
have to parse the input....

"Hence, by induction..."

If you can live with the ugliness of XML and the size&speed of Rexml,

Rexml does make XML handling nice, I have to say.

you should consider it.

There's no good reason why a language like Ruby shouldn't have
grammar rules as first-class objects (as Regexp's are), yielding
Ruby objects that reflect the AST, allowing attribute-grammar
parsers to be written and integrated directly within a program.

"That's a Rite good idea, is that!"

Such a tool, integrated into the Ruby interpreter itself, would
allow extension modules to define *Ruby syntax extensions*, so
that the language itself becomes plastic.

I haven't thought much about what these last two features would
look like in Ruby's case.

Clifford Heath.

Thank you,
Hugh

···

On Thu, 27 Jan 2005, Clifford Heath wrote:

Hugh_Sasse · 27 January 2005 12:28

> Thread.new(input){|source|
> $SAFE=5
> instance_eval source
> }.value

Sorry to say this but this is the most common error that I see when
someone try to eval some code with $SAFE >= 4

The code will be eval'ed with $SAFE >= 4 but the result (#value) will be
used with $SAFE = 0 and you can have problems.

The result of #eval must be cleaned with $SAFE >= 4, before it's
returned.

Hi,

I'm having trouble understanding what you mean by this. Can you say it in
Ruby?

My understanding is

               valid_input = Thread.new(input){|source|
                  $SAFE=5
                  result = instance_eval source
                  final_result = validate(result)
                }.value

would be better than:

               result = Thread.new(input){|source|
                  $SAFE=5
                  result = instance_eval source
                }.value
                final_result = validate(result) # OOPS!!!

because the validation takes places while "the safety catch is on"
in the first of these two cases.

Cheers,
Dave

Hugh

···

On Thu, 27 Jan 2005, Dave Burt wrote:

"ts" <decoux@moulon.inra.fr> wrote:

Clifford_Heath4 · 28 January 2005 03:26

Hugh Sasse Staff Elec Eng wrote:

According to "Crafting a Compiler" ISBN 0-8053-3201-4 it seems that
recursive descent implies LL(1)...

No. LL(1) just means that the recursive descent parser may only consider
the current unprocessed token. If the lexer allows perusal of two
unprocessed tokens, your recursive descent parser is LL(2).

Hugh_Sasse · 26 January 2005 19:34

Hi ..

I'll probably stay with the pure ruby one, but I'll certainly look
at this. I suspect that the 1 might be the problem for what I'm
trying to do. I can't remember how LL differs from LR now, but I'll
find that pretty easily.

If you are starting from scratch, then LL(1) usually isn't a problem. If you
can define the grammar in well formed BNF, then LL(1) will be fine. The real
problem comes with feature creep and making sure that you don't hose your
grammar too much

Yes. Just adding material types seems unsurmountable at the moment.

Regards,

thank you
Hugh

···

On Thu, 27 Jan 2005, Mark Probert wrote:

On Wednesday 26 January 2005 10:33, Hugh Sasse Staff Elec Eng wrote:

Hugh_Sasse · 26 January 2005 19:49

Hi Hugh,

What are you trying to parse exactly? Is it full Ruby source or a small
subset?

Not full ruby source, just a mini-language, maybe using Ruby's
parser instead of botching(!) my own...

Is the "mini-language" given or you can build your own "mini-language"?

Build my own.

If you can write your own language, then you should consider writing a
DSL suitable for your problem a la Rails, and then, let the Ruby

I've not grokked Rails yet, but yes, it is Domain Specific. The
problem is not designing the language (because I know the sort of
things I want to say); the problem is how to parse it with some
abstract machine such that I can maintain that machine and improve
it. Also I don't want to open up security holes unneccesarily:
"Deliver us from Eval". One doesn't intend to get hostile users, but
the possibilty is there.

interpreter do the whole job.

Kind Regards,
Ed

Thank you,
Hugh

···

On Thu, 27 Jan 2005, Edgardo Hames wrote:

On Thu, 27 Jan 2005 04:31:05 +0900, Hugh Sasse Staff Elec Eng > <hgs@dmu.ac.uk> wrote:

On Thu, 27 Jan 2005, Trans wrote:

Hugh_Sasse · 27 January 2005 11:59

[Hugh Sasse Staff Elec Eng <hgs@dmu.ac.uk>, 2005-01-26 19.18 CET]

My present example is that I want to parse Constructive Solid
Geometry descriptions, at the moment limited to cones, spheres,

[...]

How about using POV-ray's scene description language?

Here somebody did a yacc parser for that language:
http://www.io.com/~wwagner/pov.html
(skip to header "The Parser"). Maybe you can adapt it to use with racc.

Thanks, I'll take a look at that.

(I don't know about the licence.)

Good luck.

Hugh

···

On Thu, 27 Jan 2005, Carlos wrote:

ts1 · 27 January 2005 13:18

My understanding is

               valid_input = Thread.new(input){|source|
                  $SAFE=5
                  result = instance_eval source
                  final_result = validate(result)
                }.value

Well, a better example is

uln% cat b.rb
#!/usr/local/bin/ruby
source = <<EOT
a = Object.new
class << a
   def to_s
      puts "hello :-)"
      super
   end
end
a
EOT

puts Thread.new { $SAFE = 4; eval source }.value
uln%

uln% b.rb
hello
#<Object:0x2a955c0738>
uln%

i.e. you *MUST* validate the object with $SAFE = 4 and never trust it.

Guy Decoux

Dave_Burt2 · 27 January 2005 14:00

So essentially, the problem is that the result of eval here is tainted
(untrusted input) and needs to be treated with appropriate care. If it was
used at $SAFE=0, the result object would have normal Ruby priveleges. So the
safe thing to do is to only access that result with $SAFE in place, and
maybe to create a new object with valid data taken from the result to use
later.

Thanks for the explanation.

Cheers,
Dave

···

"Hugh Sasse Staff Elec Eng" <hgs@dmu.ac.uk> wrote:

On Thu, 27 Jan 2005, Dave Burt wrote:

"ts" <decoux@moulon.inra.fr> wrote:

> Thread.new(input){|source|
> $SAFE=5
> instance_eval source
> }.value

Sorry to say this but this is the most common error that I see when
someone try to eval some code with $SAFE >= 4

The code will be eval'ed with $SAFE >= 4 but the result (#value) will be
used with $SAFE = 0 and you can have problems.

The result of #eval must be cleaned with $SAFE >= 4, before it's
returned.

Hi,

I'm having trouble understanding what you mean by this. Can you say it in
Ruby?

My understanding is

              valid_input = Thread.new(input){|source|
                 $SAFE=5
                 result = instance_eval source
                 final_result = validate(result)
               }.value

would be better than:

              result = Thread.new(input){|source|
                 $SAFE=5
                 result = instance_eval source
               }.value
               final_result = validate(result) # OOPS!!!

because the validation takes places while "the safety catch is on"
in the first of these two cases.

Mark_Probert · 27 January 2005 19:13

Hi ..

>
> ANTLR does LL(n) for arbitrary n I believe - though you should avoid
> n > 3 or humans start to have trouble parsing your language :-).
>
> It's a shame that the ANTLR folk at Purdue went Java-only when they
> dropped their old C-based implementation. A multi-lingual ANTLR would
> be super-cool, especially if it would generate Ruby.

There is also PCCTS

PCCTS was the predecessor of ANTLR. I haven't played with either for quite
some time but I seem to remember that PCCTS is written in C and could
generate C or C++, whilst ANTLR is in Java and generates Java, C, C++ and C#.
Also, I think that Dr Parr rolled the lexer into ANTLR in release 2 (a la
Coco).

A quick search turned up a few other LL(k) generators, though most seem to be
directed at Java and C#.

but, although very flexible, it is not what I'd call simple.

That's why I liked Coco/R. A single pass lexer/scanner and a "readable"
grammar. Plus it is fast and efficient.

>> As an example, Ruby can not, as far as I have tried, be converted into
>> an LL(1) grammar, though C can.
>
> Not without a tie-in to the lexical analyser to help recognise goto
> labels, which require LL(2). Such a tie-in is commonly used however.
>

For most of my work, the LL(1) restriction hasn't been an issue. The only
problem came when I tried to parse Ruby code for fun. Not that I need to do
that for real, after all, there is eval()

> Such a tool, integrated into the Ruby interpreter itself, would
> allow extension modules to define *Ruby syntax extensions*, so
> that the language itself becomes plastic.
>
> I haven't thought much about what these last two features would
> look like in Ruby's case.

Forth? it's a case of back to the future ..

Regards,

···

On Thursday 27 January 2005 03:56, Hugh Sasse Staff Elec Eng wrote:

On Thu, 27 Jan 2005, Clifford Heath wrote:

--
-mark. (probertm at acm dot org)

tony_summerfelt · 27 January 2005 19:24

Another thing I like about Tcl is that it has _no_ special cases.

i'm not sure you'd define them as 'special cases' but tcl can be
quirky if you come from a language like perl:

set x 1

and to refer to it you use '$' so:

puts $x

but to increment it you'd do:

incr x

ie. no '$' used.

and the comment thing is a bit annoying for uneven braces in the same
line...

Its

That kind of goodies is the things that make Tcl quite bearable to me,
despite lacking a damn lot features

which features is it missing for you?

I still have to work with Tcl because of a GUI made in Tk, but I'm quite
happy with it.

tk is easily one of the quicker ways to crank out a gui for an
application in tcl...

although i'm really starting to like rubywebdialogs gui code...

http://home.cogeco.ca/~tsummerfelt1
telnet://ventedspleen.dyndns.org

···

On Thu, 27 Jan 2005 10:46:38 +0900, you wrote:

Hugh_Sasse · 28 January 2005 10:37

Hugh Sasse Staff Elec Eng wrote:

According to "Crafting a Compiler" ISBN 0-8053-3201-4 it seems that
recursive descent implies LL(1)...

No. LL(1) just means that the recursive descent parser may only consider
the current unprocessed token. If the lexer allows perusal of two
unprocessed tokens, your recursive descent parser is LL(2).

Isn't that just re-framing the problem so that all token pairs are
regarded as single entities? Then at the level of what the parser
processes, it can only take one entity at a time? Tokenizing
generally lumps things together anyway, so /(\d+)/ rather than
/(?:(\d)+)/, therefore having the lexer peruse two tokens feels
like a special case.

Anyway, I hope I haven't misrepresented what the book says.
Hugh

···

On Fri, 28 Jan 2005, Clifford Heath wrote:

Dave_Burt2 · 27 January 2005 14:05

Like this?

Thread.new { $SAFE = 4; puts eval(source) }

Or like this?

puts Thread.new { $SAFE = 4; String.new(eval(source).to_s) }.value

Both of these generate the SecurityError that your example above bypasses by
running at $SAFE=0.

Dave

···

"ts" <decoux@moulon.inra.fr> wrote:

Well, a better example is

uln% cat b.rb
#!/usr/local/bin/ruby
source = <<EOT
a = Object.new
class << a
  def to_s
     puts "hello :-)"
     super
  end
end
a
EOT

puts Thread.new { $SAFE = 4; eval source }.value
uln%

uln% b.rb
hello
#<Object:0x2a955c0738>
uln%

i.e. you *MUST* validate the object with $SAFE = 4 and never trust it.

Topic		Replies	Views
Using ruby for generic language parsing (or any language-specific parsing libraries out there?) ruby-talk	3	182	13 April 2009
Making a simple parser ruby-talk	8	89	19 April 2011
[ANN] redparse 0.8.1 Released ruby-talk	0	138	1 May 2009
Alternative Ruby grammar ruby-talk	22	241	18 November 2007
[ANN] Dhaka-2.0.0 ruby-talk	1	115	21 January 2007

When little languages grow

Related topics