Regexp: How to Find Legitimate Tokens?

Hi,

I am sure questions like these must have been posted before, but for some
reason I could not find a specific ones by searching through the web.

Basically, I want to find in a Ruby source code a token, such as “:”,
which is really a token, i.e., it is not inside a comment nor inside a
string. My questions are:

  1. How do I do it using regexp in Ruby?

  2. In Ruby, are the only things that can mask the significance of a token:

    • inside comment
    • inside string
      ? Or anything else?
  3. I have tried to read Ruby’s “parse.y”, but it is still too hard. Is
    all the Ruby syntax (but not semantics) completely defined by this single
    file “parse.y”?

  4. Are tools such as lex and yacc more powerful than Ruby in terms of
    their lexing and parsing capabilities? Without regard to efficiency, is
    to use Ruby as easy as to use lex? Is to use Ruby as easy as to use
    yacc? (Probably no, such as shown by the discussion on a parser tool in
    Ruby?)

Regards,

Bill

Basically, I want to find in a Ruby source code a token, such as ":",
which is really a token, i.e., it is not inside a comment nor inside a
string. My questions are:

Check out ripper: Index of /archive/ripper

It can parse ruby code. Here we grab the colons and "defs" from
ruby source:

#!/usr/local/bin/ruby

require 'ripper'

class R < Ripper

  def on__COLON( s )
    puts s
  end

  def on__def(d, *rest)
    puts d, (rest.join(" "))
  end
  
end
R.parse ARGF

regards,
-joe

One data point.

Given the recent discussion on lexer/parsers for Ruby, I thought that I
would do something I have been meaning to do for a while, port Coco/R to
ruby.

For those who don’t know, Coco/R is an LL(1) recursive decent scanner/parser
that basically takes an attributed EBNF as its input and generates the
scanner, parser and a driver. The things that I like about Coco/R is that
the attributed grammar is easy to understand and everything is in one place
(no separate lex and yacc files) and that it is very fast.

So, I have taken the C/C++ source and started the convert-to-ruby process.
I have the table-driven scanner done and I have to say that ruby is -very-
powerful at doing this stuff. Given that this is a first-pass, the code
is a little ugly and not optimised in -any- way, but it performs pretty
well. The scanner is around 500 lines of commented code. I expect that
to drop. And I have found it easier to write than lex code.

Now on to the parser! Then no more yacc-ing …

-mark.

···

At 11:44 PM 9/23/2002 +0900, Bill wrote:

[snip]
4) Are tools such as lex and yacc more powerful than Ruby in terms of
their lexing and parsing capabilities? Without regard to efficiency, is
to use Ruby as easy as to use lex? Is to use Ruby as easy as to use
yacc? (Probably no, such as shown by the discussion on a parser tool in
Ruby?)

Hi,

Thanks to all who replied to me directly via email. It seems that none of
the answers is trivial.

Regards,

Bill

In article 5.1.0.14.2.20020923123544.04fa7ee8@zcard04k.ca.nortel.com,

···

Mark Probert probertm@nortelnetworks.com wrote:

At 11:44 PM 9/23/2002 +0900, Bill wrote:

[snip]
4) Are tools such as lex and yacc more powerful than Ruby in terms of
their lexing and parsing capabilities? Without regard to efficiency, is
to use Ruby as easy as to use lex? Is to use Ruby as easy as to use
yacc? (Probably no, such as shown by the discussion on a parser tool in
Ruby?)

One data point.

Given the recent discussion on lexer/parsers for Ruby, I thought that I
would do something I have been meaning to do for a while, port Coco/R to
ruby.

For those who don’t know, Coco/R is an LL(1) recursive decent scanner/parser
that basically takes an attributed EBNF as its input and generates the
scanner, parser and a driver. The things that I like about Coco/R is that
the attributed grammar is easy to understand and everything is in one place
(no separate lex and yacc files) and that it is very fast.

So, I have taken the C/C++ source and started the convert-to-ruby process.
I have the table-driven scanner done and I have to say that ruby is -very-
powerful at doing this stuff. Given that this is a first-pass, the code
is a little ugly and not optimised in -any- way, but it performs pretty
well. The scanner is around 500 lines of commented code. I expect that
to drop. And I have found it easier to write than lex code.

Now on to the parser! Then no more yacc-ing …

Cool. I also don’t like having a seperate lexer and parse (like using lex
and yacc) so this sounds interesting.

Can we take a look at what you’ve got so far?

Phil

Given the recent discussion on lexer/parsers for Ruby, I thought that I
would do something I have been meaning to do for a while, port Coco/R to
ruby.

Cool. I also don’t like having a seperate lexer and parse (like using lex
and yacc) so this sounds interesting.

Can we take a look at what you’ve got so far?

There is not too much to show at the moment. I scan a ruby-ised version
of the attributed grammar into token. That is the -easy- part :-). I think
that in a few weeks I will have something “real” to share. There is a lot
of fine-print in working out the parser…

FWIW, a simple Coco/R grammar looks like (this is the expr example from
the Dragon book):

--------< expr.atg >----------------

$C # generate the driver
COMPILER Expr

This is a simple expression calculator

#----------------- Scanner Specifications ----------------------

CHARACTERS
letter = “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz”.
digit = “0123456789”.
tab = CHR(9).
eol = CHR(10).

TOKENS
ident = letter {letter|digit}.
number = digit { digit }.

IGNORE eol+tab

COMMENTS FROM “–” TO eol

#--------------------Parser Specification -------------------

PRODUCTIONS
Expr = StatSeq .

 StatSeq = SYNC { Stat ";" SYNC} .

 Stat =
    Expression<n>            (. puts "#{n}" .)
   .

 Expression<n> =
    Term<a>
    {   "+" Term<b>          (. a += b .)
      > "-" Term<b>          (. a -= b .)
    }
    .

 Term<n> =
    Factor<a>
    {   '*' Factor<b>        (. a *= b .)
      > '/' Factor<b>        (. a /= b .)
    }
    .

 Factor<n> =                 (. VAR sign =  1 .)
   [ "-"                     (.     sign = -1 .)
   ]
   (   Number<n>
     > "(" Expression<n> ")"
   )                         (. n *= sign .)
   .

Number<n>
···

At 05:39 AM 9/24/2002 +0900, Phil wrote:
=
number (. n = scanner.get_string.to_i .)
.

Ident<s>
     =
     ident                   (. n = scanner.get_string .)
     .

END Expr.