Regexp: How to Find Legitimate Tokens?

William_Djaja_Tjokr1 · 23 September 2002 14:44

Hi,

I am sure questions like these must have been posted before, but for some
reason I could not find a specific ones by searching through the web.

Basically, I want to find in a Ruby source code a token, such as “:”,
which is really a token, i.e., it is not inside a comment nor inside a
string. My questions are:

How do I do it using regexp in Ruby?
In Ruby, are the only things that can mask the significance of a token:
- inside comment
- inside string
  ? Or anything else?
I have tried to read Ruby’s “parse.y”, but it is still too hard. Is
all the Ruby syntax (but not semantics) completely defined by this single
file “parse.y”?
Are tools such as lex and yacc more powerful than Ruby in terms of
their lexing and parsing capabilities? Without regard to efficiency, is
to use Ruby as easy as to use lex? Is to use Ruby as easy as to use
yacc? (Probably no, such as shown by the discussion on a parser tool in
Ruby?)

Regards,

Bill

Joseph_McDonald · 23 September 2002 16:42

Basically, I want to find in a Ruby source code a token, such as ":",
which is really a token, i.e., it is not inside a comment nor inside a
string. My questions are:

Check out ripper: Index of /archive/ripper

It can parse ruby code. Here we grab the colons and "defs" from
ruby source:

#!/usr/local/bin/ruby

require 'ripper'

class R < Ripper

  def on__COLON( s )
    puts s
  end

  def on__def(d, *rest)
    puts d, (rest.join(" "))
  end

end
R.parse ARGF

regards,
-joe

Mark_Probert2 · 23 September 2002 16:48

One data point.

Given the recent discussion on lexer/parsers for Ruby, I thought that I
would do something I have been meaning to do for a while, port Coco/R to
ruby.

For those who don’t know, Coco/R is an LL(1) recursive decent scanner/parser
that basically takes an attributed EBNF as its input and generates the
scanner, parser and a driver. The things that I like about Coco/R is that
the attributed grammar is easy to understand and everything is in one place
(no separate lex and yacc files) and that it is very fast.

So, I have taken the C/C++ source and started the convert-to-ruby process.
I have the table-driven scanner done and I have to say that ruby is -very-
powerful at doing this stuff. Given that this is a first-pass, the code
is a little ugly and not optimised in -any- way, but it performs pretty
well. The scanner is around 500 lines of commented code. I expect that
to drop. And I have found it easier to write than lex code.

Now on to the parser! Then no more yacc-ing …

-mark.

···

At 11:44 PM 9/23/2002 +0900, Bill wrote:

[snip]
4) Are tools such as lex and yacc more powerful than Ruby in terms of
their lexing and parsing capabilities? Without regard to efficiency, is
to use Ruby as easy as to use lex? Is to use Ruby as easy as to use
yacc? (Probably no, such as shown by the discussion on a parser tool in
Ruby?)

William_Djaja_Tjokr1 · 23 September 2002 17:21

Hi,

Thanks to all who replied to me directly via email. It seems that none of
the answers is trivial.

Regards,

Bill

Ptkwt1 · 23 September 2002 20:39

In article 5.1.0.14.2.20020923123544.04fa7ee8@zcard04k.ca.nortel.com,

···

Mark Probert probertm@nortelnetworks.com wrote:

At 11:44 PM 9/23/2002 +0900, Bill wrote:

[snip]
4) Are tools such as lex and yacc more powerful than Ruby in terms of
their lexing and parsing capabilities? Without regard to efficiency, is
to use Ruby as easy as to use lex? Is to use Ruby as easy as to use
yacc? (Probably no, such as shown by the discussion on a parser tool in
Ruby?)

One data point.

Given the recent discussion on lexer/parsers for Ruby, I thought that I
would do something I have been meaning to do for a while, port Coco/R to
ruby.

For those who don’t know, Coco/R is an LL(1) recursive decent scanner/parser
that basically takes an attributed EBNF as its input and generates the
scanner, parser and a driver. The things that I like about Coco/R is that
the attributed grammar is easy to understand and everything is in one place
(no separate lex and yacc files) and that it is very fast.

So, I have taken the C/C++ source and started the convert-to-ruby process.
I have the table-driven scanner done and I have to say that ruby is -very-
powerful at doing this stuff. Given that this is a first-pass, the code
is a little ugly and not optimised in -any- way, but it performs pretty
well. The scanner is around 500 lines of commented code. I expect that
to drop. And I have found it easier to write than lex code.

Now on to the parser! Then no more yacc-ing …

Cool. I also don’t like having a seperate lexer and parse (like using lex
and yacc) so this sounds interesting.

Can we take a look at what you’ve got so far?

Phil

Mark_Probert2 · 24 September 2002 14:11

Given the recent discussion on lexer/parsers for Ruby, I thought that I
would do something I have been meaning to do for a while, port Coco/R to
ruby.

Cool. I also don’t like having a seperate lexer and parse (like using lex
and yacc) so this sounds interesting.

Can we take a look at what you’ve got so far?

There is not too much to show at the moment. I scan a ruby-ised version
of the attributed grammar into token. That is the -easy- part :-). I think
that in a few weeks I will have something “real” to share. There is a lot
of fine-print in working out the parser…

FWIW, a simple Coco/R grammar looks like (this is the expr example from
the Dragon book):

--------< expr.atg >----------------

$C # generate the driver
COMPILER Expr

This is a simple expression calculator

#----------------- Scanner Specifications ----------------------

CHARACTERS
letter = “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz”.
digit = “0123456789”.
tab = CHR(9).
eol = CHR(10).

TOKENS
ident = letter {letter|digit}.
number = digit { digit }.

IGNORE eol+tab

COMMENTS FROM “–” TO eol

#--------------------Parser Specification -------------------

PRODUCTIONS
Expr = StatSeq .

 StatSeq = SYNC { Stat ";" SYNC} .

 Stat =
    Expression<n>            (. puts "#{n}" .)
   .

 Expression<n> =
    Term<a>
    {   "+" Term<b>          (. a += b .)
      > "-" Term<b>          (. a -= b .)
    }
    .

 Term<n> =
    Factor<a>
    {   '*' Factor<b>        (. a *= b .)
      > '/' Factor<b>        (. a /= b .)
    }
    .

 Factor<n> =                 (. VAR sign =  1 .)
   [ "-"                     (.     sign = -1 .)
   ]
   (   Number<n>
     > "(" Expression<n> ")"
   )                         (. n *= sign .)
   .

Number<n>

···

At 05:39 AM 9/24/2002 +0900, Phil wrote:
=
number (. n = scanner.get_string.to_i .)
.

Ident<s>
     =
     ident                   (. n = scanner.get_string .)
     .

END Expr.

Topic		Replies	Views
Breaking Ruby code into tokens ruby-talk	2	111	5 October 2003
Announcing RubyLexer 0.6.0 ruby-talk	24	209	26 April 2005
Questions of idiom ruby-talk	4	126	8 June 2010
Ruby Lex Specification ruby-talk	6	81	31 October 2007
Lexer in Ruby ruby-talk	0	85	19 October 2002

Regexp: How to Find Legitimate Tokens?

This is a simple expression calculator

Related topics