Regular expression question

Jesse_Brown · 13 September 2006 15:55

In trying to parse a C source file I have the following section of
code:

   ...
   ...
case line
when /^.*\/\*.*?\*\/.*$/ # single line comment(s)
   non_comments = line.split(/\/\*.*?\*\//).to_s
   process_code(non_comments)
when /^.*\/\*\*?[^(\*\/)]*$/ # multi-line start
   comment = true
   next
when /^[^(\/\*)]*\*\/.*$/ # multi-line end
   comment = false
   ...
   ...

I am running into a problem with the multi-line comment sections.
While something like:

/*
A comment
*/

will work (i.e. gets properly parsed out)

/* A
* comment */

OR

/* A *
* comment */

will not.
My guess is that it is because of the [^(\*\/)] construct blocking the
leading or trailing '*' character. However, I thought that by placing
the \*\/ within parenthesis I avoided the characters being evaluated
individually.
Is there a way to look for the pattern '*/' without having a single '*'
break the search?
As an alternative, I use this:

when /^.*\/\*\*?[^(\*\/)]*\**?$/
   comment = true
   next
when /^.*?[^(\/\*)]*\*\/.*$/
   comment = false

Which *seems* to solve the problem, but I can see where it is brittle

/* A * comment
for * instance */

Any suggestions?
Thanks in advance.

Jeff · 13 September 2006 16:17

L7 wrote:

In trying to parse a C source file I have the following section of
code:

   ...
   ...
case line
when /^.*\/\*.*?\*\/.*$/ # single line comment(s)
   non_comments = line.split(/\/\*.*?\*\//).to_s
   process_code(non_comments)
when /^.*\/\*\*?[^(\*\/)]*$/ # multi-line start
   comment = true
   next
when /^[^(\/\*)]*\*\/.*$/ # multi-line end
   comment = false
   ...
   ...

I am running into a problem with the multi-line comment sections.

My eyes glaze over with these kinds of expressions, but this might help:

http://www.regularexpressions.info/examplesprogrammer.html

Scroll down the section on "Comments". They seem to have a simpler
solution, I think the trick is to be able to use . as matching newlines.

And you can turn on newline matching in Ruby by putting an "m" after the
expression:

/my_pattern_here/m

Hope this helps...?

Jeff
softiesonrails.com

···

--
Posted via http://www.ruby-forum.com/\.

Francis_Cianfrocca · 13 September 2006 16:40

Remember that in C, nested comment-blocks are not permitted, for the
incredibly good reason that they are not recognizable by
regular-expressions ;-). Why don't you take a pre-pass through your C
file and take out the comments yourself before you run your main
parse? A recursive-descent parser to do the job would probably take
almost no code at all in Ruby.

···

On 9/13/06, L7 <jesse.r.brown@gmail.com> wrote:

In trying to parse a C source file I have the following section of
code:

Rod_Knowlton · 13 September 2006 23:38

In trying to parse a C source file I have the following section of
code:

   ...
case line
when /^.*\/\*.*?\*\/.*$/ # single line comment(s)
   non_comments = line.split(/\/\*.*?\*\//).to_s
   process_code(non_comments)
when /^.*\/\*\*?[^(\*\/)]*$/ # multi-line start
   comment = true
   next
when /^[^(\/\*)]*\*\/.*$/ # multi-line end
   comment = false
   ...

Is there a way to look for the pattern '*/' without having a single '*'
break the search?

If I'm not mistaken, what you need is a negative lookahead

try /^.*\/\*([^\/]|\/(?!\*))*$/ for multi-line start

and /^([^\*]|\*(?!\/))*\*\/.*$/ for multi-line end

the key difference (from the start pattern) is ([^\/]|\/(?!\*))

this breaks down like so:

(
[^\/] # anything but /

# or

\/(?!\*) # a / not followed by an * (don't eat the character after /, just peek at it)
)

The pattern for multi-line end uses the same technique, but with the characters reversed.

I'm sure this isn't the be all and end all of C comment matching regexs, but it handles all of the cases you described.

- Rod

···

On Sep 13, 2006, at 10:55 AM, L7 wrote:

Jesse_Brown · 13 September 2006 16:35

Jeff Cohen wrote:

My eyes glaze over with these kinds of expressions, but this might help:

http://www.regularexpressions.info/examplesprogrammer.html

Scroll down the section on "Comments". They seem to have a simpler
solution, I think the trick is to be able to use . as matching newlines.

I dont think it applies to this directly. I didnt explicitly mention,
but the processing is happening on a line-by-line basis. In order to
remove all commenting in the above manner I would first have to read
the file as a string, strip, split on newline then parse code.

···

And you can turn on newline matching in Ruby by putting an "m" after the
expression:

/my_pattern_here/m

Hope this helps...?

Jeff
softiesonrails.com

--
Posted via http://www.ruby-forum.com/\.

Jesse_Brown · 13 September 2006 16:55

Francis Cianfrocca wrote:

> In trying to parse a C source file I have the following section of
> code:
>
Remember that in C, nested comment-blocks are not permitted, for the
incredibly good reason that they are not recognizable by
regular-expressions ;-).

Agreed. However, something with '*' characters in it is allowed (so
long as they are not preceeded or followed directly by '/') and that is
where I would get clobbered.

Why don't you take a pre-pass through your C
file and take out the comments yourself before you run your main

As I mentioned, that involved a bit of overhead. But with regard to the
project, I assume it is the 'best fix' to what I have.

···

On 9/13/06, L7 <jesse.r.brown@gmail.com> wrote:

Paul_Lutus · 13 September 2006 17:30

L7 wrote:

/ ...

I dont think it applies to this directly. I didnt explicitly mention,
but the processing is happening on a line-by-line basis. In order to
remove all commenting in the above manner I would first have to read
the file as a string, strip, split on newline then parse code.

Yes, and that may be the best way to approach this problem. There are a
number of problems where reading the entire file and processing it as a
long string is the best (one is tempted to say the only) way to proceed.

If you don't read the entire file, then you are obliged to carry more state
information in your algorithm between lines. IMHO, it is much better to
eliminate multiline comments in one go than to construct and maintain state
flags for this and any other contingencies that may carry over between
lines.

Obviously this poses practical problems for huge source files, but, again
IMHO, huge source files should not exist anyway -- they should be broken up
into manageable chunks.

···

--
Paul Lutus
http://www.arachnoid.com

Forum · 13 September 2006 20:49

I am intrigued, I believe that the regular expression to find all comments
in C must be very complex and probably not the correct tool, look at these
snipplets

// /*
if(strcmp(x,"*/")
// "*/
etc. etc.

BTW I cannot find a reason why the job cannot be done by a regular
expression but that does not mean it can
Robert

···

On 9/13/06, Paul Lutus <nospam@nosite.zzz> wrote:

L7 wrote:

/ ...

> I dont think it applies to this directly. I didnt explicitly mention,
> but the processing is happening on a line-by-line basis. In order to
> remove all commenting in the above manner I would first have to read
> the file as a string, strip, split on newline then parse code.

Yes, and that may be the best way to approach this problem. There are a
number of problems where reading the entire file and processing it as a
long string is the best (one is tempted to say the only) way to proceed.

If you don't read the entire file, then you are obliged to carry more
state
information in your algorithm between lines. IMHO, it is much better to
eliminate multiline comments in one go than to construct and maintain
state
flags for this and any other contingencies that may carry over between
lines.

Obviously this poses practical problems for huge source files, but, again
IMHO, huge source files should not exist anyway -- they should be broken
up
into manageable chunks.

--
Paul Lutus
http://www.arachnoid.com

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein

Tom_Copeland · 15 September 2006 02:15

I'm not sure if it's impossible to parse out C-style comments using a
regular expression, but the various JavaCC grammars I've seen all use
lexical states to do it instead. Another complication is trigraphs (*),
although I think those are unrecognized by default in most C
preprocessors.

Yours,

Tom

(*) Digraphs and trigraphs - Wikipedia

···

On Thu, 2006-09-14 at 05:49 +0900, Robert Dober wrote:

On 9/13/06, Paul Lutus <nospam@nosite.zzz> wrote:
I am intrigued, I believe that the regular expression to find all comments
in C must be very complex and probably not the correct tool, look at these
snipplets

// /*
if(strcmp(x,"*/")
// "*/
etc. etc.

Francis_Cianfrocca · 15 September 2006 02:22

A C-style block comment can indeed be recognized by a regex. In fact, that's
how lexical states are generally invoked in scanners generated by tools like
flex. However, a *nested* C-style block comment can not be detected by a
regex. A (theoretical) language supporting such a construct would be a
context-free language, not a regular language.

···

On 9/14/06, Tom Copeland <tom@infoether.com> wrote:

I'm not sure if it's impossible to parse out C-style comments using a
regular expression, but the various JavaCC grammars I've seen all use
lexical states to do it instead. Another complication is trigraphs (*),
although I think those are unrecognized by default in most C
preprocessors.

Francis_Cianfrocca · 15 September 2006 02:32

One more point. Someone upthread gave an example similar to this:

/* printf ("*/"); */

Considered strictly as a lexical construction, I think this is regular.
However, I have a funny feeling that this:

/* printf ("/*......*/"); */

is actually context-free. Does anyone know for sure?

···

On 9/14/06, Tom Copeland <tom@infoether.com> wrote:

I'm not sure if it's impossible to parse out C-style comments using a
regular expression, but the various JavaCC grammars I've seen all use
lexical states to do it instead. Another complication is trigraphs (*),
although I think those are unrecognized by default in most C
preprocessors.

Logan_Capaldo · 15 September 2006 02:43

>
> I'm not sure if it's impossible to parse out C-style comments using a
>regular expression, but the various JavaCC grammars I've seen all use
>lexical states to do it instead. Another complication is trigraphs (*),
>although I think those are unrecognized by default in most C
>preprocessors.

One more point. Someone upthread gave an example similar to this:

/* printf ("*/"); */

Pretty sure this would end up being a syntax error

Considered strictly as a lexical construction, I think this is regular.
However, I have a funny feeling that this:

/* printf ("/*......*/"); */

This too.

gcc agrees with me at least:

% cat comments.c
#include <stdio.h>

int main(int argc, char **argv) {
  /* printf("*/"); */
  /* printf("/*.......*/"); */
  return 0;
}
% gcc -c comments.c
comments.c: In function 'main':
comments.c:4: error: missing terminating " character
comments.c:5: error: missing terminating " character

is actually context-free. Does anyone know for sure?

As for whether or not its context free, I don't know, but I think you
overestimated how hard C tries. /* */ are not nestable for instance.

···

On Fri, Sep 15, 2006 at 11:32:33AM +0900, Francis Cianfrocca wrote:

On 9/14/06, Tom Copeland <tom@infoether.com> wrote:

Daniel_Martin · 15 September 2006 13:30

"Francis Cianfrocca" <garbagecat10@gmail.com> writes:

One more point. Someone upthread gave an example similar to this:

/* printf ("*/"); */

Considered strictly as a lexical construction, I think this is regular.
However, I have a funny feeling that this:

/* printf ("/*......*/"); */

is actually context-free. Does anyone know for sure?

So you want to know if a grammar is regular or not? Sounds like you
need the Myhill-Nerode theorem
(http://en.wikipedia.org/wiki/Myhill-Nerode_theorem\).

And according to that, a language that allows arbitrary nesting of
comment expressions like this is indeed not regular, and therefore not
parseable with regular expressions as traditionally defined in
computer science. To parse arbitrarily nested constructs you either
need something like perl's evaluate-code-at-regexp-match-time feature
(which so far as I know exists in no other language), or an actual
grammar. (or anything else that can get as complicated
computationally as a pushdown automaton)

···

--
s=%q( Daniel Martin -- martin@snowplow.org
puts "s=%q(#{s})",s.map{|i|i}[1] )
puts "s=%q(#{s})",s.map{|i|i}[1]

Francis_Cianfrocca · 15 September 2006 02:51

I know these are syntax errors in C. I was talking about a hypothetical
language (not C) that defined such constructs as legal. I'm still not sure
that it's impossible to use a regular language to generate this case:
/* "*/ */
I'm pretty convinced that the other case requires a context-free language.

···

On 9/14/06, Logan Capaldo <logancapaldo@gmail.com> wrote:

gcc agrees with me at least:

% cat comments.c
#include <stdio.h>

int main(int argc, char **argv) {
  /* printf("*/"); */
  /* printf("/*.......*/"); */
  return 0;
}
% gcc -c comments.c
comments.c: In function 'main':
comments.c:4: error: missing terminating " character
comments.c:5: error: missing terminating " character

>
> is actually context-free. Does anyone know for sure?
As for whether or not its context free, I don't know, but I think you
overestimated how hard C tries. /* */ are not nestable for instance.

Logan_Capaldo · 15 September 2006 12:45

Well for empirical evidence one could look at ML. (* comments (* are *)
nestable *).

···

On Fri, Sep 15, 2006 at 11:51:16AM +0900, Francis Cianfrocca wrote:

I know these are syntax errors in C. I was talking about a hypothetical
language (not C) that defined such constructs as legal. I'm still not sure
that it's impossible to use a regular language to generate this case:
/* "*/ */
I'm pretty convinced that the other case requires a context-free language.

Topic		Replies	Views
How the regular expression to match "//"? ruby-talk	3	137	16 September 2010
Small regexp question ruby-talk	0	62	12 March 2006
Regular expression question ruby-talk	2	62	26 May 2008
Regexp help sought ruby-talk	2	50	25 February 2005
Regular expression question ruby-talk	1	52	21 August 2003

Regular expression question

Related Topics