Regexp help sought

Roy · 24 February 2005 16:50

Hey All,

I'm trying to parse lines from my text editor's config file, which look
like this (pls watch for line wrap--there is one line per language,
starting with /L<<digit>>):

/L1"SAS" Line Comment = * Block Comment On = /* Block Comment Off = */
Block Comment On Alt = * Block Comment Off Alt = ; Nocase File
Extensions = SAS
/L2"Visual Basic" Line Comment = ' File Extensions = BAS FRM CLS VBS
CTL WSF
/L4"HTML" Nocase Noquote HTML_LANG Block Comment On =  Block Comment On Alt = <% Block Comment Off Alt = %>
String Chars = "' File Extensions = HTM HTML ASP SHTML HTT HTX JSP
/L11"Ruby" Line Comment Num = 2# Block Comment On = =begin Block
Comment Off = =end String Chars='" Escape Char = \ File Extensions = RB
RBW

I'm trying to write a method for extracting the comment markers & their
types (line/block & on/off). Regexps seemed the obvious tool, and I
eventually came up with this one:

c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+) ")

This is working well so far, except that it only grabs out the first
type of comment in each line. I'd hoped that I could make it get all
the comment types by putting an additional set of parens and a +
quantifier around the whole expression:

c = Regexp.new("((Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+))+ ")

But that just seems to break it--that version doesn't capture anything.

Anybody got a clue for me? I'm using v1.8 on windows. My code is
below. (And again, pls watch for line wrapping).

Thanks!

-Roy

def parse_comment_markers(line)
=begin

There are line comments & (2 different kinds of) block comments.

Line comments only have a start marker--EOL is the terminator.

Comment types are:
 Line Comment = <>
 Block Comment On = <>
 Block Comment Off = <>
 Block Comment On Alt = <>
 Block Comment Off Alt = <>

Where <> can be any contiguous set of non-whitespace chars.

   For Line comment marks, preceding digits specify the # of spaces
minus 1
   required after the nondigit portion of the marker. So for ruby, the
line
   comment mark is 2#, signifying that # is a comment only if it is
followed by
   a space. Ignore this for now.

   So--funky regexp time. We want to grab sequences centered around
the string " Comment ".
   We want the single word prior to "Comment" and all words between
"Comment" and " = ", and
   then of course the contiguous nonwhitespace following " = ".

=end
   puts line
   # Why doesn't the \S char class work?
   # c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
(\S+)")
   c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+) ")
   cm = c.match(line)
   if cm.nil?
      puts "No match!"
   else
      puts cm.captures.join(" || ")
      puts "Comment type is \"" + cm.captures[0] + "\", and comment
marker is \"" + cm.captures[2] + "\""
   end
end

parse_comment_markers("/L2 \"Ruby\" Line Comment = # Block Comment On =
' File Extensions = RB RBW")
parse_comment_markers("/L2 \"Ruby\" Block Comment On = =begin Block
Comment Off = =end File Extensions = RB RBW")

Robert · 24 February 2005 17:00

<rpardee@comcast.net> schrieb im Newsbeitrag
news:1109263535.280360.203760@g14g2000cwa.googlegroups.com...

Hey All,

I'm trying to parse lines from my text editor's config file, which look
like this (pls watch for line wrap--there is one line per language,
starting with /L<<digit>>):

/L1"SAS" Line Comment = * Block Comment On = /* Block Comment Off = */
Block Comment On Alt = * Block Comment Off Alt = ; Nocase File
Extensions = SAS
/L2"Visual Basic" Line Comment = ' File Extensions = BAS FRM CLS VBS
CTL WSF
/L4"HTML" Nocase Noquote HTML_LANG Block Comment On =  Block Comment On Alt = <% Block Comment Off Alt = %>
String Chars = "' File Extensions = HTM HTML ASP SHTML HTT HTX JSP
/L11"Ruby" Line Comment Num = 2# Block Comment On = =begin Block
Comment Off = =end String Chars='" Escape Char = \ File Extensions = RB
RBW

I'm trying to write a method for extracting the comment markers & their
types (line/block & on/off). Regexps seemed the obvious tool, and I
eventually came up with this one:

c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+) ")

This is working well so far, except that it only grabs out the first
type of comment in each line. I'd hoped that I could make it get all
the comment types by putting an additional set of parens and a +
quantifier around the whole expression:

c = Regexp.new("((Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+))+ ")

But that just seems to break it--that version doesn't capture anything.

Anybody got a clue for me? I'm using v1.8 on windows. My code is
below. (And again, pls watch for line wrapping).

You want String#scan

matches = line.scan(re)

or

line.scan(re) do |match|
....
end

Kind regards

robert

Roy_Pardee · 25 February 2005 14:45

Awesome--that's exactly what I needed. And much more readable than
elaborating the regexp.

Thanks!

-Roy

Topic		Replies	Views
Regexp help needed ruby-talk	3	86	2 May 2008
Regular expression question ruby-talk	14	52	15 September 2006
Regexps and HTML comments ruby-talk	3	61	22 November 2007
Regular expressions and conditionals ruby-talk	3	85	26 October 2008
Regular expression question ruby-talk	2	59	26 May 2008

Regexp help sought

Related Topics