Regex help

Chris_Morris · 19 January 2004 19:38

I need a re such that:

’ /* comment / String s = "*/"; '.gsub(re, "\")

returns:

’ /* comment \ String s = "**/"; ’

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?

···

–
Chris
http://clabs.org

Simon_Strandgaard1 · 19 January 2004 21:40

possible… but difficult.

I spend this evening making some broken experiments.

···

On Tue, 20 Jan 2004 04:38:12 +0900, Chris Morris wrote:

I need a re such that:

’ /* comment / String s = "*/"; '.gsub(re, "\")

returns:

’ /* comment \ String s = "**/"; ’

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?

–
Simon Strandgaard

ruby test_main.rb
Loaded suite TestMain
Started
test_balance_bad1(TestMain): .
test_balance_bad2(TestMain): F
test_balance_bad3(TestMain): F
test_balance_ok1(TestMain): F
test_balance_ok2(TestMain): .

Finished in 0.030393 seconds.

Failure:
test_balance_bad2(TestMain)
[test_main.rb:25:in assert_x' test_main.rb:33:in test_balance_bad2’]:
<["xx ", “*/”]> expected but was
.
Failure:
test_balance_bad3(TestMain)
[test_main.rb:25:in assert_x' test_main.rb:37:in test_balance_bad3’]:
<[“xx /* /* / ", "/”]> expected but was
<["/* /* "]>.
Failure:
test_balance_ok1(TestMain)
[test_main.rb:25:in assert_x' test_main.rb:41:in test_balance_ok1’]:
expected but was
<["/* "]>.

5 tests, 5 assertions, 3 failures, 0 errors

expand -t2 test_main.rb
require ‘test/unit’

class TestMain < Test::Unit::TestCase
def mk_re
comment_begin = ‘/*’ # /*
comment_end = ‘*/’ # /
re = /
(
#{comment_begin}
.?
)
#{comment_end}
.?
(?! #{comment_begin} )
(?= #{comment_end} )
/x
re
end
def assert_x(expected, input)
actual = mk_re.match(input)
if actual
actual = actual.to_a
actual.shift
end
assert_equal(expected, actual)
end
def test_balance_bad1
s = 'xx / comment / String s = "/"; ’
assert_x(['/ comment '], s)
end
def test_balance_bad2
s = 'xx / / String s = "/"; ’
assert_x(['xx ', ‘/'], s)
end
def test_balance_bad3
s = 'xx / /* / /’
assert_x(['xx / / / ', '/’], s)
end
def test_balance_ok1
s = ’ /* / / / ’
assert_x(nil, s)
end
def test_balance_ok2
s = 'xx / /* */ ’
assert_x(nil, s)
end
end

if $0 == FILE
require ‘test/unit/ui/console/testrunner’
Test::Unit::UI::Console::TestRunner.run(TestMain, 3)
end

Robert · 20 January 2004 09:09

“Chris Morris” chrismo@clabs.org schrieb im Newsbeitrag
news:400C31D0.7010902@clabs.org…

I need a re such that:

’ /* comment / String s = "*/"; '.gsub(re, "\")

returns:

’ /* comment \ String s = "**/"; ’

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?

str.gsub(%r{“[^”]"|*/}) {|m| m == '/’ ? ‘*\’ : m}

If you need single quotes as well, take this one:
str.gsub(%r{“[^”]"|‘[^’]'|*/}) {|m| m == ‘/’ ? '\’ : m}

Note: the order of the different alternatives matters.

Regards

robert

Nikolai_Weibull3 · 20 January 2004 14:57

Chris Morris chrismo@clabs.org [Jan, 19 2004 20:50]:

I need a re such that:

’ /* comment / String s = "*/"; '.gsub(re, "\")

returns:

’ /* comment \ String s = "**/"; ’

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?
there are better things, yes…

str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == ‘*’ ? '\’ : m }

will find strings and your pattern fast and efficient while avoiding
prematurely terminated strings (those that contain escaped quotes that
is),
nikolai

···

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Chris_Morris · 19 January 2004 21:44

Simon Strandgaard wrote:

···

On Tue, 20 Jan 2004 04:38:12 +0900, Chris Morris wrote:

I need a re such that:

’ /* comment / String s = "*/"; '.gsub(re, "\")

returns:

’ /* comment \ String s = "**/"; ’

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?

possible… but difficult.

I spend this evening making some broken experiments.

I thought it might be. I’ve fallen back and written a simple
char-by-char parser that ignores pieces inside quotes.

–
Chris
http://clabs.org

Robert · 20 January 2004 16:20

“Nikolai Weibull” ruby-talk@pcppopper.org schrieb im Newsbeitrag
news:20040120150317.GB3012@puritan.bredbandsbolaget.se…

Chris Morris chrismo@clabs.org [Jan, 19 2004 20:50]:

I need a re such that:

’ /* comment / String s = "*/"; '.gsub(re, "\")

returns:

’ /* comment \ String s = "**/"; ’

I want to find all *\ instances that are not enclosed in
double-quotes.
Or is this one of those problems ill-suited for a regex?
there are better things, yes…

str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == ‘*’ ? '\’ : m }

will find strings and your pattern fast and efficient while avoiding
prematurely terminated strings (those that contain escaped quotes that
is),
nikolai

Did you test that? I’m afraid, it doesn’t work:

irb(main):001:0> str=’ /* comment / String s = "/"; ’
=> " / comment / String s = "/"; "
irb(main):002:0> str.gsub(/“[^”\](\.[^“\])”|*//){ |m| m == ‘*’ ?
‘\’ : m }
=> " / comment / String s = "/"; "
irb(main):003:0> str == str.gsub(/“[^”\](\.[^"\])"|*//){ |m| m ==
‘*’ ? '\’ : m }
=> true

If you want to allow quotes to be escaped, this one is the way to go:

irb(main):012:0> puts str.gsub(%r{“([^”\]|\“)"|*/}) {|m| m == '/’ ?
'\’ : m}
/ comment \ String s = "**/”;

Regards

robert

Zach_Dennis1 · 19 January 2004 21:55

Chris,

You want to take all asterik’s with a following forward slash that are not
inside of a double-quote?

Zach

P.S. -

if String s = "hello*"world*\ then you would want to capture the last *
?

or would you like to convert the last *\ to a */ ?

I guess I’m not exaxctly following what you are attempting to do.

···

-----Original Message-----
From: Chris Morris [mailto:chrismo@clabs.org]
Sent: Monday, January 19, 2004 4:44 PM
To: ruby-talk ML
Subject: Re: regex help

Simon Strandgaard wrote:

On Tue, 20 Jan 2004 04:38:12 +0900, Chris Morris wrote:

I need a re such that:

’ /* comment / String s = "*/"; '.gsub(re, "\")

returns:

’ /* comment \ String s = "**/"; ’

I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?

possible… but difficult.

I spend this evening making some broken experiments.

I thought it might be. I’ve fallen back and written a simple
char-by-char parser that ignores pieces inside quotes.

–
Chris
http://clabs.org

Nikolai_Weibull3 · 20 January 2004 16:51

Robert Klemme bob.news@gmx.net [Jan, 20 2004 17:30]:

str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == ‘*’ ? '\’ : m }
Did you test that? I’m afraid, it doesn’t work:
yes, but i seem to have made a mistake in copying it over for some
reason, the problem is the test, not the regex, it should be

str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == '/’ ? ‘*\’ : m }
^^

If you want to allow quotes to be escaped, this one is the way to go:

irb(main):012:0> puts str.gsub(%r{“([^”\]|\“)"|*/}) {|m| m == '/’ ?
'\’ : m}
/ comment \ String s = "**/”;
well, this isn’t really correct, that would only escape quotes and you
wouldn’t allow for escaped backslashes in your strings…it is of
course trivial to mend. do note that my version is a lot faster…see
“Mastering Regular Expressions” by Jeffrey E. F. Friedl on why this is
so. anyway, thanks for pointing out that something was wrong,
nikolai

···

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Chris_Morris · 19 January 2004 22:07

Zach Dennis wrote:

Chris,

You want to take all asterik’s with a following forward slash that are not
inside of a double-quote?

Yeup.

if String s = "hello*"world*\ then you would want to capture the last *
?

Yeup.

or would you like to convert the last *\ to a */ ?

Yeup.

I guess I’m not exaxctly following what you are attempting to do.

I’m on a job where the code in question is J++, and MS supports
conditional compiles… I’m writing a ‘pre-processor’ that will process
out the conditional compile directives so I can run some of the code on
a pure java VM.

···

–
Chris
http://clabs.org

Robert · 20 January 2004 17:20

“Nikolai Weibull” ruby-talk@pcppopper.org schrieb im Newsbeitrag
news:20040120165640.GB1987@puritan.bredbandsbolaget.se…

Robert Klemme bob.news@gmx.net [Jan, 20 2004 17:30]:

str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == ‘*’ ? '\’ :
m }
Did you test that? I’m afraid, it doesn’t work:
yes, but i seem to have made a mistake in copying it over for some
reason, the problem is the test, not the regex, it should be

str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == '/’ ? ‘*\’ : m }
^^

Oops, yes you’re right.

If you want to allow quotes to be escaped, this one is the way to go:

irb(main):012:0> puts str.gsub(%r{“([^”\]|\“)"|*/}) {|m| m == '/’
?
'\’ : m}
/ comment \ String s = "**/”;
well, this isn’t really correct, that would only escape quotes and you
wouldn’t allow for escaped backslashes in your strings…it is of
course trivial to mend. do note that my version is a lot faster…see
“Mastering Regular Expressions” by Jeffrey E. F. Friedl on why this is
so. anyway, thanks for pointing out that something was wrong,
nikolai

I’ve added escaping of arbitrary chars and put it into a benchmark
(attached). The differences don’t look too big:

18:07:21 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.776000)
yours 1.719000 0.000000 1.719000 ( 1.732000)
18:07:26 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.785000)
yours 1.704000 0.000000 1.704000 ( 1.712000)
18:07:31 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.800000)
yours 1.719000 0.000000 1.719000 ( 1.706000)
18:07:36 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.788000)
yours 1.703000 0.000000 1.703000 ( 1.693000)
18:07:43 [ruby]: ./rx-bm.rb
user system total real
mine 1.766000 0.000000 1.766000 ( 1.788000)
yours 1.719000 0.000000 1.719000 ( 1.707000)
18:07:52 [ruby]:

That’s certainly not what I’d call “a lot faster”. Maybe the effects of
GC dominate the rx timing. Here’s the output of the second benchmark:

18:15:22 [ruby]: ./rx-bm-2.rb
user system total real
mine 3.234000 0.000000 3.234000 ( 3.227000)
yours 3.157000 0.016000 3.173000 ( 3.188000)
18:15:33 [ruby]: ./rx-bm-2.rb
user system total real
mine 3.203000 0.000000 3.203000 ( 3.218000)
yours 3.125000 0.016000 3.141000 ( 3.231000)
18:15:44 [ruby]: ./rx-bm-2.rb
user system total real
mine 3.187000 0.000000 3.187000 ( 3.257000)
yours 3.156000 0.000000 3.156000 ( 3.186000)

Doesn’t look so much different. Any ideas or enlightening comments from
the aforementioned book?

Kind regards

robert

rx-bm.rb (380 Bytes)

rx-bm-2.rb (457 Bytes)

Zach_Dennis1 · 19 January 2004 22:39

Can this span multiple lines for a match?

Zach

···

-----Original Message-----
From: Chris Morris [mailto:chrismo@clabs.org]
Sent: Monday, January 19, 2004 5:08 PM
To: ruby-talk ML
Subject: Re: regex help

Zach Dennis wrote:

Chris,

You want to take all asterik’s with a following forward slash that are not
inside of a double-quote?

Yeup.

if String s = "hello*"world*\ then you would want to capture the last *
?

Yeup.

or would you like to convert the last *\ to a */ ?

Yeup.

I guess I’m not exaxctly following what you are attempting to do.

I’m on a job where the code in question is J++, and MS supports
conditional compiles… I’m writing a ‘pre-processor’ that will process
out the conditional compile directives so I can run some of the code on
a pure java VM.

–
Chris
http://clabs.org

Nikolai_Weibull3 · 20 January 2004 21:54

Robert Klemme bob.news@gmx.net [Jan, 20 2004 21:50]:

Doesn’t look so much different. Any ideas or enlightening comments from
the aforementioned book?
‘my’ version avoids a lot of unnecessary backtracking under certain
conditions. I can’t really delve into it further, but if you haven’t
got the book, its really worth buying. It’s really very entertaining
and full of good knowledge.
nikolai

···

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Chris_Morris · 19 January 2004 23:02

Zach Dennis wrote:

Can this span multiple lines for a match?

I don’t think a string literal can span multiple lines in a .java source
file, can it? If not, then I’m fine for now.

···

–
Chris
http://clabs.org

Zach_Dennis1 · 20 January 2004 21:58

I’ve got the book Nikolai, and tonight after the office closes my goal is to
dive into your code and Robert’s code and find out why. Your knowledge on
this amazes me!

Zach

···

-----Original Message-----
From: Nikolai Weibull [mailto:ruby-talk@pcppopper.org]
Sent: Tuesday, January 20, 2004 4:55 PM
To: ruby-talk ML
Subject: Re: regex help

Robert Klemme bob.news@gmx.net [Jan, 20 2004 21:50]:

Doesn’t look so much different. Any ideas or enlightening comments from
the aforementioned book?
‘my’ version avoids a lot of unnecessary backtracking under certain
conditions. I can’t really delve into it further, but if you haven’t
got the book, its really worth buying. It’s really very entertaining
and full of good knowledge.
nikolai

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Robert · 21 January 2004 11:00

“Nikolai Weibull” ruby-talk@pcppopper.org schrieb im Newsbeitrag
news:20040120220017.GC2265@puritan.bredbandsbolaget.se…

Robert Klemme bob.news@gmx.net [Jan, 20 2004 21:50]:

Doesn’t look so much different. Any ideas or enlightening comments
from
the aforementioned book?
‘my’ version avoids a lot of unnecessary backtracking under certain
conditions. I can’t really delve into it further, but if you haven’t
got the book, its really worth buying. It’s really very entertaining
and full of good knowledge.
nikolai

Hm… Maybe it’s because of the alternative in the first part:
([^"\]|\.). But the rx engine can detect at the first char which of the
two alternatives it has to take. Hmm… It seems I gotta have to get
that book…

Thanks anyway!

robert

Pit · 19 January 2004 23:36

You could try multiple steps:

extract string literals
change comment in remaining text
add string literals

For example:

def format str
literal = true
str.split( /("(?:[^"\]|\.)")/ ).map do |s|
( literal = !literal ) ? s : s.gsub( /*//, "\" )
end.join
end

The regexp used in split extracts all string literals (it handles the
"a"b" notation, too). By enclosing the regexp in parentheses split will
add the string literals in its output, too. So the result of split is an
array with the 2nd, 4th, 6th, … elements being the string literals and
the 1st, 3rd, 5th, … elements being the text outside of the literals.

The map uses a binary switch to decide whether to perform the desired
substitution or not (this is the first time I could use map_with_index).
This ensures that the string literals are left unchanged.

The last step is joining the array elements back into one string.

I don’t think this will handle all possible cases, but maybe it is a
starting point.

Regards,
Pit

Nikolai_Weibull3 · 20 January 2004 22:12

Zach Dennis zdennis@mktec.com [Jan, 20 2004 23:10]:

I’ve got the book Nikolai, and tonight after the office closes my goal
is to dive into your code and Robert’s code and find out why. Your
knowledge on this amazes me!
hehe, eh, thanks i suppose. Its covered in Chapter 6. It’s a use of
what Friedl calls “Unrolling-the-Loop” for regexes. It’s one of the
coolest regex optimizations ever deviced in my opinion.
nikolai

P.S.
Sorry for not being able to explain why (or more interestingly how) in
more detail, but Friedl spends some 60-70 pages on this, so it wouldn’t
really be possible.
D.S.

···

–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}

Chris_Morris · 20 January 2004 19:07

Pit Capitain wrote:

I don’t think this will handle all possible cases, but maybe it is a
starting point.

Thanks for posting this – very cool approach.

···

–
Chris
http://clabs.org

Topic		Replies	Views
Regular expression help 2 ruby-talk	4	81	18 April 2007
Regular Expression Help ruby-talk	5	116	6 October 2012
RegEx stuff ruby-talk	3	100	16 July 2008
Regex problem for comments line ruby-talk	4	126	28 September 2009
Regex ruby-talk	13	348	26 June 2016

Regex help

Related topics