I spend this evening making some broken experiments.
···
On Tue, 20 Jan 2004 04:38:12 +0900, Chris Morris wrote:
I need a re such that:
’ /* comment / String s = "*/"; '.gsub(re, "\")
returns:
’ /* comment \ String s = "**/"; ’
I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?
–
Simon Strandgaard
ruby test_main.rb
Loaded suite TestMain
Started
test_balance_bad1(TestMain): .
test_balance_bad2(TestMain): F
test_balance_bad3(TestMain): F
test_balance_ok1(TestMain): F
test_balance_ok2(TestMain): .
Finished in 0.030393 seconds.
Failure:
test_balance_bad2(TestMain)
[test_main.rb:25:in assert_x' test_main.rb:33:in test_balance_bad2’]:
<["xx ", “*/”]> expected but was
.
Failure:
test_balance_bad3(TestMain)
[test_main.rb:25:in assert_x' test_main.rb:37:in test_balance_bad3’]:
<[“xx /* /* / ", "/”]> expected but was
<["/* /* "]>.
Failure:
test_balance_ok1(TestMain)
[test_main.rb:25:in assert_x' test_main.rb:41:in test_balance_ok1’]:
expected but was
<["/* "]>.
5 tests, 5 assertions, 3 failures, 0 errors
expand -t2 test_main.rb
require ‘test/unit’
class TestMain < Test::Unit::TestCase
def mk_re
comment_begin = ‘/*’ # /*
comment_end = ‘*/’ # /
re = /
(
#{comment_begin}
.?
)
#{comment_end}
.?
(?! #{comment_begin} )
(?= #{comment_end} )
/x
re
end
def assert_x(expected, input)
actual = mk_re.match(input)
if actual
actual = actual.to_a
actual.shift
end
assert_equal(expected, actual)
end
def test_balance_bad1
s = 'xx / comment / String s = "/"; ’
assert_x(['/ comment '], s)
end
def test_balance_bad2
s = 'xx / / String s = "/"; ’
assert_x(['xx ', ‘/'], s)
end
def test_balance_bad3
s = 'xx / /* / /’
assert_x(['xx / // ', '/’], s)
end
def test_balance_ok1
s = ’ /* / // ’
assert_x(nil, s)
end
def test_balance_ok2
s = 'xx / /* */ ’
assert_x(nil, s)
end
end
if $0 == FILE
require ‘test/unit/ui/console/testrunner’
Test::Unit::UI::Console::TestRunner.run(TestMain, 3)
end
I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?
there are better things, yes…
str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == ‘*’ ? '\’ : m }
will find strings and your pattern fast and efficient while avoiding
prematurely terminated strings (those that contain escaped quotes that
is),
nikolai
···
–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}
I want to find all *\ instances that are not enclosed in
double-quotes.
Or is this one of those problems ill-suited for a regex?
there are better things, yes…
str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == ‘*’ ? '\’ : m }
will find strings and your pattern fast and efficient while avoiding
prematurely terminated strings (those that contain escaped quotes that
is),
nikolai
Did you test that? I’m afraid, it doesn’t work:
irb(main):001:0> str=’ /* comment / String s = "/"; ’
=> " / comment / String s = "/"; "
irb(main):002:0> str.gsub(/“[^”\](\.[^“\])”|*//){ |m| m == ‘*’ ?
‘\’ : m }
=> " / comment / String s = "/"; "
irb(main):003:0> str == str.gsub(/“[^”\](\.[^"\])"|*//){ |m| m ==
‘*’ ? '\’ : m }
=> true
If you want to allow quotes to be escaped, this one is the way to go:
irb(main):012:0> puts str.gsub(%r{“([^”\]|\“)"|*/}) {|m| m == '/’ ?
'\’ : m}
/ comment \ String s = "**/”;
You want to take all asterik’s with a following forward slash that are not
inside of a double-quote?
Zach
P.S. -
if String s = "hello*"world*\ then you would want to capture the last *
?
or would you like to convert the last *\ to a */ ?
I guess I’m not exaxctly following what you are attempting to do.
···
-----Original Message-----
From: Chris Morris [mailto:chrismo@clabs.org]
Sent: Monday, January 19, 2004 4:44 PM
To: ruby-talk ML
Subject: Re: regex help
Simon Strandgaard wrote:
On Tue, 20 Jan 2004 04:38:12 +0900, Chris Morris wrote:
I need a re such that:
’ /* comment / String s = "*/"; '.gsub(re, "\")
returns:
’ /* comment \ String s = "**/"; ’
I want to find all *\ instances that are not enclosed in double-quotes.
Or is this one of those problems ill-suited for a regex?
possible… but difficult.
I spend this evening making some broken experiments.
I thought it might be. I’ve fallen back and written a simple
char-by-char parser that ignores pieces inside quotes.
str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == ‘*’ ? '\’ : m }
Did you test that? I’m afraid, it doesn’t work:
yes, but i seem to have made a mistake in copying it over for some
reason, the problem is the test, not the regex, it should be
str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == '/’ ? ‘*\’ : m }
^^
If you want to allow quotes to be escaped, this one is the way to go:
irb(main):012:0> puts str.gsub(%r{“([^”\]|\“)"|*/}) {|m| m == '/’ ?
'\’ : m}
/ comment \ String s = "**/”;
well, this isn’t really correct, that would only escape quotes and you
wouldn’t allow for escaped backslashes in your strings…it is of
course trivial to mend. do note that my version is a lot faster…see
“Mastering Regular Expressions” by Jeffrey E. F. Friedl on why this is
so. anyway, thanks for pointing out that something was wrong,
nikolai
···
–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}
You want to take all asterik’s with a following forward slash that are not
inside of a double-quote?
Yeup.
if String s = "hello*"world*\ then you would want to capture the last *
?
Yeup.
or would you like to convert the last *\ to a */ ?
Yeup.
I guess I’m not exaxctly following what you are attempting to do.
I’m on a job where the code in question is J++, and MS supports
conditional compiles… I’m writing a ‘pre-processor’ that will process
out the conditional compile directives so I can run some of the code on
a pure java VM.
str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == ‘*’ ? '\’ :
m }
Did you test that? I’m afraid, it doesn’t work:
yes, but i seem to have made a mistake in copying it over for some
reason, the problem is the test, not the regex, it should be
str.gsub!(/“[^”\](\.[^"\])"|*//){ |m| m == '/’ ? ‘*\’ : m }
^^
Oops, yes you’re right.
If you want to allow quotes to be escaped, this one is the way to go:
irb(main):012:0> puts str.gsub(%r{“([^”\]|\“)"|*/}) {|m| m == '/’
?
'\’ : m}
/ comment \ String s = "**/”;
well, this isn’t really correct, that would only escape quotes and you
wouldn’t allow for escaped backslashes in your strings…it is of
course trivial to mend. do note that my version is a lot faster…see
“Mastering Regular Expressions” by Jeffrey E. F. Friedl on why this is
so. anyway, thanks for pointing out that something was wrong,
nikolai
I’ve added escaping of arbitrary chars and put it into a benchmark
(attached). The differences don’t look too big:
18:07:21 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.776000)
yours 1.719000 0.000000 1.719000 ( 1.732000)
18:07:26 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.785000)
yours 1.704000 0.000000 1.704000 ( 1.712000)
18:07:31 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.800000)
yours 1.719000 0.000000 1.719000 ( 1.706000)
18:07:36 [ruby]: ./rx-bm.rb
user system total real
mine 1.781000 0.000000 1.781000 ( 1.788000)
yours 1.703000 0.000000 1.703000 ( 1.693000)
18:07:43 [ruby]: ./rx-bm.rb
user system total real
mine 1.766000 0.000000 1.766000 ( 1.788000)
yours 1.719000 0.000000 1.719000 ( 1.707000)
18:07:52 [ruby]:
That’s certainly not what I’d call “a lot faster”. Maybe the effects of
GC dominate the rx timing. Here’s the output of the second benchmark:
18:15:22 [ruby]: ./rx-bm-2.rb
user system total real
mine 3.234000 0.000000 3.234000 ( 3.227000)
yours 3.157000 0.016000 3.173000 ( 3.188000)
18:15:33 [ruby]: ./rx-bm-2.rb
user system total real
mine 3.203000 0.000000 3.203000 ( 3.218000)
yours 3.125000 0.016000 3.141000 ( 3.231000)
18:15:44 [ruby]: ./rx-bm-2.rb
user system total real
mine 3.187000 0.000000 3.187000 ( 3.257000)
yours 3.156000 0.000000 3.156000 ( 3.186000)
Doesn’t look so much different. Any ideas or enlightening comments from
the aforementioned book?
-----Original Message-----
From: Chris Morris [mailto:chrismo@clabs.org]
Sent: Monday, January 19, 2004 5:08 PM
To: ruby-talk ML
Subject: Re: regex help
Zach Dennis wrote:
Chris,
You want to take all asterik’s with a following forward slash that are not
inside of a double-quote?
Yeup.
if String s = "hello*"world*\ then you would want to capture the last *
?
Yeup.
or would you like to convert the last *\ to a */ ?
Yeup.
I guess I’m not exaxctly following what you are attempting to do.
I’m on a job where the code in question is J++, and MS supports
conditional compiles… I’m writing a ‘pre-processor’ that will process
out the conditional compile directives so I can run some of the code on
a pure java VM.
Doesn’t look so much different. Any ideas or enlightening comments from
the aforementioned book?
‘my’ version avoids a lot of unnecessary backtracking under certain
conditions. I can’t really delve into it further, but if you haven’t
got the book, its really worth buying. It’s really very entertaining
and full of good knowledge.
nikolai
···
–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}
I’ve got the book Nikolai, and tonight after the office closes my goal is to
dive into your code and Robert’s code and find out why. Your knowledge on
this amazes me!
Zach
···
-----Original Message-----
From: Nikolai Weibull [mailto:ruby-talk@pcppopper.org]
Sent: Tuesday, January 20, 2004 4:55 PM
To: ruby-talk ML
Subject: Re: regex help
Doesn’t look so much different. Any ideas or enlightening comments from
the aforementioned book?
‘my’ version avoids a lot of unnecessary backtracking under certain
conditions. I can’t really delve into it further, but if you haven’t
got the book, its really worth buying. It’s really very entertaining
and full of good knowledge.
nikolai
–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}
Doesn’t look so much different. Any ideas or enlightening comments
from
the aforementioned book?
‘my’ version avoids a lot of unnecessary backtracking under certain
conditions. I can’t really delve into it further, but if you haven’t
got the book, its really worth buying. It’s really very entertaining
and full of good knowledge.
nikolai
Hm… Maybe it’s because of the alternative in the first part:
([^"\]|\.). But the rx engine can detect at the first char which of the
two alternatives it has to take. Hmm… It seems I gotta have to get
that book…
def format str
literal = true
str.split( /("(?:[^"\]|\.)")/ ).map do |s|
( literal = !literal ) ? s : s.gsub( /*//, "\" )
end.join
end
The regexp used in split extracts all string literals (it handles the
"a"b" notation, too). By enclosing the regexp in parentheses split will
add the string literals in its output, too. So the result of split is an
array with the 2nd, 4th, 6th, … elements being the string literals and
the 1st, 3rd, 5th, … elements being the text outside of the literals.
The map uses a binary switch to decide whether to perform the desired
substitution or not (this is the first time I could use map_with_index).
This ensures that the string literals are left unchanged.
The last step is joining the array elements back into one string.
I don’t think this will handle all possible cases, but maybe it is a
starting point.
I’ve got the book Nikolai, and tonight after the office closes my goal
is to dive into your code and Robert’s code and find out why. Your
knowledge on this amazes me!
hehe, eh, thanks i suppose. Its covered in Chapter 6. It’s a use of
what Friedl calls “Unrolling-the-Loop” for regexes. It’s one of the
coolest regex optimizations ever deviced in my opinion.
nikolai
P.S.
Sorry for not being able to explain why (or more interestingly how) in
more detail, but Friedl spends some 60-70 pages on this, so it wouldn’t
really be possible.
D.S.
···
–
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: www.pcppopper.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux[“\021%six\012\0”],(linux)[“have”]+“fun”-97);}