Problem replacing $data[abc] with $data['abc'] using gsub

Jan_E · 12 March 2012 18:40

The part ".*?" of the regular expression is very inefficient, because it
will at first consume every character until the end of the line and then
try to find the minimum of characters needed.

It's better to make an explicit expression for what you are looking for:

/\$data\[([^\]]*)\]/

This reads as "The String '$data[' followed by zero or more characters,
which are not closing square brackets, followed by a closing square
bracket".

Also you don't need the block version of gsub. You can simple use a
substitute string and refer to the parenthesized subexpression by \1:

string.gsub /\$data\[([^\]]*)\]/, '$data[\'\1\']'

Maybe the regular expression should also check if there are already
quotes.

···

--
Posted via http://www.ruby-forum.com/.

Jan_E · 13 March 2012 11:58

Joao Silva wrote in post #1051235:

Also you don't need the block version of gsub. You can simple use a
substitute string and refer to the parenthesized subexpression by \1:

string.gsub /\$data\[([^\]]*)\]/, '$data[\'\1\']'

This is what I was looking for. Does this have a name I could Google.

It's called backreference. You can also use it in your regular
expression to refer to subexpressions previously matched:

/(\w)\w*\1/

This matches any string that consists of word characters and begins and
ends with the same character.

Joao Silva wrote in post #1051235:

Maybe the regular expression should also check if there are already
quotes.

Using what you've demonstrated above, would this then be a good way to
exclude strings which contain a leading or trailing quote within the
square brackets?
/\$data\[([^'][^\]]*[^'])\]/

Yes, but you should also exclude double quotes:

/\$data\[([^'"][^\]]*[^'"])\]/

Albert's solution is probably even better. As a small modification, I
would include possible whitespace:

/\$data\[[ \t]*([a-z_][a-z\d_]*)[ \t]*\]/i

However, even this expression doesn't cover each and every case. For
example, the word in the square brackets could be the identifier of a
method or a local variable (then it must not be quoted). But fixing this
would probably be too complex. You won't be able to do it with a simple
regular expression.

···

--
Posted via http://www.ruby-forum.com/\.

Brian_Candler · 13 March 2012 20:27

Jan E. wrote in post #1051180:

The part ".*?" of the regular expression is very inefficient

Is it? Have you measured it?

, because it
will at first consume every character until the end of the line and then
try to find the minimum of characters needed.

Does it? There are many implementations of ruby, which particular one(s)
are you referring to?

Your argument suggests that

/(.+?)(.+?)(.+?)(.+?)(.+?)/ =~ "a"*1_000_000

would be extremely inefficient, but actually it runs very fast for me.

So let's demonstrate if you are right or wrong:

require 'benchmark'

LONGSTR = ("a" * 1_000_000).freeze

Benchmark.bmbm do |x|
x.report("chars") { 1_000_000.times { /aaaaa/ =~ LONGSTR } }
x.report("non-greedy") { 1_000_000.times { /.*?.*?.*?.*?.*?/ =~
LONGSTR } }
end

And the results for me, using ruby 1.8.7 under Mac OSX Lion on a Macbook
Air i7:

Rehearsal ----------------------------------------------
chars 0.520000 0.000000 0.520000 ( 0.516861)
non-greedy 0.510000 0.000000 0.510000 ( 0.511089)
------------------------------------- total: 1.030000sec

user system total real
chars 0.510000 0.000000 0.510000 ( 0.505664)
non-greedy 0.510000 0.000000 0.510000 ( 0.511662)

I see no difference there.

Also you don't need the block version of gsub. You can simple use a
substitute string and refer to the parenthesized subexpression by \1:

You can, but the block version is often clearer, especially if you are
doing things like backslash-escaping strings:

# clear
a.gsub(/(.)/) { "\\#{$1}" }

# same result but horrible
a.gsub(/(.)/), "\\\\\\1")

···

--
Posted via http://www.ruby-forum.com/\.

11142 · 13 March 2012 20:42

Sorry, but you're both wrong. The expression, as far as I know,
*will not* first consume eveyrthing, and then back off (or at least
this is implementation-defined). Such regexps can, however, be slow.
(Although .+ can be slow as well.)

Let's try something a tiny bit more complicated (but still simple).

irb(main):001:0> re = /.+?a.+?b/
=> /.+?a.+?b/
irb(main):002:0> s = "ac"*1_000_000 + "b"; nil
=> nil
irb(main):003:0> s =~ re
=> 0

This run fasts. Now, what if it can't match?

irb(main):004:0> s = "ac"*1_000_000; nil
=> nil
irb(main):005:0> s =~ re

I have no idea what the output is - it's been running for a few
minutes... the same happens if you substitute .+? with .+.

This is a bad case of catastrophic backtracking -
Runaway Regular Expressions: Catastrophic Backtracking - and stems from
the fact that matching a regex to a string is asymptotically
exponential.

In short - you should never use .+ or .+?, unless you have a very good
reason or know precisely what kinds of input you will get.

···

2012/3/13 Brian Candler <b.candler@pobox.com>:

Jan E. wrote in post #1051180:

The part ".*?" of the regular expression is very inefficient

Is it? Have you measured it?

, because it
will at first consume every character until the end of the line and then
try to find the minimum of characters needed.

Does it? There are many implementations of ruby, which particular one(s)
are you referring to?

Your argument suggests that

/(.+?)(.+?)(.+?)(.+?)(.+?)/ =~ "a"*1_000_000

would be extremely inefficient, but actually it runs very fast for me.

So let's demonstrate if you are right or wrong:

<snip>

Brian_Candler · 14 March 2012 21:28

Bartosz Dziewoński wrote in post #1051341:

This is a bad case of catastrophic backtracking -
http://www.regular-expressions.info/catastrophic.html - and stems from
the fact that matching a regex to a string is asymptotically
exponential.

Well, matching a *real* regular expression to a string can be done in
linear time by a DFA, with no backtracking. (However, it's true that the
conversion of some pathological regexps into DFAs can take exponential
resources)

The trouble is that "regexps" in programming languages seem to have
grown features beyond true regular expressions.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Problem replacing $data[abc] with $data['abc'] using gsub ruby-talk	0	130	13 March 2012
Sub/gsub bug? ruby-talk	5	102	18 May 2008
Can't find appropriate regexp ruby-talk	16	102	24 June 2003
Gsub help ruby-talk	0	83	29 August 2007
String substitution without RegEx ruby-talk	19	123	24 July 2003

Problem replacing $data[abc] with $data['abc'] using gsub

Related topics