Regex parsing question

Why does this:

  text= "AA<X>BB<X>CC</X>DD</X>EE"
  regex = %r{(.*)<X>(.*)}

  t = text.sub( regex, "z" );
  print "$1=#{$1}\n$2=#{$2}\n$3=#{$3}\n$4=#{$4}\n"

Return this:

  $1=AA<X>BB
  $2=CC</X>DD</X>EE
  $3=
  $4=

Instead of:

  $1=AA
  $2=BB<X>CC</X>DD</X>EE
  $3=
  $4=

And how would I fix it?

Paul

Why does this:

  text= "AA<X>BB<X>CC</X>DD</X>EE"
  regex = %r{(.*)<X>(.*)}

  t = text.sub( regex, "z" );
  print "$1=#{$1}\n$2=#{$2}\n$3=#{$3}\n$4=#{$4}\n"

Return this:

  $1=AA<X>BB
  $2=CC</X>DD</X>EE
  $3=
  $4=

Because the construct .* means, "Zero of more non-newline characters, but as many as I can get". We say the * operator is "greedy".

Instead of:

  $1=AA
  $2=BB<X>CC</X>DD</X>EE
  $3=
  $4=

And how would I fix it?

One way would be to switch from the greedy * to the conservative *?. That would have your Regexp looking like this:

%r{(.*?)<X>(.*)}

Another way is to use split() with a limit:

irb(main):001:0> text= "AA<X>BB<X>CC</X>DD</X>EE"
=> "AA<X>BB<X>CC</X>DD</X>EE"
irb(main):002:0> first, rest = text.split(/<X>/, 2)
=> ["AA", "BB<X>CC</X>DD</X>EE"]
irb(main):003:0> first
=> "AA"
irb(main):004:0> rest
=> "BB<X>CC</X>DD</X>EE"

Hope that helps.

James Edward Gray II

···

On Mar 31, 2005, at 2:49 PM, Paul Hanchett wrote:

Hi --

Why does this:

  text= "AA<X>BB<X>CC</X>DD</X>EE"
  regex = %r{(.*)<X>(.*)}

  t = text.sub( regex, "z" );
  print "$1=#{$1}\n$2=#{$2}\n$3=#{$3}\n$4=#{$4}\n"

Return this:

  $1=AA<X>BB
  $2=CC</X>DD</X>EE
  $3=
  $4=

Instead of:

  $1=AA
  $2=BB<X>CC</X>DD</X>EE
  $3=
  $4=

Because * is "greedy" -- meaning, it eats up as many characters as
possible, from left to right, while still allowing for a successful
match overall.

So your first .* eats up everything until it reaches as far right as
it possibly can -- namely, just before the second <X> (which it then
leaves intact so that it can be matched by the literal <X> in your
regex). It even eats up the first <X>.

And how would I fix it?

Use *? instead of * -- like this:

    regex = %r{(.*?)<X>(.*)}

David

···

On Fri, 1 Apr 2005, Paul Hanchett wrote:

--
David A. Black
dblack@wobblini.net

* Paul Hanchett (Mar 31, 2005 23:00):

  text= "AA<X>BB<X>CC</X>DD</X>EE"
  regex = %r{(.*)<X>(.*)}

use

        regex = %r{(.*?)<X>(.*)}

The .* will match the first <X> and will only relinquish the second so
that an overall match can be made (for the <X>-part of the regex),
        nikolai

···

--
::: name: Nikolai Weibull :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA :: loc atm: Gothenburg, Sweden :::
::: page: minimalistic.org :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

Thanks all for the help. I understand better now.

Paul