Regex and non-greedy matching?

I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
  - What is between <> </>
and
  - What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

  $1 # => "b"
  $2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

···

--
Posted via http://www.ruby-forum.com/.

Hi,

At Mon, 7 Apr 2008 08:25:28 +0900,
Marc Heiler wrote in [ruby-talk:297262]:

I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
  - What is between <> </>
and
  - What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

/<([a-zA-Z]+)>.*?([^<>]+)<\/>/ =~ "<b><lightblue>name:</></b>"

$2 should only be name:
and $1 should only be lightblue

Non-greedy matching doesn't mean the shortest result matching.
It matches at the leftmost position.

···

--
Nobu Nakada

Hi--

···

On Apr 6, 2008, at 4:25 PM, Marc Heiler wrote:

I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
and
- What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

$1 # => "b"
$2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

You might want to look into hpricot (http://code.whytheluckystiff.net/hpricot/\). It will give you pretty reliable parsing of XML markup. What you have here is not valid XML because the closing tag for <lightblue> is not </lightblue> but on the chance that it's a typo, I really recommend giving hpricot a try.

Marc Heiler wrote:

I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
  - What is between <> </>
and
  - What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

  $1 # => "b"

This is your string:

'<b><lightblue>name:</></b>'

and the first part of your regex says to look for a '<', followed by one
or more characters, followed by a '>'. That certainly describes the
string '<b>'.

  $2 # => "<lightblue>name:

This is your string again:

'<b> <--already matched this
    <lightblue>name:</></b>'

The second part of your regex says to look for a '<', followed by any
character one or more times, followed by '</>'. That certainly
describes the string '<lightblue>name</>'.

Note that since the characters '</>' only appear once in your string,
the non-greedy qualifier has no effect. By default, regex's are greedy,
so if your string looked like this:

'<b><lightblue>name:</></b>xxxxxxxxxxxxxxx</>'

then the greedy version of your regex:

/>(.+)<\/>/ <----(no '?')

would match:

<lightblue>name:</></b>xxxxxxxxxxxxxxx</>

That's because the portion:

<lightblue>name:</></b>xxxxxxxxxxxxxxx

is interpreted as "any character(.) one or more times(+)".

On the other hand, your non-greedy regex(i.e. with the '?') would match:

<lightblue>name:</>

If you examine your string again:

'<b><lightblue>name:</></b>'

the 'lightblue' substring is preceded by the characters '><', and that
is different from what precedes 'b'. You can use that fact to get
'lightblue' instead of 'b'. This regex will get 'lightblue':

<([^>]+)

That says to look for '><' followed by one or more characters that are
not a '>'. That will match:

'><lightblue'

To get 'name:', you can do something similar. This is the rest of the
string after 'lightblue':

'>name:</></b>'

Here is a regex to get 'name:':

([^<]+)

That says to look for a '>', followed by one or more characters that are
not a '<'. Here it is altogether:

pattern = /><([^>]+)>([^<]+)/
str = "<b><lightblue>name:</></b>"

match_obj = pattern.match(str)
puts match_obj[1]
puts match_obj[2]

--output:--

lightblue
name:

···

--
Posted via http://www.ruby-forum.com/\.

Constructing a regexp to match more specific often helps:

irb(main):001:0> s='<b><lightblue>name:</></b>'
=> "<b><lightblue>name:</></b>"

irb(main):002:0> md = %r{<b>\s*<([^>]*)>([^<]*)</>}.match s
=> #<MatchData:0x7ff973f4>
irb(main):003:0> md.to_a
=> ["<b><lightblue>name:</>", "lightblue", "name:"]

irb(main):004:0> md = %r{<b>\s*<([^>]*)>\s*([^<]*)</>}.match s
=> #<MatchData:0x7ff85b54>
irb(main):005:0> md.to_a
=> ["<b><lightblue>name:</>", "lightblue", "name:"]
irb(main):006:0>

See how this works without reluctant quantifier?

Cheers

robert

···

2008/4/7, Marc Heiler <shevegen@linuxmail.org>:

I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
  - What is between <> </>
and
  - What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

  $1 # => "b"
  $2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

--
use.inject do |as, often| as.you_can - without end