Regex and non-greedy matching?

I have a slight problem. I have strings with some tags such as


I need to match "name:" and "lightblue"
In other words:
  - What is between <> </>
  - What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

  $1 # => "b"
  $2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue


Posted via


At Mon, 7 Apr 2008 08:25:28 +0900,
Marc Heiler wrote in [ruby-talk:297262]:

I have a slight problem. I have strings with some tags such as


I need to match "name:" and "lightblue"
In other words:
  - What is between <> </>
  - What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

/<([a-zA-Z]+)>.*?([^<>]+)<\/>/ =~ "<b><lightblue>name:</></b>"

$2 should only be name:
and $1 should only be lightblue

Non-greedy matching doesn't mean the shortest result matching.
It matches at the leftmost position.


Nobu Nakada



On Apr 6, 2008, at 4:25 PM, Marc Heiler wrote:

I have a slight problem. I have strings with some tags such as


I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
- What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

$1 # => "b"
$2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

You might want to look into hpricot (\). It will give you pretty reliable parsing of XML markup. What you have here is not valid XML because the closing tag for <lightblue> is not </lightblue> but on the chance that it's a typo, I really recommend giving hpricot a try.

Marc Heiler wrote:

I have a slight problem. I have strings with some tags such as


I need to match "name:" and "lightblue"
In other words:
  - What is between <> </>
  - What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

  $1 # => "b"

This is your string:


and the first part of your regex says to look for a '<', followed by one
or more characters, followed by a '>'. That certainly describes the
string '<b>'.

  $2 # => "<lightblue>name:

This is your string again:

'<b> <--already matched this

The second part of your regex says to look for a '<', followed by any
character one or more times, followed by '</>'. That certainly
describes the string '<lightblue>name</>'.

Note that since the characters '</>' only appear once in your string,
the non-greedy qualifier has no effect. By default, regex's are greedy,
so if your string looked like this:


then the greedy version of your regex:

/>(.+)<\/>/ <----(no '?')

would match:


That's because the portion:


is interpreted as "any character(.) one or more times(+)".

On the other hand, your non-greedy regex(i.e. with the '?') would match:


If you examine your string again:


the 'lightblue' substring is preceded by the characters '><', and that
is different from what precedes 'b'. You can use that fact to get
'lightblue' instead of 'b'. This regex will get 'lightblue':


That says to look for '><' followed by one or more characters that are
not a '>'. That will match:


To get 'name:', you can do something similar. This is the rest of the
string after 'lightblue':


Here is a regex to get 'name:':


That says to look for a '>', followed by one or more characters that are
not a '<'. Here it is altogether:

pattern = /><([^>]+)>([^<]+)/
str = "<b><lightblue>name:</></b>"

match_obj = pattern.match(str)
puts match_obj[1]
puts match_obj[2]




Posted via\.

Constructing a regexp to match more specific often helps:

irb(main):001:0> s='<b><lightblue>name:</></b>'
=> "<b><lightblue>name:</></b>"

irb(main):002:0> md = %r{<b>\s*<([^>]*)>([^<]*)</>}.match s
=> #<MatchData:0x7ff973f4>
irb(main):003:0> md.to_a
=> ["<b><lightblue>name:</>", "lightblue", "name:"]

irb(main):004:0> md = %r{<b>\s*<([^>]*)>\s*([^<]*)</>}.match s
=> #<MatchData:0x7ff85b54>
irb(main):005:0> md.to_a
=> ["<b><lightblue>name:</>", "lightblue", "name:"]

See how this works without reluctant quantifier?




2008/4/7, Marc Heiler <>:

I have a slight problem. I have strings with some tags such as


I need to match "name:" and "lightblue"
In other words:
  - What is between <> </>
  - What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

  $1 # => "b"
  $2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

use.inject do |as, often| as.you_can - without end