Regex and non-greedy matching?

Marc_Heiler · 6 April 2008 23:25

I have a slight problem. I have strings with some tags such as

'<lightblue>name:</>'

I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
and
- What is inside the first <> right next to "name:"

The following regex does not work:

'<lightblue>name:</>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

$1 # => "b"
$2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

···

--
Posted via http://www.ruby-forum.com/.

Nobuyoshi_Nakada1 · 7 April 2008 01:05

Hi,

At Mon, 7 Apr 2008 08:25:28 +0900,
Marc Heiler wrote in [ruby-talk:297262]:

I have a slight problem. I have strings with some tags such as

'<lightblue>name:</>'

I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
and
- What is inside the first <> right next to "name:"

The following regex does not work:

'<lightblue>name:</>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

/<([a-zA-Z]+)>.*?([^<>]+)<\/>/ =~ "<lightblue>name:</>"

$2 should only be name:
and $1 should only be lightblue

Non-greedy matching doesn't mean the shortest result matching.
It matches at the leftmost position.

···

--
Nobu Nakada

Steve_Ross · 7 April 2008 01:59

Hi--

···

On Apr 6, 2008, at 4:25 PM, Marc Heiler wrote:

I have a slight problem. I have strings with some tags such as

'<lightblue>name:</>'

I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
and
- What is inside the first <> right next to "name:"

The following regex does not work:

'<lightblue>name:</>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

$1 # => "b"
$2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

You might want to look into hpricot (http://code.whytheluckystiff.net/hpricot/\). It will give you pretty reliable parsing of XML markup. What you have here is not valid XML because the closing tag for <lightblue> is not </lightblue> but on the chance that it's a typo, I really recommend giving hpricot a try.

7stud · 7 April 2008 04:09

Marc Heiler wrote:

I have a slight problem. I have strings with some tags such as

'<lightblue>name:</>'

I need to match "name:" and "lightblue"
In other words:
 - What is between <> </>
and
 - What is inside the first <> right next to "name:"

The following regex does not work:

'<lightblue>name:</>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

 $1 # => "b"

This is your string:

'<lightblue>name:</>'

and the first part of your regex says to look for a '<', followed by one
or more characters, followed by a '>'. That certainly describes the
string ''.

$2 # => "<lightblue>name:

This is your string again:

' <--already matched this
<lightblue>name:</>'

The second part of your regex says to look for a '<', followed by any
character one or more times, followed by '</>'. That certainly
describes the string '<lightblue>name</>'.

Note that since the characters '</>' only appear once in your string,
the non-greedy qualifier has no effect. By default, regex's are greedy,
so if your string looked like this:

'<lightblue>name:</>xxxxxxxxxxxxxxx</>'

then the greedy version of your regex:

/>(.+)<\/>/ <----(no '?')

would match:

<lightblue>name:</>xxxxxxxxxxxxxxx</>

That's because the portion:

<lightblue>name:</>xxxxxxxxxxxxxxx

is interpreted as "any character(.) one or more times(+)".

On the other hand, your non-greedy regex(i.e. with the '?') would match:

If you examine your string again:

'<lightblue>name:</>'

the 'lightblue' substring is preceded by the characters '><', and that
is different from what precedes 'b'. You can use that fact to get
'lightblue' instead of 'b'. This regex will get 'lightblue':

<([^>]+)

That says to look for '><' followed by one or more characters that are
not a '>'. That will match:

'><lightblue'

To get 'name:', you can do something similar. This is the rest of the
string after 'lightblue':

'>name:</>'

Here is a regex to get 'name:':

([^<]+)

That says to look for a '>', followed by one or more characters that are
not a '<'. Here it is altogether:

pattern = /><([^>]+)>([^<]+)/
str = "<lightblue>name:</>"

match_obj = pattern.match(str)
puts match_obj[1]
puts match_obj[2]

--output:--

lightblue
name:

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 7 April 2008 08:06

Constructing a regexp to match more specific often helps:

irb(main):001:0> s='<lightblue>name:</>'
=> "<lightblue>name:</>"

irb(main):002:0> md = %r{\s*<([^>]*)>([^<]*)</>}.match s
=> #<MatchData:0x7ff973f4>
irb(main):003:0> md.to_a
=> ["<lightblue>name:</>", "lightblue", "name:"]

irb(main):004:0> md = %r{\s*<([^>]*)>\s*([^<]*)</>}.match s
=> #<MatchData:0x7ff85b54>
irb(main):005:0> md.to_a
=> ["<lightblue>name:</>", "lightblue", "name:"]
irb(main):006:0>

See how this works without reluctant quantifier?

Cheers

robert

···

2008/4/7, Marc Heiler <shevegen@linuxmail.org>:

I have a slight problem. I have strings with some tags such as

'<lightblue>name:</>'

I need to match "name:" and "lightblue"
In other words:
 - What is between <> </>
and
 - What is inside the first <> right next to "name:"

The following regex does not work:

'<lightblue>name:</>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

 $1 # => "b"
 $2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

--
use.inject do |as, often| as.you_can - without end

Topic		Replies	Views
Regex Non Greedy Match ruby-talk	3	159	29 August 2012
Non-greediness in a regex - need some help verifying syntax ruby-talk	6	140	4 August 2006
Regex help please please! ruby-talk	3	75	20 December 2002
Non-greedy regexp ruby-talk	3	117	12 August 2002
Regular expression seems broken not greedy when it should be ruby-talk	2	126	21 September 2004

Regex and non-greedy matching?

Related topics