Regular Expression problems

Hello,

I got a problem with regular expressions in Ruby. The closest I have
come to a solution looks like this:

···

re = /<name: ([a-zA-Z_]+)>\n(.)(<name:.)?/m
unprocessed = <<HERE
<name: a_name>
a little content
containing all sorts of
characters, even < and >
<name: another_name>
also containing all sorts
of things
<name: third_name>
i don’t know in advance how
many of these there are but
i settle with three here
HERE

match = re.match unprocessed

while match
name = match[1]
content = match[2]
unprocessed = match[3]

    puts "MATCH", "name:#{name}", "content:#{content}",
            "unproc:#{unprocessed}"

    match = re.match unprocessed

end


Here is the output:


MATCH
name:a_name
content:a little content
containing all sorts of
characters, even < and >
<name: another_name>
also containing all sorts
of things
<name: third_name>
i don’t know in advance how
many of these there are but
i settle with three here


But I want the output to look like this:


MATCH
name:a_name
content:a little content
containing all sorts of
characters, even < and >
name:another_name
content:also containing all sorts
of things
name:third_name
content:i don’t know in advance how
many of these there are but
i settle with three here


So the problem is that the second group–(.)–is too ‘hungry’ and
doesn’t stop on the first occurrence of the third group–(<name:.
).

Is it possible to change this behavior of the second group? Or are
there any better ways to solve this problem?


Best regards,
Jonas

I always thought that record would stand until it was broken.

  • Yogi Berra

Hello again,

Perhaps I should have mentioned that I’m a Ruby-newbie. So I am would
appreciate any comments on my programming style. I have not had the
time to read Programming Ruby yet (besides as a reference) thus I’m
not too familiar with the Ruby-way.

By the way, I really like what I’ve seen about Ruby so far!

···


Best regards,
Jonas

We don’t know a millionth of one percent about anything.
Thomas A. Edison

Wednesday, July 03, 2002, 7:25:01 PM, you wrote:

Hello,

I got a problem with regular expressions in Ruby. The closest I have
come to a solution looks like this:


re = /<name: ([a-zA-Z_]+)>\n(.)(<name:.)?/m
unprocessed = <<HERE
<name: a_name>
a little content
containing all sorts of
characters, even < and >
<name: another_name>
also containing all sorts
of things
<name: third_name>
i don’t know in advance how
many of these there are but
i settle with three here
HERE

match = re.match unprocessed

while match
name = match[1]
content = match[2]
unprocessed = match[3]

    puts "MATCH", "name:#{name}", "content:#{content}",
            "unproc:#{unprocessed}"
    match = re.match unprocessed

end


Here is the output:


MATCH
name:a_name
content:a little content
containing all sorts of
characters, even < and >
<name: another_name>
also containing all sorts
of things
<name: third_name>
i don’t know in advance how
many of these there are but
i settle with three here


But I want the output to look like this:


MATCH
name:a_name
content:a little content
containing all sorts of
characters, even < and >
name:another_name
content:also containing all sorts
of things
name:third_name
content:i don’t know in advance how
many of these there are but
i settle with three here


So the problem is that the second group–(.)–is too ‘hungry’ and
doesn’t stop on the first occurrence of the third group–(<name:.
).

Is it possible to change this behavior of the second group? Or are
there any better ways to solve this problem?

Hi,

So the problem is that the second group–(.)–is too ‘hungry’ and
doesn’t stop on the first occurrence of the third group–(<name:.
).

Is it possible to change this behavior of the second group? Or are
there any better ways to solve this problem?

re = /<name: ([a-zA-Z_]+)>\n(.?)(<name:.|\z)/m

Or:
re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
while match = re.match(unprocessed)
name = match[1]
content = match[2]
unprocessed = match.post_match # ←
puts “MATCH”, “name:#{name}”, “content:#{content}”
end

···

At Thu, 4 Jul 2002 02:25:01 +0900, Jonas Bengtsson wrote:


Nobu Nakada

Hello Nobu,

Wednesday, July 03, 2002, 8:56:22 PM, you wrote:

re = /<name: ([a-zA-Z_]+)>\n(.?)(<name:.|\z)/m

Or:
re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
while match = re.match(unprocessed)
name = match[1]
content = match[2]
unprocessed = match.post_match # ←
puts “MATCH”, “name:#{name}”, “content:#{content}”
end

Thanks!
I didn’t see this in Programming Ruby before:
re ?
Matches zero or one occurrence of re. The *, +, and {m,n} modifiers
are greedy by default. Append a question mark to make them minimal.

···


Best regards,
Jonas

Anyone can make the simple complicated. Creativity is making the complicated simple.

  • Charles Mingus

Hi –

Hi,

So the problem is that the second group–(.)–is too ‘hungry’ and
doesn’t stop on the first occurrence of the third group–(<name:.
).

Is it possible to change this behavior of the second group? Or are
there any better ways to solve this problem?

re = /<name: ([a-zA-Z_]+)>\n(.?)(<name:.|\z)/m

Or:
re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
while match = re.match(unprocessed)
name = match[1]
content = match[2]
unprocessed = match.post_match # ←
puts “MATCH”, “name:#{name}”, “content:#{content}”
end

One more variant, using the mighty #scan:

re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
unprocessed.scan(re) do |n,c|
puts “MATCH:”,“name:#{n}”,“content:#{c}”
end

(Jonas: contrary to your hand-made output, you did want “MATCH”
three times, didn’t you? :slight_smile:

David

···

On Thu, 4 Jul 2002 nobu.nokada@softhome.net wrote:

At Thu, 4 Jul 2002 02:25:01 +0900, > Jonas Bengtsson wrote:


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

For a good book on regular expressions, check out Jeffrey Friedells book
on O’Reilley - Mastering Regular Expressions.

You can find this answer, and many more there :slight_smile:
For Programming Ruby to cover all the regex things, it would take another
book in itself.

Daniel

···

On Thu, 4 Jul 2002, Jonas Bengtsson wrote:

Hello Nobu,

Wednesday, July 03, 2002, 8:56:22 PM, you wrote:

re = /<name: ([a-zA-Z_]+)>\n(.?)(<name:.|\z)/m

Or:
re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
while match = re.match(unprocessed)
name = match[1]
content = match[2]
unprocessed = match.post_match # ←
puts “MATCH”, “name:#{name}”, “content:#{content}”
end

Thanks!
I didn’t see this in Programming Ruby before:
re ?
Matches zero or one occurrence of re. The *, +, and {m,n} modifiers
are greedy by default. Append a question mark to make them minimal.


Best regards,
Jonas

Anyone can make the simple complicated. Creativity is making the complicated simple.

  • Charles Mingus


A consultant is a person who borrows your watch, tells you what time it
is, pockets the watch, and sends you a bill for it.