Regular Expression problems


(Jonas Bengtsson) #1

Hello,

I got a problem with regular expressions in Ruby. The closest I have
come to a solution looks like this:

···

re = /<name: ([a-zA-Z_]+)>\n(.)(<name:.)?/m
unprocessed = <<HERE
<name: a_name>
a little content
containing all sorts of
characters, even < and >
<name: another_name>
also containing all sorts
of things
<name: third_name>
i don’t know in advance how
many of these there are but
i settle with three here
HERE

match = re.match unprocessed

while match
name = match[1]
content = match[2]
unprocessed = match[3]

    puts "MATCH", "name:#{name}", "content:#{content}",
            "unproc:#{unprocessed}"

    match = re.match unprocessed

end


Here is the output:


MATCH
name:a_name
content:a little content
containing all sorts of
characters, even < and >
<name: another_name>
also containing all sorts
of things
<name: third_name>
i don’t know in advance how
many of these there are but
i settle with three here


But I want the output to look like this:


MATCH
name:a_name
content:a little content
containing all sorts of
characters, even < and >
name:another_name
content:also containing all sorts
of things
name:third_name
content:i don’t know in advance how
many of these there are but
i settle with three here


So the problem is that the second group–(.)–is too ‘hungry’ and
doesn’t stop on the first occurrence of the third group–(<name:.
).

Is it possible to change this behavior of the second group? Or are
there any better ways to solve this problem?


Best regards,
Jonas

I always thought that record would stand until it was broken.

  • Yogi Berra

(Jonas Bengtsson) #2

Hello again,

Perhaps I should have mentioned that I’m a Ruby-newbie. So I am would
appreciate any comments on my programming style. I have not had the
time to read Programming Ruby yet (besides as a reference) thus I’m
not too familiar with the Ruby-way.

By the way, I really like what I’ve seen about Ruby so far!

···


Best regards,
Jonas

We don’t know a millionth of one percent about anything.
Thomas A. Edison

Wednesday, July 03, 2002, 7:25:01 PM, you wrote:

Hello,

I got a problem with regular expressions in Ruby. The closest I have
come to a solution looks like this:


re = /<name: ([a-zA-Z_]+)>\n(.)(<name:.)?/m
unprocessed = <<HERE
<name: a_name>
a little content
containing all sorts of
characters, even < and >
<name: another_name>
also containing all sorts
of things
<name: third_name>
i don’t know in advance how
many of these there are but
i settle with three here
HERE

match = re.match unprocessed

while match
name = match[1]
content = match[2]
unprocessed = match[3]

    puts "MATCH", "name:#{name}", "content:#{content}",
            "unproc:#{unprocessed}"
    match = re.match unprocessed

end


Here is the output:


MATCH
name:a_name
content:a little content
containing all sorts of
characters, even < and >
<name: another_name>
also containing all sorts
of things
<name: third_name>
i don’t know in advance how
many of these there are but
i settle with three here


But I want the output to look like this:


MATCH
name:a_name
content:a little content
containing all sorts of
characters, even < and >
name:another_name
content:also containing all sorts
of things
name:third_name
content:i don’t know in advance how
many of these there are but
i settle with three here


So the problem is that the second group–(.)–is too ‘hungry’ and
doesn’t stop on the first occurrence of the third group–(<name:.
).

Is it possible to change this behavior of the second group? Or are
there any better ways to solve this problem?


(Nobuyoshi Nakada) #3

Hi,

So the problem is that the second group–(.)–is too ‘hungry’ and
doesn’t stop on the first occurrence of the third group–(<name:.
).

Is it possible to change this behavior of the second group? Or are
there any better ways to solve this problem?

re = /<name: ([a-zA-Z_]+)>\n(.?)(<name:.|\z)/m

Or:
re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
while match = re.match(unprocessed)
name = match[1]
content = match[2]
unprocessed = match.post_match # <–
puts “MATCH”, “name:#{name}”, "content:#{content}"
end

···

At Thu, 4 Jul 2002 02:25:01 +0900, Jonas Bengtsson wrote:


Nobu Nakada


(Jonas Bengtsson) #4

Hello Nobu,

Wednesday, July 03, 2002, 8:56:22 PM, you wrote:

re = /<name: ([a-zA-Z_]+)>\n(.?)(<name:.|\z)/m

Or:
re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
while match = re.match(unprocessed)
name = match[1]
content = match[2]
unprocessed = match.post_match # <–
puts “MATCH”, “name:#{name}”, "content:#{content}"
end

Thanks!
I didn’t see this in Programming Ruby before:
re ?
Matches zero or one occurrence of re. The *, +, and {m,n} modifiers
are greedy by default. Append a question mark to make them minimal.

···


Best regards,
Jonas

Anyone can make the simple complicated. Creativity is making the complicated simple.

  • Charles Mingus

(David Alan Black) #5

Hi –

Hi,

So the problem is that the second group–(.)–is too ‘hungry’ and
doesn’t stop on the first occurrence of the third group–(<name:.
).

Is it possible to change this behavior of the second group? Or are
there any better ways to solve this problem?

re = /<name: ([a-zA-Z_]+)>\n(.?)(<name:.|\z)/m

Or:
re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
while match = re.match(unprocessed)
name = match[1]
content = match[2]
unprocessed = match.post_match # <–
puts “MATCH”, “name:#{name}”, "content:#{content}"
end

One more variant, using the mighty #scan:

re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
unprocessed.scan(re) do |n,c|
puts “MATCH:”,“name:#{n}”,"content:#{c}"
end

(Jonas: contrary to your hand-made output, you did want "MATCH"
three times, didn’t you? :slight_smile:

David

···

On Thu, 4 Jul 2002 nobu.nokada@softhome.net wrote:

At Thu, 4 Jul 2002 02:25:01 +0900, > Jonas Bengtsson wrote:


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav


(Daniel) #6

For a good book on regular expressions, check out Jeffrey Friedells book
on O’Reilley - Mastering Regular Expressions.

You can find this answer, and many more there :slight_smile:
For Programming Ruby to cover all the regex things, it would take another
book in itself.

Daniel

···

On Thu, 4 Jul 2002, Jonas Bengtsson wrote:

Hello Nobu,

Wednesday, July 03, 2002, 8:56:22 PM, you wrote:

re = /<name: ([a-zA-Z_]+)>\n(.?)(<name:.|\z)/m

Or:
re = /<name: ([a-zA-Z_]+)>\n(.*?)(?=<name:|\z)/m
while match = re.match(unprocessed)
name = match[1]
content = match[2]
unprocessed = match.post_match # <–
puts “MATCH”, “name:#{name}”, "content:#{content}"
end

Thanks!
I didn’t see this in Programming Ruby before:
re ?
Matches zero or one occurrence of re. The *, +, and {m,n} modifiers
are greedy by default. Append a question mark to make them minimal.


Best regards,
Jonas

Anyone can make the simple complicated. Creativity is making the complicated simple.

  • Charles Mingus


A consultant is a person who borrows your watch, tells you what time it
is, pockets the watch, and sends you a bill for it.