Multiple regexp matches

Kevin_Howe1 · 17 August 2004 18:45

I want to get multiple results of a regexp pattern match, offsets included.
The following code gets the proper results, but does not return offsets:

str = ' ... '
 re = Regexp.new('(<(\/?)span>)', true) # match start or end tag
 print str.scan(re).inspect

The Regexp module will return offsets, but Regexp::match only returns the
first match, so I'm not sure how to get the full list of matches?

Zach_Dennis1 · 17 August 2004 18:51

Instead of using str.scan(re) use:

re.match( str );

which returns a MatchData object. You can use the MatchData object's offset method to find your results.....i do believe.

Zach

Kevin Howe wrote:

···

I want to get multiple results of a regexp pattern match, offsets included.
The following code gets the proper results, but does not return offsets:

str = ' ... '
re = Regexp.new('(<(\/?)span>)', true) # match start or end tag
print str.scan(re).inspect

The Regexp module will return offsets, but Regexp::match only returns the
first match, so I'm not sure how to get the full list of matches?

Kevin_Howe1 · 17 August 2004 19:25

"Zach Dennis" <zdennis@mktec.com> wrote in message
news:4122539A.1000900@mktec.com...

Instead of using str.scan(re) use:

re.match( str );

which returns a MatchData object. You can use the MatchData object's
offset method to find your results.....i do believe.

Yes that's true, but if you read the second part of my message, I'd already
tried this:

···

The Regexp module will return offsets, but Regexp::match only returns the
first match, so I'm not sure how to get the full list of matches?

Zach_Dennis1 · 17 August 2004 19:55

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );

It returns the correct offsets of the matches. offset(0) being the whole regex, offset(1) does the first subexpression, offset(2) does the second subexpression. It works.

Zach

Kevin Howe wrote:

···

"Zach Dennis" <zdennis@mktec.com> wrote in message
news:4122539A.1000900@mktec.com...

Instead of using str.scan(re) use:

re.match( str );

which returns a MatchData object. You can use the MatchData object's
offset method to find your results.....i do believe.

Yes that's true, but if you read the second part of my message, I'd already
tried this:

The Regexp module will return offsets, but Regexp::match only returns the
first match, so I'm not sure how to get the full list of matches?

David_A_Black3 · 17 August 2004 20:02

Hi --

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );

It returns the correct offsets of the matches. offset(0) being the whole
regex, offset(1) does the first subexpression, offset(2) does the second
subexpression. It works.

The problem is that Kevin wanted to scan a string more than once with
the same regex:

str = "abc abc abc"
re = /(\w+)/ # not /(\w+) (\w+) (\w+)/

re will scan against str three times. The difficulty is getting hold
of the offsets of all the matches from all three times, in relation to
the total length of the string.

Someone will probably post a simple or elegant solution; in the
meantime, here's mine:

  def find_offsets(str,re)
    offsets =
    first = 0
    of = [0,0]

loop do
 break unless m = re.match(str[first..-1])
 break if m.captures.empty?
 m.captures.each_with_index do |c,i|
 of = m.offset(i+1)
 res = [c, [of[0]+first, of[1]+first ]]
 yield res if block_given?
 offsets << res
 end
 first += of[0]
 end

offsets
end

# Little test:

str = ' ... '
re = /(<(\/?)span>)/i

  puts str
  (str.size/9).times { print "0123456789" }
  puts; puts

  find_offsets(str,re).each do |capture, (start, stop)|
    puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
  end

# Output:
 ... 
 0123456789012345678901234567890123456789

"" starts at 14, ends at 20
 "" starts at 15, ends at 15
 "" starts at 23, ends at 30
 "/" starts at 24, ends at 25
 "" starts at 31, ends at 38
 "/" starts at 32, ends at 33

David

···

On Wed, 18 Aug 2004, Zach Dennis wrote:

--
David A. Black
dblack@wobblini.net

Austin_Ziegler5 · 17 August 2004 20:25

This should do what you want.

-austin

str = ' ... '
re = /(<(\/?)span>)/i

str.scan(re) # => [["", ""], ["", "/"], ["", "/"]]

matches = []
str.scan(re) do
matches << Regexp.last_match
end

matches.each do |match|
  match.captures.each_with_index do |capture, ii|
    soff, eoff = match.offset(ii + 1)
    puts %Q("#{capture}" #{soff} .. #{eoff})
  end
end

Zach_Dennis1 · 17 August 2004 20:52

Ah...thanks for the clarification David. I was mistaken.

Sorry for the confusion Kevin.

Zach

David A. Black wrote:

···

Hi --

On Wed, 18 Aug 2004, Zach Dennis wrote:

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );

It returns the correct offsets of the matches. offset(0) being the whole regex, offset(1) does the first subexpression, offset(2) does the second subexpression. It works.

The problem is that Kevin wanted to scan a string more than once with
the same regex:

str = "abc abc abc"
re = /(\w+)/ # not /(\w+) (\w+) (\w+)/

re will scan against str three times. The difficulty is getting hold
of the offsets of all the matches from all three times, in relation to
the total length of the string.

Someone will probably post a simple or elegant solution; in the
meantime, here's mine:

def find_offsets(str,re)
 offsets =
 first = 0
 of = [0,0]

 loop do
 break unless m = re.match(str[first..-1])
 break if m.captures.empty?
 m.captures.each_with_index do |c,i|
of = m.offset(i+1)
res = [c, [of[0]+first, of[1]+first ]]
yield res if block_given?
offsets << res
 end
 first += of[0]
 end

 offsets
end

# Little test:

str = ' ... '
re = /(<(\/?)span>)/i

puts str
(str.size/9).times { print "0123456789" }
puts; puts

find_offsets(str,re).each do |capture, (start, stop)|
 puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
end

# Output:
 ... 
0123456789012345678901234567890123456789

"" starts at 14, ends at 20
"" starts at 15, ends at 15
"" starts at 23, ends at 30
"/" starts at 24, ends at 25
"" starts at 31, ends at 38
"/" starts at 32, ends at 33

David

Kevin_Howe1 · 17 August 2004 21:46

Awesome that works great thank you. I have to wonder why Ruby doesn't have
this built in, it's simple enough to add a method that returns a list of
MatchData objects as follows:

class MultiRegexp < Regexp
 def matches(str)
 str.scan(self) do
 yield Regexp.last_match
 end
 end
end

str = ' ... '
re = MultiRegexp.new('(<(\/?)span>)', true)
re.matches(str) { |i|
 capture = i.captures[0]
 start,stop = i.offset(0)
 puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
}

An even nicer alternative would be to add a Regexp::MULTIMATCH constant:

str = ' ... '
re = Regexp.new('(<(\/?)span>)', Regexp::MULTIMATCH)
matches = re.match(str)

Just a thought

"Zach Dennis" <zdennis@mktec.com> wrote in message
news:41226FE4.4060108@mktec.com...

···

Ah...thanks for the clarification David. I was mistaken.

Sorry for the confusion Kevin.

Zach

David A. Black wrote:

>Hi --
>
>On Wed, 18 Aug 2004, Zach Dennis wrote:
>
>
>
>>According to rdoc you are mistaken.
>>
>>I also think you are mistaken:
>>
>>#!/usr/bin/ruby
>>t = "This is my 1 text"
>>
>>re = /([^\s]*\s).*(\d)(\s.)/
>>md = re.match( t );
>>puts md.offset(0);
>>puts ""
>>puts md.offset( 1 );
>>puts ""
>>puts md.offset( 2 );
>>puts ""
>>puts md.offset( 3 );
>>
>>
>>It returns the correct offsets of the matches. offset(0) being the whole
>>regex, offset(1) does the first subexpression, offset(2) does the second
>>subexpression. It works.
>>
>>
>
>The problem is that Kevin wanted to scan a string more than once with
>the same regex:
>
> str = "abc abc abc"
> re = /(\w+)/ # not /(\w+) (\w+) (\w+)/
>
>re will scan against str three times. The difficulty is getting hold
>of the offsets of all the matches from all three times, in relation to
>the total length of the string.
>
>Someone will probably post a simple or elegant solution; in the
>meantime, here's mine:
>
> def find_offsets(str,re)
> offsets =
> first = 0
> of = [0,0]
>
> loop do
> break unless m = re.match(str[first..-1])
> break if m.captures.empty?
> m.captures.each_with_index do |c,i|
> of = m.offset(i+1)
> res = [c, [of[0]+first, of[1]+first ]]
> yield res if block_given?
> offsets << res
> end
> first += of[0]
> end
>
> offsets
> end
>
> # Little test:
>
> str = ' ... '
> re = /(<(\/?)span>)/i
>
> puts str
> (str.size/9).times { print "0123456789" }
> puts; puts
>
> find_offsets(str,re).each do |capture, (start, stop)|
> puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
> end
>
> # Output:
> ... 
> 0123456789012345678901234567890123456789
>
> "" starts at 14, ends at 20
> "" starts at 15, ends at 15
> "" starts at 23, ends at 30
> "/" starts at 24, ends at 25
> "" starts at 31, ends at 38
> "/" starts at 32, ends at 33
>
>
>David
>
>
>

Robert · 18 August 2004 10:31

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db91104081713254f2eb39e@mail.gmail.com...

This should do what you want.

-austin

str = ' ... '
re = /(<(\/?)span>)/i

str.scan(re) # => [["", ""], ["", "/"], ["", "/"]]

matches =
str.scan(re) do
 matches << Regexp.last_match
end

matches.each do |match|
 match.captures.each_with_index do |capture, ii|
 soff, eoff = match.offset(ii + 1)
 puts %Q("#{capture}" #{soff} .. #{eoff})
 end
end

While that works, isn't it ridiculous that one has to resort to a class
method ("Regexp.last_match")? I mean, there should rather be something like

/o/.each( "foo" ) do |md|
# md is MatchData
end

Or even

/o/.matcher( "foo" ).each do |md|
# md is MatchData
end

That way Matcher could implement Enumerable.

Sounds like a candidate for a RCR. Any comments?

robert

Simon_Strandgaard1 · 18 August 2004 10:46

What about $~ ?

bash-2.05b$ ruby a.rb
[[0, 13], [14, 20], [23, 30], [31, 38]]
bash-2.05b$ expand -t2 a.rb
str = ' ... '
re = Regexp.new('(<(\/?)span[^\n/]*?>)', true) # match start or end tag
positions =
str.scan(re) do
positions << [$~.begin(0), $~.end(0)]
end
p positionsbash-2.05b$

···

On Wednesday 18 August 2004 12:31, Robert Klemme wrote:

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db91104081713254f2eb39e@mail.gmail.com...

> This should do what you want.
>
> -austin
>
> str = ' ... '
> re = /(<(\/?)span>)/i
>
> str.scan(re) # => [["", ""], ["", "/"], ["", "/"]]
>
> matches =
> str.scan(re) do
> matches << Regexp.last_match
> end
>
> matches.each do |match|
> match.captures.each_with_index do |capture, ii|
> soff, eoff = match.offset(ii + 1)
> puts %Q("#{capture}" #{soff} .. #{eoff})
> end
> end

While that works, isn't it ridiculous that one has to resort to a class
method ("Regexp.last_match")? I mean, there should rather be something
like

/o/.each( "foo" ) do |md|
# md is MatchData
end

Or even

/o/.matcher( "foo" ).each do |md|
# md is MatchData
end

That way Matcher could implement Enumerable.

Sounds like a candidate for a RCR. Any comments?

--
Simon Strandgaard

Austin_Ziegler5 · 18 August 2004 14:55

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings. There are probably others, but those
are the ones that come to mind. This *will* break some code,
unfortunately, but that can be mitigated by adding #to_str. IMO,
this will make #gsub much easier to deal with, as you won't have to
resort to either Regexp.last_match or $[0-9] variables to be able to
work with captures. My Regexp.last_match call only presumes that
Regexp.last_match is actually threadsafe, whereas we know that the
ugly Perlish $ variables are threadsafe. I think this is an
acceptable level of incompatibility because of the use of #to_str
and the amount of flexibility that would be gained. As far as I
know, it wouldn't require *that* big a change, because for
Regexp.last_match to work, there must still be a MatchData object
*somewhere*.

What do you think?

-austin

···

On Wed, 18 Aug 2004 19:31:01 +0900, Robert Klemme <bob.news@gmx.net> wrote:

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db91104081713254f2eb39e@mail.gmail.com...

str = ' ... '
re = /(<(\/?)span> )/i

str.scan(re)
# => [[" ", ""], [" ", "/"], [" ", "/"]]

matches =
str.scan(re) do
 matches << Regexp.last_match
end

matches.each do |match|
 match.captures.each_with_index do |capture, ii|
 soff, eoff = match.offset(ii + 1)
 puts %Q("#{capture}" #{soff} .. #{eoff})
 end
end

While that works, isn't it ridiculous that one has to resort to a
class method ("Regexp.last_match")? I mean, there should rather be
something like

/o/.each( "foo" ) do |md|
 # md is MatchData
end

--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Robert · 18 August 2004 11:05

"Simon Strandgaard" <neoneye@adslhome.dk> schrieb im Newsbeitrag
news:200408181437.58446.neoneye@adslhome.dk...

···

On Wednesday 18 August 2004 12:31, Robert Klemme wrote:
> "Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
> news:9e7db91104081713254f2eb39e@mail.gmail.com...
>
> > This should do what you want.
> >
> > -austin
> >
> > str = ' ... '
> > re = /(<(\/?)span>)/i
> >
> > str.scan(re) # => [["", ""], ["", "/"], ["", "/"]]
> >
> > matches =
> > str.scan(re) do
> > matches << Regexp.last_match
> > end
> >
> > matches.each do |match|
> > match.captures.each_with_index do |capture, ii|
> > soff, eoff = match.offset(ii + 1)
> > puts %Q("#{capture}" #{soff} .. #{eoff})
> > end
> > end
>
> While that works, isn't it ridiculous that one has to resort to a class
> method ("Regexp.last_match")? I mean, there should rather be something
> like
>
> /o/.each( "foo" ) do |md|
> # md is MatchData
> end
>
> Or even
>
> /o/.matcher( "foo" ).each do |md|
> # md is MatchData
> end
>
> That way Matcher could implement Enumerable.
>
> Sounds like a candidate for a RCR. Any comments?

What about $~ ?

bash-2.05b$ ruby a.rb
[[0, 13], [14, 20], [23, 30], [31, 38]]
bash-2.05b$ expand -t2 a.rb
str = ' ... '
re = Regexp.new('(<(\/?)span[^\n/]*?>)', true) # match start or end tag
positions =
str.scan(re) do
positions << [$~.begin(0), $~.end(0)]
end
p positionsbash-2.05b$

This has the same problem, only that in this case you don't use a class
method but a global variable. Both of them are not in any way connected to
the regexp you use other than through a hidden side effect of the matching
process. I like more explicit connection similar to the one I suggested.

Kind regards

robert

Simon_Strandgaard1 · 18 August 2004 15:13

[snip]

Agree.. this would be nice.. I think I have seen an RCR about it long time
ago (but I cannot locate that RCR).

btw: my ruby regexp engine does so.. it yields matchdata instead of string.
http://raa.ruby-lang.org/project/regexp/

···

On Wednesday 18 August 2004 16:55, Austin Ziegler wrote:

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings.

--
Simon Strandgaard

Florian_Gross · 18 August 2004 20:30

Austin Ziegler wrote:

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings.

I agree with this and it seems that matz only hasn't done this yet, because of backwards compatibility.

I'm referring to this posting of him:

http://groups.google.com/groups?selm=1061229894.060091.14659.nullmailer%40picachu.netlab.jp

What do you think?

I heavily agree with this. It's the way it should have been since the beginning. #to_str sounds like a way that shouldn't break to much code and Ruby could issue a migration warning when it is called.

Rite was said to sacrifice compatibility for the cost of more elegance so now might be a good time for switching.

Regards,
Florian Gross

Robert · 19 August 2004 08:15

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db911040818075512cd5a01@mail.gmail.com...

> "Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
> news:9e7db91104081713254f2eb39e@mail.gmail.com...
>> str = ' ... '
>> re = /(<(\/?)span> )/i
>>
>> str.scan(re)
>> # => [[" ", ""], [" ", "/"], [" ", "/"]]
>>
>> matches =
>> str.scan(re) do
>> matches << Regexp.last_match
>> end
>>
>> matches.each do |match|
>> match.captures.each_with_index do |capture, ii|
>> soff, eoff = match.offset(ii + 1)
>> puts %Q("#{capture}" #{soff} .. #{eoff})
>> end
>> end
> While that works, isn't it ridiculous that one has to resort to a
> class method ("Regexp.last_match")? I mean, there should rather be
> something like
>
> /o/.each( "foo" ) do |md|
> # md is MatchData
> end

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings. There are probably others, but those
are the ones that come to mind. This *will* break some code,
unfortunately, but that can be mitigated by adding #to_str. IMO,
this will make #gsub much easier to deal with, as you won't have to
resort to either Regexp.last_match or $[0-9] variables to be able to
work with captures. My Regexp.last_match call only presumes that
Regexp.last_match is actually threadsafe, whereas we know that the
ugly Perlish $ variables are threadsafe. I think this is an
acceptable level of incompatibility because of the use of #to_str
and the amount of flexibility that would be gained. As far as I
know, it wouldn't require *that* big a change, because for
Regexp.last_match to work, there must still be a MatchData object
*somewhere*.

What do you think?

I like the functionality very much, but I'd prefer to *not* change the
behavior of String#scan, #sub, and #gsub. I'd rather have Regexp#scan(str,
&block), Regexp#sub(str, replace=nil, &block) and Regexp#gsub(str,
replace=nil, &block) that yield MatchData if there is a block. There might
be other names but since the behavior is quite similar to those methods in
String these names are propably good. The only drawback I can see is that
they might cause confusion ("Which were the ones that yielded MatchData?"),
but IMHO people can cope with this - especially since old behavior does not
change. (Personally I would find it easy to remember that Regexp <->
MatchData and String <-> String or Array of String.)

Kind regards

robert

···

On Wed, 18 Aug 2004 19:31:01 +0900, Robert Klemme <bob.news@gmx.net> wrote:

Austin_Ziegler5 · 22 August 2004 05:00

Here is the RCR I will be submitting. There is a server error on
rcrchive that prevents me from submitting it there.

Make String#scan, #gsub, and #sub yield MatchData objects
backwards compatibility [x]

Abtract:
A "least-break" change to <code> String#scan</code>,
<code>#gsub</code>, and <code> #sub</code> to provide the MatchData to
attached code blocks.

Problem:
<code> String#scan</code>, <code> #gsub</code>, and <code> #sub</code>
yield the string value of the matched regular expression to a provided
block, which is of very limited value. Currently, we must rely upon
either ugly numeric match variables (<code> $1</code> - <code>
$9</code>, etc.) or a class method (<code> Regexp.last_match</code) to
obtain the match.

<pre>str = ' ... '
re = /(<(\/?)span> )/i

str.scan(re)
# => [[" ", ""], [" ", "/"], [" ", "/"]]

matches = []
str.scan(re) do
matches << Regexp.last_match
end

matches.each do |match|
 match.captures.each_with_index do |capture, ii|
 soff, eoff = match.offset(ii + 1)
 puts %Q("#{capture}" #{soff} .. #{eoff})
 end
end</pre>

Proposal:
<code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
yield MatchData objects instead of Strings. I think that this could be
achieved while breaking the least amount of code by adding a #to_str
implementation to MatchData.

Analysis:
I have written code as noted in the problem section; it feels
unnecessarily complex and fragile. This change will work in all cases
where a single string is provided; it will require a change to code
that deals with array values (e.g., String#scan with groups are
provided (because of the use of rb_reg_nth_match in scan_once);
switching to the use of MatchData#captures by the developers will work
just fine.

Implementation:
I *think* that the changes look something like this:
<pre>
--- re.c.old 2004-08-22 00:24:09 Eastern Daylight Time
+++ re.c 2004-08-22 00:18:50 Eastern Daylight Time

@@ -2320,6 +2320,7 @@
     rb_define_method(rb_cMatch, "pre_match", rb_reg_match_pre, 0);
     rb_define_method(rb_cMatch, "post_match", rb_reg_match_post, 0);
     rb_define_method(rb_cMatch, "to_s", match_to_s, 0);
+ rb_define_method(rb_cMatch, "to_str", match_to_s, 0);
     rb_define_method(rb_cMatch, "inspect", rb_any_to_s, 0); /* in object.c */
     rb_define_method(rb_cMatch, "string", match_string, 0);
}

--- string.c.old 2004-08-22 00:24:10 Eastern Daylight Time
+++ string.c 2004-08-22 00:20:35 Eastern Daylight Time

@@ -1928,7 +1928,7 @@

if (iter) {
 rb_match_busy(match);
- repl = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match)));
+ repl = rb_obj_as_string(rb_yield(0, match));
 rb_backref_set(match);
 }
 else {
@@ -2043,7 +2043,7 @@
 regs = RMATCH(match)-> regs;
 if (iter) {
 rb_match_busy(match);
- val = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match)));
+ val = rb_obj_as_string(rb_yield(match));
 rb_backref_set(match);
 }
 else {
@@ -4164,15 +4164,7 @@
 else {
 *start = END(0);
 }
- if (regs-> num_regs == 1) {
- return rb_reg_nth_match(0, match);
- }
- result = rb_ary_new2(regs-> num_regs);
- for (i=1; i < regs-> num_regs; i++) {
- rb_ary_push(result, rb_reg_nth_match(i, match));
- }

···

-
- return result;
+ return match;
}
return Qnil;
}
</pre>

I'm not 100% sure that this is right, and I haven't tested it. The
equivalent Ruby code would be (note: this code appears to work, but
it does cause problems with irb):

<pre>class MatchData
 def to_str
 self.to_s
 end
end

class String
  alias_method :old_scan, :scan
  alias_method :old_gsub!, :gsub!
  alias_method :old_sub!, :sub!

  def scan(pattern)
    if block_given?
      old_scan(pattern) { yield Regexp.last_match }
    else
      old_scan(pattern)
    end
  end

  def gsub(pattern, repl = nil, &block)
    s = self.dup
    s.gsub!(pattern, repl, &block)
    s
  end

  def gsub!(pattern, repl = nil)
    if block_given? and repl.nil?
      old_gsub!(pattern) { yield Regexp.last_match }
    elsif repl.nil?
      old_gsub!(pattern)
    else
      old_gsub!(pattern, repl)
    end
  end

  def sub(pattern, repl = nil, &block)
    s = self.dup
    s.sub!(pattern, repl, &block)
    s
  end

def sub!(pattern, repl = nil)
 if block_given? and repl.nil?
 old_sub!(pattern) { yield Regexp.last_match }
 elsif repl.nil?
 old_sub!(pattern)
 else
 old_sub!(pattern, repl)
 end
 end
end</pre>

Austin_Ziegler5 · 22 August 2004 16:47

This has been resolved. This is now RCR 276.

http://rcrchive.net/rcr/RCR/RCR276

-austin

···

On Sun, 22 Aug 2004 01:00:39 -0400, Austin Ziegler <halostatue@gmail.com> wrote:

Here is the RCR I will be submitting. There is a server error on
rcrchive that prevents me from submitting it there.

--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Nobuyoshi_Nakada · 22 August 2004 22:33

Hi,

At Sun, 22 Aug 2004 14:00:45 +0900,
Austin Ziegler wrote in [ruby-talk:110110]:

Proposal:
<code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
yield MatchData objects instead of Strings. I think that this could be
achieved while breaking the least amount of code by adding a #to_str
implementation to MatchData.

#to_str doesn't solve everything. MatchData# returns a matched
portion for sub-patterns, whereas String# returns a byte at
the position.

···

--
Nobu Nakada

Robert · 22 August 2004 18:25

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db91104082209472cd80541@mail.gmail.com...

···

On Sun, 22 Aug 2004 01:00:39 -0400, Austin Ziegler <halostatue@gmail.com> wrote:
> Here is the RCR I will be submitting. There is a server error on
> rcrchive that prevents me from submitting it there.

This has been resolved. This is now RCR 276.

RCR::RCR276 - RCRchive home

-austin
--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Thx for including my comment. I was about to add it myself but saw it just
in time.

robert

Austin_Ziegler5 · 23 August 2004 00:09

Agreed. It also is 100% incompatible on #scan with groups in the
regexp (e.g., "foobar".scan(/(..)(.)/) will yield [["fo", "o"], ["ba",
"b"]]. This is the argument for Regexp#scan instead of modifying
String#scan. However, this is something that I believe should be
changed. An alternative is to yield both the normal values and the
match -- but that itself will be incompatible with #scan and most
current uses of #gsub and #sub that use the match value.

Yet another alternative is to add an optional parameter in all cases.
String#gsub currently expects a regexp and a replace pattern OR a
regexp and a block. #gsub could be modified such that when it gets a
regexp, a "boolean", and a block, it yields something different. This
could be, for example:

String#gsub(pattern, true) { |match_data| ... }
String#gsub(pattern) { |string| ... }

I would actually rather see the opposite form, if we do this:

String#gsub(pattern, true) { |string| ... }
String#gsub(pattern) { |match_data| ... }

This would encourage the use of the new form. By doing it this way, a
transition period can be introduced for this (e.g., it in 1.8.3 it may
warn that the current replace will be changed to yield a match_data
instead of a string; in 1.9 it yields a match_data instead of a
string).

I have *not* analysed code out there that uses #gsub/#scan/#sub, but I
think that this is an ideal change.

-austin (I'm also adding this to the discussion on RCR276)

···

On Mon, 23 Aug 2004 07:33:18 +0900, nobu.nokada@softhome.net <nobu.nokada@softhome.net> wrote:

At Sun, 22 Aug 2004 14:00:45 +0900,
Austin Ziegler wrote in [ruby-talk:110110]:
> Proposal:
> <code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
> yield MatchData objects instead of Strings. I think that this could be
> achieved while breaking the least amount of code by adding a #to_str
> implementation to MatchData.
#to_str doesn't solve everything. MatchData# returns a matched
portion for sub-patterns, whereas String# returns a byte at
the position.

--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Topic		Replies	Views
Simple regexp question ruby-talk	0	64	26 October 2005
Regexp#match(str, offset) ruby-talk	3	85	30 September 2004
Match a pattern multiple times, returning matches, captures and offset? ruby-talk	9	152	8 April 2011
Regexp question ruby-talk	3	76	9 May 2005
Multiple matches ruby-talk	3	111	28 May 2009

Multiple regexp matches

Related topics