Multiple regexp matches

I want to get multiple results of a regexp pattern match, offsets included.
The following code gets the proper results, but does not return offsets:

  str = '<span id="1"> <span>...</span> </span>'
  re = Regexp.new('(<(\/?)span>)', true) # match start or end tag
  print str.scan(re).inspect

The Regexp module will return offsets, but Regexp::match only returns the
first match, so I'm not sure how to get the full list of matches?

Instead of using str.scan(re) use:

re.match( str );

which returns a MatchData object. You can use the MatchData object's offset method to find your results.....i do believe.

Zach

Kevin Howe wrote:

···

I want to get multiple results of a regexp pattern match, offsets included.
The following code gets the proper results, but does not return offsets:

str = '<span id="1"> <span>...</span> </span>'
re = Regexp.new('(<(\/?)span>)', true) # match start or end tag
print str.scan(re).inspect

The Regexp module will return offsets, but Regexp::match only returns the
first match, so I'm not sure how to get the full list of matches?

"Zach Dennis" <zdennis@mktec.com> wrote in message
news:4122539A.1000900@mktec.com...

Instead of using str.scan(re) use:

re.match( str );

which returns a MatchData object. You can use the MatchData object's
offset method to find your results.....i do believe.

Yes that's true, but if you read the second part of my message, I'd already
tried this:

···

The Regexp module will return offsets, but Regexp::match only returns the
first match, so I'm not sure how to get the full list of matches?

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );

It returns the correct offsets of the matches. offset(0) being the whole regex, offset(1) does the first subexpression, offset(2) does the second subexpression. It works.

Zach

Kevin Howe wrote:

···

"Zach Dennis" <zdennis@mktec.com> wrote in message
news:4122539A.1000900@mktec.com...

Instead of using str.scan(re) use:

re.match( str );

which returns a MatchData object. You can use the MatchData object's
offset method to find your results.....i do believe.
   
Yes that's true, but if you read the second part of my message, I'd already
tried this:

The Regexp module will return offsets, but Regexp::match only returns the
first match, so I'm not sure how to get the full list of matches?
   

Hi --

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );

It returns the correct offsets of the matches. offset(0) being the whole
regex, offset(1) does the first subexpression, offset(2) does the second
subexpression. It works.

The problem is that Kevin wanted to scan a string more than once with
the same regex:

  str = "abc abc abc"
  re = /(\w+)/ # not /(\w+) (\w+) (\w+)/

re will scan against str three times. The difficulty is getting hold
of the offsets of all the matches from all three times, in relation to
the total length of the string.

Someone will probably post a simple or elegant solution; in the
meantime, here's mine:

  def find_offsets(str,re)
    offsets =
    first = 0
    of = [0,0]

    loop do
      break unless m = re.match(str[first..-1])
      break if m.captures.empty?
      m.captures.each_with_index do |c,i|
  of = m.offset(i+1)
  res = [c, [of[0]+first, of[1]+first ]]
  yield res if block_given?
  offsets << res
      end
      first += of[0]
    end

    offsets
  end

  # Little test:

  str = '<span id="1"> <span>...</span> </span>'
  re = /(<(\/?)span>)/i

  puts str
  (str.size/9).times { print "0123456789" }
  puts; puts

  find_offsets(str,re).each do |capture, (start, stop)|
    puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
  end

  # Output:
  <span id="1"> <span>...</span> </span>
  0123456789012345678901234567890123456789

  "<span>" starts at 14, ends at 20
  "" starts at 15, ends at 15
  "</span>" starts at 23, ends at 30
  "/" starts at 24, ends at 25
  "</span>" starts at 31, ends at 38
  "/" starts at 32, ends at 33

David

···

On Wed, 18 Aug 2004, Zach Dennis wrote:

--
David A. Black
dblack@wobblini.net

This should do what you want.

-austin

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]

matches = []
str.scan(re) do
  matches << Regexp.last_match
end

matches.each do |match|
  match.captures.each_with_index do |capture, ii|
    soff, eoff = match.offset(ii + 1)
    puts %Q("#{capture}" #{soff} .. #{eoff})
  end
end

Ah...thanks for the clarification David. I was mistaken.

Sorry for the confusion Kevin.

Zach

David A. Black wrote:

···

Hi --

On Wed, 18 Aug 2004, Zach Dennis wrote:

According to rdoc you are mistaken.

I also think you are mistaken:

#!/usr/bin/ruby
t = "This is my 1 text"

re = /([^\s]*\s).*(\d)(\s.)/
md = re.match( t );
puts md.offset(0);
puts ""
puts md.offset( 1 );
puts ""
puts md.offset( 2 );
puts ""
puts md.offset( 3 );

It returns the correct offsets of the matches. offset(0) being the whole regex, offset(1) does the first subexpression, offset(2) does the second subexpression. It works.
   
The problem is that Kevin wanted to scan a string more than once with
the same regex:

str = "abc abc abc"
re = /(\w+)/ # not /(\w+) (\w+) (\w+)/

re will scan against str three times. The difficulty is getting hold
of the offsets of all the matches from all three times, in relation to
the total length of the string.

Someone will probably post a simple or elegant solution; in the
meantime, here's mine:

def find_offsets(str,re)
   offsets =
   first = 0
   of = [0,0]

   loop do
     break unless m = re.match(str[first..-1])
     break if m.captures.empty?
     m.captures.each_with_index do |c,i|
of = m.offset(i+1)
res = [c, [of[0]+first, of[1]+first ]]
yield res if block_given?
offsets << res
     end
     first += of[0]
   end

   offsets
end

# Little test:

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

puts str
(str.size/9).times { print "0123456789" }
puts; puts

find_offsets(str,re).each do |capture, (start, stop)|
   puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
end

# Output:
<span id="1"> <span>...</span> </span>
0123456789012345678901234567890123456789

"<span>" starts at 14, ends at 20
"" starts at 15, ends at 15
"</span>" starts at 23, ends at 30
"/" starts at 24, ends at 25
"</span>" starts at 31, ends at 38
"/" starts at 32, ends at 33

David

Awesome that works great thank you. I have to wonder why Ruby doesn't have
this built in, it's simple enough to add a method that returns a list of
MatchData objects as follows:

class MultiRegexp < Regexp
    def matches(str)
        str.scan(self) do
          yield Regexp.last_match
        end
    end
end

str = '<span id="1"> <span>...</span> </span>'
re = MultiRegexp.new('(<(\/?)span>)', true)
re.matches(str) { |i|
    capture = i.captures[0]
    start,stop = i.offset(0)
    puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
}

An even nicer alternative would be to add a Regexp::MULTIMATCH constant:

str = '<span id="1"> <span>...</span> </span>'
re = Regexp.new('(<(\/?)span>)', Regexp::MULTIMATCH)
matches = re.match(str)

Just a thought :slight_smile:

"Zach Dennis" <zdennis@mktec.com> wrote in message
news:41226FE4.4060108@mktec.com...

···

Ah...thanks for the clarification David. I was mistaken.

Sorry for the confusion Kevin.

Zach

David A. Black wrote:

>Hi --
>
>On Wed, 18 Aug 2004, Zach Dennis wrote:
>
>
>
>>According to rdoc you are mistaken.
>>
>>I also think you are mistaken:
>>
>>#!/usr/bin/ruby
>>t = "This is my 1 text"
>>
>>re = /([^\s]*\s).*(\d)(\s.)/
>>md = re.match( t );
>>puts md.offset(0);
>>puts ""
>>puts md.offset( 1 );
>>puts ""
>>puts md.offset( 2 );
>>puts ""
>>puts md.offset( 3 );
>>
>>
>>It returns the correct offsets of the matches. offset(0) being the whole
>>regex, offset(1) does the first subexpression, offset(2) does the second
>>subexpression. It works.
>>
>>
>
>The problem is that Kevin wanted to scan a string more than once with
>the same regex:
>
> str = "abc abc abc"
> re = /(\w+)/ # not /(\w+) (\w+) (\w+)/
>
>re will scan against str three times. The difficulty is getting hold
>of the offsets of all the matches from all three times, in relation to
>the total length of the string.
>
>Someone will probably post a simple or elegant solution; in the
>meantime, here's mine:
>
> def find_offsets(str,re)
> offsets =
> first = 0
> of = [0,0]
>
> loop do
> break unless m = re.match(str[first..-1])
> break if m.captures.empty?
> m.captures.each_with_index do |c,i|
> of = m.offset(i+1)
> res = [c, [of[0]+first, of[1]+first ]]
> yield res if block_given?
> offsets << res
> end
> first += of[0]
> end
>
> offsets
> end
>
> # Little test:
>
> str = '<span id="1"> <span>...</span> </span>'
> re = /(<(\/?)span>)/i
>
> puts str
> (str.size/9).times { print "0123456789" }
> puts; puts
>
> find_offsets(str,re).each do |capture, (start, stop)|
> puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
> end
>
> # Output:
> <span id="1"> <span>...</span> </span>
> 0123456789012345678901234567890123456789
>
> "<span>" starts at 14, ends at 20
> "" starts at 15, ends at 15
> "</span>" starts at 23, ends at 30
> "/" starts at 24, ends at 25
> "</span>" starts at 31, ends at 38
> "/" starts at 32, ends at 33
>
>
>David
>
>
>

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db91104081713254f2eb39e@mail.gmail.com...

This should do what you want.

-austin

str = '<span id="1"> <span>...</span> </span>'
re = /(<(\/?)span>)/i

str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]

matches =
str.scan(re) do
  matches << Regexp.last_match
end

matches.each do |match|
  match.captures.each_with_index do |capture, ii|
    soff, eoff = match.offset(ii + 1)
    puts %Q("#{capture}" #{soff} .. #{eoff})
  end
end

While that works, isn't it ridiculous that one has to resort to a class
method ("Regexp.last_match")? I mean, there should rather be something like

/o/.each( "foo" ) do |md|
  # md is MatchData
end

Or even

/o/.matcher( "foo" ).each do |md|
  # md is MatchData
end

That way Matcher could implement Enumerable.

Sounds like a candidate for a RCR. Any comments?

    robert

What about $~ ?

bash-2.05b$ ruby a.rb
[[0, 13], [14, 20], [23, 30], [31, 38]]
bash-2.05b$ expand -t2 a.rb
str = '<span id="1"> <span>...</span> </span>'
re = Regexp.new('(<(\/?)span[^\n/]*?>)', true) # match start or end tag
positions =
str.scan(re) do
  positions << [$~.begin(0), $~.end(0)]
end
p positionsbash-2.05b$

···

On Wednesday 18 August 2004 12:31, Robert Klemme wrote:

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db91104081713254f2eb39e@mail.gmail.com...

> This should do what you want.
>
> -austin
>
> str = '<span id="1"> <span>...</span> </span>'
> re = /(<(\/?)span>)/i
>
> str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]
>
> matches =
> str.scan(re) do
> matches << Regexp.last_match
> end
>
> matches.each do |match|
> match.captures.each_with_index do |capture, ii|
> soff, eoff = match.offset(ii + 1)
> puts %Q("#{capture}" #{soff} .. #{eoff})
> end
> end

While that works, isn't it ridiculous that one has to resort to a class
method ("Regexp.last_match")? I mean, there should rather be something
like

/o/.each( "foo" ) do |md|
  # md is MatchData
end

Or even

/o/.matcher( "foo" ).each do |md|
  # md is MatchData
end

That way Matcher could implement Enumerable.

Sounds like a candidate for a RCR. Any comments?

--
Simon Strandgaard

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings. There are probably others, but those
are the ones that come to mind. This *will* break some code,
unfortunately, but that can be mitigated by adding #to_str. IMO,
this will make #gsub much easier to deal with, as you won't have to
resort to either Regexp.last_match or $[0-9] variables to be able to
work with captures. My Regexp.last_match call only presumes that
Regexp.last_match is actually threadsafe, whereas we know that the
ugly Perlish $ variables are threadsafe. I think this is an
acceptable level of incompatibility because of the use of #to_str
and the amount of flexibility that would be gained. As far as I
know, it wouldn't require *that* big a change, because for
Regexp.last_match to work, there must still be a MatchData object
*somewhere*.

What do you think?

-austin

···

On Wed, 18 Aug 2004 19:31:01 +0900, Robert Klemme <bob.news@gmx.net> wrote:

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db91104081713254f2eb39e@mail.gmail.com...

str = '<span id="1"> <span> ...</span> </span> '
re = /(<(\/?)span> )/i

str.scan(re)
# => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]

matches =
str.scan(re) do
  matches << Regexp.last_match
end

matches.each do |match|
  match.captures.each_with_index do |capture, ii|
    soff, eoff = match.offset(ii + 1)
    puts %Q("#{capture}" #{soff} .. #{eoff})
  end
end

While that works, isn't it ridiculous that one has to resort to a
class method ("Regexp.last_match")? I mean, there should rather be
something like

/o/.each( "foo" ) do |md|
  # md is MatchData
end

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

"Simon Strandgaard" <neoneye@adslhome.dk> schrieb im Newsbeitrag
news:200408181437.58446.neoneye@adslhome.dk...

···

On Wednesday 18 August 2004 12:31, Robert Klemme wrote:
> "Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
> news:9e7db91104081713254f2eb39e@mail.gmail.com...
>
> > This should do what you want.
> >
> > -austin
> >
> > str = '<span id="1"> <span>...</span> </span>'
> > re = /(<(\/?)span>)/i
> >
> > str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]
> >
> > matches =
> > str.scan(re) do
> > matches << Regexp.last_match
> > end
> >
> > matches.each do |match|
> > match.captures.each_with_index do |capture, ii|
> > soff, eoff = match.offset(ii + 1)
> > puts %Q("#{capture}" #{soff} .. #{eoff})
> > end
> > end
>
> While that works, isn't it ridiculous that one has to resort to a class
> method ("Regexp.last_match")? I mean, there should rather be something
> like
>
> /o/.each( "foo" ) do |md|
> # md is MatchData
> end
>
> Or even
>
> /o/.matcher( "foo" ).each do |md|
> # md is MatchData
> end
>
> That way Matcher could implement Enumerable.
>
> Sounds like a candidate for a RCR. Any comments?

What about $~ ?

bash-2.05b$ ruby a.rb
[[0, 13], [14, 20], [23, 30], [31, 38]]
bash-2.05b$ expand -t2 a.rb
str = '<span id="1"> <span>...</span> </span>'
re = Regexp.new('(<(\/?)span[^\n/]*?>)', true) # match start or end tag
positions =
str.scan(re) do
  positions << [$~.begin(0), $~.end(0)]
end
p positionsbash-2.05b$

This has the same problem, only that in this case you don't use a class
method but a global variable. Both of them are not in any way connected to
the regexp you use other than through a hidden side effect of the matching
process. I like more explicit connection similar to the one I suggested.

Kind regards

    robert

[snip]

Agree.. this would be nice.. I think I have seen an RCR about it long time
ago (but I cannot locate that RCR).

btw: my ruby regexp engine does so.. it yields matchdata instead of string.
http://raa.ruby-lang.org/project/regexp/

···

On Wednesday 18 August 2004 16:55, Austin Ziegler wrote:

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings.

--
Simon Strandgaard

Austin Ziegler wrote:

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings.

I agree with this and it seems that matz only hasn't done this yet, because of backwards compatibility.

I'm referring to this posting of him:

http://groups.google.com/groups?selm=1061229894.060091.14659.nullmailer%40picachu.netlab.jp

What do you think?

I heavily agree with this. It's the way it should have been since the beginning. #to_str sounds like a way that shouldn't break to much code and Ruby could issue a migration warning when it is called.

Rite was said to sacrifice compatibility for the cost of more elegance so now might be a good time for switching.

Regards,
Florian Gross

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db911040818075512cd5a01@mail.gmail.com...

> "Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
> news:9e7db91104081713254f2eb39e@mail.gmail.com...
>> str = '<span id="1"> <span> ...</span> </span> '
>> re = /(<(\/?)span> )/i
>>
>> str.scan(re)
>> # => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]
>>
>> matches =
>> str.scan(re) do
>> matches << Regexp.last_match
>> end
>>
>> matches.each do |match|
>> match.captures.each_with_index do |capture, ii|
>> soff, eoff = match.offset(ii + 1)
>> puts %Q("#{capture}" #{soff} .. #{eoff})
>> end
>> end
> While that works, isn't it ridiculous that one has to resort to a
> class method ("Regexp.last_match")? I mean, there should rather be
> something like
>
> /o/.each( "foo" ) do |md|
> # md is MatchData
> end

There's a simple solution, and I'll probably open an RCR about this
if others agree with it. String#scan, #sub, and #gsub should yield
MatchData objects, not Strings. There are probably others, but those
are the ones that come to mind. This *will* break some code,
unfortunately, but that can be mitigated by adding #to_str. IMO,
this will make #gsub much easier to deal with, as you won't have to
resort to either Regexp.last_match or $[0-9] variables to be able to
work with captures. My Regexp.last_match call only presumes that
Regexp.last_match is actually threadsafe, whereas we know that the
ugly Perlish $ variables are threadsafe. I think this is an
acceptable level of incompatibility because of the use of #to_str
and the amount of flexibility that would be gained. As far as I
know, it wouldn't require *that* big a change, because for
Regexp.last_match to work, there must still be a MatchData object
*somewhere*.

What do you think?

I like the functionality very much, but I'd prefer to *not* change the
behavior of String#scan, #sub, and #gsub. I'd rather have Regexp#scan(str,
&block), Regexp#sub(str, replace=nil, &block) and Regexp#gsub(str,
replace=nil, &block) that yield MatchData if there is a block. There might
be other names but since the behavior is quite similar to those methods in
String these names are propably good. The only drawback I can see is that
they might cause confusion ("Which were the ones that yielded MatchData?"),
but IMHO people can cope with this - especially since old behavior does not
change. (Personally I would find it easy to remember that Regexp <->
MatchData and String <-> String or Array of String.)

Kind regards

    robert

···

On Wed, 18 Aug 2004 19:31:01 +0900, Robert Klemme <bob.news@gmx.net> wrote:

Here is the RCR I will be submitting. There is a server error on
rcrchive that prevents me from submitting it there.

Make String#scan, #gsub, and #sub yield MatchData objects
backwards compatibility [x]

Abtract:
A "least-break" change to <code> String#scan</code>,
<code>#gsub</code>, and <code> #sub</code> to provide the MatchData to
attached code blocks.

Problem:
<code> String#scan</code>, <code> #gsub</code>, and <code> #sub</code>
yield the string value of the matched regular expression to a provided
block, which is of very limited value. Currently, we must rely upon
either ugly numeric match variables (<code> $1</code> - <code>
$9</code>, etc.) or a class method (<code> Regexp.last_match</code) to
obtain the match.

<pre>str = '<span id="1"> <span> ...</span> </span> '
re = /(<(\/?)span> )/i

str.scan(re)
  # => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]

matches = []
str.scan(re) do
  matches << Regexp.last_match
end

matches.each do |match|
  match.captures.each_with_index do |capture, ii|
    soff, eoff = match.offset(ii + 1)
     puts %Q("#{capture}" #{soff} .. #{eoff})
  end
end</pre>

Proposal:
<code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
yield MatchData objects instead of Strings. I think that this could be
achieved while breaking the least amount of code by adding a #to_str
implementation to MatchData.

Analysis:
I have written code as noted in the problem section; it feels
unnecessarily complex and fragile. This change will work in all cases
where a single string is provided; it will require a change to code
that deals with array values (e.g., String#scan with groups are
provided (because of the use of rb_reg_nth_match in scan_once);
switching to the use of MatchData#captures by the developers will work
just fine.

Implementation:
I *think* that the changes look something like this:
<pre>
--- re.c.old 2004-08-22 00:24:09 Eastern Daylight Time
+++ re.c 2004-08-22 00:18:50 Eastern Daylight Time

@@ -2320,6 +2320,7 @@
     rb_define_method(rb_cMatch, "pre_match", rb_reg_match_pre, 0);
     rb_define_method(rb_cMatch, "post_match", rb_reg_match_post, 0);
     rb_define_method(rb_cMatch, "to_s", match_to_s, 0);
+ rb_define_method(rb_cMatch, "to_str", match_to_s, 0);
     rb_define_method(rb_cMatch, "inspect", rb_any_to_s, 0); /* in object.c */
     rb_define_method(rb_cMatch, "string", match_string, 0);
}

--- string.c.old 2004-08-22 00:24:10 Eastern Daylight Time
+++ string.c 2004-08-22 00:20:35 Eastern Daylight Time

@@ -1928,7 +1928,7 @@

        if (iter) {
            rb_match_busy(match);
- repl = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match)));
+ repl = rb_obj_as_string(rb_yield(0, match));
            rb_backref_set(match);
        }
        else {
@@ -2043,7 +2043,7 @@
        regs = RMATCH(match)-> regs;
        if (iter) {
            rb_match_busy(match);
- val = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match)));
+ val = rb_obj_as_string(rb_yield(match));
            rb_backref_set(match);
        }
        else {
@@ -4164,15 +4164,7 @@
        else {
            *start = END(0);
        }
- if (regs-> num_regs == 1) {
- return rb_reg_nth_match(0, match);
- }
- result = rb_ary_new2(regs-> num_regs);
- for (i=1; i < regs-> num_regs; i++) {
- rb_ary_push(result, rb_reg_nth_match(i, match));
- }

···

-
- return result;
+ return match;
     }
     return Qnil;
}
</pre>

I'm not 100% sure that this is right, and I haven't tested it. The
equivalent Ruby code would be (note: this code appears to work, but
it does cause problems with irb):

<pre>class MatchData
  def to_str
    self.to_s
  end
end

class String
  alias_method :old_scan, :scan
  alias_method :old_gsub!, :gsub!
  alias_method :old_sub!, :sub!

  def scan(pattern)
    if block_given?
      old_scan(pattern) { yield Regexp.last_match }
    else
      old_scan(pattern)
    end
  end

  def gsub(pattern, repl = nil, &block)
    s = self.dup
    s.gsub!(pattern, repl, &block)
    s
  end

  def gsub!(pattern, repl = nil)
    if block_given? and repl.nil?
      old_gsub!(pattern) { yield Regexp.last_match }
    elsif repl.nil?
      old_gsub!(pattern)
    else
      old_gsub!(pattern, repl)
    end
  end

  def sub(pattern, repl = nil, &block)
    s = self.dup
    s.sub!(pattern, repl, &block)
    s
  end

  def sub!(pattern, repl = nil)
    if block_given? and repl.nil?
      old_sub!(pattern) { yield Regexp.last_match }
    elsif repl.nil?
      old_sub!(pattern)
    else
      old_sub!(pattern, repl)
    end
  end
end</pre>

This has been resolved. This is now RCR 276.

http://rcrchive.net/rcr/RCR/RCR276

-austin

···

On Sun, 22 Aug 2004 01:00:39 -0400, Austin Ziegler <halostatue@gmail.com> wrote:

Here is the RCR I will be submitting. There is a server error on
rcrchive that prevents me from submitting it there.

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

Hi,

At Sun, 22 Aug 2004 14:00:45 +0900,
Austin Ziegler wrote in [ruby-talk:110110]:

Proposal:
<code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
yield MatchData objects instead of Strings. I think that this could be
achieved while breaking the least amount of code by adding a #to_str
implementation to MatchData.

#to_str doesn't solve everything. MatchData# returns a matched
portion for sub-patterns, whereas String# returns a byte at
the position.

···

--
Nobu Nakada

"Austin Ziegler" <halostatue@gmail.com> schrieb im Newsbeitrag
news:9e7db91104082209472cd80541@mail.gmail.com...

···

On Sun, 22 Aug 2004 01:00:39 -0400, Austin Ziegler <halostatue@gmail.com> wrote:
> Here is the RCR I will be submitting. There is a server error on
> rcrchive that prevents me from submitting it there.

This has been resolved. This is now RCR 276.

RCR::RCR276 - RCRchive home

-austin
--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

Thx for including my comment. I was about to add it myself but saw it just
in time. :slight_smile:

    robert

Agreed. It also is 100% incompatible on #scan with groups in the
regexp (e.g., "foobar".scan(/(..)(.)/) will yield [["fo", "o"], ["ba",
"b"]]. This is the argument for Regexp#scan instead of modifying
String#scan. However, this is something that I believe should be
changed. An alternative is to yield both the normal values and the
match -- but that itself will be incompatible with #scan and most
current uses of #gsub and #sub that use the match value.

Yet another alternative is to add an optional parameter in all cases.
String#gsub currently expects a regexp and a replace pattern OR a
regexp and a block. #gsub could be modified such that when it gets a
regexp, a "boolean", and a block, it yields something different. This
could be, for example:

  String#gsub(pattern, true) { |match_data| ... }
  String#gsub(pattern) { |string| ... }

I would actually rather see the opposite form, if we do this:

  String#gsub(pattern, true) { |string| ... }
  String#gsub(pattern) { |match_data| ... }

This would encourage the use of the new form. By doing it this way, a
transition period can be introduced for this (e.g., it in 1.8.3 it may
warn that the current replace will be changed to yield a match_data
instead of a string; in 1.9 it yields a match_data instead of a
string).

I have *not* analysed code out there that uses #gsub/#scan/#sub, but I
think that this is an ideal change.

-austin (I'm also adding this to the discussion on RCR276)

···

On Mon, 23 Aug 2004 07:33:18 +0900, nobu.nokada@softhome.net <nobu.nokada@softhome.net> wrote:

At Sun, 22 Aug 2004 14:00:45 +0900,
Austin Ziegler wrote in [ruby-talk:110110]:
> Proposal:
> <code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
> yield MatchData objects instead of Strings. I think that this could be
> achieved while breaking the least amount of code by adding a #to_str
> implementation to MatchData.
#to_str doesn't solve everything. MatchData# returns a matched
portion for sub-patterns, whereas String# returns a byte at
the position.

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca