[RCR] Global Regexp Match Mechanism (//g)

#scan doesn’t solve the ultimate problem – being able to backtrack and rescan from an earlier position (a la Perl’s pos() function). Scan would work “OK” if it returned an array of MatchData instead of Strings. BTW, my posted code almost works. I meant to say that I wanted to do that without doing “foostr = md.post_match”.

#scan creates an intermediate array that doesn’t help in a backtracking situation.

-a

···


austin ziegler
Sent from my Treo

Hi,

#scan doesn’t solve the ultimate problem – being able to
backtrack and rescan from an earlier position (a la Perl’s
pos() function). Scan would work “OK” if it returned an array
of MatchData instead of Strings. BTW, my posted code almost
works. I meant to say that I wanted to do that without doing
“foostr = md.post_match”.

String#index takes optional argument specifies searching
position.

pos = 0
while foostr.index(/foo/, pos)
puts $&
pos = $~.end(0)
end

Optional argument to Regexp#match somebody had proposed would
be nice too.

pos = 0
while md = /foo/.match(foostr, pos)
puts md.to_s
pos = md.end(0)
end

···

At Fri, 13 Dec 2002 04:31:12 +0900, Austin Ziegler wrote:


Nobu Nakada

nobu.nokada@softhome.net writes:

#scan doesn’t solve the ultimate problem – being able to
backtrack and rescan from an earlier position (a la Perl’s
pos() function).

[…]

String#index takes optional argument specifies searching
position.

pos = 0
while foostr.index(/foo/, pos)
puts $&
pos = $~.end(0)
end

The problem with String#index is:

(a) if ruby is not run in ASCII mode, ruby must scan the whole
    string up to 'pos' to find the correct byte offset
    (e.g. utf8_startpos() in regex.c)

(b) there is no way to anchor the regex at 'pos'

        "abcd".index(/\Abc/, 1) -> nil
        "abcd".index(/^bc/, 1)  -> nil

This makes String#index not suitable for answering simple questions
efficiently – e.g. does a given regexp match at a given offset into
the string. So String#index is not good for lexical analysis
applications.

Optional argument to Regexp#match somebody had proposed would be
nice too.

pos = 0
while md = /foo/.match(foostr, pos)
puts md.to_s
pos = md.end(0)
end

Yes, I proposed this before I understood about point (a) above. This
proposed change to Regexp#match turns and String#index into the almost
same thing.

Because of these problems, an API like Perl’s pos() and \G is
desirable. For example:

(1) Have the string remember its last end-of-match position (byte
    and offset).  This fixes problem (a) above.
(2) In regexps, \G match this position.  This fixes problem (b).
(3) Have String#gpos (or better name) set/get the end-of-match
    position based on a character index, for convenience.

This begins to look a lot like strscan, which will be part of ruby
1.8. However, because strscan is not part of String, it can not know
when the string is modified and must freeze the string before
operating on it (otherwise it risks having its byte offsets be
incorrect when the string is modified).

Freezing the string is inconvenient in my application. I examine a
string in detail before deciding whether to append more data to it
(String#<<) from a file or start a new string.

···

At Fri, 13 Dec 2002 04:31:12 +0900, > Austin Ziegler wrote:


Don’t send mail to Donald_Schaefer@hole.lickey.com
The address is there for spammers to harvest.

Hi,

String#index takes optional argument specifies searching
position.

pos = 0
while foostr.index(/foo/, pos)
puts $&
pos = $~.end(0)
end

The problem with String#index is:

(a) if ruby is not run in ASCII mode, ruby must scan the whole
    string up to 'pos' to find the correct byte offset
    (e.g. utf8_startpos() in regex.c)

I know, and it has improved in 1.7.

(b) there is no way to anchor the regex at 'pos'

        "abcd".index(/\Abc/, 1) -> nil
        "abcd".index(/^bc/, 1)  -> nil

“abcd”.index(/\Gbc/, 1) → 1

Because of these problems, an API like Perl’s pos() and \G is
desirable. For example:

(1) Have the string remember its last end-of-match position (byte
    and offset).  This fixes problem (a) above.

But thread unsafe.

(2) In regexps, \G match this position.  This fixes problem (b).

Already it does.

(3) Have String#gpos (or better name) set/get the end-of-match
    position based on a character index, for convenience.

It doesn’t seem a good interface to me.

This begins to look a lot like strscan, which will be part of ruby
1.8. However, because strscan is not part of String, it can not know
when the string is modified and must freeze the string before
operating on it (otherwise it risks having its byte offsets be
incorrect when the string is modified).

Freezing the string is inconvenient in my application. I examine a
string in detail before deciding whether to append more data to it
(String#<<) from a file or start a new string.

Modification of target string will cause the character boundary
issue even with your String#gpos. It must be recalculated.

···

At Fri, 13 Dec 2002 14:13:07 +0900, Matt Armstrong wrote:


Nobu Nakada

nobu.nokada@softhome.net writes:

(b) there is no way to anchor the regex at 'pos'

        "abcd".index(/\Abc/, 1) -> nil
        "abcd".index(/^bc/, 1)  -> nil

“abcd”.index(/\Gbc/, 1) → 1

Great! I didn’t know you could do that.

Freezing the string is inconvenient in my application. I examine a
string in detail before deciding whether to append more data to it
(String#<<) from a file or start a new string.

Modification of target string will cause the character boundary
issue even with your String#gpos. It must be recalculated.

It is possible to be fancy and avoid recalculation of byte offset B of
character position C. E.g. if you insert 10 bytes before B, increment
B by 10.

But your point about thread safety is good. String#index together
with \G is a good solution. Though Regexp#match with a character
offset may be handy too, since it returns the MatchData.

Hi,

Freezing the string is inconvenient in my application. I examine a
string in detail before deciding whether to append more data to it
(String#<<) from a file or start a new string.

Modification of target string will cause the character boundary
issue even with your String#gpos. It must be recalculated.

It is possible to be fancy and avoid recalculation of byte offset B of
character position C. E.g. if you insert 10 bytes before B, increment
B by 10.

It means all destructive String methods must pay that cost. It
depends on frequencies of index and other methods, but I doubt
if it acceptable.

But your point about thread safety is good. String#index together
with \G is a good solution. Though Regexp#match with a character
offset may be handy too, since it returns the MatchData.

I also want this extension.

Index: re.c

···

At Sat, 14 Dec 2002 01:25:21 +0900, Matt Armstrong wrote:

RCS file: /cvs/ruby/src/ruby/re.c,v
retrieving revision 1.86
diff -u -2 -p -r1.86 re.c
— re.c 12 Dec 2002 09:17:32 -0000 1.86
+++ re.c 12 Dec 2002 09:31:57 -0000
@@ -1130,10 +1130,18 @@ rb_reg_match2(re)

static VALUE
-rb_reg_match_m(re, str)

  • VALUE re, str;
    +rb_reg_match_m(argc, argv, re)
  • int argc;
  • VALUE *argv;
  • VALUE re;
    {
  • VALUE result = rb_reg_match(re, str);
  • VALUE str, initpos, result;
  • long pos = 0;
  • if (NIL_P(result)) return Qnil;
  • if (rb_scan_args(argc, argv, “11”, &str, &initpos) == 2) {
  • pos = NUM2LONG(initpos);
  • }
  • if (NIL_P(str)) return Qnil;
  • StringValue(str);
  • if (rb_reg_search(re, str, pos, 0) < 0) return Qnil;
    result = rb_backref_get();
    rb_match_busy(result);
    @@ -1586,5 +1683,5 @@ Init_Regexp()
    rb_define_method(rb_cRegexp, “===”, rb_reg_match, 1);
    rb_define_method(rb_cRegexp, “~”, rb_reg_match2, 0);
  • rb_define_method(rb_cRegexp, “match”, rb_reg_match_m, 1);
  • rb_define_method(rb_cRegexp, “match”, rb_reg_match_m, -1);
    rb_define_method(rb_cRegexp, “to_s”, rb_reg_to_s, 0);
    rb_define_method(rb_cRegexp, “inspect”, rb_reg_inspect, 0);


Nobu Nakada

nobu.nokada@softhome.net writes:

Though Regexp#match with a character offset may be handy too, since
it returns the MatchData.

I also want this extension.

Index: re.c

RCS file: /cvs/ruby/src/ruby/re.c,v
retrieving revision 1.86
diff -u -2 -p -r1.86 re.c
— re.c 12 Dec 2002 09:17:32 -0000 1.86
+++ re.c 12 Dec 2002 09:31:57 -0000
@@ -1130,10 +1130,18 @@ rb_reg_match2(re)

static VALUE
-rb_reg_match_m(re, str)

  • VALUE re, str;
    +rb_reg_match_m(argc, argv, re)

I applied this to the CVS version of Ruby and like it. I have some
code that could be more efficient with this capability.

Please commit it.

:slight_smile:

Or should an official RCR be filed?

Hi,

···

In message “Re: [RCR] Global Regexp Match Mechanism (//g)” on 02/12/15, Matt Armstrong matt@lickey.com writes:

-rb_reg_match_m(re, str)

  • VALUE re, str;
    +rb_reg_match_m(argc, argv, re)

I applied this to the CVS version of Ruby and like it. I have some
code that could be more efficient with this capability.

Please commit it.

:slight_smile:

Or should an official RCR be filed?

It’s not ignored. But I need to think about it twice before changing
anything. It’s a lesson learned from 9 years of Ruby development.
Discussions are welcome.

						matz.