No way of looking for a regrexp match starting from a particular point in a string?

Kenneth_McDonald · 3 June 2007 03:59

I'm probably just missing something obvious, but I haven't found a way to match a regular expression against only part of a string, in particular only past a certain point of a string, as a way of finding successive matches. Of course, one could do a match against a string, take the substring past that match and do a match against the substring, and so on, to find all of the matches for the string, but that could be very expensive for very large strings.

I'm aware of the String.scan method, but that doesn't work for me because it doesn't return MatchData instances.

What I want is just something like regexp.match(string, n), where the regexp starts looking for a match at or after position n in the string.

Thanks,
Ken

Nobuyoshi_Nakada1 · 3 June 2007 04:46

Hi,

At Sun, 3 Jun 2007 12:59:24 +0900,
Kenneth McDonald wrote in [ruby-talk:254054]:

What I want is just something like regexp.match(string, n), where the
regexp starts looking for a match at or after position n in the string.

string.index(regexp, n)

···

--
Nobu Nakada

Harry3 · 3 June 2007 04:46

You could match the string but ignore the first part of the match.

str = "abcdefghabcehjjjuabcfjkiabcgdfg"
str =~ /(abc.)/
p $1 # abcd
str =~ /a.*ju(abc.)/
p $1 #abcf

Harry

···

On 6/3/07, Kenneth McDonald <kenneth.m.mcdonald@sbcglobal.net> wrote:

What I want is just something like regexp.match(string, n), where the
regexp starts looking for a match at or after position n in the string.

Thanks,
Ken

--

A Look into Japanese Ruby List in English

Patrick_Hurley1 · 3 June 2007 04:48

I don't know of anything obvious, but I would probably do something a
little more like:

class String
  def match_each(exp)
    str = self
    while md = str.match(exp)
      yield md
      str = md.post_match
    end
  end
end

foo = "foo bar foo bar foo"
foo.match_each /[oa][or]/ do |md|
puts "Found: #{md}"
end

# pth

···

On 6/2/07, Kenneth McDonald <kenneth.m.mcdonald@sbcglobal.net> wrote:

I'm probably just missing something obvious, but I haven't found a way
to match a regular expression against only part of a string, in
particular only past a certain point of a string, as a way of finding
successive matches. Of course, one could do a match against a string,
take the substring past that match and do a match against the substring,
and so on, to find all of the matches for the string, but that could be
very expensive for very large strings.

I'm aware of the String.scan method, but that doesn't work for me
because it doesn't return MatchData instances.

What I want is just something like regexp.match(string, n), where the
regexp starts looking for a match at or after position n in the string.

Thanks,
Ken

Edwin_Fine · 3 June 2007 05:02

Kenneth McDonald wrote:

I'm probably just missing something obvious, but I haven't found a way
to match a regular expression against only part of a string, in
particular only past a certain point of a string, as a way of finding
successive matches. Of course, one could do a match against a string,
take the substring past that match and do a match against the substring,
and so on, to find all of the matches for the string, but that could be
very expensive for very large strings.

I'm aware of the String.scan method, but that doesn't work for me
because it doesn't return MatchData instances.

What I want is just something like regexp.match(string, n), where the
regexp starts looking for a match at or after position n in the string.

Thanks,
Ken

How about this?

def match(s, re, n)
/(?:.{#{n}})(#{re})/.match(s)
end

irb(main):043:0> p s
"abdefgh abdefgh abdefgh abdefgh abdefgh abdefgh abdefgh abdefgh abdefgh
abdefgh "
irb(main):044:0> p match(s, /abd/, 10).begin(1)
16
irb(main):045:0> p match(s, /abd/, 20).begin(1)
24

···

--
Posted via http://www.ruby-forum.com/\.

Logan_Capaldo · 3 June 2007 12:42

I'm probably just missing something obvious, but I haven't found a way
to match a regular expression against only part of a string, in
particular only past a certain point of a string, as a way of finding
successive matches. Of course, one could do a match against a string,
take the substring past that match and do a match against the substring,
and so on, to find all of the matches for the string, but that could be
very expensive for very large strings.

I'm aware of the String.scan method, but that doesn't work for me
because it doesn't return MatchData instances.

What I want is just something like regexp.match(string, n), where the
regexp starts looking for a match at or after position n in the string.

require 'strscan'
scanner = StringScanner.new(string)
scanner.pos = n
if scanner.scan(regexp)
  p scanner[1]
  p scanner.matched
  p scanner.pos
end

It's in the stdlib. (Note, it doesn't actually give you a match data, or
set $~, but of the top of my head I can't think of anything that a
matchdata can do that the stringscanner can't.)

···

On Sun, Jun 03, 2007 at 12:59:24PM +0900, Kenneth McDonald wrote:

Thanks,
Ken

Forum · 4 June 2007 10:19

Hmm apart of using #scan and #index with $~ as indicated, I do not
think that there is a performance penalty if you do

rg.match(string[n..-1])

Cheers
Robert

···

On 6/3/07, Kenneth McDonald <kenneth.m.mcdonald@sbcglobal.net> wrote:

I'm probably just missing something obvious, but I haven't found a way
to match a regular expression against only part of a string, in
particular only past a certain point of a string, as a way of finding
successive matches. Of course, one could do a match against a string,
take the substring past that match and do a match against the substring,
and so on, to find all of the matches for the string, but that could be
very expensive for very large strings.

I'm aware of the String.scan method, but that doesn't work for me
because it doesn't return MatchData instances.

What I want is just something like regexp.match(string, n),

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

Patrick_Hurley1 · 3 June 2007 04:56

I think he wanted MatchData objects. The String#index method returns
the index (numeric position of the match). But if all you want are
captures, then index is a good solution.

pth

···

On 6/3/07, Nobuyoshi Nakada <nobu@ruby-lang.org> wrote:

Hi,

At Sun, 3 Jun 2007 12:59:24 +0900,
Kenneth McDonald wrote in [ruby-talk:254054]:
> What I want is just something like regexp.match(string, n), where the
> regexp starts looking for a match at or after position n in the string.

string.index(regexp, n)

--
Nobu Nakada

Harry3 · 3 June 2007 05:20

If you want to specify the point in the string by number, you could do this.

str = "abcdefghabcehjjjuabcfjkiabcgdfg"
str =~ /.{10}(abc.).*(abc.)/
p $1 #abcf
p $2 #abcg

Harry

···

On 6/3/07, Harry Kakueki <list.push@gmail.com> wrote:

On 6/3/07, Kenneth McDonald <kenneth.m.mcdonald@sbcglobal.net> wrote:
>
> What I want is just something like regexp.match(string, n), where the
> regexp starts looking for a match at or after position n in the string.
>
> Thanks,
> Ken
>

You could match the string but ignore the first part of the match.

str = "abcdefghabcehjjjuabcfjkiabcgdfg"
str =~ /(abc.)/
p $1 # abcd
str =~ /a.*ju(abc.)/
p $1 #abcf

Harry

--

A Look into Japanese Ruby List in English

Kenneth_McDonald · 3 June 2007 05:20

Edwin Fine wrote:

Kenneth McDonald wrote:


I'm probably just missing something obvious, but I haven't found a way
to match a regular expression against only part of a string, in
particular only past a certain point of a string, as a way of finding
successive matches. Of course, one could do a match against a string,
take the substring past that match and do a match against the substring,
and so on, to find all of the matches for the string, but that could be
very expensive for very large strings.

I'm aware of the String.scan method, but that doesn't work for me
because it doesn't return MatchData instances.

What I want is just something like regexp.match(string, n), where the
regexp starts looking for a match at or after position n in the string.

Thanks,
Ken

How about this?

def match(s, re, n)
  /(?:.{#{n}})(#{re})/.match(s)
end

irb(main):043:0> p s
"abdefgh abdefgh abdefgh abdefgh abdefgh abdefgh abdefgh abdefgh abdefgh abdefgh "
irb(main):044:0> p match(s, /abd/, 10).begin(1)
16
irb(main):045:0> p match(s, /abd/, 20).begin(1)
24

That's clever. Obscure, but clever :-). I wonder if the regexp engine is clever enough to turn a match like .{n} into a constant time operation?

Thanks,
Ken

Forum · 4 June 2007 10:20

My bad how stupid, am I thinking in C???
Robert

···

On 6/4/07, Robert Dober <robert.dober@gmail.com> wrote:

rg.match(string[n..-1])

7rans · 4 June 2007 10:54

How can that be? You have to create a whole new String. If that can be
avoided in the internal implementation then adding an optional offset
index to #match is not an unreasonable idea.

T.

···

On Jun 4, 6:19 am, "Robert Dober" <robert.do...@gmail.com> wrote:

On 6/3/07, Kenneth McDonald <kenneth.m.mcdon...@sbcglobal.net> wrote:> I'm probably just missing something obvious, but I haven't found a way
> to match a regular expression against only part of a string, in
> particular only past a certain point of a string, as a way of finding
> successive matches. Of course, one could do a match against a string,
> take the substring past that match and do a match against the substring,
> and so on, to find all of the matches for the string, but that could be
> very expensive for very large strings.

> I'm aware of the String.scan method, but that doesn't work for me
> because it doesn't return MatchData instances.

> What I want is just something like regexp.match(string, n),

Hmm apart of using #scan and #index with $~ as indicated, I do not
think that there is a performance penalty if you do

rg.match(string[n..-1])

Nobuyoshi_Nakada1 · 3 June 2007 05:30

Hi,

At Sun, 3 Jun 2007 13:56:05 +0900,
Patrick Hurley wrote in [ruby-talk:254059]:

I think he wanted MatchData objects. The String#index method returns
the index (numeric position of the match). But if all you want are
captures, then index is a good solution.

String#index also sets $~.

···

--
Nobu Nakada

Forum · 4 June 2007 11:28

> On 6/3/07, Kenneth McDonald <kenneth.m.mcdon...@sbcglobal.net> wrote:> I'm probably just missing something obvious, but I haven't found a way
> > to match a regular expression against only part of a string, in
> > particular only past a certain point of a string, as a way of finding
> > successive matches. Of course, one could do a match against a string,
> > take the substring past that match and do a match against the substring,
> > and so on, to find all of the matches for the string, but that could be
> > very expensive for very large strings.
>
> > I'm aware of the String.scan method, but that doesn't work for me
> > because it doesn't return MatchData instances.
>
> > What I want is just something like regexp.match(string, n),
>
> Hmm apart of using #scan and #index with $~ as indicated, I do not
> think that there is a performance penalty if you do
>
> rg.match(string[n..-1])

How can that be? You have to create a whole new String.

Beating a dead man Tom? As mentioned I had a terrible slip to C in my
reasoning, no idea why

···

On 6/4/07, Trans <transfire@gmail.com> wrote:

On Jun 4, 6:19 am, "Robert Dober" <robert.do...@gmail.com> wrote:
If that can be avoided in the internal implementation then adding an optional offset
index to #match is not an unreasonable idea.

T.

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

Patrick_Hurley1 · 3 June 2007 05:48

I should have know to never question Nobu Nakada :-), I always forget
about those variables.

Thanks
pth

···

On 6/3/07, Nobuyoshi Nakada <nobu@ruby-lang.org> wrote:

Hi,

At Sun, 3 Jun 2007 13:56:05 +0900,
Patrick Hurley wrote in [ruby-talk:254059]:
> I think he wanted MatchData objects. The String#index method returns
> the index (numeric position of the match). But if all you want are
> captures, then index is a good solution.

String#index also sets $~.

--
Nobu Nakada

Robert_K1 · 3 June 2007 07:30

But then you can also use String#scan:

irb(main):002:0> "ababb".scan(/(a)b+/) {p $~}
#<MatchData:0x7ff94618>
#<MatchData:0x7ff94578>
=> "ababb"
irb(main):003:0> "ababb".scan(/(a)b+/) {p $~.to_a}
["ab", "a"]
["abb", "a"]
=> "ababb"

Ken, why do you need MatchData objects?

Kind regards

robert

···

On 03.06.2007 07:30, Nobuyoshi Nakada wrote:

Hi,

At Sun, 3 Jun 2007 13:56:05 +0900,
Patrick Hurley wrote in [ruby-talk:254059]:

I think he wanted MatchData objects. The String#index method returns
the index (numeric position of the match). But if all you want are
captures, then index is a good solution.

String#index also sets $~.

Devin_Mullins · 3 June 2007 08:26

Nobuyoshi Nakada wrote:

String#index also sets $~.

For that matter, so does String#scan.

Robert_K1 · 4 June 2007 11:50

Robert, actually string[n..-1] is cheaper than you might assume: I believe the new string shares the char buffer with the old string, so you basically just get a new String object with a different offset - the large bit (the char data) is not copied.

Kind regards

robert

···

On 04.06.2007 13:28, Robert Dober wrote:

On 6/4/07, Trans <transfire@gmail.com> wrote:

On Jun 4, 6:19 am, "Robert Dober" <robert.do...@gmail.com> wrote:
> On 6/3/07, Kenneth McDonald <kenneth.m.mcdon...@sbcglobal.net> wrote:> I'm probably just missing something obvious, but I haven't found a way
> > to match a regular expression against only part of a string, in
> > particular only past a certain point of a string, as a way of finding
> > successive matches. Of course, one could do a match against a string,
> > take the substring past that match and do a match against the substring,
> > and so on, to find all of the matches for the string, but that could be
> > very expensive for very large strings.
>
> > I'm aware of the String.scan method, but that doesn't work for me
> > because it doesn't return MatchData instances.
>
> > What I want is just something like regexp.match(string, n),
>
> Hmm apart of using #scan and #index with $~ as indicated, I do not
> think that there is a performance penalty if you do
>
> rg.match(string[n..-1])

How can that be? You have to create a whole new String.

Beating a dead man Tom? As mentioned I had a terrible slip to C in my
reasoning, no idea why

If that can be avoided in the internal implementation then adding an optional offset
index to #match is not an unreasonable idea.

Rick_DeNatale1 · 3 June 2007 15:00

Hence:
irb(main):001:0> "abcdefabc".scan(/abc/) {puts "#{$~.inspect}, #{$~}"}
#<MatchData:0xb7b0220c>, abc
#<MatchData:0xb7b021e4>, abc
=> "abcdefabc"

···

On 6/3/07, Devin Mullins <twifkak@comcast.net> wrote:

Nobuyoshi Nakada wrote:
> String#index also sets $~.
For that matter, so does String#scan.

--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

Forum · 4 June 2007 12:06

I am afraid that this is not true anymore when the slice is passed as
a formal parameter, the data has to be copied

irb(main):011:0> def change(x)
irb(main):012:1> x << "changed"
irb(main):013:1> end
=> nil
irb(main):014:0> a="abcdef"
=> "abcdef"
irb(main):015:0> change(a[1..2])
=> "bcchanged"
irb(main):016:0> a
=> "abcdef"

Cheers
Robert

···

On 6/4/07, Robert Klemme <shortcutter@googlemail.com> wrote:

On 04.06.2007 13:28, Robert Dober wrote:

Robert, actually string[n..-1] is cheaper than you might assume: I
believe the new string shares the char buffer with the old string, so
you basically just get a new String object with a different offset - the
large bit (the char data) is not copied.

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

Topic		Replies	Views
String iterate through regex matches with possition ruby-talk	4	176	12 September 2012
[Q] specify start postion of Regexp matching ruby-talk	15	109	27 November 2007
Checking if a string matches a regexp - am I missing something? ruby-talk	10	109	14 January 2007
Simple regexp question ruby-talk	0	64	26 October 2005
Regexp/scan question ruby-talk	8	79	11 December 2006

No way of looking for a regrexp match starting from a particular point in a string?

Related topics