Finding positions in a string

I have a long string and I need to know the starting position of all
substrings matching a particular sequence. Most importantly, it needs
to be fast. Secondly, it would be nice if it was somewhat concise.

Here's the method:

def substring_positions(substring, string)
## fast, concise method??
end

my_substring = 'this'
my_string = 'this string this is a string this is it'

substring_positions(my_substring, my_string) # -> should return
[0, 12, 29]

This seems trivial to do, but looking at StringScanner, String#match,
and String#scan, nothing simple comes to mind. There must be a one-
liner somewhere for this kind of thing. I checked through facets and
didn't see anything...

Thanks

Hi --

I have a long string and I need to know the starting position of all
substrings matching a particular sequence. Most importantly, it needs
to be fast. Secondly, it would be nice if it was somewhat concise.

Here's the method:

def substring_positions(substring, string)
## fast, concise method??

Would you settle for concise? :slight_smile:

end

my_substring = 'this'
my_string = 'this string this is a string this is it'

substring_positions(my_substring, my_string) # -> should return
[0, 12, 29]

This seems trivial to do, but looking at StringScanner, String#match,
and String#scan, nothing simple comes to mind. There must be a one-
liner somewhere for this kind of thing. I checked through facets and
didn't see anything...

I'll get the ball rolling, though I doubt it's very efficient:

require 'enumerator'
def substring_positions(substring, string)
   string.enum_for(:scan, substring).map { $~.offset(0)[0] }
end

David

···

On Tue, 24 Jul 2007, bwv549 wrote:

--
* Books:
   RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242\)
   RUBY FOR RAILS (http://www.manning.com/black\)
* Ruby/Rails training
     & consulting: Ruby Power and Light, LLC (http://www.rubypal.com)

Well, here's what I have so far:

  def substring_positions(substring, string)
    substring_len = substring.size
    s = StringScanner.new(string)
    regexp = Regexp.new(Regexp.escape(substring))
    positions = []
    while s.scan_until(regexp)
      positions << (s.pos - substring_len)
    end
    positions
  end

In theory, it looks fairly fast, but it's not all that concise...

Anyone have anything better?

Thanks

# substring_positions(my_substring, my_string) # -> should return
# [0, 12, 29]

···

From: bwv549 [mailto:jtprince@gmail.com]
#
# This seems trivial to do, but looking at StringScanner, String#match,
# and String#scan, nothing simple comes to mind. There must be a one-
# liner somewhere for this kind of thing.

not a raw one-liner, but you can create one :wink:

root@pc4all:~# cat -n test.rb
     1 class String
     2 def scan_p ss
     3 a = []
     4 self.scan(ss){a << Regexp.last_match.begin(0)}
     5 a
     6 end
     7 end
     8
     9 p "this string this is a string this is it".scan_p("this")
    10
root@pc4all:~# ruby test.rb
[0, 12, 29]
root@pc4all:~#

i think the keyword is Matchdata.

kind regards -botp

Another idea is to use String#index() with the optional second minimum index parameter which you keep incrementing.

James Edward Gray II

···

On Jul 23, 2007, at 3:54 PM, bwv549 wrote:

I have a long string and I need to know the starting position of all
substrings matching a particular sequence. Most importantly, it needs
to be fast. Secondly, it would be nice if it was somewhat concise.

Here's the method:

def substring_positions(substring, string)
## fast, concise method??
end

my_substring = 'this'
my_string = 'this string this is a string this is it'

substring_positions(my_substring, my_string) # -> should return
[0, 12, 29]

require 'enumerator'
def substring_positions(substring, string)
   string.enum_for(:scan, substring).map { $~.offset(0)[0] }
end

That's the one-line magic I knew was hiding out there. I will use it
and see if it is acceptably fast. Thanks

> I have a long string and I need to know the starting position of all
> substrings matching a particular sequence. Most importantly, it needs
> to be fast. Secondly, it would be nice if it was somewhat concise.

<snip solutions>

Another idea is to use String#index() with the optional second
minimum index parameter which you keep incrementing.

James Edward Gray II

Indeed.

d-o-ibook-g4:~ d-o$ uname -a
Darwin d-o-ibook-g4.local 8.8.0 Darwin Kernel Version 8.8.0: Fri Sep
8 17:18:57 PDT 2006; root:xnu-792.12.6.obj~1/RELEASE_PPC Power
Macintosh powerp

d-o-ibook-g4:~ d-o$ cat bs.rb
require 'benchmark'
require 'enumerator'

COUNT=1_000_000

class String
  def scan_p ss
    a =
    self.scan(ss){a << Regexp.last_match.begin(0)}
    a
  end
end

substr = Proc.new do |s,seq|
pos=0
index=

while((i=s.index(seq,pos))!=nil)
  index<<i
  pos=i+seq.length
end

index
end

substr_positions = Proc.new do |string,substring|
   string.enum_for(:scan, substring).map { $~.offset(0)[0] }
end

str="assbasscassmassthatassbass"
seq="ss"

Benchmark.bm do |x|
  x.report("with index(): ") do; 1.upto(COUNT) do;
substr.call(seq,str); end end
  x.report("with enum_for()/map(): ") do; 1.upto(COUNT) do;
substr_positions.call(str,seq); end end
  x.report("with scan(): ") do; 1.upto(COUNT) do; str.scan_p(seq); end
end
end

d-o-ibook-g4:~ d-o$ ruby bs.rb
      user system total real
with index(): 10.380000 0.060000 10.440000 ( 10.670129)
with enum_for()/map(): 88.740000 0.910000 89.650000 ( 98.726685)
with scan(): 57.620000 0.590000 58.210000 ( 65.996496)

···

On Jul 23, 9:10 pm, James Edward Gray II <ja...@grayproductions.net> wrote:

On Jul 23, 2007, at 3:54 PM, bwv549 wrote:

# On Jul 23, 9:10 pm, James Edward Gray II <ja...@grayproductions.net>
# wrote:
# > Another idea is to use String#index() with the optional second
# > minimum index parameter which you keep incrementing.

···

From: Skye Shaw!@#$ [mailto:skye.shaw@gmail.com]
#

cool and simple indeed.

# substr = Proc.new do |s,seq|
# pos=0
# index=[]
   ^^^^^
  rename. it might confuse us nubies :slight_smile:

#
# while((i=s.index(seq,pos))!=nil)

simplify. while i=s.index(seq,pos)

# index<<i
# pos=i+seq.length
           ^^^^^
       this does not change. init this outside loop.

# end
#
# index
# end

hmmm, even the recursive index method is not bad..

root@pc4all:~# cat -n test.rb
     1 require 'benchmark'
     2 require 'enumerator'
     3
     4 COUNT=100_000
     5
     6 class String
     7 def scan_p ss
     8 a = []
     9 self.scan(ss){a << Regexp.last_match.begin(0)}
    10 a
    11 end
    12
    13 # recursive scan using index
    14 def scan_p2 ss, pos = 0
    15 return [] unless i = index(ss,pos)
    16 [i] + scan_p2(ss, i + ss.length)
    17 end
    18 end
    19
    20 class String
    21 def scan_i seq
    22 pos=0
    23 ndx=[]
    24 slen = seq.length
    25 while i=index(seq,pos)
    26 ndx << i
    27 pos = i + slen
    28 end
    29 ndx
    30 end
    31 end
    32
    33 class String
    34 #substr_positions = Proc.new do |string,substring|
    35 def scan_enum substring
    36 enum_for(:scan, substring).map { $~.offset(0)[0] }
    37 end
    38 end
    39
    40 str="assbasscassmassthatassbass"
    41 seq="ss"
    42
    43 Benchmark.bm do |x|
    44 x.report("with index(): ") do; 1.upto(COUNT) do;
    45 str.scan_i(seq); end end
    46 x.report("with enum_for()/map(): ") do; 1.upto(COUNT) do;
    47 str.scan_enum(seq); end end
    48 x.report("with scan_p(): ") do; 1.upto(COUNT) do; str.scan_p(seq); end end
    49 x.report("with scan_p2(): ") do; 1.upto(COUNT) do; str.scan_p2(seq); end end
    50
    51 end
    52
    53
root@pc4all:~# ruby test.rb
                           user system total real
with index(): 4.760000 0.010000 4.770000 ( 4.774163)
with enum_for()/map(): 22.100000 0.010000 22.110000 ( 22.139630)
with scan_p(): 16.680000 0.010000 16.690000 ( 16.702391)
with scan_p2(): 9.940000 0.010000 9.950000 ( 9.955084)
root@pc4all:~#

not bad at 100k loops :slight_smile:

root@pc4all:~# cat /proc/cpuinfo | egrep -i '(mhz|name)'
model name : Pentium II (Klamath)
cpu MHz : 300.061
root@pc4all:~# cat /proc/meminfo | grep -i mem
MemTotal: 255616 kB
MemFree: 7920 kB

kind regards -botp