String.scan - catching overlapping patterns with lookahead

Hi all.

I wrote a small script to count the number of occurrences of one string
in another string (including somewhat overlapping occurrences). Then I
found out about the .scan method, which speeded up some things, but
unfortunately introduced the 'banana problem' -> ("banana".scan("ana")
returns only one "ana")

(This was the script I had before i found out about scan:)

···

------
$proteasome.each {|x|
i = VIRUS #virus length
begin
i = virus[0,i + 4].rindex(x)
count_prot += 1 if i != nil
end until i == nil
}
------

proteasome is an array filled with 12 strings ("00010" , "00100" ,
"10110" etc,
virus is a long string of 1's and 0's (~3000 in total).

Initially, I changed the following, which speeded up the whole program
about 25% but suffered from the banana problem (most of the program's
energy goes into the above algorithm)

------
$proteasome.each {|x|
virus.scan(x) {
count_prot += 1
}
}
------

So I changed it into the following, and then optimized a little bit
(first did virus.scan(/(?=#{x})/ which is a little slower then what I'm
doing now)

------
$proteasome.each {|x|
virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
count_prot += 1
}
}
------

The sad thing is, the above is comparable in speed with the script I
had before I found out about the scan method. Did I miss any obvious
optimizations in the scanning/regexp method?

Thanks!,
Boris

"Ardanwen" <bvschmid@gmail.com> schrieb im Newsbeitrag
news:1103202699.721826.143490@f14g2000cwb.googlegroups.com...

So I changed it into the following, and then optimized a little bit
(first did virus.scan(/(?=#{x})/ which is a little slower then what I'm
doing now)

------
$proteasome.each {|x|
virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
count_prot += 1
}
}
------

The sad thing is, the above is comparable in speed with the script I
had before I found out about the scan method. Did I miss any obvious
optimizations in the scanning/regexp method?

Do you have any metacharacters in the strings contained in $proteasome?
In that case you should use Regexp.escape.

Although it might be a small optimization, you could split your search
string at another position than the first. For example, if you search for
"foofii" then you can split after the second "o" without loosing anything.
In fact you could even skip the lookahead for "foofii" completely, because
if it matches the earliest next match is after the complete string.
However there's a certain overhead involved in finding this position so it
might not be worth it if the strings you search through are short. If you
search through huge piles of data then it might be worthwile.

See also String-searching algorithm - Wikipedia

Kind regards

    robert

Are you sure about the benchmarks? I got results that suggest that
the plain scan version is much faster, assuming I've adapted your
original code correctly:

   require 'benchmark'
   include Benchmark

   str = "bananaaananaaana" * 100
   n = 5000

   def boris(str)
      c = 0
      i = str.size
      begin
        i = str[0,i + 2].rindex("ana")
        c += 1 if i != nil
      end until i == nil
      c
   end

   def david(str)
     str.scan(/(?=ana)/).size
   end

   bm do |x|
     x.report { n.times { boris(str) } }
     x.report { n.times { david(str) } }
   end

Results:

   $ ruby ban.rb
         user system total real
    27.460000 0.490000 27.950000 ( 28.048449)
     8.790000 0.020000 8.810000 ( 8.800876)

David

···

On Thu, 16 Dec 2004, Ardanwen wrote:

Hi all.

I wrote a small script to count the number of occurrences of one string
in another string (including somewhat overlapping occurrences). Then I
found out about the .scan method, which speeded up some things, but
unfortunately introduced the 'banana problem' -> ("banana".scan("ana")
returns only one "ana")

(This was the script I had before i found out about scan:)
------
$proteasome.each {|x|
i = VIRUS #virus length
begin
i = virus[0,i + 4].rindex(x)
count_prot += 1 if i != nil
end until i == nil
}
------

proteasome is an array filled with 12 strings ("00010" , "00100" ,
"10110" etc,
virus is a long string of 1's and 0's (~3000 in total).

Initially, I changed the following, which speeded up the whole program
about 25% but suffered from the banana problem (most of the program's
energy goes into the above algorithm)

------
$proteasome.each {|x|
virus.scan(x) {
count_prot += 1
}
------

So I changed it into the following, and then optimized a little bit
(first did virus.scan(/(?=#{x})/ which is a little slower then what I'm
doing now)

------
$proteasome.each {|x|
virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
count_prot += 1
}
------

The sad thing is, the above is comparable in speed with the script I
had before I found out about the scan method. Did I miss any obvious
optimizations in the scanning/regexp method?

--
David A. Black
dblack@wobblini.net

"David A. Black" <dblack@wobblini.net> schrieb im Newsbeitrag
news:Pine.LNX.4.61.0412160543050.23221@wobblini...

> Hi all.
>
> I wrote a small script to count the number of occurrences of one

string

> in another string (including somewhat overlapping occurrences). Then I
> found out about the .scan method, which speeded up some things, but
> unfortunately introduced the 'banana problem' -> ("banana".scan("ana")
> returns only one "ana")
>
>
> (This was the script I had before i found out about scan:)
> ------
> $proteasome.each {|x|
> i = VIRUS #virus length
> begin
> i = virus[0,i + 4].rindex(x)
> count_prot += 1 if i != nil
> end until i == nil
> }
> ------
>
> proteasome is an array filled with 12 strings ("00010" , "00100" ,
> "10110" etc,
> virus is a long string of 1's and 0's (~3000 in total).
>
>
> Initially, I changed the following, which speeded up the whole program
> about 25% but suffered from the banana problem (most of the program's
> energy goes into the above algorithm)
>
> ------
> $proteasome.each {|x|
> virus.scan(x) {
> count_prot += 1
> }
> }
> ------
>
>
> So I changed it into the following, and then optimized a little bit
> (first did virus.scan(/(?=#{x})/ which is a little slower then what

I'm

> doing now)
>
> ------
> $proteasome.each {|x|
> virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
> count_prot += 1
> }
> }
> ------
>
> The sad thing is, the above is comparable in speed with the script I
> had before I found out about the scan method. Did I miss any obvious
> optimizations in the scanning/regexp method?

Are you sure about the benchmarks? I got results that suggest that
the plain scan version is much faster, assuming I've adapted your
original code correctly:

   require 'benchmark'
   include Benchmark

   str = "bananaaananaaana" * 100
   n = 5000

   def boris(str)
      c = 0
      i = str.size
      begin
        i = str[0,i + 2].rindex("ana")
        c += 1 if i != nil
      end until i == nil
      c
   end

   def david(str)
     str.scan(/(?=ana)/).size
   end

   bm do |x|
     x.report { n.times { boris(str) } }
     x.report { n.times { david(str) } }
   end

Results:

   $ ruby ban.rb
         user system total real
    27.460000 0.490000 27.950000 ( 28.048449)
     8.790000 0.020000 8.810000 ( 8.800876)

David

As far as I remember he didn't do the size trick your used. That might be
one point.

    robert

···

On Thu, 16 Dec 2004, Ardanwen wrote:

First time trying to use ruby's benchmark module. Doing something
wrong,

···

----------
boris:falcon:~/SPEC:ruby bench-search.rb
user system total real
../benchmark.rb:148:in `times': undefined method `times' for
Process:Module (NameError)
----------
so I switched back to linux's "time bench-search.rb".

Re-rewrote your program to something that resembles mine more again,
and then
they're similarly slow again (n=50000).

boris:falcon:~/SPEC:time bench-search.rb 7.430u 0.060s 0:07.99 93.7%
0+0k 0+0io 359pf+0w (david)
boris:falcon:~/SPEC:time bench-search.rb 7.480u 0.010s 0:07.54 99.3%
0+0k 0+0io 348pf+0w (boris)

Not sure why you find something so different. Gutfeeling says it might
be because of your search size (3 chars instead of 5 chars, or the
specific search- and searched-string you use. Doing some testing right
now (n = 50000 though).

Now with a 5char nonchanging pattern, changing searched-string
[syntax: program pattern randomseed]
boris:falcon:~/SPEC:time boris.rb 11111 1 8.210u 0.140s 0:08.43 99.0%
boris:falcon:~/SPEC:time boris.rb 11111 1 8.320u 0.100s 0:08.57 98.2%
boris:falcon:~/SPEC:time boris.rb 11111 2 7.920u 0.030s 0:07.98 99.6%
boris:falcon:~/SPEC:time boris.rb 11111 2 7.980u 0.020s 0:07.99 100.1%
boris:falcon:~/SPEC:time boris.rb 11111 3 4.760u 0.010s 0:04.77 100.0%
boris:falcon:~/SPEC:time boris.rb 11111 3 4.770u 0.020s 0:04.82 99.3%
boris:falcon:~/SPEC:time david.rb 11111 1 7.780u 0.000s 0:07.79 99.8%
boris:falcon:~/SPEC:time david.rb 11111 1 7.840u 0.000s 0:07.91 99.1%
boris:falcon:~/SPEC:time david.rb 11111 2 7.530u 0.000s 0:07.58 99.3%
boris:falcon:~/SPEC:time david.rb 11111 2 7.440u 0.000s 0:07.51 99.0%
boris:falcon:~/SPEC:time david.rb 11111 3 5.450u 0.000s 0:05.42 100.5%
boris:falcon:~/SPEC:time david.rb 11111 3 5.510u 0.000s 0:05.60 98.3%

Seems like searched-string pattern matters. Now look at search-string.
Specific search string you use matters quite a lot.
boris:falcon:~/SPEC:time boris.rb 10101 1 9.520u 0.000s 0:09.52 100.0%
boris:falcon:~/SPEC:time boris.rb 10101 2 9.290u 0.010s 0:09.24 100.6%
boris:falcon:~/SPEC:time boris.rb 10101 3 8.620u 0.030s 0:08.70 99.4%
boris:falcon:~/SPEC:time david.rb 10101 1 9.420u 0.000s 0:09.51 99.0%
boris:falcon:~/SPEC:time david.rb 10101 2 8.710u 0.000s 0:08.78 99.2%
boris:falcon:~/SPEC:time david.rb 10101 3 8.540u 0.010s 0:08.79 97.2%

So not sure if that's enough to explain the difference between 1's and
0's and banana's. Another factor is the size of the search string, but
I grow a bit weary of testing :).

Anyway, thanks for testing. Time to look at that search-string link.

bench-search.rb
----------------------------------------
#! /usr/bin/env ruby
FIVE =
["00000","00001","00010","00011","00100","00101","00110","00111","01000","01001","01010","01011","01100","01101","01110","01111","10000","10001","10010","10011","10100","10101","10110","10111","11000","11001","11010","11011","11100","11101","11110","11111"]

str = ""
begin
str << "#{rand(2)}"
end until str.length == 1000

srand(1)

n = 50000

def boris(str)
pattern = FIVE[rand(32)]
c = 0
i = 1000
begin
i = str[0,i + 4].rindex(pattern)
c += 1 if i != nil
end until i == nil
c
end

def david(str)
pattern = FIVE[rand(32)]
str.scan(/(?=#{pattern})/).size
end

n.times { boris(str) }
#n.times { david(str) }
-----------------------------------------

boris.rb
------------------------
#! /usr/bin/env ruby

# This bit makes a string length 1000 filled with random 0's and 1's
srand(ARGV[1].to_i)
str = ""
begin
str << "#{rand(2)}"
end until str.length == 1000

# This is the method.
def boris(str)
c = 0
i = 1000
begin
i = str[0,i + 4].rindex(ARGV[0])
c += 1 if i != nil
end until i == nil
c
end

# Here we run it.
50000.times { boris(str) }
--------------------------

david.rb
------------
#! /usr/bin/env ruby

# This bit makes a string length 1000 filled with random 0's and 1's
srand(ARGV[1].to_i)
str = ""
begin
str << "#{rand(2)}"
end until str.length == 1000

# This is the method.
def david(str)
str.scan(/(?=#{ARGV[0]})/).size
end
# Here we run it.
50000.times { david(str) }
---------------------

Ugh. I need to learn how to preserve indentation in google groups :P.
New to newsgroups in general, I'll go look for a faq. For google groups
users, the original (but sadly still unindented code) is under 'show
options' 'show original'. For some reason, a part changes and becomes
blue in the normal google view.