Count substrings in string, scan too slow

Danny_Challis · 24 June 2010 15:04

Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:

s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4

Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
Thanks,

Danny.

···

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 24 June 2010 15:17

I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:

require 'benchmark'

TIMES = 100_000
s = "you like to play with your yo-yo"

Benchmark.bmbm do |x|
  x.report("scan") do
    TIMES.times do
  s.scan("yo").size
    end
  end
  x.report("while") do
    TIMES.times do
  index = -1
  count = 0
  while (index = s.index("yo", index+1))
    count += 1
  end
  count
    end
  end
end

$ ruby scan_vs_while.rb
Rehearsal -----------------------------------------
scan 0.560000 0.020000 0.580000 ( 0.585972)
while 0.440000 0.060000 0.500000 ( 0.492969)
-------------------------------- total: 1.080000sec

user system total real
scan 0.510000 0.010000 0.520000 ( 0.519078)
while 0.470000 0.020000 0.490000 ( 0.493562)

Don't know if this is enough for you, probably not

Jesus.

···

On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis@gmail.com> wrote:

Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:

s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4

Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?

Dave_Baldwin · 24 June 2010 15:49

If written in Ruby may not beat using the underlying library functions as they are written in C.

I have vague recollections of a ruby quiz being based on something like this
Dave.

···

On 24 Jun 2010, at 16:04, Danny Challis wrote:

Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:

s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4

Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
Thanks,

Danny.
--
Posted via http://www.ruby-forum.com/\.

Danny_Challis · 24 June 2010 15:45

Thanks Jesus,
This method actually decreased the runtime by quite a bit, so thanks
for the help! However, I still need something even faster if it exists,
so any other ideas would be appreciated. I may have to just implement
this part is C or something.

Danny.

Jesús Gabriel y Galán wrote:

···

On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis@gmail.com> > wrote:

Once again, what I'm really looking for is something fast. �I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?

I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:
...

Don't know if this is enough for you, probably not

Jesus.

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 24 June 2010 16:05

I took the liberty to extend the benchmark a bit:

gist.github.com

https://gist.github.com/rklemme/451622

gistfile2.txt

18:00:04 Temp$ allruby sc.rb
CYGWIN_NT-5.1 padrklemme1 1.7.5(0.225/5/3) 2010-04-12 19:07 i686 Cygwin
========================================
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-cygwin]
Rehearsal ----------------------------------------------
scan         2.766000   0.000000   2.766000 (  2.786000)
scan ++      4.656000   0.000000   4.656000 (  4.668000)
scan re      2.688000   0.000000   2.688000 (  2.696000)
scan re ++   4.531000   0.000000   4.531000 (  4.547000)
while        1.094000   0.000000   1.094000 (  1.135000)

This file has been truncated. show original

sc.rb

require 'benchmark'

TIMES = 100_000
s = "you like to play with your yo-yo"

Benchmark.bmbm do |x|
 x.report("scan") do
   TIMES.times do
       count = s.scan("yo").size
       raise count unless count == 4

This file has been truncated. show original

I would have expected regexp to be faster...

Cheers

robert

···

2010/6/24 Jesús Gabriel y Galán <jgabrielygalan@gmail.com>:

On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis@gmail.com> wrote:

Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:

s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4

Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?

I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:

require 'benchmark'

TIMES = 100_000
s = "you like to play with your yo-yo"

Benchmark.bmbm do |x|
x.report("scan") do
TIMES.times do
s.scan("yo").size
end
end
x.report("while") do
TIMES.times do
index = -1
count = 0
while (index = s.index("yo", index+1))
count += 1
end
count
end
end
end

$ ruby scan_vs_while.rb
Rehearsal -----------------------------------------
scan 0.560000 0.020000 0.580000 ( 0.585972)
while 0.440000 0.060000 0.500000 ( 0.492969)
-------------------------------- total: 1.080000sec
       user     system      total        real
scan 0.510000 0.010000 0.520000 ( 0.519078)
while 0.470000 0.020000 0.490000 ( 0.493562)

Don't know if this is enough for you, probably not

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Jesus_Gabriel_y_Gala · 24 June 2010 15:50

I suppose that if you implement a C method that does what I did in
Ruby, that would be faster.
I mean doing the loop in C and calling String#index from there.

Jesus.

···

On Thu, Jun 24, 2010 at 5:45 PM, Danny Challis <dannychallis@gmail.com> wrote:

Thanks Jesus,
This method actually decreased the runtime by quite a bit, so thanks
for the help! However, I still need something even faster if it exists,
so any other ideas would be appreciated. I may have to just implement
this part is C or something.

Jesus_Gabriel_y_Gala · 24 June 2010 16:16

This thing about adding the length of the match can be argued
depending on the requirements, I think.
What would you expect from:

"yoyoyoyo".magical_count_method("yoyo")

2 or 3?

If you add the length to the index you get 2. If you add 1, you get 3.

irb(main):018:0> s = "yoyoyoyo"
=> "yoyoyoyo"
irb(main):019:0> count = 0
=> 0
irb(main):020:0> len = s.length
=> 8
irb(main):021:0> search = "yoyo"
=> "yoyo"
irb(main):023:0> len = search.length
=> 4
irb(main):024:0> index = -len
=> -4
irb(main):025:0> while (index = s.index(search, index + len))
irb(main):026:1> count += 1
irb(main):027:1> end
=> nil
irb(main):028:0> count
=> 2

irb(main):029:0> count = 0
=> 0
irb(main):030:0> index = -1
=> -1
irb(main):031:0> while (index = s.index(search, index + 1))
irb(main):032:1> count += 1
irb(main):033:1> end
=> nil
irb(main):034:0> count
=> 3

So, I don't know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.

Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I've read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it's near the best or
worst case for each, but anyway.

Jesus.

···

On Thu, Jun 24, 2010 at 6:05 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

2010/6/24 Jesús Gabriel y Galán <jgabrielygalan@gmail.com>:
On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis@gmail.com> wrote:

Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:

s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4

Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?

I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:

require 'benchmark'

TIMES = 100_000
s = "you like to play with your yo-yo"

Benchmark.bmbm do |x|
x.report("scan") do
TIMES.times do
s.scan("yo").size
end
end
x.report("while") do
TIMES.times do
index = -1
count = 0
while (index = s.index("yo", index+1))
count += 1
end
count
end
end
end

$ ruby scan_vs_while.rb
Rehearsal -----------------------------------------
scan 0.560000 0.020000 0.580000 ( 0.585972)
while 0.440000 0.060000 0.500000 ( 0.492969)
-------------------------------- total: 1.080000sec
       user     system      total        real
scan 0.510000 0.010000 0.520000 ( 0.519078)
while 0.470000 0.020000 0.490000 ( 0.493562)

Don't know if this is enough for you, probably not
I took the liberty to extend the benchmark a bit:

sc.rb · GitHub

I would have expected regexp to be faster...

botp1 · 24 June 2010 16:35

you don't like strscan ?
best regards -botp

···

On Fri, Jun 25, 2010 at 12:05 AM, Robert Klemme <shortcutter@googlemail.com> wrote:

sc.rb · GitHub
I would have expected regexp to be faster...

Forum · 29 June 2010 20:19

I too took the liberty to change the benchmark and I found a strange
way to beat the "while"
but by little

gist.github.com

https://gist.github.com/RobertDober/457751

scountbench.rb

require 'benchmark'

TIMES = 4_000
s = "you like to play with your yo-yo" * 100
Count = 400

def check!
  abort "count not #{Count} but #{@count}" unless @count == Count
end

This file has been truncated. show original

···

On Thu, Jun 24, 2010 at 6:05 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

--
The best way to predict the future is to invent it.
-- Alan Kay

Danny_Challis · 24 June 2010 17:03

I'm looking for non-overlapping matches (so a 2 in your example)
I modified your code to do this for me like you showed and it works
fine. I was thinking of trying a Boyer-Moore implementation, but I
suspect if I implement this manually in Ruby it will be much slower.

Jesús Gabriel y Galán wrote:

···

On Thu, Jun 24, 2010 at 6:05 PM, Robert Klemme > <shortcutter@googlemail.com> wrote:

s = "you like to play with your yo-yo"

So, I don't know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.

Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I've read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it's near the best or
worst case for each, but anyway.

Jesus.

--
Posted via http://www.ruby-forum.com/\.

Michael_Fellinger1 · 24 June 2010 17:16

I've just run some benchmarks with strscan, and it's at least in the
same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: sc.rb · GitHub

···

On Fri, Jun 25, 2010 at 1:35 AM, botp <botpena@gmail.com> wrote:

On Fri, Jun 25, 2010 at 12:05 AM, Robert Klemme > <shortcutter@googlemail.com> wrote:

sc.rb · GitHub
I would have expected regexp to be faster...

you don't like strscan ?
best regards -botp

--
Michael Fellinger
CTO, The Rubyists, LLC

Brabuhr · 24 June 2010 19:02

http://en.literateprograms.org/Boyer-Moore_string_search_algorithm_(Java)

require 'java'
java_import 'BoyerMoore'

  x.report 'boyer_moore' do
    count = BoyerMoore.match("yo", s).size
    check count
  end

$ jruby -v yomark.rb
jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) Client VM 1.6.0_20) [i386-java]
Rehearsal -----------------------------------------------
scan 22.423000 0.000000 22.423000 ( 22.334000)
scan ++ 36.738000 0.000000 36.738000 ( 36.738000)
scan re 19.451000 0.000000 19.451000 ( 19.451000)
scan re ++ 39.222000 0.000000 39.222000 ( 39.222000)
while 22.621000 0.000000 22.621000 ( 22.622000)
strscan 29.075000 0.000000 29.075000 ( 29.076000)
boyer_moore 0.009000 0.000000 0.009000 ( 0.009000)
------------------------------------ total: 169.539000sec

user system total real
scan 18.050000 0.000000 18.050000 ( 18.051000)
scan ++ 35.046000 0.000000 35.046000 ( 35.046000)
scan re 17.807000 0.000000 17.807000 ( 17.807000)
scan re ++ 34.086000 0.000000 34.086000 ( 34.085000)
while 22.089000 0.000000 22.089000 ( 22.089000)
strscan 29.538000 0.000000 29.538000 ( 29.538000)
boyer_moore 0.005000 0.000000 0.005000 ( 0.004000)

$ jruby -v --server --fast yomark.rb
jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) Server VM 1.6.0_20) [i386-java]
yobench.rb:50 warning: Useless use of a variable in void context.
Rehearsal -----------------------------------------------
scan 17.340000 0.000000 17.340000 ( 17.154000)
scan ++ 23.986000 0.000000 23.986000 ( 23.987000)
scan re 15.170000 0.000000 15.170000 ( 15.169000)
scan re ++ 22.805000 0.000000 22.805000 ( 22.806000)
while 12.050000 0.000000 12.050000 ( 12.050000)
strscan 31.396000 0.000000 31.396000 ( 31.396000)
boyer_moore 0.010000 0.000000 0.010000 ( 0.010000)
------------------------------------ total: 122.756999sec

user system total real
scan 15.201000 0.000000 15.201000 ( 15.201000)
scan ++ 23.758000 0.000000 23.758000 ( 23.758000)
scan re 14.770000 0.000000 14.770000 ( 14.770000)
scan re ++ 22.455000 0.000000 22.455000 ( 22.455000)
while 12.182000 0.000000 12.182000 ( 12.182000)
strscan 24.497000 0.000000 24.497000 ( 24.497000)
boyer_moore 0.002000 0.000000 0.002000 ( 0.002000)

···

On Thu, Jun 24, 2010 at 1:16 PM, Michael Fellinger <m.fellinger@gmail.com> wrote:

I've just run some benchmarks with strscan, and it's at least in the
same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: sc.rb · GitHub

botp1 · 25 June 2010 04:01

On Fri, Jun 25, 2010 at 1:16 AM, Michael Fellinger > I've just run
some benchmarks with strscan, and it's at least in the

same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: sc.rb · GitHub

that is not fair for strscan.. you are recreating the object inside the loop

outside loop do:
s=StringScanner.new "some string foo..."
s2=s.dup

inside loop do:
s=s2
.... s.scan_until...

best regards -botp

Brabuhr · 24 June 2010 19:48

http://en.literateprograms.org/Boyer-Moore_string_search_algorithm_(Java)

require 'java'
java_import 'BoyerMoore'

x.report 'boyer_moore' do
count = BoyerMoore.match("yo", s).size
check count
end

that wasn't the right one

  x.report 'boyer_moore' do
    TIMES.times do
      count = BoyerMoore.match("yo", s).size
      check count
    end
  end

jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) Client VM 1.6.0_20) [i386-java]
Rehearsal -----------------------------------------------
boyer_moore 25.742000 0.000000 25.742000 ( 25.661000)
------------------------------------- total: 25.742000sec

user system total real
boyer_moore 24.869000 0.000000 24.869000 ( 24.869000)

jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) Server VM 1.6.0_20) [i386-java]
Rehearsal -----------------------------------------------
boyer_moore 16.733000 0.000000 16.733000 ( 16.401000)
------------------------------------- total: 16.733000sec

user system total real
boyer_moore 15.970000 0.000000 15.970000 ( 15.971000)

Michael_Fellinger1 · 25 June 2010 07:38

That's not fair for the others, and doesn't make any difference in the
benchmark anyway.

···

On Fri, Jun 25, 2010 at 1:01 PM, botp <botpena@gmail.com> wrote:

On Fri, Jun 25, 2010 at 1:16 AM, Michael Fellinger > I've just run
some benchmarks with strscan, and it's at least in the

same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: sc.rb · GitHub

that is not fair for strscan.. you are recreating the object inside the loop

--
Michael Fellinger
CTO, The Rubyists, LLC

botp1 · 25 June 2010 10:00

On Fri, Jun 25, 2010 at 3:38 PM, Michael Fellinger

That's not fair for the others,

indeed, in general. but if multiple/repeated processes are done on the
same string, then strscan will make very big difference.

and doesn't make any difference in the
benchmark anyway.

wc makes me think that it could be possible that ruby strings may be
strscan-ready without added init load

best regards -botp

Charles_Nutter · 29 June 2010 19:13

FYI, a large part of the overhead here is probably the Java calls,
which are a bit slower than Ruby to Ruby calls (plus it's decoding the
"yo" string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.

- Charlie

···

On Thu, Jun 24, 2010 at 2:48 PM, <brabuhr@gmail.com> wrote:

that wasn't the right one

x.report 'boyer_moore' do
TIMES.times do
count = BoyerMoore.match("yo", s).size
check count
end
end

Brabuhr · 30 June 2010 02:04

I had a similar suspicion and had started a modified benchmark doing
fewer loops over larger data, but had to move on to other things.

This gives me a chance to try out the JRuby Mac Installer...

Original benchmark:

jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) 64-Bit Server VM 1.6.0_20) [x86_64-java]
Rehearsal -----------------------------------------------
scan 8.851000 0.000000 8.851000 ( 8.784000)
scan ++ 14.186000 0.000000 14.186000 ( 14.186000)
scan re 8.594000 0.000000 8.594000 ( 8.594000)
scan re ++ 15.558000 0.000000 15.558000 ( 15.558000)
while 8.102000 0.000000 8.102000 ( 8.101000)
strscan 14.023000 0.000000 14.023000 ( 14.023000)
boyer_moore 7.446000 0.000000 7.446000 ( 7.446000)
------------------------------------- total: 76.760000sec

user system total real
scan 8.157000 0.000000 8.157000 ( 8.157000)
scan ++ 13.953000 0.000000 13.953000 ( 13.953000)
scan re 8.346000 0.000000 8.346000 ( 8.346000)
scan re ++ 15.332000 0.000000 15.332000 ( 15.333000)
while 8.087000 0.000000 8.087000 ( 8.087000)
strscan 14.303000 0.000000 14.303000 ( 14.303000)
boyer_moore 6.885000 0.000000 6.885000 ( 6.885000)

Even with the Ruby to Java call overhead, the Java BoyerMoore is
coming back the fastest on this machine. For comparison:

ruby 1.8.7 (2009-06-12 patchlevel 174) [universal-darwin10.0]
Rehearsal ----------------------------------------------
scan 31.030000 0.020000 31.050000 ( 31.094718)
scan ++ 62.310000 0.900000 63.210000 ( 63.227271)
scan re 31.030000 0.030000 31.060000 ( 31.110528)
scan re ++ 62.820000 0.870000 63.690000 ( 63.718876)
while 26.090000 0.020000 26.110000 ( 26.095308)
strscan 28.440000 0.010000 28.450000 ( 28.485140)
----------------------------------- total: 243.570000sec

user system total real
scan 31.240000 0.020000 31.260000 ( 31.264699)
scan ++ 64.000000 0.860000 64.860000 ( 64.865223)
scan re 31.570000 0.020000 31.590000 ( 31.581045)
scan re ++ 64.180000 0.980000 65.160000 ( 65.401667)
while 26.580000 0.030000 26.610000 ( 26.757658)
strscan 28.730000 0.030000 28.760000 ( 28.831860)

Unfortunately, I do not have 1.9.x on this machine at the moment.

···

On Tue, Jun 29, 2010 at 3:13 PM, Charles Oliver Nutter <headius@headius.com> wrote:

On Thu, Jun 24, 2010 at 2:48 PM, <brabuhr@gmail.com> wrote:

x.report 'boyer_moore' do
TIMES.times do
count = BoyerMoore.match("yo", s).size
check count
end
end

FYI, a large part of the overhead here is probably the Java calls,
which are a bit slower than Ruby to Ruby calls (plus it's decoding the
"yo" string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.

Topic		Replies	Views
Count substrings from a string ruby-talk	8	120	18 November 2007
Counting Occurrences of a String in an Array ruby-talk	4	187	10 February 2009
String.scan - catching overlapping patterns with lookahead ruby-talk	5	115	16 December 2004
Dir, recursive filescan ruby-talk	6	65	27 August 2007
Regular expressions, strange result from .scan method ruby-talk	1	120	21 February 2011

Count substrings in string, scan too slow

Related topics