Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:
s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4
Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
Thanks,
I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:
require 'benchmark'
TIMES = 100_000
s = "you like to play with your yo-yo"
Benchmark.bmbm do |x|
x.report("scan") do
TIMES.times do
s.scan("yo").size
end
end
x.report("while") do
TIMES.times do
index = -1
count = 0
while (index = s.index("yo", index+1))
count += 1
end
count
end
end
end
user system total real
scan 0.510000 0.010000 0.520000 ( 0.519078)
while 0.470000 0.020000 0.490000 ( 0.493562)
Don't know if this is enough for you, probably not
Jesus.
···
On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis@gmail.com> wrote:
Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:
s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4
Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
If written in Ruby may not beat using the underlying library functions as they are written in C.
I have vague recollections of a ruby quiz being based on something like this
Dave.
···
On 24 Jun 2010, at 16:04, Danny Challis wrote:
Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:
s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4
Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
Thanks,
Thanks Jesus,
This method actually decreased the runtime by quite a bit, so thanks
for the help! However, I still need something even faster if it exists,
so any other ideas would be appreciated. I may have to just implement
this part is C or something.
Danny.
Jesús Gabriel y Galán wrote:
···
On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis@gmail.com> > wrote:
Once again, what I'm really looking for is something fast. �I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:
...
Don't know if this is enough for you, probably not
2010/6/24 Jesús Gabriel y Galán <jgabrielygalan@gmail.com>:
On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis@gmail.com> wrote:
Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:
s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4
Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:
require 'benchmark'
TIMES = 100_000
s = "you like to play with your yo-yo"
Benchmark.bmbm do |x|
x.report("scan") do
TIMES.times do
s.scan("yo").size
end
end
x.report("while") do
TIMES.times do
index = -1
count = 0
while (index = s.index("yo", index+1))
count += 1
end
count
end
end
end
I suppose that if you implement a C method that does what I did in
Ruby, that would be faster.
I mean doing the loop in C and calling String#index from there.
Jesus.
···
On Thu, Jun 24, 2010 at 5:45 PM, Danny Challis <dannychallis@gmail.com> wrote:
Thanks Jesus,
This method actually decreased the runtime by quite a bit, so thanks
for the help! However, I still need something even faster if it exists,
so any other ideas would be appreciated. I may have to just implement
this part is C or something.
This thing about adding the length of the match can be argued
depending on the requirements, I think.
What would you expect from:
"yoyoyoyo".magical_count_method("yoyo")
2 or 3?
If you add the length to the index you get 2. If you add 1, you get 3.
irb(main):018:0> s = "yoyoyoyo"
=> "yoyoyoyo"
irb(main):019:0> count = 0
=> 0
irb(main):020:0> len = s.length
=> 8
irb(main):021:0> search = "yoyo"
=> "yoyo"
irb(main):023:0> len = search.length
=> 4
irb(main):024:0> index = -len
=> -4
irb(main):025:0> while (index = s.index(search, index + len))
irb(main):026:1> count += 1
irb(main):027:1> end
=> nil
irb(main):028:0> count
=> 2
irb(main):029:0> count = 0
=> 0
irb(main):030:0> index = -1
=> -1
irb(main):031:0> while (index = s.index(search, index + 1))
irb(main):032:1> count += 1
irb(main):033:1> end
=> nil
irb(main):034:0> count
=> 3
So, I don't know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.
Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I've read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it's near the best or
worst case for each, but anyway.
Jesus.
···
On Thu, Jun 24, 2010 at 6:05 PM, Robert Klemme <shortcutter@googlemail.com> wrote:
2010/6/24 Jesús Gabriel y Galán <jgabrielygalan@gmail.com>:
On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis@gmail.com> wrote:
Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:
s = "you like to play with your yo-yo"
s.magical_count_method("yo") => 4
Once again, what I'm really looking for is something fast. I've tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
I don't know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:
require 'benchmark'
TIMES = 100_000
s = "you like to play with your yo-yo"
Benchmark.bmbm do |x|
x.report("scan") do
TIMES.times do
s.scan("yo").size
end
end
x.report("while") do
TIMES.times do
index = -1
count = 0
while (index = s.index("yo", index+1))
count += 1
end
count
end
end
end
I'm looking for non-overlapping matches (so a 2 in your example)
I modified your code to do this for me like you showed and it works
fine. I was thinking of trying a Boyer-Moore implementation, but I
suspect if I implement this manually in Ruby it will be much slower.
Jesús Gabriel y Galán wrote:
···
On Thu, Jun 24, 2010 at 6:05 PM, Robert Klemme > <shortcutter@googlemail.com> wrote:
s = "you like to play with your yo-yo"
So, I don't know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.
Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I've read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it's near the best or
worst case for each, but anyway.
I've just run some benchmarks with strscan, and it's at least in the
same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.
user system total real
scan 18.050000 0.000000 18.050000 ( 18.051000)
scan ++ 35.046000 0.000000 35.046000 ( 35.046000)
scan re 17.807000 0.000000 17.807000 ( 17.807000)
scan re ++ 34.086000 0.000000 34.086000 ( 34.085000)
while 22.089000 0.000000 22.089000 ( 22.089000)
strscan 29.538000 0.000000 29.538000 ( 29.538000)
boyer_moore 0.005000 0.000000 0.005000 ( 0.004000)
$ jruby -v --server --fast yomark.rb
jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot(TM) Server VM 1.6.0_20) [i386-java]
yobench.rb:50 warning: Useless use of a variable in void context.
Rehearsal -----------------------------------------------
scan 17.340000 0.000000 17.340000 ( 17.154000)
scan ++ 23.986000 0.000000 23.986000 ( 23.987000)
scan re 15.170000 0.000000 15.170000 ( 15.169000)
scan re ++ 22.805000 0.000000 22.805000 ( 22.806000)
while 12.050000 0.000000 12.050000 ( 12.050000)
strscan 31.396000 0.000000 31.396000 ( 31.396000)
boyer_moore 0.010000 0.000000 0.010000 ( 0.010000)
------------------------------------ total: 122.756999sec
user system total real
scan 15.201000 0.000000 15.201000 ( 15.201000)
scan ++ 23.758000 0.000000 23.758000 ( 23.758000)
scan re 14.770000 0.000000 14.770000 ( 14.770000)
scan re ++ 22.455000 0.000000 22.455000 ( 22.455000)
while 12.182000 0.000000 12.182000 ( 12.182000)
strscan 24.497000 0.000000 24.497000 ( 24.497000)
boyer_moore 0.002000 0.000000 0.002000 ( 0.002000)
···
On Thu, Jun 24, 2010 at 1:16 PM, Michael Fellinger <m.fellinger@gmail.com> wrote:
I've just run some benchmarks with strscan, and it's at least in the
same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.
FYI, a large part of the overhead here is probably the Java calls,
which are a bit slower than Ruby to Ruby calls (plus it's decoding the
"yo" string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.
- Charlie
···
On Thu, Jun 24, 2010 at 2:48 PM, <brabuhr@gmail.com> wrote:
that wasn't the right one
x.report 'boyer_moore' do
TIMES.times do
count = BoyerMoore.match("yo", s).size
check count
end
end
user system total real
scan 31.240000 0.020000 31.260000 ( 31.264699)
scan ++ 64.000000 0.860000 64.860000 ( 64.865223)
scan re 31.570000 0.020000 31.590000 ( 31.581045)
scan re ++ 64.180000 0.980000 65.160000 ( 65.401667)
while 26.580000 0.030000 26.610000 ( 26.757658)
strscan 28.730000 0.030000 28.760000 ( 28.831860)
Unfortunately, I do not have 1.9.x on this machine at the moment.
···
On Tue, Jun 29, 2010 at 3:13 PM, Charles Oliver Nutter <headius@headius.com> wrote:
On Thu, Jun 24, 2010 at 2:48 PM, <brabuhr@gmail.com> wrote:
x.report 'boyer_moore' do
TIMES.times do
count = BoyerMoore.match("yo", s).size
check count
end
end
FYI, a large part of the overhead here is probably the Java calls,
which are a bit slower than Ruby to Ruby calls (plus it's decoding the
"yo" string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.