Hello,
What is the best approach to searching a string for another string?
For instance, I have:
url1 = 'http://www.url.com'
url2 = 'http://www.url.com/page'
If part of url1 is in url2, like above, I'd like to declare it a
match. I'm sure this happens using a regular expression, but my
experience is limited with them.
The other problem is that I'm not going to be looking for just one
url1, but I have an entire database table full of those to compare to
an entire database table of url2.
Any thoughts on approaching this problem are appreciated.
Thanks
Clint
I can't think of why that wouldn't work. Thank you.
Clint
···
On 4/4/06, dblack@wobblini.net <dblack@wobblini.net> wrote:
Hi --
On Wed, 5 Apr 2006, Clint Pidlubny wrote:
> Hello,
>
> What is the best approach to searching a string for another string?
>
> For instance, I have:
>
> url1 = 'http://www.url.com'
> url2 = 'URL.com - MediaOptions;
>
> If part of url1 is in url2, like above, I'd like to declare it a
> match. I'm sure this happens using a regular expression, but my
> experience is limited with them.
>
> The other problem is that I'm not going to be looking for just one
> url1, but I have an entire database table full of those to compare to
> an entire database table of url2.
>
> Any thoughts on approaching this problem are appreciated.
It's not a complete answer, but in case it helps: String has an
include? method:
url2.include?(url1) => true
David
--
David A. Black (dblack@wobblini.net)
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)
"Ruby for Rails" chapters now available
from Manning Early Access Program! Ruby for Rails
dblack@wobblini.net wrote:
Hi --
Hello,
What is the best approach to searching a string for another string?
For instance, I have:
url1 = 'http://www.url.com'
url2 = 'URL.com - MediaOptions;
If part of url1 is in url2, like above, I'd like to declare it a
match. I'm sure this happens using a regular expression, but my
experience is limited with them.
The other problem is that I'm not going to be looking for just one
url1, but I have an entire database table full of those to compare to
an entire database table of url2.
Any thoughts on approaching this problem are appreciated.
It's not a complete answer, but in case it helps: String has an
include? method:
url2.include?(url1) => true
Using String#include? is much faster then regexp matching. Here are some
benchmarks. I didn't test this with Oniguruma though, but I su
-- START CODE --
require 'benchmark'
url = "URL.com - MediaOptions;
url2 = "URL.com - MediaOptions;
Benchmark.bm{ |x|
x.report{ 100000.times { url2.include?( url ) } }
x.report{ 100000.times { url2 =~ /#{url}/ } }
}
-- END CODE ---
Benchmark Windows ruby 1.8.4 (2005-12-24) [i386-mswin32]
C:\source\projects\ruby\strings>ruby temp.rb
user system total real
0.080000 0.000000 0.080000 ( 0.080000)
1.722000 0.130000 1.852000 ( 1.873000)
Benchmark Linux ruby 1.8.4 (2005-12-24) [i686-linux]
zdennis@lima:~$ ruby-1.8.4 temp.rb
user system total real
0.100000 0.000000 0.100000 ( 0.119403)
1.570000 0.040000 1.610000 ( 1.760446)
Benchmark Linux ruby 1.8.3 (2005-06-23) [i486-linux]
zdennis@lima:~$ ruby temp.rb
user system total real
0.160000 0.030000 0.190000 ( 0.209436)
1.720000 0.080000 1.800000 ( 2.021754)
Benchmark Linux ruby 1.8.2 (2005-04-11) [i386-linux]
zdennis@jboss:~$ ruby temp.rb
user system total real
0.000000 0.000000 0.000000 ( 0.246239)
0.000000 0.000000 0.000000 ( 1.401049)
Zach
···
On Wed, 5 Apr 2006, Clint Pidlubny wrote:
Using String#include? is much faster then regexp matching. Here are some
benchmarks. I didn't test this with Oniguruma though, but I su
-- START CODE --
require 'benchmark'
url = "URL.com - Media Options;
url2 = "URL.com - Media Options;
Benchmark.bm{ |x|
x.report{ 100000.times { url2.include?( url ) } }
x.report{ 100000.times { url2 =~ /#{url}/ } }
}
-- END CODE ---
Benchmark Windows ruby 1.8.4 (2005-12-24) [i386-mswin32]
C:\source\projects\ruby\strings>ruby temp.rb
user system total real
0.080000 0.000000 0.080000 ( 0.080000)
1.722000 0.130000 1.852000 ( 1.873000)
Benchmark Linux ruby 1.8.4 (2005-12-24) [i686-linux]
zdennis@lima:~$ ruby-1.8.4 temp.rb
user system total real
0.100000 0.000000 0.100000 ( 0.119403)
1.570000 0.040000 1.610000 ( 1.760446)
Benchmark Linux ruby 1.8.3 (2005-06-23) [i486-linux]
zdennis@lima:~$ ruby temp.rb
user system total real
0.160000 0.030000 0.190000 ( 0.209436)
1.720000 0.080000 1.800000 ( 2.021754)
Benchmark Linux ruby 1.8.2 (2005-04-11) [i386-linux]
zdennis@jboss:~$ ruby temp.rb
user system total real
0.000000 0.000000 0.000000 ( 0.246239)
0.000000 0.000000 0.000000 ( 1.401049)
Zach
Excellent info Zach. Very relevant for me. I'll have thousands of
links to do this with.
Thanks again,
Clint
Hi,
Excellent info Zach. Very relevant for me. I'll have thousands of
links to do this with.
$ cat str_inc_bench.rb
require 'benchmark'
url = "http://www.url.com/"
url2 = "http://www.url.com/page"
urlrx = /#{url}/
Benchmark.bm{ |x|
x.report{ 100000.times { url2.include?( url ) } }
x.report{ 100000.times { url2 =~ urlrx } }
x.report{ 100000.times { url2 =~ /#{url}/ } }
}
$ ruby -v str_inc_bench.rb
ruby 1.8.4 (2005-12-24) [i686-linux]
user system total real
0.070000 0.000000 0.070000 ( 0.071435)
0.130000 0.000000 0.130000 ( 0.130016)
1.130000 0.020000 1.150000 ( 1.182629)
So, regular expression matching itself is not that much slower than String#include?.
What makes "url2 =~ /#{url}/" slow is the creation of so many Regexp objects.
I just wanted to point that out.
Dominik
···
On Thu, 06 Apr 2006 01:23:31 +0200, Clint Pidlubny <clint.pidlubny@gmail.com> wrote:
Ruby has some very subtle optimizations for Regexps too:
# ruby 1.8.4 (2006-03-20) [powerpc-darwin8.5.0]
GC.disable
n = ObjectSpace.each_object(Regexp){}
def foo; /abc/ end
# Note I didn't call foo.
ObjectSpace.each_object(Regexp){} - n #=> 1
1000.times {foo}
ObjectSpace.each_object(Regexp){} - n #=> 1
It is always nice to see simple optimizations like this.
Brian.
···
On 4/5/06, Dominik Bathon <dbatml@gmx.de> wrote:
So, regular expression matching itself is not that much slower than
String#include?.
What makes "url2 =~ /#{url}/" slow is the creation of so many Regexp
objects.