I have this little script that takes a list of keyword sets, each set has only
two keywords and for each one of them the script creates a regular expression
like this:
Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
then I match it to a string that contains a long text fetched from a web page.
a more complete pseudo-code
···
#########################################
long_text = get_web_page(url)
keyword_hash = load_keyword_array_from_database
keyword_hash.each_pair { |id, value|
key1 = value[0]
key2 = value[1]
r = Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
return id if long_text =~ r
}
return -1
###########################################
Now this code works perfect, the problem is that the keyword_hash has more
than 300 elements and running this code can take between 50 to 120 seconds.
Since I am processing more than 1000 pages with this code it takes forever.
I solved this problem by replacing the regular expression match to
r1 = Regexp.new("#{key1}\.*#{key2}")
r2 = Regexp.new("#{key2}\.*#{key1}")
return id if long_text =~ r1 or long_text =~ r2
I simply put the or statement outside the regular expresion and the speedup
was from 50~120sec to 0.40 secs per page.
using the Benchmark class and running some test I got
normal: 0 0
27.688000 0.015000 27.703000 ( 27.765000 )
fast:
0.469000 0.000000 0.484000 (0.954000)
the speed difference is totally diferent.
Is this expected when using regular expressions??
regards,
Horacio