[QUIZ] Banned Words (#9)

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.grayproductions.net/ruby_quiz/

3. Enjoy!

···

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Fredrik Jagenheim

At work we discovered that they installed a spam filter that throws away e-mail
that it considers spam. It doesn't use any Bayes Filter, it simply checks for
certain words that it considers 'banned'. One word we discovered was 'sex',
which is a Swedish word for the number six. So the phrase "I'll be home at six
o'clock." will be classified as spam, thrown away and never seen.

The Ruby Quiz I propose is to figure out which words are banned. Since the
filter is a black box, we can only find out which words they are by sending
email through it. The real problem is to find out how to do it with as *few*
emails as possible.

Of course I don't want the ruby community to do a Denial-of-Service on my
employer's mailserver, so do it as a local filter; perhaps something like this:

  # by JEG2
  
  #
  # A simple class for managing a filter that prevents to use
  # of a given _banned_words_ list.
  #
  class LanguageFilter
    #
    # Create a new LanguageFilter object that will
    # disallow _banned_words_.
    # Accepts a list of words, arrays of words,
    # or a combination of the two.
    #
    def initialize( *banned_words )
      @banned_words = banned_words.flatten.sort
      @clean_calls = 0
    end
    
    # A count of the calls to <i>clean?</i>.
    attr_reader :clean_calls
    
    #
    # Test if provided _text_ is allowable by this filter.
    # Returns *false* if _text_ contains _banned_words_,
    # *true* if it does not.
    #
    def clean?( text )
      @clean_calls += 1
      @banned_words.each do |word|
        return false if text =~ /\b#{word}\b/
      end
      true
    end
    
    #
    # Verify a _suspect_words_ list against the actual
    # _banned_words_ list.
    # Returns *false* if the two lists are not identical or
    # *true* if the lists do match.
    # Accepts a list of words, arrays of words,
    # or a combination of the two.
    #
    def verify( *suspect_words )
      suspect_words.flatten.sort == @banned_words
    end
  end
  
  filter = LanguageFilter.new "six"
  
  filter.clean?("I'll be home at six.") # => false
  filter.clean?("Do not taunt the happy fun ball!") # => true
  
  filter.verify("ball") # => false
  filter.verify("six") # => true
  
  filter.clean_calls # => 2

So, figure out how to find the hidden words, using as few calls to
LanguageFilter#clean? as possible.

Which algorithms do you find are effective when many words are blocked (like
10%) and which are effective when very few are blocked(1 in 20000)?

All solutions should do better than this: :wink:

  dict = ["foo", "bar", "six", "baz"]
  filter = LanguageFilter.new "six"
  
  p dict - (dict.dup.delete_if { |word| not filter.clean?(word) })

A teaser. Or possibly, a goal to shoot for.

  3930 words, 29 banned words found
  Correct? true
  Time taken: 0.16201 seconds
  Calls: 427
  ...

James Edward Gray II

···

On Nov 26, 2004, at 10:11 AM, Ruby Quiz wrote:

So, figure out how to find the hidden words, using as few calls to
LanguageFilter#clean? as possible.

Hello Group,

I propose using

  def clean?( text )
    @clean_calls += 1
    @banned_words.each do |word|
      return false if text.include?(word)
    end
    true
  end

in the test filter, because my solution does not work with swedish characters. Trying it with wswedish gives errors when using the regex and no errors when using include?, so I think there must be something weird going on in the regex.

James, be shure to include this kind of words in your test-cases, or specify that it is not needed. I do not want to delve deeper into this.

Regards,

Brian

PS:
Test 1:
Testing with 121430 words and 3 banned words
Took 0.564248085021973 seconds
Used 567 calls
Verified as correct

Test 2
Testing with 121430 words and 126 banned words
Took 6.52887606620789 seconds
Used 5545 calls
Verified as correct

Test 3
Testing with 3930 words and 28 banned words
Took 0.0872080326080322 seconds
Used 497 calls
Verified as correct

···

On Sat, 27 Nov 2004 01:11:47 +0900 Ruby Quiz <james@grayproductions.net> wrote:

Of course I don't want the ruby community to do a Denial-of-Service on my
employer's mailserver, so do it as a local filter; perhaps something like this:
[snip]

--
Brian Schröder
http://www.brian-schroeder.de/

Which filter class is the latest? Or should we use the standard
filter, and cope with the fact that it will give false positivies?

I noticed that the current filter gives several false positives, not
only for 'special' characters, but also for apostrophes; say that the
forbidden word is Clara, I will get a positive on both Clara and
Clara's, making it impossible to use verify to check my list...

I decided to 'cheat' by cleaning up my dictionary file from all the
'forbidden' characters and now I average 280ish with a dictionary
containing 127367 words and 10 forbidden words.

//F

Btw, as a sidenote I can say that the spamfilter we have at work have
the exact same problem. It doesn't know swedish characters, so when it
looks at 'medsända' it finds 'meds' which is a forbidden word...

Hello Grouo

This was a short quiz. It was not possible for me to get a lot better than my first try. (Until I just now read the other solution :frowning: )

Even though it was not too complicated I again spend too much time on it. If I'd live in the states, I'd definitly sue James for stealing my time;)

The main idea is, that given a list of words, I check if it contains a word that is banned. If the whole chunk passes, I can forget about all words in this chunk and return the empty list.

Otherwise I split the list into equally sized chunks and continue with each chunk. If a chunk that contains only one word is tested and matches, I have found a match.

Here is the algorithm implemented for splitting into two slices.

def find_words(filter, words)
  return [] if words.empty? or filter.clean?(words.join(' '))
  return words if words.length == 1
  return find_words(filter, words[0...words.length / 2]) +
           find_words(filter, words[words.length / 2..-1])
end

The fastest algorithm of this type is the above for n = 3. Here is the generic version.

def find_words_n(filter, words, n = 2)
  return [] if words.empty? or filter.clean?(words.join(' '))
  return words if words.length == 1
  n = words.length if n > words.length
  slices = Array.new(n) { | i | i * words.length / n } << words.length
  slices[0..-2].zip(slices[1..-1]).inject([]) do | result, (low, high) |
                     result + find_words_n(filter, words[low...high], n)
  end
end

Then I tried to find the optimal solution on a more formal basis. My musings are can be read here:
http://ruby.brian-schroeder.de/quiz/detect_words/content/content.html
http://ruby.brian-schroeder.de/quiz/detect_words/content.ps

This resulted in a huge lookup table where I have the optimal split factor for a given number of words and expected probabilty.

After reading the solution by Wayne Vucenic I understood that one can save some calls, if we now that we had a match in wordlist w and no match in the first parts of the wordlist. The last part then neccessarily has to match.

This is implemented here

def find_words(filter, words, filter_mask = true)
  return [] if words.empty? or (filter_mask and filter.clean?(words.join(' ')))
  return words if words.length == 1
  result = find_words(filter, words[0...words.length / 2])
  result + find_words(filter, words[words.length / 2..-1], !result.empty?)
end

And for n > 2:

def find_words_n(filter, words, n = 2, filter_mask = true)
  return [] if words.empty? or (filter_mask and filter.clean?(words.join(' ')))
  return words if words.length == 1
  n = words.length if n > words.length
  slices = Array.new(n) { | i | i * words.length / n } << words.length
  slices = slices[0..-2].zip(slices[1..-1])
  result = slices[0..-2].inject([]) do | result, (low, high) | result + find_words_n(filter, words[low...high], n) end
  result + find_words_n(filter, words[slices[-1][0]...slices[-1][1]], n, !result.empty?)
end

Also I implemented a solution, that saves calls depending on the information that also superwords of banned words are banned. E.g. if sex is banned, sexual is also banned. It turned out that the specs did not include this behaviour, so the solution can be found under the unsuccessfull tries.

The full story (All sources, results, etc) is here:
http://ruby.brian-schroeder.de/quiz/detect_words/

regards,

Brian

···

--
Brian Schröder
http://www.brian-schroeder.de/

My solution:

1 Send mails with N words (see @steps)
2 delete all words of the clean mails
3 shuffle word list
4 repeat with a smaller N

···

###########
class JannisHardersAlgorithm < RQ9Algorithm
     def initialize()
         @steps = [110000,700000,300000,200000,110000,
                     70000,30000,20000,11000,7000,
                     3000,2000,1100,700,300,200,110,
                     70,30,20,11,7,3,2,1]
     end
     def run()
         wordsLeft = @words.dup

         @steps.each do |count|
             if count*6 > @words.length
                 next
             end

             position = 0
             while position < wordsLeft.length
                 testMail = wordsLeft[position .. position+count-1].join(" ")
                 if @filter.clean? testMail
                     (position .. position+count-1).each do |i|
                         wordsLeft[i]=nil
                     end
                 end
                 position += count
             end
             wordsLeft.compact! # delete nil words
             wordsLeft = wordsLeft.sort_by{rand} #shuffle
         end
         wordsLeft
     end
end
##########
Results :
Test: (15000,2,5,true) # 0.1%
Runs: 5
Words: 15000
Banned Words: 2
Average
   Seconds: 0.1430744
   Mails: 47
Verified: 5/5
Rebuilt Banned Words

Test: (15000,15,5,true) # 1%
Runs: 5
Words: 15000
Banned Words: 15
Average
   Seconds: 0.833412
   Mails: 260
Verified: 5/5
Rebuilt Banned Words

Test: (15000,150,5,true) # 10%
Runs: 5
Words: 15000
Banned Words: 150
Average
   Seconds: 9.6565792
   Mails: 1786
Verified: 5/5
Rebuilt Banned Words

Some comments on the quiz:

If we have no information on the banned word list at all,
so every subset of the dictionary would be equally likely,
then we could not do better than brute force, i.e. testing
every word. This is because every query gives us one bit
of information, and for n words in the dictionary, there
are 2^n possible subsets, which have to be distinguished
by n bits of information.

So we will have to use some kind of additional knowledge
on the banned word list. Several variants are possible.
We could get some info on the probability distribution of
the list, or we may have an upper (and maybe a lower) limit
on the number of words in it. Then it is also not clear
what "as few emails as possible" means. This could mean
the maximal number of emails, or the expected number of
emails, or the minimization of some other cost functional.

So, it is possible to choose. Here is the choice I made
and analyzed: I assume that I know the dictionary size
and the probability with which each word is chosen for
the banned word list. Here is the part of ruby code I
use to create the banned word list

   def initialize(size, prob)
     @size = size
     if prob <= 0 or prob >= 1 # sanity check
       throw "illegal probability in LanguageFilter::initialize"
     end

     @badWords = []
     @size.times {|k| @badWords.push(k) if rand < prob}
   end

Note, that I use not words but an index of the words (so I
can use numbers 1...n instead of a word list). My object is
to minimize the expected number of email calls in this
scenario. Now, if I have such a word list with n words, I
can send an email of size k. If it does not get blocked, my
problem reduces to the same minimization problem with a\
smaller size n - k. If the email does get blocked, my
problem reduces to two smaller problems, one of size k
and one of size n - k. In the word list of size k, I do
have slightly more information then before: I know that
there is at least one blocked word in it (this does not
make much of a difference for large k, but a _lot_ of
difference for small k). Now I can find recursive formulas
for the two kinds of expected values (available upon
request).

I coded most of this, but unfortunately RL caught up with
me, and I will need another day or so until I can present a
full solution.

Michael

···

--
Michael Ulm
R&D Team
ISIS Information Systems Austria
tel: +43 2236 27551-219, fax: +43 2236 21081
e-mail: michael.ulm@isis-papyrus.com
Visit our Website: www.isis-papyrus.com

Hmm, I'm bummed that regex doesn't work. I was trying to use something that would be worldly. Maybe one of the regex gurus will jump in here and correct it for us.

My complaint about String#include? is that banning the word "box" and then passing in a message containing the word "boxer" will break it, right?

James Edward Gray II

···

On Nov 26, 2004, at 2:38 PM, Brian Schröder wrote:

Hello Group,

I propose using

  def clean?( text )
    @clean_calls += 1
    @banned_words.each do |word|
      return false if text.include?(word)
    end
    true
  end

in the test filter, because my solution does not work with swedish characters. Trying it with wswedish gives errors when using the regex and no errors when using include?, so I think there must be something weird going on in the regex.

My results (I use a list of english words NOT random "words"):

10 Tests (different words banned) with:
   121430 Words
        3 Banned
average: 81.7 Mails

10 Tests (different words banned) with:
   121430 Words
      126 Banned
average: 2110.5 Mails

30 Tests (different words banned) with:
     3930 Words
       29 Banned
average: 356.9 Mails

10 Tests (different words banned) with:
   125282 Words(the complete list)
       80 Banned
average: 1426.5 Mails

PS: My code is VERY slow (I'm still trying to optimize it).

- --
Jannis Harder (Germany)

Ah the joys of getting a specification right, eh? :wink:

Test with whatever you are comfortable with. I believe everything that's been proposed so for has some disadvantages.

I will use what I put forth in the quiz when writing the summary, with a dictionary that has no special characters in it, not even apostrophes. In my opinion, this does not affect the focus of the quiz, which is to find/invent an algorithm.

Hope that helps.

James Edward Gray II

···

On Nov 27, 2004, at 9:19 AM, Fredrik Jagenheim wrote:

Which filter class is the latest? Or should we use the standard
filter, and cope with the fact that it will give false positivies?

Interesting. How did you come up with the step numbers?

Regards,

Brian

···

On Mon, 29 Nov 2004 07:33:35 +0900 Jannis Harder <jannis@harderweb.de> wrote:

My solution:

1 Send mails with N words (see @steps)
2 delete all words of the clean mails
3 shuffle word list
4 repeat with a smaller N

###########
class JannisHardersAlgorithm < RQ9Algorithm
     def initialize()
         @steps = [110000,700000,300000,200000,110000,
                     70000,30000,20000,11000,7000,
                     3000,2000,1100,700,300,200,110,
                     70,30,20,11,7,3,2,1]
     end
[snip]

--
Brian Schröder
http://www.brian-schroeder.de/

Michael Ulm wrote:
--snip--

I assume that I know the dictionary size
and the probability with which each word is chosen for
the banned word list. Here is the part of ruby code I
use to create the banned word list

  def initialize(size, prob)
    @size = size
    if prob <= 0 or prob >= 1 # sanity check
      throw "illegal probability in LanguageFilter::initialize"
    end

    @badWords =
    @size.times {|k| @badWords.push(k) if rand < prob}
  end

Note, that I use not words but an index of the words (so I
can use numbers 1...n instead of a word list). My object is
to minimize the expected number of email calls in this
scenario. Now, if I have such a word list with n words, I
can send an email of size k. If it does not get blocked, my
problem reduces to the same minimization problem with a\
smaller size n - k. If the email does get blocked, my
problem reduces to two smaller problems, one of size k
and one of size n - k. In the word list of size k, I do
have slightly more information then before: I know that
there is at least one blocked word in it (this does not
make much of a difference for large k, but a _lot_ of
difference for small k). Now I can find recursive formulas
for the two kinds of expected values (available upon
request).

I coded most of this, but unfortunately RL caught up with
me, and I will need another day or so until I can present a
full solution.

Just used my lunchbreak to implement the last details. I am
afraid the code is not very cleanly written, I totally ignored
speed so far, and I totally cheated in so far that I didn't use
words at all but numbers. Here is the output of two test runs.

6 words of 1000 found after 142 calls (expected 196.20329267521)
11 words of 1000 found after 208 calls (expected 274.947142087926)
27 words of 1000 found after 337 calls (expected 333.688742114315)
23 words of 1000 found after 287 calls (expected 382.652436479999)
51 words of 1000 found after 442 calls (expected 425.491312500002)
64 words of 1000 found after 494 calls (expected 462.720759999999)
73 words of 1000 found after 490 calls (expected 495.858497499998)
64 words of 1000 found after 489 calls (expected 528.357760000002)
104 words of 1000 found after 567 calls (expected 560.076936390001)
101 words of 1000 found after 562 calls (expected 583.697899999996)

11 words of 1000 found after 208 calls (expected 196.20329267521)
24 words of 1000 found after 304 calls (expected 274.947142087926)
23 words of 1000 found after 282 calls (expected 333.688742114315)
38 words of 1000 found after 374 calls (expected 382.652436479999)
48 words of 1000 found after 425 calls (expected 425.491312500002)
63 words of 1000 found after 468 calls (expected 462.720759999999)
76 words of 1000 found after 531 calls (expected 495.858497499998)
87 words of 1000 found after 543 calls (expected 528.357760000002)
94 words of 1000 found after 580 calls (expected 560.076936390001)
91 words of 1000 found after 542 calls (expected 583.697899999996)

================ here is the code, version 0.1 ================
# this holds classes and methods to "solve"
# ruby-talk quiz #9: Banned Words

···

#
# creates the language filter
class LanguageFilter
   def initialize(size, prob)
     @size = size
     if prob <= 0 or prob >= 1 # sanity check
       throw "illegal probability in LanguageFilter::initialize"
     end

     @badWords =
     @size.times {|k| @badWords.push(k) if rand < prob}
     @count = 0
   end

   def countReset
     @count = 0
   end

   def getBad
     @badWords
   end

   # checks if one of the numbers in the array testText is on the
   # badWords list
   def clean?(testText)
     sortedClean?(testText.sort)
   end

   # checks if one of the numbers in the array testText is on the
   # badWords list testText is assumed to be sorted
   def sortedClean?(testText)
     @count += 1
     binStart, binEnd = 0, @badWords.size
     testText.each do |current|
       binStart = binarySearch(current, binStart, binEnd)
       return false if @badWords[binStart] == current
     end
     return true
   end

   def test(testWords)
     [@count, testWords.sort == @badWords]
   end

   private

   def binarySearch(what, searchStart, searchEnd)
     return searchEnd if what > @badWords[searchEnd - 1]
     while searchEnd - searchStart > 1
       testSearch = (searchStart + searchEnd) / 2
       testWord = @badWords[testSearch]
       return testSearch if testWord == what
       if testWord < what
         searchStart = testSearch
       else
         searchEnd = testSearch
       end
     end
     return searchStart
   end
end

class QueryStrategy
   def initialize(prob)
     @prob = prob
     @qrob = 1.0 - prob
     @eNone = [0, 1] # expected value for interval without additional info
     @eOne = [0, 0] # expected values for interval with at least one bad word
     @kNone = [0, 1] # optimal interval query length in interval without info
     @kOne = [0, 0] # optimal interval query length in interval with one bad word
     @size = 2
   end

   def getStrategy(size, noInfo = true, quick = false)
     if size < @size
       return noInfo ? [@kNone[size], @eNone[size]] : [@kOne[size], @eOne[size]]
     end

     qToN = Math::exp(@size * Math::log(@qrob))
     @size.upto(size) do |n|
       # compute F_p(n)
       minExp = n.to_f
       minK = 1
       qToK = 1.0
       1.upto(n - 1) do |k|
         qToK *= @qrob
         thisExp = qToK * @eOne[n - k] + (1 - qToK) * (@eOne[k] + @eNone[n - k])
         if thisExp < minExp
           minK, minExp = k, thisExp
         end
       end
       @kOne[n] = minK
       @eOne[n] = 1.0 + minExp / (1 - qToN)

       # compute E_p(n)
       minExp = n.to_f
       minK = 1
       qToK = 1.0
       1.upto(n) do |k|
         qToK *= @qrob
         thisExp = @eNone[n - k] + (1 - qToK) * @eOne[k]
         if thisExp < minExp
           minK, minExp = k, thisExp
         end
       end
       @kNone[n] = minK
       @eNone[n] = 1 + minExp

       qToN *= @qrob
     end

     @size = size + 1;
     return noInfo ? [@kNone[size], @eNone[size]] : [@kOne[size], @eOne[size]]
   end

   def findWords(filter, size)
     @myWords =
     getWordsRecursively(filter, 0...size, true)
     @myWords
   end

private
   def getWordsRecursively(filter, range, noInfo)
     rangesize = range.end - range.begin
     return if rangesize == 0
     if rangesize == 1
       if noInfo
         @myWords.push(range.begin) unless filter.sortedClean?([range.begin])
       else
         @myWords.push(range.begin)
       end
     else
       thisStrat = getStrategy(rangesize, noInfo)
       testRange = range.begin...(range.begin + thisStrat[0])
       testArray =
       testRange.each {|k| testArray.push(k)}
       if filter.sortedClean?(testArray)
         getWordsRecursively(filter, (range.begin + thisStrat[0])...range.end, noInfo)
       else
         getWordsRecursively(filter, testRange, false)
         getWordsRecursively(filter, (range.begin + thisStrat[0])...range.end, true)
       end
     end
   end
end

# test
testsize = 1000
10.times do |level|
   testprob = 0.01 * (level + 1)

   myfilt = LanguageFilter.new(testsize, testprob)

   strat = QueryStrategy.new(testprob)
   testWords = strat.findWords(myfilt, testsize)
   number, found = myfilt.test(testWords)
   if found
     puts "#{testWords.size} words of #{testsize} found after #{number} calls (expected #{strat.getStrategy(testsize)[1]})"
   else
     puts "word list not found after #{number} calls"
   end
end

--
Michael Ulm
R&D Team
ISIS Information Systems Austria
tel: +43 2236 27551-219, fax: +43 2236 21081
e-mail: michael.ulm@isis-papyrus.com
Visit our Website: www.isis-papyrus.com

James Edward Gray II wrote:

···

On Nov 26, 2004, at 2:38 PM, Brian Schröder wrote:

Hello Group,

I propose using

  def clean?( text )
    @clean_calls += 1
    @banned_words.each do |word|
      return false if text.include?(word)
    end
    true
  end

in the test filter, because my solution does not work with swedish characters. Trying it with wswedish gives errors when using the regex and no errors when using include?, so I think there must be something weird going on in the regex.

Hmm, I'm bummed that regex doesn't work. I was trying to use something that would be worldly. Maybe one of the regex gurus will jump in here and correct it for us.

My complaint about String#include? is that banning the word "box" and then passing in a message containing the word "boxer" will break it, right?

Maybe something like text.split.include?(word)

Hal

Hello Jannis,

if you are using something like /usr/share/dict/wenglish would you mind sharing your test-code, such that we could compare the algorithms on a fixed basis?

Regards,

Brian

···

On Sat, 27 Nov 2004 07:07:46 +0900 Jannis Harder <jannis@harderweb.de> wrote:

My results (I use a list of english words NOT random "words"):

[snip]

--
Brian Schröder
http://ruby.brian-schroeder.de/

Ah, I thought this behaviour was wanted. If this is not wanted my whole optimization was in vain. :frowning:

Regards,

Brian

···

On Sat, 27 Nov 2004 06:10:55 +0900 James Edward Gray II <james@grayproductions.net> wrote:

On Nov 26, 2004, at 2:38 PM, Brian Schröder wrote:

> Hello Group,
>
> I propose using
>
> def clean?( text )
> @clean_calls += 1
> @banned_words.each do |word|
> return false if text.include?(word)
> end
> true
> end
>
> in the test filter, because my solution does not work with swedish
> characters. Trying it with wswedish gives errors when using the regex
> and no errors when using include?, so I think there must be something
> weird going on in the regex.

Hmm, I'm bummed that regex doesn't work. I was trying to use something
that would be worldly. Maybe one of the regex gurus will jump in here
and correct it for us.

My complaint about String#include? is that banning the word "box" and
then passing in a message containing the word "boxer" will break it,
right?

James Edward Gray II

--
Brian Schröder
http://www.brian-schroeder.de/

Below is my solution, formatted to use Jannis' testing classes.

The solution basically recursively calls clean? on the full array of
words, then on the first half and on the second half, then on the
first quarter, the second quarter, the third quarter, etc. Whenever
clean? returns true, it stops dividing that section of the array.
(This description is breadth-first, but the algorithm actually runs
depth-first. Breadth-first is easier to explain, and the final result
is the same.)

I then noticed that we don't really need to make all these calls to
clean?, because we can sometimes deduce what the results will be
without calling clean?. For example, if we call clean? on [0...8]
and the result is false, then we call clean? on [0...4] and the result
is true, then we know that clean? on [4...8] must return false, so we
don't need to make this call.

The ordinary recursion is done by the method findBanned(aWords). When
the above deduction tells us that clean? will return false, we instead
call the method findBannedThereIsOne(aWords). In the above example we
would call findBannedThereIsOne(aWords[4...8]), and it will skip
calling clean? on aWords[4...8].

(Yes, findBannedThereIsOne is an awkward name, but at least it's
better than findBannedAndWeKnowThereIsAtLeastOne)

When findBannedThereIsOne is called with an array with only one
element, it doesn't need to call clean? at all, because we know this
must be a banned word.

So that findBanned and findBannedThereIsOne only need to deal with
arrays whose size is a power of two, the main routine,
runYourAlgorithm, divides the input array into chunks of these sizes.
As I'm typing this, it occurs to me that this code might actually run
faster if runYourAlgorithm didn't do this. I'll give this a try, but
for now I'll post the version that does the chunking.

runYourAlgorithm also handles the case of an empty array, so the other
two routines don't need to handle it.

Here are the results with Jannis' testing classes. For the times in
seconds, keep in mind that these tests were run on an ancient 500MHz
Pentium III with other work going on at the same time.

Runs: 10
Words: 121430
Banned Words: 3
Average
  Seconds: 0.5561
  Mails: 76
Verified: 10/10
Rebuilt Banned Words

Runs: 10
Words: 121430
Banned Words: 126
Average
  Seconds: 254.4113
  Mails: 1948
Verified: 10/10
Rebuilt Banned Words

Runs: 10
Words: 3930
Banned Words: 29
Average
  Seconds: 12.5187
  Mails: 327
Verified: 10/10
Rebuilt Banned Words

Runs: 10
Words: 125282
Banned Words: 80
Average
  Seconds: 139.9058
  Mails: 1319
Verified: 10/10
Rebuilt Banned Words

Runs: 10
Words: 127367
Banned Words: 10
Average
  Seconds: 1.5328
  Mails: 210
Verified: 10/10
Rebuilt Banned Words

Wayne Vucenic
No Bugs Software
"Ruby and C++ Contract Programming in Silicon Valley"

···

======================================

class YourAlgorithm < RQ9Algorithm
  def run()
    runYourAlgorithm(@words)
  end

  # Returns an array containing all banned words from aWords
  def runYourAlgorithm(aWords)
    return [] if aWords.empty?

    # Compute the largest power of two <= aWords.size
    powerOfTwo = 1
    powerOfTwo *= 2 while powerOfTwo * 2 <= aWords.size

    # To simplify the logic in findBanned, we always call it with an
    # array whose size is a power of two.
    aBanned = findBanned(aWords[0...powerOfTwo])

    # If we didn't process all of aWords, recursively do the rest
    if powerOfTwo < aWords.size
      aBanned += runYourAlgorithm(aWords[powerOfTwo..-1])
    end

    aBanned
  end

  # Returns an array containing all banned words from aWords
  # aWords.size is a power of 2, and is > 0
  def findBanned(aWords)
    if aWords.size == 1
      @filter.clean?(aWords[0]) ? [] : aWords
    elsif @filter.clean?(aWords.join(' '))
      []
    else
      iSplit = aWords.size / 2
      if @filter.clean?(aWords[0...iSplit].join(' '))
        # There is at least one banned word in 0..-1, but not in 0...iSplit,
        # so there must be one in iSplit..-1
        findBannedThereIsOne(aWords[iSplit..-1])
      else
        # From the test above we know there is a banned word in 0...iSplit
        findBannedThereIsOne(aWords[0...iSplit]) +
findBanned(aWords[iSplit..-1])
      end
    end
  end

  # Returns an array containing all banned words from aWords
  # aWords.size is a power of 2, and is > 0
  # Our caller has determined there is at least one banned word in aWords
  def findBannedThereIsOne(aWords)
    if aWords.size == 1
      # Since we know there is at least one banned word, and since there is
      # only one word in the array, we know this word is banned without
      # having to call clean?
      aWords
    else
      iSplit = aWords.size / 2
      if @filter.clean?(aWords[0...iSplit].join(' '))
        # There is at least one banned word in 0..-1, but not in 0...iSplit,
        # so there must be one in iSplit..-1
        findBannedThereIsOne(aWords[iSplit..-1])
      else
        # From the test above we know there is a banned word in 0...iSplit
        findBannedThereIsOne(aWords[0...iSplit]) +
findBanned(aWords[iSplit..-1])
      end
    end
  end
end

Yeah, that's probably better, if a little less efficient (I assume).

James Edward Gray II

···

On Nov 26, 2004, at 3:20 PM, Hal Fulton wrote:

Maybe something like text.split.include?(word)

Hmm... "sin" and "single" have pretty different meanings. It would seem unfortunate to group them together.

James Edward Gray II

···

On Nov 26, 2004, at 5:49 PM, Brian Schröder wrote:

Ah, I thought this behaviour was wanted. If this is not wanted my whole optimization was in vain. :frowning:

I uploaded my testing classes with a brute-force example and the
dictionary I'm using (from http://wordlist.sourceforge.net/ , filtered):

http://www.harderweb.de/jannis/rq9.tar.gz

- --
Jannis Harder (Germany)

The solution basically recursively calls clean? on the full array of
words, then on the first half and on the second half, then on the
first quarter, the second quarter, the third quarter, etc. Whenever
clean? returns true, it stops dividing that section of the array.

Below is my solution, which works the same way.

This "Divide and Conquer" approach is very effective when only a few words are banned, but I believe there are better approaches for many banned words. Hopefully, we'll see one submitted...

I then noticed that we don't really need to make all these calls to
clean?, because we can sometimes deduce what the results will be
without calling clean?. For example, if we call clean? on [0...8]
and the result is false, then we call clean? on [0...4] and the result
is true, then we know that clean? on [4...8] must return false, so we
don't need to make this call.

Clever. I didn't think of this.

Here are the results with Jannis' testing classes.

[...]

Words: 121430
Banned Words: 3

[...]

Words: 121430
Banned Words: 126

[...]

Words: 3930
Banned Words: 29

[...]

Words: 125282
Banned Words: 80

[...]

Words: 127367
Banned Words: 10

I consider all of these tests to be in the "very few are blocked" category, from the quiz. Anyone try anything closer to the suggested 10% blocked from the quiz?

James Edward Gray II

#!/usr/bin/env ruby

···

On Nov 28, 2004, at 11:11 AM, Wayne Vucenic wrote:

#
# A simple class for managing a filter that prevents to use
# of a given _banned_words_ list.
#
class LanguageFilter
  #
  # Create a new LanguageFilter object that will
  # disallow _banned_words_.
  # Accepts a list of words, arrays of words,
  # or a combination of the two.
  #
  def initialize( *banned_words )
    @banned_words = banned_words.flatten.sort
    @clean_calls = 0
  end
  
  # A count of the calls to <i>clean?</i>.
  attr_reader :clean_calls
  
  #
  # Test if provided _text_ is allowable by this filter.
  # Returns *false* if _text_ contains _banned_words_,
  # *true* if it does not.
  #
  def clean?( text )
    @clean_calls += 1
    @banned_words.each do |word|
      return false if text =~ /\b#{word}\b/
    end
    true
  end
  
  #
  # Verify a _suspect_words_ list against the actual
  # _banned_words_ list.
  # Returns *false* if the two lists are not identical or
  # *true* if the lists do match.
  # Accepts a list of words, arrays of words,
  # or a combination of the two.
  #
  def verify( *suspect_words )
    suspect_words.flatten.sort == @banned_words
  end
end

# my algorithm
def isolate( list, test )
  if test.clean? list.join(" ")
    Array.new
  elsif list.size == 1
    list
  else
    left, right = list[0...(list.size / 2)], list[(list.size / 2)..-1]
    isolate(left, test) + isolate(right, test)
  end
end

# test code
words = ARGF.read.split " "
filter = LanguageFilter.new words.select { rand <= 0.01 }

start = Time.now
banned = isolate words, filter
time = Time.now - start

puts "#{words.size} words, #{banned.size} banned words found"
puts "Correct? #{filter.verify banned}"
puts "Time taken: #{time} seconds"
puts "Calls: #{filter.clean_calls}"
puts "Words:"
puts banned.map { |word| "\t" + word }

__END__