[QUIZ] Banned Words (#9)

I removed the code to break the array into chunks whose size is a
power of two. This didn't make any significant difference in the
speed of the algorithm, but it does considerably simplify the code.
The method 'findBannedWords' goes away, and 'run' becomes

  def run()
    if @words.empty?
      []
    else
      findBanned(@words)
    end
  end

(And the comments on the other two methods about the size of their
input being a power of two also go away.)

Wayne

James Edward Gray II wrote:

···

On Nov 26, 2004, at 3:20 PM, Hal Fulton wrote:

Maybe something like text.split.include?(word)

Yeah, that's probably better, if a little less efficient (I assume).

Probably is less efficient.

Also it's not quite right, as it doesn't ignore punctuation.

Hal

Hi James,

This "Divide and Conquer" approach is very effective when only a few
words are banned, but I believe there are better approaches for many
banned words...Anyone try anything closer to the suggested
10% blocked from the quiz?

I agree that if we know that 10% of the words are banned, we can do
better than this. If we have 1000 words with 10% banned, there's little
point in calling clean? on 1000, then the first 500, the last 500,
250, 250, 250, 250, 125, ... since there's a high probability that clean?
will return false on almost all of these. Since 10% is 1 in 10, it makes
sense to skip these larger chunks, and start with calling clean? on about
10 words, as there's a roughly equal chance that clean? will return true
as there is that it returns false, so we maximize the amount we learn by
calling clean? with about 10 words.

I've changed my original algorithm to start with chunks of 10, and then proceed
as before: if clean? on a chunk of 10 returns false, subdivide into 5 and 5,
etc. (The one changed routine is copied below.) This results in a moderate
improvement:

Original algorithm
Runs: 10
Words: 1000
Banned Words: 100
Average
  Seconds: 51.9328
  Mails: 596
Verified: 10/10
Rebuilt Banned Words

Start with chunks of 10 words
Runs: 10
Words: 1000
Banned Words: 100
Average
  Seconds: 50.29
  Mails: 514
Verified: 10/10
Rebuilt Banned Words

Of course, this requires knowing in advance the percentage of banned words.
If we didn't know this in advance we could try an adaptive algorithm that
figures out the percentage as it's running. Perhaps first make one pass
dividing by two each time: 1000, 500, 250, 125, 63, 32, 16, 8, 4, 2, 1.
(Starting the chunk at 0 each time: [0...1000], [0...500], etc.)
If there are really 10% banned words, they maybe clean? of 16 would be false,
but clean? of 8 would be true. From this, we _guess_ that the percentage of
banned words is between 1/16 and 1/8. We average these to get 0.09375 or 1
in every 10.6666. Rounding this to 1 in 11, we pick an initial chunk size
of 11, and do as in my algorithm above, but with chunks of 11. But we also
keep track of the percentage of banned words we've seen so far, and if this
percentage moves outside the range of 10.5 to 11.5, we adjust the chunk size
down or up, accordingly.

So, we quickly make a guess at the percentage, then as the algorithm runs we
improve our guess, and pick the corresponding chunk size.

Adaptive algorithms like this sometimes need damping, so that the chunk size
doesn't change wildly if we happen to hit a run of banned words, followed by
a run of non-banned words. But I'm not sure how much damping is needed in
this particular case, as the continually increasing denominator in our
percentage calculations will provide some damping.

Wayne

···

============================================

  def run()
    if @words.empty?
      
    else
      biggestChunkSize = 10 # Use this for 10%
      aBanned =
      iStart = 0
      iEnd = iStart + biggestChunkSize
      while iEnd <= @words.size
        aBanned += findBanned(@words[iStart...iEnd])
        iStart = iEnd
        iEnd += biggestChunkSize
      end
      if iStart < @words.size
        aBanned += findBanned(@words[iStart..-1])
      end
      aBanned
    end
  end

I was trying to update your code for the quiz archives based on this message and got confused. Would you please repost your solution as it stands now?

Thanks.

James Edward Gray II

···

On Nov 28, 2004, at 5:36 PM, Wayne Vucenic wrote:

I removed the code to break the array into chunks whose size is a
power of two. This didn't make any significant difference in the
speed of the algorithm, but it does considerably simplify the code.
The method 'findBannedWords' goes away, and 'run' becomes

  def run()
    if @words.empty?
      
    else
      findBanned(@words)
    end
  end

(And the comments on the other two methods about the size of their
input being a power of two also go away.)

Hal Fulton wrote:

James Edward Gray II wrote:

Maybe something like text.split.include?(word)

Yeah, that's probably better, if a little less efficient (I assume).

Probably is less efficient.

Also it's not quite right, as it doesn't ignore punctuation.

What about
  ALL_WORDS = /\b[^\s.,:;_!?\/&()"-]+\b/
  text.scan(ALL_WORDS).map{|x| x.downcase}.include?(word)
Even less efficient, but I guess it does always work.

Markus

···

On Nov 26, 2004, at 3:20 PM, Hal Fulton wrote:

I tried to implement an algorithm that is perfect regarding the chunk size given the percentage information. This is described in

http://ruby.brian-schroeder.de/quiz/detect_words/content/content.html
or prettier
http://ruby.brian-schroeder.de/quiz/detect_words/content.ps

The problem is that this algorithm was not significantly better than always dividing into three parts. (Surprisingly it was even slower). Maybe my math is wrong there, if you detect the mind-bug it would be very interesting.

Regards,

Brian

···

On Mon, 29 Nov 2004 11:08:24 +0900 Wayne Vucenic <nightphotos@gmail.com> wrote:

Hi James,

I agree that if we know that 10% of the words are banned, we can do
better than this.
[snip]

--
Brian Schröder
http://www.brian-schroeder.de/

Gladly!

Wayne

···

On Mon, 29 Nov 2004 23:31:54 +0900, James Edward Gray II <james@grayproductions.net> wrote:

Would you please repost your solution as it stands now?

----------------------------------------------------------

class YourAlgorithm < RQ9Algorithm
  # Returns an array containing all banned words from @words
  def run()
    if @words.empty?
      
    else
      findBanned(@words)
    end
  end

  # Returns an array containing all banned words from aWords
  # aWords.size is > 0
  def findBanned(aWords)
    if aWords.size == 1
      @filter.clean?(aWords[0]) ? : aWords
    elsif @filter.clean?(aWords.join(' '))
      
    else
      iSplit = aWords.size / 2
      if @filter.clean?(aWords[0...iSplit].join(' '))
        # There is at least one banned word in 0..-1, but not in 0...iSplit,
        # so there must be one in iSplit..-1
        findBannedThereIsOne(aWords[iSplit..-1])
      else
        # From the test above we know there is a banned word in 0...iSplit
        findBannedThereIsOne(aWords[0...iSplit]) +
findBanned(aWords[iSplit..-1])
      end
    end
  end

  # Returns an array containing all banned words from aWords
  # aWords.size is > 0
  # Our caller has determined there is at least one banned word in aWords
  def findBannedThereIsOne(aWords)
    if aWords.size == 1
      # Since we know there is at least one banned word, and since there is
      # only one word in the array, we know this word is banned without
      # having to call clean?
      aWords
    else
      iSplit = aWords.size / 2
      if @filter.clean?(aWords[0...iSplit].join(' '))
        # There is at least one banned word in 0..-1, but not in 0...iSplit,
        # so there must be one in iSplit..-1
        findBannedThereIsOne(aWords[iSplit..-1])
      else
        # From the test above we know there is a banned word in 0...iSplit
        findBannedThereIsOne(aWords[0...iSplit]) +
findBanned(aWords[iSplit..-1])
      end
    end
  end
end

Brian Schröder wrote:

I tried to implement an algorithm that is perfect regarding the chunk size given the percentage information. This is described in

http://ruby.brian-schroeder.de/quiz/detect_words/content/content.html
or prettier
http://ruby.brian-schroeder.de/quiz/detect_words/content.ps

The problem is that this algorithm was not significantly better than always dividing into three parts. (Surprisingly it was even slower). Maybe my math is wrong there, if you detect the mind-bug it would be very interesting.

I believe that you oversimplified the problem, and therefore your math is wrong (or, rather, a too coarse approximation, as your results prove).

From your paper:
N... Number of Words.
n... Number of banned words.
r:=P[word w is banned]=n/N
p:=P[slice of size x passes] = (1-r)^x.

Your formula for p ((1-r)^x) assumes that the probability r is constant,
while in reality it changes with every word you put into your slice.

The probability of a slice of size x passing is the same as the probability that none of 100 words you randomly pick is banned.
Now, while you pick (you pick only non-banned words, because you immediatly stop if you get a banned one), the ration banned/total
for the _remaining_ set constantly changes.

You can instantly see that your formula cannot be exact if you
try to calculate the probability of a slice of size N passing.
Your formulat gives (1-n/N)^N, which is probably quite small, but still
greater than zero. The exact probability is obviosly zero (assuming n > 0)

Additionally, the probability r changes everytime you find a non-banned slice. If, for example, you test 500 out of 100 words, and the slice passes, the being-banned probability of the rest of the words doubles.

greetings, Florian Pflug

Hello Florian,

thank you for taking the time to think about my ramblings. I'd love to take
some time to think more about this, so I think I won't have it. But I'm shure
to have learned something from you here.

best regards,

Brian

···

On Thu, 2 Dec 2004 04:42:46 +0900 "Florian G. Pflug" <fgp@phlo.org> wrote:

Brian Schröder wrote:
> I tried to implement an algorithm that is perfect regarding the chunk size
> given the percentage information. This is described in
>
> http://ruby.brian-schroeder.de/quiz/detect_words/content/content.html
> or prettier
> http://ruby.brian-schroeder.de/quiz/detect_words/content.ps
>
> The problem is that this algorithm was not significantly better than always
> dividing into three parts. (Surprisingly it was even slower). Maybe my math
> is wrong there, if you detect the mind-bug it would be very interesting.

I believe that you oversimplified the problem, and therefore your math
is wrong (or, rather, a too coarse approximation, as your results prove).

From your paper:
N... Number of Words.
n... Number of banned words.
r:=P[word w is banned]=n/N
p:=P[slice of size x passes] = (1-r)^x.

Your formula for p ((1-r)^x) assumes that the probability r is constant,
while in reality it changes with every word you put into your slice.

The probability of a slice of size x passing is the same as the
probability that none of 100 words you randomly pick is banned.
Now, while you pick (you pick only non-banned words, because you
immediatly stop if you get a banned one), the ration banned/total
for the _remaining_ set constantly changes.

You can instantly see that your formula cannot be exact if you
try to calculate the probability of a slice of size N passing.
Your formulat gives (1-n/N)^N, which is probably quite small, but still
greater than zero. The exact probability is obviosly zero (assuming n > 0)

Additionally, the probability r changes everytime you find a non-banned
slice. If, for example, you test 500 out of 100 words, and the slice
passes, the being-banned probability of the rest of the words doubles.

greetings, Florian Pflug

--
Brian Schröder
http://www.brian-schroeder.de/