[QUIZ] Scrabble Stems (#12)

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.grayproductions.net/ruby_quiz/

3. Enjoy!

···

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Martin DeMello

In Scrabble parlance, a 'bingo stem' is defined as a set of six
letters that combines with a large fraction of the alphabet to anagram
to valid seven letter words (a 'bingo' is a move using all seven tiles
on your rack). For instance, one of the more prolific stems, SATIRE,
combines with twenty letters, A, B, C, D, E, F, G, H, I, K, L, M, N,
O, P, R, S, T, V, W (forming words like ASTERIA, BAITERS, RACIEST
...).

Write a program that, given a word list and a cutoff n, finds all 6
letter stems that combine with n or more letters, sorted in order of
how many distinct letters they combine with.

When talking of 6 letter stems, do I have to find all the 6 letter
stems that also have a meaning by themselves? (Like Satire for
example). Or may I find even nonsense stems?

···

On Fri, 17 Dec 2004 22:45:39 +0900, Ruby Quiz <james@grayproductions.net> wrote:

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

Gray Soft / Not Found

3. Enjoy!

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Martin DeMello

In Scrabble parlance, a 'bingo stem' is defined as a set of six
letters that combines with a large fraction of the alphabet to anagram
to valid seven letter words (a 'bingo' is a move using all seven tiles
on your rack). For instance, one of the more prolific stems, SATIRE,
combines with twenty letters, A, B, C, D, E, F, G, H, I, K, L, M, N,
O, P, R, S, T, V, W (forming words like ASTERIA, BAITERS, RACIEST
...).

Write a program that, given a word list and a cutoff n, finds all 6
letter stems that combine with n or more letters, sorted in order of
how many distinct letters they combine with.

Ruby Quiz wrote:

In Scrabble parlance, a 'bingo stem' is defined as a set of six
letters that combines with a large fraction of the alphabet to anagram
to valid seven letter words.

Write a program that, given a word list and a cutoff n, finds all 6
letter stems that combine with n or more letters, sorted in order of
how many distinct letters they combine with.

If this quiz is really just about Scrabble, can we assume the input word list will contain only 7-letter words?

If I filter my dictionary (2of4brif.txt) down to just 7-letter words, I can handle this task in less than an hour of wall-clock time on a 2GHz PC. If I allow all words with 7 or more letters, I don't know if I'll be able to finish the processing in under a week's time.

I've certainly learned a few things about Ruby performance tuning along the way.

···

--
Glenn Parker | glenn.parker-AT-comcast.net | <http://www.tetrafoil.com/&gt;

[Ruby Quiz <james@grayproductions.net>, 2004-12-17 14.45 CET]

In Scrabble parlance, a 'bingo stem' is defined as a set of six
letters that combines with a large fraction of the alphabet to anagram
to valid seven letter words (a 'bingo' is a move using all seven tiles
on your rack). For instance, one of the more prolific stems, SATIRE,
combines with twenty letters, A, B, C, D, E, F, G, H, I, K, L, M, N,
O, P, R, S, T, V, W (forming words like ASTERIA, BAITERS, RACIEST
...).

Write a program that, given a word list and a cutoff n, finds all 6
letter stems that combine with n or more letters, sorted in order of
how many distinct letters they combine with.

Here is a possible (I hope correct) solution. SATIRE combines only with 16
letters according to my dictionary.

DICT = "/usr/share/dict/words"
CUTOFF = ARGV[0].to_i

STEMS = {}

File.open(DICT) do |f|
  f.each do |word|
    word.chomp!
    next if word.length != 7
    word.downcase!
    letters = word.split(//).sort!
    uniques = letters.uniq
    word = letters.join
    uniques.each do |letter|
      stem = word.sub(/#{letter}/, "")
      (STEMS[stem] ||= {})[letter] = 1
    end
  end
end

result = STEMS.delete_if { |k,v| v.size < CUTOFF }.
    sort_by { |k,v| v.size }.
    reverse!.
    collect! { |k,v| [k, v.size] }

result.each do |stem, combining| puts "#{stem} #{combining}" end

by Martin DeMello

In Scrabble parlance, a 'bingo stem' is defined as a set of six
letters that combines with a large fraction of the alphabet to anagram
to valid seven letter words (a 'bingo' is a move using all seven tiles
on your rack). For instance, one of the more prolific stems, SATIRE,
combines with twenty letters, A, B, C, D, E, F, G, H, I, K, L, M, N,
O, P, R, S, T, V, W (forming words like ASTERIA, BAITERS, RACIEST
...).

Write a program that, given a word list and a cutoff n, finds all 6
letter stems that combine with n or more letters, sorted in order of
how many distinct letters they combine with.

Here's my solution. My earlier performance problems resulted from several very inefficient attempts at generating unique letter combinations, i.e. list all unique sets of "N letters taken 6 at a time". Reducing the problem to "7 letters taken 6 at a time" made that part trivial, but I kept plugging at the more general problem.

The dictionary I used was 2of4brif.txt, available as part of the 12Dicts package at Download 12dicts-4.0.zip (SCOWL (and friends))

On my 2GHz P4, I get 5319 stems with a cutoff of 3 in 4.062 seconds. A version optimized for 7-letter words (not shown) takes barely over 2 seconds.

If I swap line 47 in for line 48, so that all words of 7 letters or more are processed, I get 167,227 stems with a cutoff of 3 in 2196.453 seconds (~37 min.).

bingo1.rb (1.94 KB)

···

--
Glenn Parker | glenn.parker-AT-comcast.net | <http://www.tetrafoil.com/&gt;

I don't believe the stems meaning or lack thereof is significant.

James Edward Gray II

···

On Dec 18, 2004, at 6:16 AM, Giovanni Intini wrote:

When talking of 6 letter stems, do I have to find all the 6 letter
stems that also have a meaning by themselves? (Like Satire for
example). Or may I find even nonsense stems?

No, all six letter stems count. The "meaningful" rearrangement is just a
handy way to refer to the stem.

martin

···

Giovanni Intini <intinig@gmail.com> wrote:

When talking of 6 letter stems, do I have to find all the 6 letter
stems that also have a meaning by themselves? (Like Satire for
example). Or may I find even nonsense stems?

Ruby Quiz wrote:

In Scrabble parlance, a 'bingo stem' is defined as a set of six
letters that combines with a large fraction of the alphabet to anagram
to valid seven letter words.
Write a program that, given a word list and a cutoff n, finds all 6
letter stems that combine with n or more letters, sorted in order of
how many distinct letters they combine with.

If this quiz is really just about Scrabble, can we assume the input word list will contain only 7-letter words?

I filtered out words less than and more than seven letters.

If I filter my dictionary (2of4brif.txt) down to just 7-letter words, I can handle this task in less than an hour of wall-clock time on a 2GHz PC. If I allow all words with 7 or more letters, I don't know if I'll be able to finish the processing in under a week's time.

I've certainly learned a few things about Ruby performance tuning along the way.

Hmmm, I would say that it doesn't HAVE to take that long. My program runs in under 30 seconds. :wink:

James Edward Gray II

···

On Dec 18, 2004, at 3:42 PM, Glenn Parker wrote:

* Carlos <angus@quovadis.com.ar> [2004-12-19]:

DICT = "/usr/share/dict/words"

....

          stem = word.sub(/#{letter}/, "")

There seems to be a problem with german umlauts in words:

rq:16: premature end of regular expression: /üge/ (RegexpError)
        from rq:15:in `each'
        from rq:15
        from rq:8:in `each'
        from rq:8
        from rq:7:in `open'
        from rq:7

Julius

Carlos wrote:

Here is a possible (I hope correct) solution. SATIRE combines only with 16
letters according to my dictionary.

Very simmilar to my solution: (it finds 18 combining letters for SATIRE using the ENABLE2K word list)

hash = Hash.new {|h, k| h[k] = 0}
File.foreach(ARGV[0] || 'WORD.LST') do |line|
   line.strip!
   if line.size == 7
     letters = line.downcase.scan(/./).sort.join
     7.times do |i|
       hash[letters[0, i] + letters[(i+1)..-1]] |= 1 << (letters[i] - 97)
     end
   end
end

cutoff = (ARGV[1] || '15').to_i
count = {}
hash.each do |k, v|
   v = (v & 0x5555555) + ((v>>1) & 0x5555555)
   v = (v & 0x3333333) + ((v>>2) & 0x3333333)
   v = (v & 0xf0f0f0f) + ((v>>4) & 0xf0f0f0f)
   v = (v & 0x0ff00ff) + ((v>>8) & 0x0ff00ff)
   v = (v & 0x000ffff) + ((v>>16) & 0x000ffff)
   count[k] = v if v >= cutoff
end

count.keys.sort_by {|k| count[k]}.each do |letters|
   printf "%s: (%d) ", letters, count[letters]
   combi = hash[letters]
   26.times do |i|
     print((i+97).chr) if combi[i] == 1
   end
   puts
end

This is my solution for _meaningful_ stems (words in the dictionary):
#Usage: <Dictionary> <Stemsize> <Cutoff>

#On my machine:
#8728 stems
#13969 possible bingos
#1141 stems have more than 6 bingos
#stemsize is 6
#bingosize is 7
#brute force needs 3169957232 comparisons
#I need 226928 hash lookups
#22.270484 sec (searching bingos)
#6.235895 sec (reading dictionary)
#28.506379 sec (total)
#0.00326608375343721 sec / stem

stemsize = (ARGV[1] || 6).to_i
minbingos = (ARGV[2] || 6).to_i

alphabeth=("a".."z").to_a;
stems = []
bingowords = []

results = {}

start1 = Time.new
File.foreach(File.expand_path(ARGV[0]||"~/dict")) do |line|
     chomped = line.chomp.downcase
     stems << chomped if chomped.size == stemsize
     bingowords << chomped if chomped.size == stemsize+1
end

start2 = Time.new

sbingos = bingowords.map{|word| word.split("").sort.join}
sbingohash={}

sbingos.each_index do |index|
     sbingohash[sbingos[index]]=bingowords[index]
end

stems.each do |stem|

     bingosfound = [];
     xbingo = stem.split("")
     alphabeth.each do |char|
         stbingo = (xbingo+[char]).sort.join
         fbingo = sbingohash[stbingo];
         bingosfound << fbingo if fbingo
     end

     if bingosfound.size >= minbingos
         results[stem] = bingosfound

     end

end
out =""
results.to_a.sort_by{|e|-(e[1].size)}.each do |result|
     out << "#{result[0]} #{result[1].size} #{result[1].join","}\n"
end
done = Time.new;

puts "\nOrdered:\n\n"
puts out
puts "#{stems.size} stems\n#{bingowords.size} possible bingos\n#{results.size} stems have more than #{minbingos} bingos"
puts "stemsize is #{stemsize}\nbingosize is #{stemsize+1}"
puts "brute force needs #{stems.size*alphabeth.size*bingowords.size} comparisons"
puts "I need #{stems.size*alphabeth.size} hash lookups"
puts "#{(done-start2).to_s} sec (searching bingos)"
puts "#{(start2-start1).to_s} sec (reading dictionary)"
puts "#{(done-start1).to_s} sec (total)"
puts "#{((done-start1).to_f/stems.size).to_s} sec / stem"

Jannis Harder

"jp6iSZmkLp5ISZlEiW5C".unpack("m")[0].unpack("C*").map{|x|x.chr}.join.
unpack("B*")[0].scan(/.{24}/){i=7;$&.scan(/..../){print\
"\e[3#{i-=1};1;40m ";$&.each_byte{|z|print" #"[z-?0,1]*2}};puts"\e[0m"}

James Edward Gray II wrote:

Hmmm, I would say that it doesn't HAVE to take that long. My program runs in under 30 seconds. :wink:

Thanks, stripping it down to handle only 7-letter words made a big difference. It runs in well under 4 seconds now. :slight_smile:

···

--
Glenn Parker | glenn.parker-AT-comcast.net | <http://www.tetrafoil.com/&gt;

[Julius Plenz <usenet@plenz.com>, 2004-12-19 21.37 CET]

* Carlos <angus@quovadis.com.ar> [2004-12-19]:
> DICT = "/usr/share/dict/words"

....

> stem = word.sub(/#{letter}/, "")

Oops... no need to interpolate. This line should be

    stem = word.sub(letter, "")

as in Martin DeMello's solution.

Thanks.

My Solution for nonsense stems:

#Usage <Dict> <Stemsize> <Cutoff>

#6.965449 sec
#13969 bingos
#43953 stems found (without cutoff)
#202 stems found (with cutoff)
#0.000158474939139535 sec/stem
#6310 stems/sec
#0.000498636194430525 sec/bingo
#2005 bingos/sec

dict = ARGV[0]||"~/dict"
stemsize = (ARGV[1] || 6).to_i
cutoff = (ARGV[1] || 10).to_i
bingosize = stemsize+1
stems = {}
bingos = []
bingocount = 0
start = Time.new
File.foreach(File.expand_path(dict)) do |bingo|
     bingo.chomp!
     if bingo.size == bingosize
         bingocount+=1
         sbingo = bingo.split("").sort.join
         subingo = sbingo.split("").uniq
         subingo.each do |char|
             stem = sbingo.dup
             stem[sbingo.index(char),1]=""
             if stems[stem]
                 stems[stem][char]=1
             else
                 stems[stem]={char=>1}
             end
         end
     end
end
stems = stems.to_a
stemsize = stems.size
stems.delete_if{|x| x[1].size < cutoff}
stems.sort_by{|x|-x[1].size}.each do |stem|
     puts "#{stem[0]} #{stem[1].size} #{stem[1].keys.join}"
end
done = Time.new
puts "STATS",
"#{(done-start).to_f} sec",
"#{bingocount} bingos",
"#{stemsize} stems found (without cutoff)",
"#{stems.size} stems found (with cutoff)",
"#{((done-start).to_f)/stemsize} sec/stem",
"#{(stemsize/((done-start).to_f)).to_i} stems/sec",
"#{((done-start).to_f)/bingocount} sec/bingo",
"#{(bingocount/((done-start).to_f)).to_i} bingos/sec"

Jannis Harder

"jp6iSZmkLp5ISZlEiW5C".unpack("m")[0].unpack("C*").map{|x|x.chr}.join.
unpack("B*")[0].scan(/.{24}/){i=7;$&.scan(/..../){print\
"\e[3#{i-=1};1;40m ";$&.each_byte{|z|print" #"[z-?0,1]*2}};puts"\e[0m"}

From the original post it looked like you could have anything longer than 6.

Here's my solution. It doesn't really have any clever optimizations, but it runs in seconds.

James Edward Gray II

#!/usr/bin/env ruby

unless ARGV.size >= 2 and ARGV[1] =~ /^[1-9]|1\d|2[0-6]$/
  puts "Usage: #$0 WORD_LIST_FILE(S) MINIMUM_STEM_LIMIT"
  exit
end

$limit = ARGV.pop.to_i
$stems = { }

while line = ARGF.gets
  line.chomp!
  line.tr!("^a-zA-Z", "")
  line.downcase!
  
  next unless line.length == 7
  
  word = line.split("")
  word.each_index do |i|
    stem = word.dup
    stem.delete_at(i)
    stem = stem.sort.join
    
    if $stems.include?(stem)
      $stems[stem] << word[i] unless $stems[stem].include?(word[i])
    else
      $stems[stem] = [ word[i] ]
    end
  end
end

$stems.each_pair do |stem, letters|
  next if letters.size < $limit

  puts stem
  puts "\t#{letters.size}: #{letters.sort.join}"
end

In the game of Scrabble, you are limited to seven "tiles". You get 50 points when you can play all seven at once.

Technically, that could be in longer words (using letters already on the board). I didn't get the impression this quiz was concerned with that though.

James Edward Gray II

···

On Dec 18, 2004, at 5:49 PM, Giovanni Intini wrote:

From the original post it looked like you could have anything longer than 6.

My bad, then - I meant six letter stems that form seven letter words
when combined with at least n out of the 26 letters in the alphabet.
There are seven-to-make-eight stems too, but as James noted, the quiz is
not concerned with them (it's a trivial extension, though).

martin

···

Giovanni Intini <intinig@gmail.com> wrote:

From the original post it looked like you could have anything longer than 6.

Can you explain this line?

···

On Mon, 20 Dec 2004 10:36:18 +0900, James Edward Gray II <james@grayproductions.net> wrote:

Here's my solution. It doesn't really have any clever optimizations,
but it runs in seconds.

        line.tr!("^a-zA-Z", "")

James Edward Gray II wrote:

···

On Dec 18, 2004, at 5:49 PM, Giovanni Intini wrote:

From the original post it looked like you could have anything longer than 6.

In the game of Scrabble, you are limited to seven "tiles". You get 50 points when you can play all seven at once.

The quiz alluded to Scrabble, but I have to agree that the phrasing was still ambiguous about the length of the input words.

--
Glenn Parker | glenn.parker-AT-comcast.net | <http://www.tetrafoil.com/&gt;