[SUMMARY] Cryptograms (#13)

by Glenn Parker

Solving a cryptogram by brute force is prohibitively expensive. The maximum
number of possible solutions is 26!, or roughly 4*10^26, so the first challenge
is to pare down the search to something manageable.

Both my solution and Michael's begin with the insight that any word, either a
regular dictionary word or a cryptographic token, can be viewed as a pattern of
repeated and non-repeated characters. For example, "banana" has the pattern [1
2 3 2 3 2], where the first letter is used exactly once and the second letter
is used three times and the third letter is used twice. These patterns group
all known words into families. The word, banana, belongs to the same family as
the word, rococo.

All words in a dictionary can be grouped into families according to their
patterns, and each cryptographic token has its own pattern that corresponds
(with any luck), to one of the families from the dictionary. If a token has no
matching family, then it cannot be solved with the given dictionary, so we
won't worry about that case too much.

We start by assuming that one of the cryptographic tokens corresponds to one of
the words in its family. This pairing produces a partial map of input to
output characters. So, if we examine the token, xyzyzy, we might assume that
it is really the word, banana. The partial map that results is x->b y->a z->n,
or

  abcdefghijklmnopqrstuvwxyz
  .......................ban

Note that this mapping will affect all other cryptographic tokens that share
the letters x, y, and z. In fact, it may even solve some of them completely,
e.g. "zyx" becomes "nab". Or, the map may convert another token into a word
that is not in the dictionary, e.g. "zyxxyz" becomes "nabban", which is not in
my dictionary. This is a useful trick that will reduce the size of the search.

Next we assume that another token can be mapped into a dictionary word from its
family, which produces another partial map that must be combined with the first
map. There are two ways this combination can fail. First, the new map may
have a previously mapped input letter going to a different output letter, so if
we mapped "uvwxyz" to "monkey", the result would be a map where "x" mapped to
both "b" and "k". Second, the new map may have a previously unused input
letter going to an output letter that was already used, so if we mapped
"abcdef" to "monkey", the result would map both "c" and "z" to "n". Failed
mappings also serve to reduce the size of the search.

For my solution, I used a depth-first search, working through the tokens trying
every word in its family. The tokens are ordered according increasing family
size, so the tokens with the fewest possible solutions are examined first. At
each level of the recursion, all the words for a token are applied in sequence
to the current map. If the resulting map is valid, I recurse and the new map
is applied to the remaining unsolved tokens to see if they are already solved
or unsolvable. Solved tokens are ignored for the rest of this branch of the
search, and unsolvable tokens are shelved. Then I start working on the next
token with the new map.

The recursion terminates when a complete map is found, or the number of shelved
(unsolvable) tokens exceeds a limit, or every family word has been used for the
last token.

We are interested in maps that do not yield dictionary words for every token.
This is because cryptograms often contain non-dictionary words, and so we may
be satisfied by a partial solution even when a full solution is impossible.
Finding partial solutions is more expensive than finding only full solutions,
since the search space can be significantly larger. Aside from the trick of
shelving unsolvable words, partial solutions require us to selectively ignore
tokens that may be "spoiling" the search even though they produce valid maps.
Neither Michael's solution nor my own fully implements this as far as I can
tell.

Michael represents each map as a set of character pairs. Every possible
partial map for each individual token is calculated and forms a set of maps for
that token. Tokens are ordered by the increasing size of their set of maps, on
the assumption that smaller sets are more constraining and thus simplify the
search.

Some clever set operations are used to progressively combine each token's set
of maps into a master set of maps. When mixing in a token's set of maps
produces an empty set of maps, that token is ignored. After all the sets have
been combined, you are left with a set of maps that solve some or all of the
cryptogram.

The weakness in both mine and Michael's approach is that tokens are always
mixed in to the solution using a single pre-defined order. But the tokens that
are mixed in first can have an overwhelming influence on the final maps that
result. In the worst case, the first token to be mapped can make it impossible
to add any other tokens to the map.

The only solution I know is to add another wrapper around the entire search
process that mutates the order of the token mixing. I've implemented this in
Python, but not in the Ruby version, and I think Michael's solution might be a
better place to start from when persuing this.

Hi,

my solution!

Two minuts later :(.

I've just finished it (started yesterday evening)

The general idea was to construct a dictionnary structure which containes all the dictionary.
It's a kind of true with a value '@final' with false or true depends if the node is the end of a word.

The I have a method (exist) which is able to find if a word exist on it very quickly and using '.' for unknow characters.

The example is found in few minuts on my laptop.

Regards.

crypto.rb (7.05 KB)

Late response, with a question about the solvabilty of these
kind of problems in general.

I used a similar approach and get similar answers to Glenns, only
much slower (although mine finishes in finite time :slight_smile: When running
Glens program, I get many partial solutions in < 1 Sec (including most
that my program finds) - then his progam consumes 99% of my CPU
time indefinately. On my PC, mine runs in ~10 S for crypto1
(partial answer) ~16 S for crypto2, full correct answer,
and ~80 S for crypto3, producing garbage. (can I buy a
vowel? :slight_smile:

Actually I added the ablity to provide a partial map
to help prime the solver - still cant make sense of crypto3.

The main question I have is how does one handle non-dictionary
words that match a dictionary word signature - and then force
other wrong choices. The specific case is "alva" which can
map to "aqua". Once I have "aqua" as plain text, it prevents
the solver from using the letter 'u' in genius - making that
word match denims.

In fact I can envsion quotes with non-dictionary words that
could map, wrongly, to a set of all dictionary words - i.e.
the solution with 1 or more "unsolved" words would be more
correct than the solution with no unsolved words...

There's also the situation of "bent, cent, vent, went" - where
the choice of any of them doesn't affect any other words in the
cypher text - providing no help selecting between them.

In both these cases, I don't think backtracking can help - other
than to provide a number of possible solutions that a person
could read and pick from?

Addionally, while I agree that brute force w/26! possible maps
is impractial, after a certain point, would it make sense to
use the brute force approach for the "remaining" letters -
somwhere in the 8-10 letters remaining?

Thanks to all involved in providing this - been reading for a
while, but this is my first try at solving one ... took *way*
more time than I intended, but was lots of fun. Looking forward
to more in the future.

Vance

rq13 (3.23 KB)

[LONG POST that is short on Ruby specific content, apologies]

Solving a cryptogram by brute force is prohibitively expensive. The maximum
number of possible solutions is 26!, or roughly 4*10^26, so the first challenge
is to pare down the search to something manageable.

25! actually. A character cannot map to itself. But it is still a rather
large number of possible keys. :slight_smile:

All words in a dictionary can be grouped into families according to their
patterns, and each cryptographic token has its own pattern that corresponds
(with any luck), to one of the families from the dictionary. If a token has no
matching family, then it cannot be solved with the given dictionary, so we
won't worry about that case too much.

One thing I did wonder about is why the dictionary we used didn't
include 'i' and 'a'-- both very common words in English, and the likely
solution any time we see a single character token in a cipher text. Some
of my early tests did inject these two words. I forgot to add that back
in and see what the difference was.

Michael represents each map as a set of character pairs.

That was one area I struggled with: representing possible maps and then
how to compare them.

some sort of mapping arithmethic where

(abc => dog) + (abc => peg) = (abc => ..g)
(abc => ..g) + (abc => fog) = (abc => fog)

Here is where I'm going with this:

Every possible mapping is a number between 0 (where the mapping is
abcdefghijklmnopqrstuvwxyz => ..........................) and 26! (which
would be a mapping of abcdefghijklmnopqrstuvwxyz =>
zyxwvutsrqponmlkjihgfedcba). In any case, it can be represented most
compactly as a 26 digit number in base 27 (26 letters plus the unsolved
token).

27 is not a computer friendly number, but 32 is.

32 = 2 ** 5 = 11111 in base 2 notation

So our mappings would easily work if we convert each character to a base
32 number (some of our number space will go unused) and concatenate. The
entire thing fits in a 130 digit binary number then. I am not going to
draw out entire examples, but the first mapping above (all ....s) would
be 000...0 and the other one (zyx...) would be 111...1.

Here it is in action with a pair of two letter mappings (ab => de) and
(ab => be) (x and y respectively):

irb(main):033:0> x = 0b0010000101; y = 0b0001000101
=> 69
irb(main):034:0> sprintf("%010b",(x & y))
=> "0000000101"

The resulting string is clearly the map (ab => .e)

The problem is that the example is contrived and falls down when the mappings
are (ab => ce) and (ab => be)-- any case where the binary representations have
some matching bits. (00011 (c) & 00010 (b) = 00010 (b), which is wrong).

Is there a way to overcome this? Maybe it's time to write a quick Mapping class
that provides comparison and combination methods... but it would be nice if there
were some easier way. :slight_smile:

The weakness in both mine and Michael's approach is that tokens are always
mixed in to the solution using a single pre-defined order. But the tokens that
are mixed in first can have an overwhelming influence on the final maps that
result. In the worst case, the first token to be mapped can make it impossible
to add any other tokens to the map.

The only solution I know is to add another wrapper around the entire search
process that mutates the order of the token mixing. I've implemented this in
Python, but not in the Ruby version, and I think Michael's solution might be a
better place to start from when persuing this.

I tried longest token first (most likely to have unique pattern?),
shortest token first (just for kicks), least number of patterns first
(keep the running list of possible maps to a minimum). I tried just
doing them in order presented, too. One thing I didn't try that I think
may be promising is finding the longest, most unique word and then
sorting the other tokens such that each successive token shares as much
of the mapping with what's already been solved.

But this is always going to be a problem. The other, worse problem, is
settling on a mapping that solves a non-dictionary-word token using a
dictionary word (i.e. "thomas aqua edison").

Anyway, this was great fun. My decrypt.rb solves the cryptogram in the
paper without too much work, so I'm happy. :slight_smile:

-Michael, www.andsoforth.com

···

On Thu, 2005-01-06 at 22:53 +0900, Ruby Quiz wrote:

Solving a cryptogram by brute force is prohibitively expensive. The
maximum
number of possible solutions is 26!, or roughly 4*10^26, so the first
challenge
is to pare down the search to something manageable.

Both my solution and Michael's begin with the insight that any word,
either a
regular dictionary word or a cryptographic token, can be viewed as a
pattern of
repeated and non-repeated characters. For example, "banana" has the
pattern [1
2 3 2 3 2], where the first letter is used exactly once and the second
letter
is used three times and the third letter is used twice. These patterns
group
all known words into families. The word, banana, belongs to the same
family as
the word, rococo.

I'm sorry I didn't have time to give this a go.

You looked at matching words. I'm wondering if you could get as good a go
with reasonable-sized ciphertexts using a letter frequencies method? (i.e.
the most common letters in an English text are E, then T, etc.) Perhaps you
could apply letter frequencies first before moving on to match common words,
even.

Cheers,
Dave

As I told Glenn, I was insanely busy this week and couldn't quite steal the time to try this quiz. Which is a real shame because I think it was a great problem.

However, one of my ideas for pruning was to toss out mappings that didn't put "reasonable" vowels in every word. I'm not sure how well that works in practice though, since I didn't code it up. Just a thought.

James Edward Gray II

···

On Jan 7, 2005, at 3:02 AM, Vance A Heron wrote:

(can I buy a vowel? :slight_smile:

Using letter frequencies really was not something we could do for the
Quiz because our inputs were just a list of dictionary words and a
cryptogram. Going beyond the Quiz parametes, we could generate a list of
frequencies (using texts from gutenberg.org or a web crawler) or we
could find some cryptological authority's list and use that.

Even without using additional inputs, I did consider the frequency
approach since that is a common heuristic a human uses to solve a
substitution or shift cipher. However, the dictionary we were using
returns different frequencies than I expected to see based on my
experience with cryptanalysis (namely, watching "Wheel of Fortune" for
20+ years), so I abandoned this approach.

After thinking about it, I concluded we're getting essentially the same
effect from a different angle using word patterns.

-Michael

···

On Sun, 2005-01-09 at 15:31 +0900, Dave Burt wrote:

You looked at matching words. I'm wondering if you could get as good a go
with reasonable-sized ciphertexts using a letter frequencies method? (i.e.
the most common letters in an English text are E, then T, etc.) Perhaps you
could apply letter frequencies first before moving on to match common words,
even.