Fun with Python and Cryptography

May 28, 2004 | My Projects, Python | By: Mark VandeWettering

I’m a bit of a classic cryptography nut, and I also have had a lot of fun writing bits of Python code. Python is awfully good at sucking up text files and performing statistics and the like, so I thought I would go ahead and describe some of the fun things I’ve been doing.

I’ve developed code to crack fairly sophisticated ciphers before, including both the Playfair and Enigma machines, but I thought it might be fun to work on a simple python script to crack ordinary substitution ciphers using evolutionary computation. You would probably get better results from using a dictionary based attack, but I thought this would be both simple and fun.

The basic idea is simple: the key is a permutation of the letters from A-Z. Spaces, apostrophes, periods and question marks are all signficant, so leave them in. Take a large sample of the english language:

  1. replace all lowercase letters with uppercase letters
  2. replace all repeats of unsignificant chars with a single space
  3. count the number of each trigraph which appears
  4. store out a dictionary which consists of the trigraph and the logarithm (base 2 works well) of their frequency

Why do we need this? We are going to score tentative decodes of our ciphertext by summing the log based two counts of each trigraph which appears in the decode.

In Python, it’s dead simple, you can look at the code for the resulting program gencorpus.py.

Now that we have that, we can use it to score messages. If you are familiar with evolutionary computation, you should find relatively little of the cryptogram.py surprising.

How well does it work? Well, consider the following ciphertext:

OYI XCKMIZZXFP X YRB NRZ OYRO NI NIMI DIRHXPV OYI NIZO RPB IPOIMXPV OYI
IRZO; OYI CFZO NIZOIMP FA ZKDIPBXB EMXBVIZ FHIM OYI BRPLEI, NYXUY XZ
YIMI FA PFEDI NXBOY RPB BIKOY, OFFJ LZ RCFPV OYI OMRBXOXFPZ FA OLMJXZY
MLDI.

A few generations yields the following, not so good decrypt….

Best Decode score = 1271
THE RFYNELLRAS R HUJ KUL THUT KE KENE ZEUCRSV THE KELT USJ ESTENRSV THE
EULT; THE FALT KELTENS AM LYZESJRJ XNRJVEL ACEN THE JUSIXE, KHROH RL
HENE AM SAXZE KRJTH USJ JEYTH, TAAD IL UFASV THE TNUJRTRASL AM TINDRLH
NIZE.

But, if you let it run a bit…

Best Decode score = 2041
THE IMPRESSION I HAL WAS THAT WE WERE CEAVIND THE WEST ANL ENTERIND THE
EAST; THE MOST WESTERN OF SPCENLIL BRILDES OVER THE LANUBE, WHIGH IS
HERE OF NOBCE WILTH ANL LEPTH, TOOK US AMOND THE TRALITIONS OF TURKISH
RUCE.

It then seemed to get stuck in this run (curse you local minima!). A second run produced:

THE IMPRESSION I HAD WAS THAT WE WERE LEAVING THE WEST AND ENTERING THE
EAST; THE MOST WESTERN OF SPLENDID BRIDGES OVER THE DANUBE, WHICH IS
HERE OF NOBLE WIDTH AND DEPTH, TOOK US AMONG THE TRADITIONS OF TURKISH
RULE.

Not bad, considering there was no real dictionary of words. I need to experiment a bit more with the mutation and crossover operations and probabilities: it seems that tweaking these can produce better or much worse convergence. Still, a fun chunk of code to play with. Hope it was a teeny bit thought provoking.