“Privacy is an inherent human right, and a requirement for maintaining the human condition with dignity and respect.”
—Bruce Schneier, cryptographer, 2006
Two methods exist to hack the Vigenère cipher. One method uses a brute-force dictionary attack to try every word in the dictionary file as the Vigenère key, which works only if the key is an English word, such as RAVEN or DESK. The second, more sophisticated method, which was used by the 19th-century mathematician Charles Babbage, works even when the key is a random group of letters, such as VUWFE or PNFJ. In this chapter, we’ll write programs to hack the Vigenère cipher using both methods.
We’ll first use the dictionary attack to hack the Vigenère cipher. The dictionary file dictionary.txt (available on this book’s website at https://www.nostarch.com/crackingcodes/) has approximately 45,000 English words. It takes less than five minutes for my computer to run through all these decryptions for a message the size of a long paragraph. This means that if an English word is used to encrypt a Vigenère ciphertext, the ciphertext is vulnerable to a dictionary attack. Let’s look at the source code for a program that uses a dictionary attack to hack the Vigenère cipher.
Open a new file editor window by selecting File▸New File. Enter the following code into the file editor, and then save it as vigenereDictionaryHacker.py. Be sure to place the detectEnglish.py, vigenereCipher.py, and pyperclip.py files in the same directory as the vigenereDictionaryHacker.py file. Then press F5 to run the program.
vigenere
Dictionary
Hacker.py
1. # Vigenere Cipher Dictionary Hacker
2. # https://www.nostarch.com/crackingcodes/ (BSD Licensed)
3.
4. import detectEnglish, vigenereCipher, pyperclip
5.
6. def main():
7. ciphertext = """Tzx isnz eccjxkg nfq lol mys bbqq I lxcz."""
8. hackedMessage = hackVigenereDictionary(ciphertext)
9.
10. if hackedMessage != None:
11. print('Copying hacked message to clipboard:')
12. print(hackedMessage)
13. pyperclip.copy(hackedMessage)
14. else:
15. print('Failed to hack encryption.')
16.
17.
18. def hackVigenereDictionary(ciphertext):
19. fo = open('dictionary.txt')
20. words = fo.readlines()
21. fo.close()
22.
23. for word in lines:
24. word = word.strip() # Remove the newline at the end.
25. decryptedText = vigenereCipher.decryptMessage(word, ciphertext)
26. if detectEnglish.isEnglish(decryptedText, wordPercentage=40):
27. # Check with user to see if the decrypted key has been found:
28. print()
29. print('Possible encryption break:')
30. print('Key ' + str(word) + ': ' + decryptedText[:100])
31. print()
32. print('Enter D for done, or just press Enter to continue
breaking:')
33. response = input('> ')
34.
35. if response.upper().startswith('D'):
36. return decryptedText
37.
38. if __name__ == '__main__':
39. main()
When you run the vigenereDictionaryHacker.py program, the output should look like this:
Possible encryption break:
Key ASTROLOGY: The recl yecrets crk not the qnks I tell.
Enter D for done, or just press Enter to continue breaking:
>
Possible encryption break:
Key ASTRONOMY: The real secrets are not the ones I tell.
Enter D for done, or just press Enter to continue breaking:
> d
Copying hacked message to clipboard:
The real secrets are not the ones I tell.
The first keyword the program suggests (ASTROLOGY) doesn’t work, so the user presses enter to let the hacking program continue until it finds the correct decryption key (ASTRONOMY).
Because the source code for the vigenereDictionaryHacker.py program is similar to previous hacking programs in this book, I won’t explain it line by line. Briefly, the hackVigenereDictionary() function attempts to use each word in the dictionary file to decrypt the ciphertext, and when the decrypted text looks like English (according to the detectEnglish module), it prints the decryption and prompts the user to quit or continue.
Note that this program uses the readlines() method on file objects returned from open():
20. words = fo.readlines()
Unlike the read() method, which returns the full contents of the file as a single string, the readlines() method returns a list of strings, where each string is a single line from the file. Since there is one word in each line of the dictionary file, the words variable contains a list of every English word from Aarhus to Zurich.
The rest of the program, from lines 23 to 36, works similarly to the transposition cipher–hacking program in Chapter 12. A for loop will iterate over each word in the words list, decrypt the message with the word as the key, and then call detectEnglish.isEnglish() to see whether the result is understandable English text.
Now that we’ve written a program that hacks the Vigenère cipher using a dictionary attack, let’s look at how to hack the Vigenère cipher even when the key is a random group of letters rather than a dictionary word.
Kasiski examination is a process that we can use to determine the length of the Vigenère key used to encrypt a ciphertext. We can then use frequency analysis to break each of the subkeys independently. Charles Babbage was the first person to have broken the Vigenère cipher using this process, but he never published his results. His method was later published by Friedrich Kasiski, an early 20th-century mathematician who became the namesake of the method. Let’s look at the steps involved in Kasiski examination. These are the steps that our Vigenère hacking program will take.
The first step of Kasiski examination is to find every repeated set of at least three letters in the ciphertext. These repeated sequences could be the same letters of plaintext encrypted using the same subkeys of the Vigenère key. For example, if you encrypted the plaintext THE CAT IS OUT OF THE BAG with the key SPILLTHEBEANS, you’d get:
THECATISOUTOFTHEBAG
SPILLTHEBEANSSPILLT
LWMNLMPWPYTBXLWMMLZ
Notice that the letters LWM repeat twice. The reason is that in the ciphertext, LWM is the plaintext word THE encrypted using the same letters of the key—SPI—because the key happens to repeat at the second THE. The number of letters from the beginning of the first LWM to the beginning of the second LWM, which we’ll call the spacing, is 13. This suggests that the key used for this ciphertext is 13 letters long. By just looking at the repeated sequences, you can figure out the length of the key.
However, in most ciphertexts, the key won’t conveniently align with a repeated sequence of letters, or the key might repeat more than once between repeated sequences, meaning that the number of letters between the repeated letters would be equal to a multiple of the key rather than the key itself. To try to address these problems, let’s look at a longer example in which we don’t know what the key is.
When we remove the non-letters in the ciphertext PPQCA XQVEKG YBNKMAZU YBNGBAL JON I TSZM JYIM. VRAG VOHT VRAU C TKSG. DDWUO XITLAZU VAVV RAZ C VKB QP IWPOU, it would look like the string shown in Figure 20-1. The figure also shows the repeated sequences in this string—VRA, AZU, and YBN—and the number of letters between each sequence pair.
Figure 20-1: The repeated sequences in the example string
In this example, there are several potential key lengths. The next step of the Kasiski examination is to calculate all the factors of these counts to narrow down the potential key lengths.
The spacings between the sequences are 8, 8, 24, 32, and 48 in the example. But the factors of the spacings are more important than the spacings.
To see why, look at the message THEDOGANDTHECAT in Table 20-1 and try to encrypt it with the nine-letter key ABCDEFGHI and the three-letter key XYZ. Each key repeats for the length of the message.
Table 20-1: Encrypting THEDOGANDTHECAT with Two Different Keys
Encrypting with ABCDEFGHI |
Encrypting with XYZ |
|
Plaintext message |
THEDOGANDTHECAT |
THEDOGANDTHECAT |
Key (repeating) |
ABCDEFGHIABCDEF |
XYZXYZXYZXYZXYZ |
Ciphertext |
TIGGSLGULTIGFEY |
QFDAMFXLCQFDZYS |
The two keys produce two different ciphertexts, as expected. Of course, the hacker won’t know the original message or the key, but they will see in the TIGGSLGULTIGFEY ciphertext that the sequence TIG appears at index 0 and index 9. Because 9 – 0 = 9, the spacing between these sequences is 9, which would seem to indicate that the original key was a nine-letter key; in this case, that indication is correct.
However, the QFDAMFXLCQFDZYS ciphertext also produces a repeated sequence (QFD) that appears at index 0 and index 9. The spacing between these sequences is also 9, indicating that the key used in this ciphertext was also nine letters long. But we know that the key is only three letters long: XYZ.
The repeated sequences occur when the same letters in the message (THE in our example) are encrypted with the same letters of the key (ABC and XYZ in our example), which happens when the similar letters in the message and key “line up” and encrypt to the same sequence. This alignment can happen at any multiple of the real key length (such as 3, 6, 9, 12, and so on), which is why the three-letter key can produce a repeated sequence with a spacing of 9.
So the possible key length is due not just to the spacing but any factor of that spacing. The factors of 9 are 9, 3, and 1. Therefore, if you find repeated sequences with a spacing of 9, you must consider that the key could be of length 9 or 3. We can ignore 1 because a Vigenère cipher with a one-letter key is just the Caesar cipher.
Step 2 of Kasiski examination involves finding each of the spacings’ factors (excluding 1), as shown in Table 20-2.
Table 20-2: Factors of Each Spacing
Spacing |
Factors |
8 |
2, 4, 8 |
24 |
2, 4, 6, 8, 12, 24 |
32 |
2, 4, 8, 16 |
48 |
2, 4, 6, 8, 12, 24, 48 |
Collectively, the numbers 8, 8, 24, 32, and 48 have the following factors: 2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 8, 8, 8, 8, 12, 12, 16, 24, 24, and 48.
The key is most likely to be the most frequently occurring factors, which you can determine by counting. Because 2, 4, and 8 are the most frequently occurring factors of the spacings, they are the most likely lengths of the Vigenère key.
Now that we have possible lengths of the Vigenère key, we can use this information to decrypt the message one subkey at a time. For this example, let’s assume that the key length is 4. If we’re unable to crack this ciphertext, we can try again assuming the key length is 2 or 8.
Because the key is cycled through to encrypt the plaintext, a key length of 4 would mean that starting from the first letter, every fourth letter in the ciphertext is encrypted using the first subkey, every fourth letter starting from the second letter of the plaintext is encrypted using the second subkey, and so on. Using this information, we’ll form strings from the ciphertext of the letters that have been encrypted by the same subkey. First, let’s identify what every fourth letter in the string would be if we started from different letters. Then we’ll combine the letters into a single string. In these examples, we’ll bold every fourth letter.
Identify every fourth letter starting with the first letter:
PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCV
KBQPIWPOU
Next, we find every fourth letter starting with the second letter:
PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCV
KBQPIWPOU
Then we do the same starting with the third letter and fourth letter until we reach the length of the subkey we’re testing. Table 20-3 shows the combined strings of the bolded letters for each iteration.
Table 20-3: Strings of Every Fourth Letter
Starting with |
String |
First letter |
PAEBABANZIAHAKDXAAAKIU |
Second letter |
PXKNZNLIMMGTUSWIZVZBW |
Third letter |
QQGKUGJTJVVVCGUTUVCQP |
Fourth letter |
CVYMYBOSYRORTDOLVRVPO |
If we guessed the correct key length, each of the four strings we created in the previous section would have been encrypted with one subkey. This means that when a string is decrypted with the correct subkey and undergoes frequency analysis, the decrypted letters are likely to have a high English frequency match score. Let’s see how this process works using the first string, PAEBABANZIAHAKDXAAAKIU, as an example.
First, we decrypt the string 26 times (once for each of the 26 possible subkeys) using the Vigenère decryption function in Chapter 18, vigenereCipher.decryptMessage(). Then we test each decrypted string using the English frequency analysis function in Chapter 19, freqAnalysis.englishFreqMatchScore(). Run the following code in the interactive shell:
>>> import freqAnalysis, vigenereCipher
>>> for subkey in 'ABCDEFGHJIJKLMNOPQRSTUVWXYZ':
... decryptedMessage = vigenereCipher.decryptMessage(subkey,
'PAEBABANZIAHAKDXAAAKIU')
... print(subkey, decryptedMessage,
freqAnalysis.englishFreqMatchScore(decryptedMessage))
...
A PAEBABANZIAHAKDXAAAKIU 2
B OZDAZAZMYHZGZJCWZZZJHT 1
--snip--
Table 20-4 shows the results.
Table 20-4: English Frequency Match Score for Each Decryption
Subkey |
Decryption |
English frequency match score |
'A' |
'PAEBABANZIAHAKDXAAAKIU' |
2 |
'B' |
'OZDAZAZMYHZGZJCWZZZJHT' |
1 |
'C' |
'NYCZYZYLXGYFYIBVYYYIGS' |
1 |
'D' |
'MXBYXYXKWFXEXHAUXXXHFR' |
0 |
'E' |
'LWAXWXWJVEWDWGZTWWWGEQ' |
1 |
'F' |
'KVZWVWVIUDVCVFYSVVVFDP' |
0 |
'G' |
'JUYVUVUHTCUBUEXRUUUECO' |
1 |
'H' |
'ITXUTUTGSBTATDWQTTTDBN' |
1 |
'I' |
'HSWTSTSFRASZSCVPSSSCAM' |
2 |
'J' |
'GRVSRSREQZRYRBUORRRBZL' |
0 |
'K' |
'FQURQRQDPYQXQATNQQQAYK' |
1 |
'L' |
'EPTQPQPCOXPWPZSMPPPZXJ' |
0 |
'M' |
'DOSPOPOBNWOVOYRLOOOYWI' |
1 |
'N' |
'CNRONONAMVNUNXQKNNNXVH' |
2 |
'O' |
'BMQNMNMZLUMTMWPJMMMWUG' |
1 |
'P' |
'ALPMLMLYKTLSLVOILLLVTF' |
1 |
'Q' |
'ZKOLKLKXJSKRKUNHKKKUSE' |
0 |
'R' |
'YJNKJKJWIRJQJTMGJJJTRD' |
1 |
'S' |
'XIMJIJIVHQIPISLFIIISQC' |
1 |
'T' |
'WHLIHIHUGPHOHRKEHHHRPB' |
1 |
'U' |
'VGKHGHGTFOGNGQJDGGGQOA' |
1 |
'V' |
'UFJGFGFSENFMFPICFFFPNZ' |
1 |
'W' |
'TEIFEFERDMELEOHBEEEOMY' |
2 |
'X' |
'SDHEDEDQCLDKDNGADDDNLX' |
2 |
'Y' |
'RCGDCDCPBKCJCMFZCCCMKW' |
0 |
'Z' |
'QBFCBCBOAJBIBLEYBBBLJV' |
0 |
The subkeys that produce decryptions with the closest frequency match to English are most likely to be the real subkey. In Table 20-4, the subkeys 'A', 'I', 'N', 'W', and 'X' result in the highest frequency match scores for the first string. Note that these scores are low in general because there isn’t enough ciphertext to give us a large sample of text, but they work well enough for this example.
The next step is to repeat this process for the other three strings to find their most likely subkeys. Table 20-5 shows the final results.
Table 20-5: Most Likely Subkeys for the Example Strings
Ciphertext string |
Most likely subkeys |
PAEBABANZIAHAKDXAAAKIU |
A, I, N, W, X |
PXKNZNLIMMGTUSWIZVZBW |
I, Z |
QQGKUGJTJVVVCGUTUVCQP |
C |
CVYMYBOSYRORTDOLVRVPO |
K, N, R, V, Y |
Because there are five possible subkeys for the first subkey, two for the second subkey, one for the third subkey, and five for the fourth subkey, the total number of combinations is 50 (which we get from multiplying all the possible subkeys 5 × 2 × 1 × 5). In other words, we need to brute-force 50 possible keys. But this is much better than brute-forcing through 26 × 26 × 26 × 26 (or 456,976) possible keys, our task had we not narrowed down the list of possible subkeys. This difference becomes even greater if the Vigenère key is longer!
To brute-force the key, we’ll try every combination of the likely subkeys. All 50 possible subkey combinations are listed as follows:
The final step in our Vigenère hacking program will be to test all 50 of these decryption keys on the full ciphertext to see which produces readable English plaintext. Doing so should reveal that the key to the “PPQCA XQVEKG…” ciphertext is WICK.
Open a new file editor window by selecting File▸New File. Make sure the detectEnglish.py, freqAnalysis.py, vigenereCipher.py, and pyperclip.py files are in the same directory as the vigenereHacker.py file. Then enter the following code into the file editor and save it as vigenereHacker.py. Press F5 to run the program.
The ciphertext on line 17 in this program is difficult to copy from the book. To avoid typos, copy and paste it from the book’s website at https://www.nostarch.com/crackingcodes/. You can check for any differences between the text in your program and the text of the program in this book by using the online diff tool on the book’s website.
vigenere
Hacker.py
1. # Vigenere Cipher Hacker
2. # https://www.nostarch.com/crackingcodes/ (BSD Licensed)
3.
4. import itertools, re
5. import vigenereCipher, pyperclip, freqAnalysis, detectEnglish
6.
7. LETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
8. MAX_KEY_LENGTH = 16 # Will not attempt keys longer than this.
9. NUM_MOST_FREQ_LETTERS = 4 # Attempt this many letters per subkey.
10. SILENT_MODE = False # If set to True, program doesn't print anything.
11. NONLETTERS_PATTERN = re.compile('[^A-Z]')
12.
13.
14. def main():
15. # Instead of typing this ciphertext out, you can copy & paste it
16. # from https://www.nostarch.com/crackingcodes/:
17. ciphertext = """Adiz Avtzqeci Tmzubb wsa m Pmilqev halpqavtakuoi,
lgouqdaf, kdmktsvmztsl, izr xoexghzr kkusitaaf. Vz wsa twbhdg
ubalmmzhdad qz
--snip--
azmtmd'g widt ion bwnafz tzm Tcpsw wr Zjrva ivdcz eaigd yzmbo
Tmzubb a kbmhptgzk dvrvwz wa efiohzd."""
18. hackedMessage = hackVigenere(ciphertext)
19.
20. if hackedMessage != None:
21. print('Copying hacked message to clipboard:')
22. print(hackedMessage)
23. pyperclip.copy(hackedMessage)
24. else:
25. print('Failed to hack encryption.')
26.
27.
28. def findRepeatSequencesSpacings(message):
29. # Goes through the message and finds any 3- to 5-letter sequences
30. # that are repeated. Returns a dict with the keys of the sequence and
31. # values of a list of spacings (num of letters between the repeats).
32.
33. # Use a regular expression to remove non-letters from the message:
34. message = NONLETTERS_PATTERN.sub('', message.upper())
35.
36. # Compile a list of seqLen-letter sequences found in the message:
37. seqSpacings = {} # Keys are sequences; values are lists of int spacings.
38. for seqLen in range(3, 6):
39. for seqStart in range(len(message) - seqLen):
40. # Determine what the sequence is and store it in seq:
41. seq = message[seqStart:seqStart + seqLen]
42.
43. # Look for this sequence in the rest of the message:
44. for i in range(seqStart + seqLen, len(message) - seqLen):
45. if message[i:i + seqLen] == seq:
46. # Found a repeated sequence:
47. if seq not in seqSpacings:
48. seqSpacings[seq] = [] # Initialize blank list.
49.
50. # Append the spacing distance between the repeated
51. # sequence and the original sequence:
52. seqSpacings[seq].append(i - seqStart)
53. return seqSpacings
54.
55.
56. def getUsefulFactors(num):
57. # Returns a list of useful factors of num. By "useful" we mean factors
58. # less than MAX_KEY_LENGTH + 1 and not 1. For example,
59. # getUsefulFactors(144) returns [2, 3, 4, 6, 8, 9, 12, 16].
60.
61. if num < 2:
62. return [] # Numbers less than 2 have no useful factors.
63.
64. factors = [] # The list of factors found.
65.
66. # When finding factors, you only need to check the integers up to
67. # MAX_KEY_LENGTH:
68. for i in range(2, MAX_KEY_LENGTH + 1): # Don't test 1: it's not useful.
69. if num % i == 0:
70. factors.append(i)
71. otherFactor = int(num / i)
72. if otherFactor < MAX_KEY_LENGTH + 1 and otherFactor != 1:
73. factors.append(otherFactor)
74. return list(set(factors)) # Remove duplicate factors.
75.
76.
77. def getItemAtIndexOne(x):
78. return x[1]
79.
80.
81. def getMostCommonFactors(seqFactors):
82. # First, get a count of how many times a factor occurs in seqFactors:
83. factorCounts = {} # Key is a factor; value is how often it occurs.
84.
85. # seqFactors keys are sequences; values are lists of factors of the
86. # spacings. seqFactors has a value like {'GFD': [2, 3, 4, 6, 9, 12,
87. # 18, 23, 36, 46, 69, 92, 138, 207], 'ALW': [2, 3, 4, 6, ...], ...}.
88. for seq in seqFactors:
89. factorList = seqFactors[seq]
90. for factor in factorList:
91. if factor not in factorCounts:
92. factorCounts[factor] = 0
93. factorCounts[factor] += 1
94.
95. # Second, put the factor and its count into a tuple and make a list
96. # of these tuples so we can sort them:
97. factorsByCount = []
98. for factor in factorCounts:
99. # Exclude factors larger than MAX_KEY_LENGTH:
100. if factor <= MAX_KEY_LENGTH:
101. # factorsByCount is a list of tuples: (factor, factorCount).
102. # factorsByCount has a value like [(3, 497), (2, 487), ...].
103. factorsByCount.append( (factor, factorCounts[factor]) )
104.
105. # Sort the list by the factor count:
106. factorsByCount.sort(key=getItemAtIndexOne, reverse=True)
107.
108. return factorsByCount
109.
110.
111. def kasiskiExamination(ciphertext):
112. # Find out the sequences of 3 to 5 letters that occur multiple times
113. # in the ciphertext. repeatedSeqSpacings has a value like
114. # {'EXG': [192], 'NAF': [339, 972, 633], ... }:
115. repeatedSeqSpacings = findRepeatSequencesSpacings(ciphertext)
116.
117. # (See getMostCommonFactors() for a description of seqFactors.)
118. seqFactors = {}
119. for seq in repeatedSeqSpacings:
120. seqFactors[seq] = []
121. for spacing in repeatedSeqSpacings[seq]:
122. seqFactors[seq].extend(getUsefulFactors(spacing))
123.
124. # (See getMostCommonFactors() for a description of factorsByCount.)
125. factorsByCount = getMostCommonFactors(seqFactors)
126.
127. # Now we extract the factor counts from factorsByCount and
128. # put them in allLikelyKeyLengths so that they are easier to
129. # use later:
130. allLikelyKeyLengths = []
131. for twoIntTuple in factorsByCount:
132. allLikelyKeyLengths.append(twoIntTuple[0])
133.
134. return allLikelyKeyLengths
135.
136.
137. def getNthSubkeysLetters(nth, keyLength, message):
138. # Returns every nth letter for each keyLength set of letters in text.
139. # E.g. getNthSubkeysLetters(1, 3, 'ABCABCABC') returns 'AAA'
140. # getNthSubkeysLetters(2, 3, 'ABCABCABC') returns 'BBB'
141. # getNthSubkeysLetters(3, 3, 'ABCABCABC') returns 'CCC'
142. # getNthSubkeysLetters(1, 5, 'ABCDEFGHI') returns 'AF'
143.
144. # Use a regular expression to remove non-letters from the message:
145. message = NONLETTERS_PATTERN.sub('', message.upper())
146.
147. i = nth - 1
148. letters = []
149. while i < len(message):
150. letters.append(message[i])
151. i += keyLength
152. return ''.join(letters)
153.
154.
155. def attemptHackWithKeyLength(ciphertext, mostLikelyKeyLength):
156. # Determine the most likely letters for each letter in the key:
157. ciphertextUp = ciphertext.upper()
158. # allFreqScores is a list of mostLikelyKeyLength number of lists.
159. # These inner lists are the freqScores lists:
160. allFreqScores = []
161. for nth in range(1, mostLikelyKeyLength + 1):
162. nthLetters = getNthSubkeysLetters(nth, mostLikelyKeyLength,
ciphertextUp)
163.
164. # freqScores is a list of tuples like
165. # [(<letter>, <Eng. Freq. match score>), ... ]
166. # List is sorted by match score. Higher score means better match.
167. # See the englishFreqMatchScore() comments in freqAnalysis.py.
168. freqScores = []
169. for possibleKey in LETTERS:
170. decryptedText = vigenereCipher.decryptMessage(possibleKey,
nthLetters)
171. keyAndFreqMatchTuple = (possibleKey,
freqAnalysis.englishFreqMatchScore(decryptedText))
172. freqScores.append(keyAndFreqMatchTuple)
173. # Sort by match score:
174. freqScores.sort(key=getItemAtIndexOne, reverse=True)
175.
176. allFreqScores.append(freqScores[:NUM_MOST_FREQ_LETTERS])
177.
178. if not SILENT_MODE:
179. for i in range(len(allFreqScores)):
180. # Use i + 1 so the first letter is not called the "0th" letter:
181. print('Possible letters for letter %s of the key: ' % (i + 1),
end='')
182. for freqScore in allFreqScores[i]:
183. print('%s ' % freqScore[0], end='')
184. print() # Print a newline.
185.
186. # Try every combination of the most likely letters for each position
187. # in the key:
188. for indexes in itertools.product(range(NUM_MOST_FREQ_LETTERS),
repeat=mostLikelyKeyLength):
189. # Create a possible key from the letters in allFreqScores:
190. possibleKey = ''
191. for i in range(mostLikelyKeyLength):
192. possibleKey += allFreqScores[i][indexes[i]][0]
193.
194. if not SILENT_MODE:
195. print('Attempting with key: %s' % (possibleKey))
196.
197. decryptedText = vigenereCipher.decryptMessage(possibleKey,
ciphertextUp)
198.
199. if detectEnglish.isEnglish(decryptedText):
200. # Set the hacked ciphertext to the original casing:
201. origCase = []
202. for i in range(len(ciphertext)):
203. if ciphertext[i].isupper():
204. origCase.append(decryptedText[i].upper())
205. else:
206. origCase.append(decryptedText[i].lower())
207. decryptedText = ''.join(origCase)
208.
209. # Check with user to see if the key has been found:
210. print('Possible encryption hack with key %s:' % (possibleKey))
211. print(decryptedText[:200]) # Only show first 200 characters.
212. print()
213. print('Enter D if done, anything else to continue hacking:')
214. response = input('> ')
215.
216. if response.strip().upper().startswith('D'):
217. return decryptedText
218.
219. # No English-looking decryption found, so return None:
220. return None
221.
222.
223. def hackVigenere(ciphertext):
224. # First, we need to do Kasiski examination to figure out what the
225. # length of the ciphertext's encryption key is:
226. allLikelyKeyLengths = kasiskiExamination(ciphertext)
227. if not SILENT_MODE:
228. keyLengthStr = ''
229. for keyLength in allLikelyKeyLengths:
230. keyLengthStr += '%s ' % (keyLength)
231. print('Kasiski examination results say the most likely key lengths
are: ' + keyLengthStr + '\n')
232. hackedMessage = None
233. for keyLength in allLikelyKeyLengths:
234. if not SILENT_MODE:
235. print('Attempting hack with key length %s (%s possible keys)...'
% (keyLength, NUM_MOST_FREQ_LETTERS ** keyLength))
236. hackedMessage = attemptHackWithKeyLength(ciphertext, keyLength)
237. if hackedMessage != None:
238. break
239.
240. # If none of the key lengths found using Kasiski examination
241. # worked, start brute-forcing through key lengths:
242. if hackedMessage == None:
243. if not SILENT_MODE:
244. print('Unable to hack message with likely key length(s). Brute-
forcing key length...')
245. for keyLength in range(1, MAX_KEY_LENGTH + 1):
246. # Don't recheck key lengths already tried from Kasiski:
247. if keyLength not in allLikelyKeyLengths:
248. if not SILENT_MODE:
249. print('Attempting hack with key length %s (%s possible
keys)...' % (keyLength, NUM_MOST_FREQ_LETTERS **
keyLength))
250. hackedMessage = attemptHackWithKeyLength(ciphertext,
keyLength)
251. if hackedMessage != None:
252. break
253. return hackedMessage
254.
255.
256. # If vigenereHacker.py is run (instead of imported as a module), call
257. # the main() function:
258. if __name__ == '__main__':
259. main()
When you run the vigenereHacker.py program, the output should look like this:
Kasiski examination results say the most likely key lengths are: 3 2 6 4 12
Attempting hack with key length 3 (27 possible keys)...
Possible letters for letter 1 of the key: A L M
Possible letters for letter 2 of the key: S N O
Possible letters for letter 3 of the key: V I Z
Attempting with key: ASV
Attempting with key: ASI
--snip--
Attempting with key: MOI
Attempting with key: MOZ
Attempting hack with key length 2 (9 possible keys)...
Possible letters for letter 1 of the key: O A E
Possible letters for letter 2 of the key: M S I
Attempting with key: OM
Attempting with key: OS
--snip--
Attempting with key: ES
Attempting with key: EI
Attempting hack with key length 6 (729 possible keys)...
Possible letters for letter 1 of the key: A E O
Possible letters for letter 2 of the key: S D G
Possible letters for letter 3 of the key: I V X
Possible letters for letter 4 of the key: M Z Q
Possible letters for letter 5 of the key: O B Z
Possible letters for letter 6 of the key: V I K
Attempting with key: ASIMOV
Possible encryption hack with key ASIMOV:
ALAN MATHISON TURING WAS A BRITISH MATHEMATICIAN, LOGICIAN, CRYPTANALYST, AND
COMPUTER SCIENTIST. HE WAS HIGHLY INFLUENTIAL IN THE DEVELOPMENT OF COMPUTER
SCIENCE, PROVIDING A FORMALISATION OF THE CON
Enter D for done, or just press Enter to continue hacking:
> d
Copying hacked message to clipboard:
Alan Mathison Turing was a British mathematician, logician, cryptanalyst, and
computer scientist. He was highly influential in the development of computer
--snip--
Let’s look at the source code for the Vigenère hacking program. The hacking program imports many different modules, including a new module named itertools, which you’ll learn more about soon:
1. # Vigenere Cipher Hacker
2. # https://www.nostarch.com/crackingcodes/ (BSD Licensed)
3.
4. import itertools, re
5. import vigenereCipher, pyperclip, freqAnalysis, detectEnglish
In addition, the program sets up several constants on lines 7 to 11, which I’ll explain later when they’re used in the program.
The main() function of the hacking program is similar to the main() functions in previous hacking functions:
14. def main():
15. # Instead of typing this ciphertext out, you can copy & paste it
16. # from https://www.nostarch.com/crackingcodes/:
17. ciphertext = """Adiz Avtzqeci Tmzubb wsa m Pmilqev halpqavtakuoi,
lgouqdaf, kdmktsvmztsl, izr xoexghzr kkusitaaf. Vz wsa twbhdg
ubalmmzhdad qz
--snip--
azmtmd'g widt ion bwnafz tzm Tcpsw wr Zjrva ivdcz eaigd yzmbo
Tmzubb a kbmhptgzk dvrvwz wa efiohzd."""
18. hackedMessage = hackVigenere(ciphertext)
19.
20. if hackedMessage != None:
21. print('Copying hacked message to clipboard:')
22. print(hackedMessage)
23. pyperclip.copy(hackedMessage)
24. else:
25. print('Failed to hack encryption.')
The ciphertext is passed to the hackVigenere() function, which either returns the decrypted string if the hack is successful or the None value if the hack fails. If successful, the program prints the hacked message to the screen and copies it to the clipboard.
The findRepeatSequencesSpacings() function accomplishes the first step of the Kasiski examination by locating all the repeated sequences of letters in the message string and counting the spacings between the sequences:
28. def findRepeatSequencesSpacings(message):
--snip--
33. # Use a regular expression to remove non-letters from the message:
34. message = NONLETTERS_PATTERN.sub('', message.upper())
35.
36. # Compile a list of seqLen-letter sequences found in the message:
37. seqSpacings = {} # Keys are sequences; values are lists of int spacings.
Line 34 converts the message to uppercase and removes any non-letter characters from message using the sub() regular expression method.
The seqSpacings dictionary on line 37 holds repeated sequence strings as its keys and a list with integers representing the number of letters between all the occurrences of that sequence as its values. For example, if we pass the 'PPQCAXQV...' example string as message, the findRepeatSequenceSpacings() function would return {'VRA': [8, 24, 32], 'AZU': [48], 'YBN': [8]}.
The for loop on line 38 checks whether each sequence repeats by finding the sequences in message and calculating the spacings between repeated sequences:
38. for seqLen in range(3, 6):
39. for seqStart in range(len(message) - seqLen):
40. # Determine what the sequence is and store it in seq:
41. seq = message[seqStart:seqStart + seqLen]
Figure 20-2: Values of from depending on the value in
On the first iteration of the loop, the code finds sequences that are exactly three letters long. On the next iteration, it finds sequences exactly four letters long, and then five letters long. You can change what sequence lengths the code searches for by modifying the range(3, 6) call on line 38; however, finding repeated sequences of length three, four, and five seems to work for most ciphertexts. The reason is that these are long enough that repeats in the ciphertext aren’t likely to be coincidence but short enough that repeats are likely to be found. The sequence length the for loop is currently checking is stored in seqLen.
The for loop on line 39 slices message into every possible substring of length seqLen. We’ll use this for loop to determine the start of the slice and slice message into a substring seqLen characters long. For example, if seqLen is 3 and message is 'PPQCAXQ', we would want to start at the first index, which is 0, and slice three characters to get the substring 'PPQ'. Then we would want to go to the next index, which is 1, and slice three characters to get the substring 'PQC'. We need to do this for every index up to the last three characters, which is the index equivalent to len(message) – seqLen. Doing this, you would get the sequences shown in Figure 20-2.
The for loop on line 39 loops through every index up to len(message) – seqLen and assigns the current index to start the substring slice to the variable seqStart. After we have the starting index, line 41 sets the seq variable to the substring slice.
We’ll search through the message for repeats of that slice using the for loop on line 44.
43. # Look for this sequence in the rest of the message:
44. for i in range(seqStart + seqLen, len(message) - seqLen):
45. if message[i:i + seqLen] == seq:
The for loop on line 44 is inside the for loop on line 39 and sets i to be the indexes of every possible sequence of length seqLen in message. These indexes start at seqStart + seqLen, or after the sequence currently in seq, and go up to len(message) - seqLen, which is the last index where a sequence of length seqLen can be found. For example, if message was 'PPQCAXQVEKGYBNKMAZUYBN', seqStart was 11, and seqLen was 3, line 41 would set seq to 'YBN'. The for loop would begin looking at message starting from index 14.
The expression message[i:i + seqLen] on line 45 evaluates to a substring of message, which is compared to seq to check whether the substring is a repeat of seq. If it is, lines 46 to 52 calculate the spacing and add it to the seqSpacings dictionary. On the first iteration, line 45 compares 'KMA' to seq, then 'MAZ' to seq on the next iteration, then 'AZU' to seq on the next, and so on. When i is 19, line 45 finds 'YBN' is equal to seq, and the execution runs lines 46 to 52:
46. # Found a repeated sequence:
47. if seq not in seqSpacings:
48. seqSpacings[seq] = [] # Initialize blank list.
49.
50. # Append the spacing distance between the repeated
51. # sequence and the original sequence:
52. seqSpacings[seq].append(i - seqStart)
Lines 47 and 48 check whether the seq variable exists as a key in seqSpacings. If not, seqSpacings[seq] is set as a key with a blank list as its value.
The number of letters between the sequence at message[i:i + seqLen] and the original sequence at message[seqStart:seqStart+seqLen] is simply i - seqStart. Notice that i and seqStart are the beginning indexes before the colons. So the integer that i - seqStart evaluates to is the number of letters between the two sequences, which we append to the list stored at seqSpacings[seq].
When all these for loops have finished, the seqSpacings dictionary should contain every repeated sequence of length 3, 4, and 5 as well as the number of letters between repeated sequences. The seqSpacings dictionary is returned from findRepeatSequencesSpacings() on line 53:
53. return seqSpacings
Now that you’ve seen how the program performs the first step of the Kasiski examination by finding repeated sequences in the ciphertext and counting the number of letters between them, let’s look at how the program conducts the next step of the Kasiski examination.
Recall that the next step of the Kasiski examination involves finding the factors of the spacings. We’re looking for factors between 2 and MAX_KEY_LENGTH in length. To do this, we’ll create the getUsefulFactors() function, which takes a num parameter and returns a list of only those factors that meet this criteria.
56. def getUsefulFactors(num):
--snip--
61. if num < 2:
62. return [] # Numbers less than 2 have no useful factors.
63.
64. factors = [] # The list of factors found.
Line 61 checks for the special case where num is less than 2. In this case, line 62 returns the empty list because num would have had no useful factors if it were less than 2.
If num is larger than 2, we would need to calculate all the factors of num and store them in a list. At line 64, we create an empty list called factors to store the factors.
The for loop on line 68 loops through the integers from 2 up to MAX_KEY_LENGTH, including the value in MAX_KEY_LENGTH. Remember that because range() causes the for loop to iterate up to but not including the second argument, we pass MAX_KEY_LENGTH + 1 so that MAX_KEY_LENGTH is included. This loop finds all the factors of num.
68. for i in range(2, MAX_KEY_LENGTH + 1): # Don't test 1: it's not useful.
69. if num % i == 0:
70. factors.append(i)
71. otherFactor = int(num / i)
Line 69 tests whether num % i is equal to 0; if it is, we know that i divides num evenly with no remainder, which means i is a factor of num. In this case, line 70 appends i to the list of factors in the factors variable. Because we know that num / i must also divide num evenly, line 71 stores the integer form of it in otherFactor. (Remember that the / operator always evaluates to a float value, such as 21 / 7 evaluating to the float 3.0 instead of the int 3.) If the resulting value is 1, the program doesn’t include it in the factors list, so line 72 checks for this case:
72. if otherFactor < MAX_KEY_LENGTH + 1 and otherFactor != 1:
73. factors.append(otherFactor)
Line 73 appends the value if it isn’t 1. We exclude 1 because if the Vigenère key had a length of 1, the Vigenère cipher would be no different than the Caesar cipher!
Recall that we need to know the most common factor as part of the Kasiski examination because the most common factor will almost certainly be the length of the Vigenère key. However, before we can analyze the frequency of each factor, we’ll need to use the set() function to remove any duplicate factors from the factors list. For example, if getUsefulFactors() was passed 9 for the num parameter, then 9 % 3 == 0 would be True and both i and otherFactor would have been appended to factors. But both i and int(num / i) are equal to 3, so 3 would be appended to the list twice. To prevent duplicate numbers, we can pass the list to set(), which returns a list as a set data type. The set data type is similar to the list data type except a set value can only contain unique values.
You can pass any list value to the set() function to get a set value that doesn’t have any duplicate values in it. Conversely, if you pass a set value to list(), it would return a list value version of the set. To see examples of this, enter the following into the interactive shell:
>>> set([1, 2, 3, 3, 4])
set([1, 2, 3, 4])
>>> spam = list(set([2, 2, 2, 'cats', 2, 2]))
>>> spam
[2, 'cats']
Any repeated list values are removed when a list is converted to a set. Even when a set converted from a list is reconverted to a list, it will still not have any repeated values.
Line 74 passes the list value in factors to set() to remove any duplicate factors:
74. return list(set(factors)) # Remove duplicate factors.
The function getItemAtIndexOne() on line 77 is almost identical to getItemAtIndexZero() in the freqAnalysis.py program you wrote in Chapter 19 (see “Getting the First Member of a Tuple” on page 268):
77. def getItemAtIndexOne(x):
78. return x[1]
This function will be passed to sort() later in the program to sort based on the item at index 1 of the items being sorted.
To find the most common factors, which are the most likely key lengths, we need to write the getMostCommonFactors() function, which begins on line 81.
81. def getMostCommonFactors(seqFactors):
82. # First, get a count of how many times a factor occurs in seqFactors:
83. factorCounts = {} # Key is a factor; value is how often it occurs.
The seqFactors parameter on line 81 takes a dictionary value created using the kasiskiExamination() function, which I’ll explain shortly. This dictionary has strings of sequences as its keys and a list of integer factors as the value of each key. (These are factors of the spacing integers that findRepeatSequencesSpacings() returned previously.) For example, seqFactors could contain a dictionary value that looks something like this:
{'VRA': [8, 2, 4, 2, 3, 4, 6, 8, 12, 16, 8, 2, 4], 'AZU': [2, 3, 4, 6, 8, 12,
16, 24], 'YBN': [8, 2, 4]}
The getMostCommonFactors() function orders the most common factors in seqFactors from the most frequently occurring to the least occurring and returns them as a list of two-integer tuples. The first integer in the tuple is the factor, and the second integer is how many times it appeared in seqFactors.
For example, getMostCommonFactors() might return a list value, such as this:
[(3, 556), (2, 541), (6, 529), (4, 331), (12, 325), (8, 171), (9, 156), (16,
105), (5, 98), (11, 86), (10, 84), (15, 84), (7, 83), (14, 68), (13, 52)]
This list shows that in the seqFactors dictionary that was passed to getMostCommonFactors(), the factor 3 occurred 556 times, the factor 2 occurred 541 times, the factor 6 occurred 529 times, and so on. Note that 3 appears first in the list because it’s the most frequent factor; 13 is the least frequent factor and therefore is last in the list.
For the first step of getMostCommonFactors(), we’ll set up the factorCounts dictionary on line 83, which we’ll use to store the counts of each factor. The key of factorCounts will be the factor, and the values associated with the keys will be the counts of those factors.
Next, the for loop on line 88 loops over every sequence in seqFactors, storing it in a variable named seq on each iteration. The list of factors in seqFactors for seq is stored in a variable named factorList on line 89:
88. for seq in seqFactors:
89. factorList = seqFactors[seq]
90. for factor in factorList:
91. if factor not in factorCounts:
92. factorCounts[factor] = 0
93. factorCounts[factor] += 1
The factors in this list are looped over with a for loop on line 90. If a factor doesn’t exist as a key in factorCounts, it’s added on line 92 with a value of 0. Line 93 increments factorCounts[factor], which is the factor’s value in factorCounts.
For the second step of getMostCommonFactors(), we need to sort the values in the factorCounts dictionary by their count. But because dictionary values aren’t ordered, we must first convert the dictionary to a list of two-integer tuples. (We did something similar in Chapter 19 in the getFrequencyOrder() function in the freqAnalaysis.py module.) We store this list value in a variable named factorsByCount, which starts as an empty list on line 97:
97. factorsByCount = []
98. for factor in factorCounts:
99. # Exclude factors larger than MAX_KEY_LENGTH:
100. if factor <= MAX_KEY_LENGTH:
101. # factorsByCount is a list of tuples: (factor, factorCount).
102. # factorsByCount has a value like [(3, 497), (2, 487), ...].
103. factorsByCount.append( (factor, factorCounts[factor]) )
Then the for loop on line 98 goes through each of the factors in factorCounts and appends this (factor, factorCounts[factor]) tuple to the factorsByCount list only if the factor is less than or equal to MAX_KEY_LENGTH.
After the for loop finishes adding all the tuples to factorsByCount, line 106 sorts the factorsByCount as the final step of the getMostCommonFactors() function.
106. factorsByCount.sort(key=getItemAtIndexOne, reverse=True)
107.
108. return factorsByCount
Because the getItemAtIndexOne function is passed for the key keyword argument and True is passed for the reverse keyword argument, the list is sorted by the factor counts in descending order. Line 108 returns the sorted list in factorsByCount, which should indicate which factors appear most frequently and therefore are most likely to be the Vigenère key lengths.
Before we can figure out what the possible subkeys are for a ciphertext, we need to know how many subkeys there are. That is, we need to know the length of the key. The kasiskiExamination() function on line 111 returns a list of the most likely key lengths for the given ciphertext argument.
111. def kasiskiExamination(ciphertext):
--snip--
115. repeatedSeqSpacings = findRepeatSequencesSpacings(ciphertext)
The key lengths are integers in a list; the first integer in the list is the most likely key length, the second integer the second most likely, and so on.
The first step in finding the key length is to find the spacings between repeated sequences in the ciphertext. This is returned from the function findRepeatSequencesSpacings() as a dictionary with the sequence strings as its keys and a list with the spacings as integers as its values. The function findRepeatSequencesSpacings() was described previously in “Finding Repeated Sequences” on page 294.
Before continuing with the next lines of code, you’ll need to learn about the extend() list method.
When you need to add multiple values to the end of a list, there is an easier way than calling append() inside a loop. The extend() list method can add values to the end of a list, similar to the append() list method. When passed a list, the append() method adds the entire list as one item to the end of another list, like this:
>>> spam = ['cat', 'dog', 'mouse']
>>> eggs = [1, 2, 3]
>>> spam.append(eggs)
>>> spam
['cat', 'dog', 'mouse', [1, 2, 3]]
In contrast, the extend() method adds each item in the list argument to the end of a list. Enter the following into the interactive shell to see an example:
>>> spam = ['cat', 'dog', 'mouse']
>>> eggs = [1, 2, 3]
>>> spam.extend(eggs)
>>> spam
['cat', 'dog', 'mouse', 1, 2, 3]
As you can see, all the values in eggs (1, 2, and 3) are appended to spam as discrete items.
Although repeatedSeqSpacings is a dictionary that maps sequence strings to lists of integer spacings, we actually need a dictionary that maps sequence strings to lists of factors of those integer spacings. (See “Getting Factors of Spacings” on page 283 for the reason why.) Lines 118 to 122 do this:
118. seqFactors = {}
119. for seq in repeatedSeqSpacings:
120. seqFactors[seq] = []
121. for spacing in repeatedSeqSpacings[seq]:
122. seqFactors[seq].extend(getUsefulFactors(spacing))
Line 118 starts with an empty dictionary in seqFactors. The for loop on line 119 iterates over every key, which is a sequence string, in the dictionary repeatedSeqSpacings. For each key, line 120 sets a blank list to be the value in seqFactors.
The for loop on line 121 iterates over all the spacings integers by passing each to a getUsefulFactors() call. Each of the items in the list returned from getUsefulFactors() is added to seqFactors[seq] using the extend() method. When all the for loops are finished, seqFactors should be a dictionary that maps sequence strings to lists of factors of integer spacings. This allows us to have the factors of the spacings, not just the spacings.
Line 125 passes the seqFactors dictionary to the getMostCommonFactors() function and returns a list of two-integer tuples whose first integer represents the factor and whose second integer shows how often that factor appears in seqFactors. Then the tuple gets stored in factorsByCount.
125. factorsByCount = getMostCommonFactors(seqFactors)
But we want the kasiskiExamination() function to return a list of integer factors, not a list of tuples with factors and the count of how often they appeared. Because these factors are stored as the first item of the two-integer tuples list in factorsByCount, we need to pull these factors from the tuples and put them in a separate list.
Lines 130 to 134 store the separate list of factors in allLikelyKeyLengths.
130. allLikelyKeyLengths = []
131. for twoIntTuple in factorsByCount:
132. allLikelyKeyLengths.append(twoIntTuple[0])
133.
134. return allLikelyKeyLengths
The for loop on line 131 iterates over each of the tuples in factorsByCount and appends the tuple’s index 0 item to the end of allLikelyKeyLengths. After this for loop completes, the allLikelyKeyLengths variable should contain all the integer factors in factorsByCount, which gets returned as a list from kasiskiExamination().
Although we now have the ability to find the likely key lengths the message was encrypted with, we need to be able to separate letters from the message that were encrypted with the same subkey. Recall that encrypting 'THEDOGANDTHECAT' with the key 'XYZ' ends up using the 'X' from the key to encrypt the message letters at index 0, 3, 6, 9, and 12. Because these letters from the original English message are encrypted with the same subkey ('X'), the decrypted text should have a letter frequency count similar to English. We can use this information to figure out the subkey.
To pull out the letters from a ciphertext that were encrypted with the same subkey, we need to write a function that can create a string using the 1st, 2nd, or nth letters of a message. After the function has the starting index, the key length, and the message passed to it, the first step is to remove the non-letter characters from message using a regular expression object and its sub() method on line 145.
NOTE
Regular expressions are discussed in “Finding Characters with Regular Expressions” on page 230.
This letters-only string is then stored as the new value in message:
137. def getNthSubkeysLetters(nth, keyLength, message):
--snip--
145. message = NONLETTERS_PATTERN.sub('', message.upper())
Next, we build a string by appending the letter strings to a list and then use join() to merge the list into a single string:
147. i = nth - 1
148. letters = []
149. while i < len(message):
150. letters.append(message[i])
151. i += keyLength
152. return ''.join(letters)
The i variable points to the index of the letter in message that you want to append to the string-building list, which is stored in a variable named letters. The i variable starts with the value nth - 1 on line 147, and the letters variable starts with a blank list on line 148.
The while loop on line 149 continues to run as long as i is less than the length of message. On each iteration, the letter at message[i] is appended to the list in letters. Then i is updated to point to the next letter in the subkey by adding keyLength to i on line 151.
After this loop finishes, line 152 joins the single-letter string values in the letters list into a one string, and this string is returned from getNthSubkeysLetters().
Now that we can pull out letters that were encrypted with the same subkey, we can use getNthSubkeysLetters() to try decrypting with some potential key lengths.
Recall that the kasiskiExamination() function isn’t guaranteed to return the actual length of the Vigenère key but rather a list of several possible lengths sorted in order of most likely to least likely key length. If the code has determined the wrong key length, it will try again using a different key length. The attemptHackWithKeyLength() function does this when passed the ciphertext and the determined key length. If successful, this function returns a string of the hacked message. If the hacking fails, the function returns None.
The hacking code works only on uppercase letters, but we want to return any decrypted string with its original casing, so we need to preserve the original string. To do this, we store the uppercase form of the ciphertext string in a separate variable named ciphertextUp on line 157.
155. def attemptHackWithKeyLength(ciphertext, mostLikelyKeyLength):
156. # Determine the most likely letters for each letter in the key:
157. ciphertextUp = ciphertext.upper()
If we assume the value in the mostLikelyKeyLength is the correct key length, the hacking algorithm calls getNthSubkeysLetters() for each subkey and then brute-forces through the 26 possible letters to find the one that produces decrypted text whose letter frequency most closely matches the letter frequency of English for that subkey.
First, an empty list is stored in allFreqScores on line 160, which will store the frequency match scores returned by freqAnalysis.englishFreqMatchScore():
160. allFreqScores = []
161. for nth in range(1, mostLikelyKeyLength + 1):
162. nthLetters = getNthSubkeysLetters(nth, mostLikelyKeyLength,
ciphertextUp)
The for loop on line 161 sets the nth variable to each integer from 1 to the mostLikelyKeyLength value. Recall that when range() is passed two arguments, the range goes up to, but not including, the second argument. The + 1 is put into the code so the integer value in mostLikelyKeyLength is included in the range object returned.
The letters of the nth subkey are returned from getNthSubkeysLetters() on line 162.
Next, we need to decrypt the letters of the nth subkey with all 26 possible subkeys to see which ones produce English-like letter frequencies. A list of English frequency match scores is stored in a list in a variable named freqScores. This variable starts as an empty list on line 168 and then the for loop on line 169 loops through each of the 26 uppercase letters from the LETTERS string:
168. freqScores = []
169. for possibleKey in LETTERS:
170. decryptedText = vigenereCipher.decryptMessage(possibleKey,
nthLetters)
The possibleKey value decrypts the ciphertext by calling vigenereCipher.decryptMessage() on line 170. The subkey in possibleKey is only one letter, but the string in nthLetters is made up of only the letters from message that would have been encrypted with that subkey if the code has determined the key length correctly.
The decrypted text is then passed to freqAnalysis.englishFreqMatchScore() to see how closely the frequency of the letters in decryptedText matches the letter frequency of regular English. As you learned in Chapter 19, the return value is an integer between 0 and 12: recall that a higher number means a closer match.
Line 171 puts this frequency match score and the key used to decrypt into a tuple and stores it in the keyAndFreqMatchTuple variable. This tuple is appended to the end of freqScores on line 172:
171. keyAndFreqMatchTuple = (possibleKey,
freqAnalysis.englishFreqMatchScore(decryptedText))
172. freqScores.append(keyAndFreqMatchTuple)
After the for loop on line 169 completes, the freqScores list should contain 26 key-and-frequency-match-score tuples: one tuple for each of the 26 subkeys. We need to sort this list so the tuples with the largest English frequency match scores come first. This means that we want to sort the tuples in freqScores by the value at index 1 and in reverse (descending) order.
We call the sort() method on the freqScores list, passing the function value getItemAtIndexOne for the key keyword argument. Note that we’re not calling the function, as you can tell from the lack of parentheses. The value True is passed for the reverse keyword argument to sort in descending order.
174. freqScores.sort(key=getItemAtIndexOne, reverse=True)
Initially, the NUM_MOST_FREQ_LETTERS constant was set to the integer value 4 on line 9. After sorting the tuples in freqScores in reverse order, line 176 appends a list containing only the first three tuples, or the tuples with the three highest English frequency match scores, to allFreqScores. As a result, allFreqScores[0] contains the frequency scores for the first subkey, allFreqScores[1] contains the frequency scores for the second subkey, and so on.
176. allFreqScores.append(freqScores[:NUM_MOST_FREQ_LETTERS])
After the for loop on line 161 completes, allFreqScores should contain a number of list values equal to the integer value in mostLikelyKeyLength. For example, if mostLikelyKeyLength was 3, allFreqScores would be a list of three lists. The first list value holds the tuples for the top three highest matching subkeys for the first subkey of the full Vigenère key. The second list value holds the tuples for the top three highest matching subkeys for the second subkey of the full Vigenère key, and so on.
Originally, if we wanted to brute-force through the full Vigenère key, the number of possible keys would be 26 raised to the power of key length. For example, if the key was ROSEBUD with a length of 7, there would be 267, or 8,031,810,176, possible keys.
But checking the English frequency matching helped determine the four most likely letters for each subkey. Continuing with the ROSEBUD example, this means that we only need to check 47, or 16,384, possible keys, which is a huge improvement over 8 billion possible keys!
Next, we want to print output to the user. To do this, we use print() but pass an argument to an optional parameter we haven’t used before. Whenever the print() function is called, it prints to the screen the string passed to it along with a newline character. To print something else at the end of the string instead of a newline character, we can specify the string for the print() function’s end keyword argument. Enter the following into the interactive shell to see how to use the print() function’s end keyword argument:
>>> def printStuff():
➊ print('Hello', end='\n')
➋ print('Howdy', end='')
➌ print('Greetings', end='XYZ')
print('Goodbye')
>>> printStuff()
Hello
HowdyGreetingsXYZGoodbye
Passing end='\n' prints the string normally ➊. However, passing end='' ➋ or end='XYZ' ➌ replaces the usual newline character, so subsequent print() calls are not displayed on a new line.
At this point, we want to know which letters are the top three candidates for each subkey. If the SILENT_MODE constant was set to False earlier in the program, the code on lines 178 to 184 would print the values in allFreqScores to the screen:
178. if not SILENT_MODE:
179. for i in range(len(allFreqScores)):
180. # Use i + 1 so the first letter is not called the "0th" letter:
181. print('Possible letters for letter %s of the key: ' % (i + 1),
end='')
182. for freqScore in allFreqScores[i]:
183. print('%s ' % freqScore[0], end='')
184. print() # Print a newline.
If SILENT_MODE were set to True, the code in the if statement’s block would be skipped.
We’ve now reduced the number of subkeys to a small enough number that we can brute-force all of them. Next, you’ll learn how to use the itertools.product() function to generate every possible combination of subkeys to brute-force.
Now that we have possible subkeys, we need to put them together to find the whole key. The problem is that, even though we’ve found letters for each subkey, the most likely letter might not actually be the right letter. Instead, the second most likely or third most likely letter might be the right subkey letter. This means we can’t just combine the most likely letters for each subkey into one key: we need to try different combinations of likely letters to find the right key.
The vigenereHacker.py program uses the itertools.product() function to test every possible combination of subkeys.
The itertools.product() function produces every possible combination of items in a list or list-like value, such as a string or tuple. Such a combination of items is called a Cartesian product, which is where the function gets its name. The function returns an itertools product object value, which can also be converted to a list by passing it to list(). Enter the following into the interactive shell to see an example:
>>> import itertools
>>> itertools.product('ABC', repeat=4)
➊ <itertools.product object at 0x02C40170>
>>> list(itertools.product('ABC', repeat=4))
[('A', 'A', 'A', 'A'), ('A', 'A', 'A', 'B'), ('A', 'A', 'A', 'C'), ('A', 'A',
'B', 'A'), ('A', 'A', 'B', 'B'), ('A', 'A', 'B', 'C'), ('A', 'A', 'C', 'A'),
('A', 'A', 'C', 'B'), ('A', 'A', 'C', 'C'), ('A', 'B', 'A', 'A'), ('A', 'B',
'A', 'B'), ('A', 'B', 'A', 'C'), ('A', 'B', 'B', 'A'), ('A', 'B', 'B', 'B'),
--snip--
('C', 'B', 'C', 'B'), ('C', 'B', 'C', 'C'), ('C', 'C', 'A', 'A'), ('C', 'C',
'A', 'B'), ('C', 'C', 'A', 'C'), ('C', 'C', 'B', 'A'), ('C', 'C', 'B', 'B'),
('C', 'C', 'B', 'C'), ('C', 'C', 'C', 'A'), ('C', 'C', 'C', 'B'), ('C', 'C',
'C', 'C')]
Passing 'ABC' and the integer 4 for the repeat keyword argument to itertools.product() returns an itertools product object ➊ that, when converted to a list, has tuples of four values with every possible combination of 'A', 'B', and 'C'. This results in a list that has a total of 34, or 81, tuples.
You can also pass list values to itertools.product() and some values similar to lists, such as range objects returned from range(). Enter the following into the interactive shell to see what happens when lists and objects like lists are passed to this function:
>>> import itertools
>>> list(itertools.product(range(8), repeat=5))
[(0, 0, 0, 0, 0), (0, 0, 0, 0, 1), (0, 0, 0, 0, 2), (0, 0, 0, 0, 3), (0, 0, 0,
0, 4), (0, 0, 0, 0, 5), (0, 0, 0, 0, 6), (0, 0, 0, 0, 7), (0, 0, 0, 1, 0), (0,
0, 0, 1, 1), (0, 0, 0, 1, 2), (0, 0, 0, 1, 3), (0, 0, 0, 1, 4),
--snip--
(7, 7, 7, 6, 6), (7, 7, 7, 6, 7), (7, 7, 7, 7, 0), (7, 7, 7, 7, 1), (7, 7, 7,
7, 2), (7, 7, 7, 7, 3), (7, 7, 7, 7, 4), (7, 7, 7, 7, 5), (7, 7, 7, 7, 6), (7,
7, 7, 7, 7)]
When the range object returned from range(8) is passed to itertools.product(), along with 5 for the repeat keyword argument, it generates a list that has tuples of five values, integers ranging from 0 to 7.
We can’t just pass itertools.product() a list of the potential subkey letters, because the function creates combinations of the same values and each of the subkeys will probably have different potential letters. Instead, because our subkeys are stored in tuples in allFreqScores, we’ll access those letters by index values, which will range from 0 to the number of letters we want to try minus 1. We know that the number of letters in each tuple is equal to NUM_MOST_FREQ_LETTERS because we only stored that number of potential letters in each tuple earlier on line 176. So the range of indexes we’ll need to access is from 0 to NUM_MOST_FREQ_LETTERS, which is what we’ll pass to itertools.product(). We’ll also pass itertools.product() a likely key length as a second argument, creating tuples as long as the potential key length.
For example, if we wanted to try only the first three most likely letters of each subkey (which is determined by NUM_MOST_FREQ_LETTERS) for a key that’s likely five letters long, the first value itertools.product() would produce would be (0, 0, 0, 0, 0). The next value would be (0, 0, 0, 0, 1), then (0, 0, 0, 0, 2), and values would be generated until reaching (2, 2, 2, 2, 2). Each integer in the five-value tuples represents an index to allFreqScores.
The value in allFreqScores is a list that holds the most likely letters of each subkey along with their frequency match scores. To see how this list works, let’s create a hypothetical allFreqScores value in IDLE. For example, allFreqScores might look like this for a six-letter key where we found the four most likely letters for each subkey:
>>> allFreqScores = [[('A', 9), ('E', 5), ('O', 4), ('P', 4)], [('S', 10),
('D', 4), ('G', 4), ('H', 4)], [('I', 11), ('V', 4), ('X', 4), ('B', 3)],
[('M', 10), ('Z', 5), ('Q', 4), ('A', 3)], [('O', 11), ('B', 4), ('Z', 4),
('A', 3)], [('V', 10), ('I', 5), ('K', 5), ('Z', 4)]]
This may look complex, but we can drill down to specific values of the lists and tuples with indexing. When allFreqScores is accessed at an index, it evaluates to a list of tuples of possible letters for a single subkey and their frequency match scores. For example, allFreqScores[0] has a list of tuples for the first subkey along with the frequency match scores of each potential subkey, allFreqScores[1] has a list of tuples for the second subkey and frequency match scores, and so on:
>>> allFreqScores[0]
[('A', 9), ('E', 5), ('O', 4), ('P', 4)]
>>> allFreqScores[1]
[('S', 10), ('D', 4), ('G', 4), ('H', 4)]
You can also access each tuple of the possible letters for each subkey by adding an additional index reference. For example, we would get a tuple of the most likely letter to be the second subkey and its frequency match score if we accessed allFreqScores[1][0], the second most likely letter from allFreqScores[1][1], and so on:
>>> allFreqScores[1][0]
('S', 10)
>>> allFreqScores[1][1]
('D', 4)
Because these values are tuples, we would need to access the first value in the tuple to get just the possible letter without its frequency match score value. Each letter is stored in the first index of the tuples, so we would use allFreqScores[1][0][0] to access the most likely letter of the first subkey, allFreqScores[1][1][0] to access the most likely letter of the second subkey, and so on:
>>> allFreqScores[1][0][0]
'S'
>>> allFreqScores[1][1][0]
'D'
Once you’re able to access potential subkeys in allFreqScores, you need to combine them to find potential keys.
The tuples produced by itertools.product() each represent one key where the position in the tuple corresponds to the first index we access in allFreqScores, and the integers in the tuple represent the second index we access in allFreqScores.
Because we set the NUM_MOST_FREQ_LETTERS constant to 4 earlier, itertools.product(range(NUM_MOST_FREQ_LETTERS), repeat=mostLikelyKeyLength) on line 188 causes the for loop to have a tuple of integers (from 0 to 3) representing the four most likely letters for each subkey for the indexes variable:
188. for indexes in itertools.product(range(NUM_MOST_FREQ_LETTERS),
repeat=mostLikelyKeyLength):
189. # Create a possible key from the letters in allFreqScores:
190. possibleKey = ''
191. for i in range(mostLikelyKeyLength):
192. possibleKey += allFreqScores[i][indexes[i]][0]
We construct full Vigenère keys using indexes, which takes the value of one tuple created by itertools.product() on each iteration. The key starts as a blank string on line 190, and the for loop on line 191 iterates through the integers from 0 up to, but not including, mostLikelyKeyLength for each tuple to construct a key.
As the i variable changes for each iteration of the for loop, the value at indexes[i] is the index of the tuple we want to use in allFreqScores[i]. This is why allFreqScores[i][indexes[i]] evaluates to the correct tuple we want. When we have the correct tuple, we need to access index 0 in that tuple to get the subkey letter.
If SILENT_MODE is False, line 195 prints the key that was created by the for loop on line 191:
194. if not SILENT_MODE:
195. print('Attempting with key: %s' % (possibleKey))
Now that we have a complete Vigenère key, lines 197 to 208 decrypt the ciphertext and check whether the decrypted text is readable English. If it is, the program prints it to the screen for the user to confirm that it is indeed English to check for false positives.
Because decryptedText is in uppercase, lines 201 to 207 build a new string by appending an uppercase or lowercase form of the letters in decryptedText to the origCase list:
197. decryptedText = vigenereCipher.decryptMessage(possibleKey,
ciphertextUp)
198.
199. if detectEnglish.isEnglish(decryptedText):
200. # Set the hacked ciphertext to the original casing:
201. origCase = []
202. for i in range(len(ciphertext)):
203. if ciphertext[i].isupper():
204. origCase.append(decryptedText[i].upper())
205. else:
206. origCase.append(decryptedText[i].lower())
207. decryptedText = ''.join(origCase)
The for loop on line 202 goes through each of the indexes in the ciphertext string, which, unlike ciphertextUp, has the original casing of the ciphertext. If ciphertext[i] is uppercase, the uppercase form of decryptedText[i] is appended to origCase. Otherwise, the lowercase form of decryptedText[i] is appended. The list in origCase is then joined on line 207 to become the new value of decryptedText.
The next lines of code print the decryption output to the user to check whether the key has been found:
210. print('Possible encryption hack with key %s:' % (possibleKey))
211. print(decryptedText[:200]) # Only show first 200 characters.
212. print()
213. print('Enter D if done, anything else to continue hacking:')
214. response = input('> ')
215.
216. if response.strip().upper().startswith('D'):
217. return decryptedText
The correctly cased decrypted text is printed to the screen for the user to confirm it is English. If the user enters 'D', the function returns the decryptedText string.
Otherwise, if none of the decryptions look like English, the hacking has failed and the None value is returned:
220. return None
Finally, all of the functions we’ve defined will be used by the hackVigenere() function, which accepts a ciphertext string as an argument and returns the hacked message (if hacking was successful) or None (if it wasn’t). It starts by getting the likely key lengths with kasiskiExamination():
223. def hackVigenere(ciphertext):
224. # First, we need to do Kasiski examination to figure out what the
225. # length of the ciphertext's encryption key is:
226. allLikelyKeyLengths = kasiskiExamination(ciphertext)
The hackVignere() function’s output depends on whether the program is in SILENT_MODE:
227. if not SILENT_MODE:
228. keyLengthStr = ''
229. for keyLength in allLikelyKeyLengths:
230. keyLengthStr += '%s ' % (keyLength)
231. print('Kasiski examination results say the most likely key lengths
are: ' + keyLengthStr + '\n')
The likely key lengths are printed to the screen if SILENT_MODE is False.
Next, we need to find likely subkey letters for each key length. We’ll do that with another loop that attempts to hack the cipher with each key length we found.
We want the code to continue looping and checking key lengths until it finds a potentially correct key length. When it finds a key length that seems correct, we’ll stop the loop with a break statement.
Similar to how the continue statement is used inside a loop to go back to the start of the loop, the break statement is used inside a loop to immediately exit the loop. When the program execution breaks out of a loop, it immediately moves to the first line of code after the loop ends. We’ll break out of the loop whenever the program finds a potentially correct key and needs to ask the user to confirm that the key is correct.
232. hackedMessage = None
233. for keyLength in allLikelyKeyLengths:
234. if not SILENT_MODE:
235. print('Attempting hack with key length %s (%s possible keys)...'
% (keyLength, NUM_MOST_FREQ_LETTERS ** keyLength))
236. hackedMessage = attemptHackWithKeyLength(ciphertext, keyLength)
237. if hackedMessage != None:
238. break
For each possible key length, the code calls attemptHackWithKeyLength() on line 236. If attemptHackWithKeyLength() does not return None, the hack is successful, and the program execution should break out of the for loop on line 238.
If the hack fails all the possible key lengths that kasiskiExamination() returned, hackedMessage is set to None when the if statement on line 242 executes. In this case, all the other key lengths up to MAX_KEY_LENGTH are tried. If Kasiski examination fails to calculate the correct key length, we can just brute-force through the key lengths with the for loop on line 245:
242. if hackedMessage == None:
243. if not SILENT_MODE:
244. print('Unable to hack message with likely key length(s). Brute-
forcing key length...')
245. for keyLength in range(1, MAX_KEY_LENGTH + 1):
246. # Don't recheck key lengths already tried from Kasiski:
247. if keyLength not in allLikelyKeyLengths:
248. if not SILENT_MODE:
249. print('Attempting hack with key length %s (%s possible
keys)...' % (keyLength, NUM_MOST_FREQ_LETTERS **
keyLength))
250. hackedMessage = attemptHackWithKeyLength(ciphertext,
keyLength)
251. if hackedMessage != None:
252. break
Line 245 starts a for loop that calls attemptHackWithKeyLength() for each value of keyLength (which ranges from 1 to MAX_KEY_LENGTH) as long as it’s not in allLikelyKeyLengths. The reason is that the key lengths in allLikelyKeyLengths have already been tried in the code on lines 233 to 238.
Finally, the value in hackedMessage is returned on line 253:
253. return hackedMessage
Lines 258 and 259 call the main() function if this program was run by itself rather than being imported by another program:
256. # If vigenereHacker.py is run (instead of imported as a module), call
257. # the main() function:
258. if __name__ == '__main__':
259. main()
That’s the full Vigenère hacking program. Whether it’s successful depends on the characteristics of the ciphertext. The closer the original plaintext’s letter frequency is to regular English’s letter frequency and the longer the plaintext, the more likely the hacking program will work.
We can modify a few details if the hacking program doesn’t work. Three constants we set on lines 8 to 10 affect how the hacking program runs:
8. MAX_KEY_LENGTH = 16 # Will not attempt keys longer than this.
9. NUM_MOST_FREQ_LETTERS = 4 # Attempt this many letters per subkey.
10. SILENT_MODE = False # If set to True, program doesn't print anything.
If the Vigenère key is longer than the integer in MAX_KEY_LENGTH on line 8, there is no way the hacking program will find the correct key. If the hacking program fails to hack the ciphertext, try increasing this value and running the program again.
Keep in mind that trying to hack an incorrect key length that is short takes a short amount of time. But if MAX_KEY_LENGTH is set very high and the kasiskiExamination() function mistakenly thinks that the key length could be an enormous integer, the program could spend hours, or even months, attempting to hack the ciphertext using the wrong key lengths.
To prevent this, NUM_MOST_FREQ_LETTERS on line 9 limits the number of possible letters tried for each subkey. By increasing this value, the hacking program tries many more keys, which you might need to do if the freqAnalysis.englishFreqMatchScore() was inaccurate for the original plaintext message, but this also causes the program to slow down. And setting NUM_MOST_FREQ_LETTERS to 26 would cause the program to skip narrowing down the number of possible letters for each subkey entirely!
For both MAX_KEY_LENGTH and NUM_MOST_FREQ_LETTERS, a smaller value is faster to execute but less likely to succeed in hacking the cipher, and a larger value is slower to execute but more likely to succeed.
Finally, to increase the speed of your program, you can set SILENT_MODE to True on line 10 so the program doesn’t waste time printing information to the screen. Although your computer can perform calculations quickly, displaying characters on the screen can be relatively slow. The downside of not printing information is that you won’t know how the program is doing until it has completely finished running.
Hacking the Vigenère cipher requires you to follow several detailed steps. Also, many parts of the hacking program could fail: for example, perhaps the Vigenère key used for encryption is longer than MAX_KEY_LENGTH, or perhaps the English frequency–matching function received inaccurate results because the plaintext doesn’t follow normal letter frequency, or maybe the plaintext has too many words that aren’t in the dictionary file and isEnglish() doesn’t recognize it as English.
As you identify different ways in which the hacking program could fail, you can change the code to handle such cases. But the hacking program in this book does a pretty good job of reducing billions or trillions of possible keys to mere thousands.
However, there is one trick that makes the Vigenère cipher mathematically impossible to break, no matter how powerful your computer or how clever your hacking program is. You’ll learn about this trick, which is called a one-time pad, in Chapter 21.