Note: The second edition of this book is available under the title Cracking Codes with Python
Topics Covered In This Chapter:
· The extend() list method
· The Set data type and set() function
· The itertools.product() function
Alan says, “When we want to sink a convey, we send out an observation plane first. It is ostensibly an observation plane. Of course, to observe is not its real duty—We already know exactly where the convoy is. Its real duty is to be observed—That is, to fly close enough to the convoy that it will be noticed by the lookouts on the ships. The ships will then send out a radio message to the effect that they have been sighted by an Allied observation plane. Then, when we come round and sink them, the Germans will not find it suspicious—At least, not quite so monstrously suspicious that we knew exactly where to go.”
...
Alan says, “Unless we continue to do stunningly idiotic things like sinking convoys in the fog, they will never receive any clear and unmistakable indications that we have broken Enigma.”
“Cryptonomicon” by Neal Stephenson
There are two different methods to hack the Vigenère cipher. The first is a brute-force attack that tries every word in the dictionary file as the Vigenère key. This method will only work if an English word like “RAVEN” or “DESK” was used for the key instead of a random key like “VUWFE” or “PNFJ”. The second is a more sophisticated method that works even if a random key was used. The earliest record of its use was by the mathematician Charles Babbage in the 19th century.
If the Vigenère key is an English word it is very easy to memorize. But never use an English word for the encryption key. This makes your ciphertext vulnerable to a dictionary attack.
A dictionary attack is a brute-force technique where a hacker attempts to decrypt the ciphertext using the words from a dictionary file as the keys. The dictionary.txt dictionary file available on this book’s website (at http://invpy.com/dictionary.txt) has about 45,000 English words. It takes less than 5 minutes for my computer to run through all of these decryptions for a message the size of a long paragraph.
Open a new file editor window by clicking on File ► New Window. Type in the following code into the file editor, and then save it as vigenereDictionaryHacker.py. Press F5 to run the program. Note that first you will need to download the pyperclip.py module and place this file in the same directory as the vigenereDictionaryHacker.py file. You can download this file from http://invpy.com/pyperclip.py.
Source code for vigenereDictionaryHacker.py
1. # Vigenere Cipher Dictionary Hacker
2. # http://inventwithpython.com/hacking (BSD Licensed)
3.
4. import detectEnglish, vigenereCipher, pyperclip
5.
6. def main():
7. ciphertext = """Tzx isnz eccjxkg nfq lol mys bbqq I lxcz."""
8. hackedMessage = hackVigenere(ciphertext)
9.
10. if hackedMessage != None:
11. print('Copying hacked message to clipboard:')
12. print(hackedMessage)
13. pyperclip.copy(hackedMessage)
14. else:
15. print('Failed to hack encryption.')
16.
17.
18. def hackVigenere(ciphertext):
19. fo = open('dictionary.txt')
20. words = fo.readlines()
21. fo.close()
22.
23. for word in words:
24. word = word.strip() # remove the newline at the end
25. decryptedText = vigenereCipher.decryptMessage(word, ciphertext)
26. if detectEnglish.isEnglish(decryptedText, wordPercentage=40):
27. # Check with user to see if the decrypted key has been found.
28. print()
29. print('Possible encryption break:')
30. print('Key ' + str(word) + ': ' + decryptedText[:100])
31. print()
32. print('Enter D for done, or just press Enter to continue breaking:')
33. response = input('> ')
34.
35. if response.upper().startswith('D'):
36. return decryptedText
37.
38. if __name__ == '__main__':
39. main()
When you run this program the output will look like this:
Possible encryption break:
Key ASTROLOGY: The recl yecrets crk not the qnks I tell.
Enter D for done, or just press Enter to continue breaking:
>
Possible encryption break:
Key ASTRONOMY: The real secrets are not the ones I tell.
Enter D for done, or just press Enter to continue breaking:
> d
Copying hacked message to clipboard:
The real secrets are not the ones I tell.
The first keyword it suggests (“ASTROLOGY”) doesn’t quite work, so the user presses Enter to let the hacking program continue until it gets the correct decryption key (“ASTRONOMY”).
20. words = fo.readlines()
File objects returned from open() have a readlines() method. Unlike the read() method which returns the full contents of the file as a single string, the readlines() method will return a list of strings, where each string is a single line from the file. Note that each of the strings in the list will end with a \n newline character (except for possibly the very last string, since the file might not have ended with a newline).
The source code for this program isn’t anything we haven’t seen in previous hacking programs in this book, aside from the new readlines() method. The hackVigenere() function reads in the contents of the dictionary file, uses each word in that file to decrypt the ciphertext, and if the decrypted text looks like readable English it will prompt the user to quit or continue.
As such, we won’t do a line-by-line explanation for this program, and instead continue on with a program that can hack the Vigenère cipher even when the key was not a word that can be found in the dictionary.
Charles Babbage is known to have broken the Vigenère cipher, but he never published his results. Later studies revealed he used a method that was later published by early 20th-century mathematician Friedrich Kasiski.
“Kasiski Examination” is a process used to determine how long the Vigenère key used to encrypt a ciphertext was. After this is determined, frequency analysis can be used to break each of the subkeys.
The first part of Kasiski Examination is to find every repeated set of letters at least three letters long in the ciphertext. These are significant, because they could indicate that they were the same letters of plaintext encrypted with the same subkeys of the key. For example, if the ciphertext is “Ppqca xqvekg ybnkmazu ybngbal jon i tszm jyim. Vrag voht vrau c tksg. Ddwuo xitlazu vavv raz c vkb qp iwpou.” and we remove the non-letters, the ciphertext looks like this:
PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCVKBQPIWPOU
You can see that the sequences VRA, AZU, and YBN repeat in this ciphertext:
PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCVKBQPIWPOU
PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCVKBQPIWPOU
PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCVKBQPIWPOU
After finding the repeated sequences, get a count of the spacing between the sequences. If we count the number of letters between the start of each of these sequences, we find that:
· Between the first and second VRA sequences there are 8 letters.
· Between the second and third VRA sequences there are 24 letters.
· Between the first and third VRA sequences there are 32 letters.
· Between the first and second AZU there are 48 letters.
· Between the first and second YBN there are 8 letters.
So the spacings are 8, 8, 24, 32, and 48. Let’s find the factors of each of these numbers (not including one):
· The factors of 8 are 2, 4, and 8.
· The factors of 24 are 2, 4, 6, 8, 12, and 24.
· The factors of 32 are 2, 4, 8, and 16.
· The factors of 48 are 2, 4, 6, 8, 12, 24, and 48.
So the spacings of 8, 8, 24, 32, and 48 expand to this list of factors: 2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 8, 8, 8, 8, 12, 12, 16, 24, 24, and 48. If we do a count of these factors, we get this:
Table 21-1. Factor count from our “Ppqca xqvekg...” example.
Factor |
Count |
2 |
Appears 4 times. |
4 |
Appears 4 times. |
6 |
Appears 2 times. |
8 |
Appears 4 times. |
12 |
Appears 2 times. |
16 |
Appears 1 time. |
24 |
Appears 2 times. |
48 |
Appears 1 time. |
The factors that have the highest count are the most likely lengths of the Vigenère key. In our example above, these are 2, 4, and 8. The Vigenère key is probably 2, 4, or 8 letters long.
For this example, we will guess that the key length is 4. Next we will want to split up the ciphertext into every 4th letter. This means we want the following underlined letters as a separate string:
Every 4th letter starting with the first letter: PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCVKBQPIWPOU
Every 4th letter starting with the second letter: PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCVKBQPIWPOU
Every 4th letter starting with the third letter: PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCVKBQPIWPOU
Every 4th letter starting with the fourth lettter: PPQCAXQVEKGYBNKMAZUYBNGBALJONITSZMJYIMVRAGVOHTVRAUCTKSGDDWUOXITLAZUVAVVRAZCVKBQPIWPOU
When combined, they become these four strings:
Every 4th letter starting with the first letter: |
PAEBABANZIAHAKDXAAAKIU |
Every 4th letter starting with the second letter: |
PXKNZNLIMMGTUSWIZVZBW |
Every 4th letter starting with the third letter: |
QQGKUGJTJVVVCGUTUVCQP |
Every 4th letter starting with the fourth letter: |
CVYMYBOSYRORTDOLVRVPO |
If our guess from Kasiski Examination was correct and the decryption key was in fact 4 characters long, then the first subkey of the key would have been used to encrypt the characters in the first string above, the second subkey of the key would have been used to encrypt the characters in the second string above, and so on.
Remember, the Vigenère cipher is the same as the Caesar cipher, except it uses multiple subkeys. Kasiski Examination tells us how many subkeys were used for the ciphertext, now we just have to hack each subkey one at a time. Let’s try to hack the first of these four ciphertext strings: PAEBABANZIAHAKDXAAAKIU
We will decrypt this string 26 times, once for each of the 26 possible subkeys, and then see what English frequency match score the decrypted text has. In the table below, the first column is the subkey used to decrypt the PAEBABANZIAHAKDXAAAKIU string. The second column is the returned decrypted text value from vigenereCipher.decryptMessage(subkey, 'PAEBABANZIAHAKDXAAAKIU') where subkey is the subkey from the first column. The third column is the returned value from freqAnalysis.englishFreqMatchScore(decryptedText) where decryptedText is the value from the second column.
Table 21-2. English frequency match score for each decryption.
Subkey |
Text When PAEB… is Decrypted with the Subkey |
English Frequency Match Score |
'A' |
'PAEBABANZIAHAKDXAAAKIU' |
2 |
'B' |
'OZDAZAZMYHZGZJCWZZZJHT' |
1 |
'C' |
'NYCZYZYLXGYFYIBVYYYIGS' |
1 |
'D' |
'MXBYXYXKWFXEXHAUXXXHFR' |
0 |
'E' |
'LWAXWXWJVEWDWGZTWWWGEQ' |
1 |
'F' |
'KVZWVWVIUDVCVFYSVVVFDP' |
0 |
'G' |
'JUYVUVUHTCUBUEXRUUUECO' |
1 |
'H' |
'ITXUTUTGSBTATDWQTTTDBN' |
1 |
'I' |
'HSWTSTSFRASZSCVPSSSCAM' |
2 |
'J' |
'GRVSRSREQZRYRBUORRRBZL' |
0 |
'K' |
'FQURQRQDPYQXQATNQQQAYK' |
1 |
'L' |
'EPTQPQPCOXPWPZSMPPPZXJ' |
0 |
'M' |
'DOSPOPOBNWOVOYRLOOOYWI' |
1 |
'N' |
'CNRONONAMVNUNXQKNNNXVH' |
2 |
'O' |
'BMQNMNMZLUMTMWPJMMMWUG' |
1 |
'P' |
'ALPMLMLYKTLSLVOILLLVTF' |
1 |
'Q' |
'ZKOLKLKXJSKRKUNHKKKUSE' |
0 |
'R' |
'YJNKJKJWIRJQJTMGJJJTRD' |
1 |
'S' |
'XIMJIJIVHQIPISLFIIISQC' |
1 |
'T' |
'WHLIHIHUGPHOHRKEHHHRPB' |
1 |
'U' |
'VGKHGHGTFOGNGQJDGGGQOA' |
1 |
'V' |
'UFJGFGFSENFMFPICFFFPNZ' |
1 |
'W' |
'TEIFEFERDMELEOHBEEEOMY' |
2 |
'X' |
'SDHEDEDQCLDKDNGADDDNLX' |
2 |
'Y' |
'RCGDCDCPBKCJCMFZCCCMKW' |
0 |
'Z' |
'QBFCBCBOAJBIBLEYBBBLJV' |
0 |
The subkeys that produce decryptions with the closest frequency match to English are the ones that are most likely to be the real subkey. In the above decryptions (for the 1st of the four ciphertext strings), 'A', 'I', 'N', 'W', and 'X' are the subkeys that have the highest frequency matches with English. Note that these scores are low in general because there isn’t enough ciphertext to give us a large sample of text, but it still ends up working well.
We need to repeat this 26-decryptions-and-frequency-match for the other three strings to find out their most likely subkeys. After this frequency analysis, we find:
The most likely subkeys for the first string are: |
A, I, N, W, and X |
The most likely subkeys for the second string are: |
I and Z |
The most likely subkey for the third string is: |
C |
The most likely subkeys for the fourth string are: |
K, N, R, V, and Y |
Next we will brute-force the key by trying out every combination of subkey. Because there are 5 possible subkeys for the first subkey, 2 for the second subkey, 1 for the third subkey, and 5 for the fourth subkey, the number of combinations is 5 × 2 × 1 × 5 or 50 possible keys to brute-force through. This is much better than the 26 × 26 × 26 × 26 or 456,976 possible keys we would have to brute-force through if we had not narrowed the list of possible subkeys. This difference becomes even greater if the Vigenère key had been longer!
AICK |
IICK |
NICK |
WICK |
XICK |
AICN |
IICN |
NICN |
WICN |
XICN |
AICR |
IICR |
NICR |
WICR |
XICR |
AICV |
IICV |
NICV |
WICV |
XICV |
AICY |
IICY |
NICY |
WICY |
XICY |
AZCK |
IZCK |
NZCK |
WZCK |
XZCK |
AZCN |
IZCN |
NZCN |
WZCN |
XZCN |
AZCR |
IZCR |
NZCR |
WZCR |
XZCR |
AZCV |
IZCV |
NZCV |
WZCV |
XZCV |
AZCY |
IZCY |
NZCY |
WZCY |
XZCY |
Now it’s just a matter of going through all 50 of these decryption keys for the full ciphertext and seeing which one produces a readable English plaintext. If you do this, you’ll find that the key to the “Ppqca xqvekg…” ciphertext is “WICK”.
Open a new file editor window by clicking on File ► New Window. Type in the following code into the file editor, and then save it as vigenereHacker.py. Press F5 to run the program. Note that first you will need to download the pyperclip.py module and place this file in the same directory as the vigenereHacker.py file. You can download this file from http://invpy.com/pyperclip.py.
The ciphertext in this program may be difficult to copy from the book, but you can copy & paste it from http://invpy.com/vigenereHacking.py. You can see if there are any differences between the text in your program to the text of the program in this book by using the online diff tool at http://invpy.com/hackingdiff.
Source code for vigenereHacker.py
1. # Vigenere Cipher Hacker
2. # http://inventwithpython.com/hacking (BSD Licensed)
3.
4. import itertools, re
5. import vigenereCipher, pyperclip, freqAnalysis, detectEnglish
6.
7. LETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
8. SILENT_MODE = False # if set to True, program doesn't print attempts
9. NUM_MOST_FREQ_LETTERS = 4 # attempts this many letters per subkey
10. MAX_KEY_LENGTH = 16 # will not attempt keys longer than this
11. NONLETTERS_PATTERN = re.compile('[^A-Z]')
12.
13.
14. def main():
15. # Instead of typing this ciphertext out, you can copy & paste it
16. # from http://invpy.com/vigenereHacker.py
17. ciphertext = """Adiz Avtzqeci Tmzubb wsa m Pmilqev halpqavtakuoi, lgouqdaf, kdmktsvmztsl, izr xoexghzr kkusitaaf. Vz wsa twbhdg ubalmmzhdad qz hce vmhsgohuqbo ox kaakulmd gxiwvos, krgdurdny i rcmmstugvtawz ca tzm ocicwxfg jf "stscmilpy" oid "uwydptsbuci" wabt hce Lcdwig eiovdnw. Bgfdny qe kddwtk qjnkqpsmev ba pz tzm roohwz at xoexghzr kkusicw izr vrlqrwxist uboedtuuznum. Pimifo Icmlv Emf DI, Lcdwig owdyzd xwd hce Ywhsmnemzh Xovm mby Cqxtsm Supacg (GUKE) oo Bdmfqclwg Bomk, Tzuhvif'a ocyetzqofifo ositjm. Rcm a lqys ce oie vzav wr Vpt 8, lpq gzclqab mekxabnittq tjr Ymdavn fihog cjgbhvnstkgds. Zm psqikmp o iuejqf jf lmoviiicqg aoj jdsvkavs Uzreiz qdpzmdg, dnutgrdny bts helpar jf lpq pjmtm, mb zlwkffjmwktoiiuix avczqzs ohsb ocplv nuby swbfwigk naf ohw Mzwbms umqcifm. Mtoej bts raj pq kjrcmp oo tzm Zooigvmz Khqauqvl Dincmalwdm, rhwzq vz cjmmhzd gvq ca tzm rwmsl lqgdgfa rcm a kbafzd-hzaumae kaakulmd, hce SKQ. Wi 1948 Tmzubb jgqzsy Msf Zsrmsv'e Qjmhcfwig Dincmalwdm vt Eizqcekbqf Pnadqfnilg, ivzrw pq onsaafsy if bts yenmxckmwvf ca tzm Yoiczmehzr uwydptwze oid tmoohe avfsmekbqr dn eifvzmsbuqvl tqazjgq. Pq kmolm m dvpwz ab ohw ktshiuix pvsaa at hojxtcbefmewn, afl bfzdakfsy okkuzgalqzu xhwuuqvl jmmqoigve gpcz ie hce Tmxcpsgd-Lvvbgbubnkq zqoxtawz, kciup isme xqdgo otaqfqev qz hce 1960k. Bgfdny'a tchokmjivlabk fzsmtfsy if i ofdmavmz krgaqqptawz wi 1952, wzmz vjmgaqlpad iohn wwzq goidt uzgeyix wi tzm Gbdtwl Wwigvwy. Vz aukqdoev bdsvtemzh rilp rshadm tcmmgvqg (xhwuuqvl uiehmalqab) vs sv mzoejvmhdvw ba dmikwz. Hpravs rdev qz 1954, xpsl whsm tow iszkk jqtjrw pug 42id tqdhcdsg, rfjm ugmbddw xawnofqzu. Vn avcizsl lqhzreqzsy tzif vds vmmhc wsa eidcalq; vds ewfvzr svp gjmw wfvzrk jqzdenmp vds vmmhc wsa mqxivmzhvl. Gv 10 Esktwunsm 2009, fgtxcrifo mb Dnlmdbzt uiydviyv, Nfdtaat Dmiem Ywiikbqf Bojlab Wrgez avdw iz cafakuog pmjxwx ahwxcby gv nscadn at ohw Jdwoikp scqejvysit xwd "hce sxboglavs kvy zm ion tjmmhzd." Sa at Haq 2012 i bfdvsbq azmtmd'g widt ion bwnafz tzm Tcpsw wr Zjrva ivdcz eaigd yzmbo Tmzubb a kbmhptgzk dvrvwz wa efiohzd."""
18. hackedMessage = hackVigenere(ciphertext)
19.
20. if hackedMessage != None:
21. print('Copying hacked message to clipboard:')
22. print(hackedMessage)
23. pyperclip.copy(hackedMessage)
24. else:
25. print('Failed to hack encryption.')
26.
27.
28. def findRepeatSequencesSpacings(message):
29. # Goes through the message and finds any 3 to 5 letter sequences
30. # that are repeated. Returns a dict with the keys of the sequence and
31. # values of a list of spacings (num of letters between the repeats).
32.
33. # Use a regular expression to remove non-letters from the message.
34. message = NONLETTERS_PATTERN.sub('', message.upper())
35.
36. # Compile a list of seqLen-letter sequences found in the message.
37. seqSpacings = {} # keys are sequences, values are list of int spacings
38. for seqLen in range(3, 6):
39. for seqStart in range(len(message) - seqLen):
40. # Determine what the sequence is, and store it in seq
41. seq = message[seqStart:seqStart + seqLen]
42.
43. # Look for this sequence in the rest of the message
44. for i in range(seqStart + seqLen, len(message) - seqLen):
45. if message[i:i + seqLen] == seq:
46. # Found a repeated sequence.
47. if seq not in seqSpacings:
48. seqSpacings[seq] = [] # initialize blank list
49.
50. # Append the spacing distance between the repeated
51. # sequence and the original sequence.
52. seqSpacings[seq].append(i - seqStart)
53. return seqSpacings
54.
55.
56. def getUsefulFactors(num):
57. # Returns a list of useful factors of num. By "useful" we mean factors
58. # less than MAX_KEY_LENGTH + 1. For example, getUsefulFactors(144)
59. # returns [2, 72, 3, 48, 4, 36, 6, 24, 8, 18, 9, 16, 12]
60.
61. if num < 2:
62. return [] # numbers less than 2 have no useful factors
63.
64. factors = [] # the list of factors found
65.
66. # When finding factors, you only need to check the integers up to
67. # MAX_KEY_LENGTH.
68. for i in range(2, MAX_KEY_LENGTH + 1): # don't test 1
69. if num % i == 0:
70. factors.append(i)
71. factors.append(int(num / i))
72. if 1 in factors:
73. factors.remove(1)
74. return list(set(factors))
75.
76.
77. def getItemAtIndexOne(x):
78. return x[1]
79.
80.
81. def getMostCommonFactors(seqFactors):
82. # First, get a count of how many times a factor occurs in seqFactors.
83. factorCounts = {} # key is a factor, value is how often if occurs
84.
85. # seqFactors keys are sequences, values are lists of factors of the
86. # spacings. seqFactors has a value like: {'GFD': [2, 3, 4, 6, 9, 12,
87. # 18, 23, 36, 46, 69, 92, 138, 207], 'ALW': [2, 3, 4, 6, ...], ...}
88. for seq in seqFactors:
89. factorList = seqFactors[seq]
90. for factor in factorList:
91. if factor not in factorCounts:
92. factorCounts[factor] = 0
93. factorCounts[factor] += 1
94.
95. # Second, put the factor and its count into a tuple, and make a list
96. # of these tuples so we can sort them.
97. factorsByCount = []
98. for factor in factorCounts:
99. # exclude factors larger than MAX_KEY_LENGTH
100. if factor <= MAX_KEY_LENGTH:
101. # factorsByCount is a list of tuples: (factor, factorCount)
102. # factorsByCount has a value like: [(3, 497), (2, 487), ...]
103. factorsByCount.append( (factor, factorCounts[factor]) )
104.
105. # Sort the list by the factor count.
106. factorsByCount.sort(key=getItemAtIndexOne, reverse=True)
107.
108. return factorsByCount
109.
110.
111. def kasiskiExamination(ciphertext):
112. # Find out the sequences of 3 to 5 letters that occur multiple times
113. # in the ciphertext. repeatedSeqSpacings has a value like:
114. # {'EXG': [192], 'NAF': [339, 972, 633], ... }
115. repeatedSeqSpacings = findRepeatSequencesSpacings(ciphertext)
116.
117. # See getMostCommonFactors() for a description of seqFactors.
118. seqFactors = {}
119. for seq in repeatedSeqSpacings:
120. seqFactors[seq] = []
121. for spacing in repeatedSeqSpacings[seq]:
122. seqFactors[seq].extend(getUsefulFactors(spacing))
123.
124. # See getMostCommonFactors() for a description of factorsByCount.
125. factorsByCount = getMostCommonFactors(seqFactors)
126.
127. # Now we extract the factor counts from factorsByCount and
128. # put them in allLikelyKeyLengths so that they are easier to
129. # use later.
130. allLikelyKeyLengths = []
131. for twoIntTuple in factorsByCount:
132. allLikelyKeyLengths.append(twoIntTuple[0])
133.
134. return allLikelyKeyLengths
135.
136.
137. def getNthSubkeysLetters(n, keyLength, message):
138. # Returns every Nth letter for each keyLength set of letters in text.
139. # E.g. getNthSubkeysLetters(1, 3, 'ABCABCABC') returns 'AAA'
140. # getNthSubkeysLetters(2, 3, 'ABCABCABC') returns 'BBB'
141. # getNthSubkeysLetters(3, 3, 'ABCABCABC') returns 'CCC'
142. # getNthSubkeysLetters(1, 5, 'ABCDEFGHI') returns 'AF'
143.
144. # Use a regular expression to remove non-letters from the message.
145. message = NONLETTERS_PATTERN.sub('', message)
146.
147. i = n - 1
148. letters = []
149. while i < len(message):
150. letters.append(message[i])
151. i += keyLength
152. return ''.join(letters)
153.
154.
155. def attemptHackWithKeyLength(ciphertext, mostLikelyKeyLength):
156. # Determine the most likely letters for each letter in the key.
157. ciphertextUp = ciphertext.upper()
158. # allFreqScores is a list of mostLikelyKeyLength number of lists.
159. # These inner lists are the freqScores lists.
160. allFreqScores = []
161. for nth in range(1, mostLikelyKeyLength + 1):
162. nthLetters = getNthSubkeysLetters(nth, mostLikelyKeyLength, ciphertextUp)
163.
164. # freqScores is a list of tuples like:
165. # [(<letter>, <Eng. Freq. match score>), ... ]
166. # List is sorted by match score. Higher score means better match.
167. # See the englishFreqMatchScore() comments in freqAnalysis.py.
168. freqScores = []
169. for possibleKey in LETTERS:
170. decryptedText = vigenereCipher.decryptMessage(possibleKey, nthLetters)
171. keyAndFreqMatchTuple = (possibleKey, freqAnalysis.englishFreqMatchScore(decryptedText))
172. freqScores.append(keyAndFreqMatchTuple)
173. # Sort by match score
174. freqScores.sort(key=getItemAtIndexOne, reverse=True)
175.
176. allFreqScores.append(freqScores[:NUM_MOST_FREQ_LETTERS])
177.
178. if not SILENT_MODE:
179. for i in range(len(allFreqScores)):
180. # use i + 1 so the first letter is not called the "0th" letter
181. print('Possible letters for letter %s of the key: ' % (i + 1), end='')
182. for freqScore in allFreqScores[i]:
183. print('%s ' % freqScore[0], end='')
184. print() # print a newline
185.
186. # Try every combination of the most likely letters for each position
187. # in the key.
188. for indexes in itertools.product(range(NUM_MOST_FREQ_LETTERS), repeat=mostLikelyKeyLength):
189. # Create a possible key from the letters in allFreqScores
190. possibleKey = ''
191. for i in range(mostLikelyKeyLength):
192. possibleKey += allFreqScores[i][indexes[i]][0]
193.
194. if not SILENT_MODE:
195. print('Attempting with key: %s' % (possibleKey))
196.
197. decryptedText = vigenereCipher.decryptMessage(possibleKey, ciphertextUp)
198.
199. if detectEnglish.isEnglish(decryptedText):
200. # Set the hacked ciphertext to the original casing.
201. origCase = []
202. for i in range(len(ciphertext)):
203. if ciphertext[i].isupper():
204. origCase.append(decryptedText[i].upper())
205. else:
206. origCase.append(decryptedText[i].lower())
207. decryptedText = ''.join(origCase)
208.
209. # Check with user to see if the key has been found.
210. print('Possible encryption hack with key %s:' % (possibleKey))
211. print(decryptedText[:200]) # only show first 200 characters
212. print()
213. print('Enter D for done, or just press Enter to continue hacking:')
214. response = input('> ')
215.
216. if response.strip().upper().startswith('D'):
217. return decryptedText
218.
219. # No English-looking decryption found, so return None.
220. return None
221.
222.
223. def hackVigenere(ciphertext):
224. # First, we need to do Kasiski Examination to figure out what the
225. # length of the ciphertext's encryption key is.
226. allLikelyKeyLengths = kasiskiExamination(ciphertext)
227. if not SILENT_MODE:
228. keyLengthStr = ''
229. for keyLength in allLikelyKeyLengths:
230. keyLengthStr += '%s ' % (keyLength)
231. print('Kasiski Examination results say the most likely key lengths are: ' + keyLengthStr + '\n')
232.
233. for keyLength in allLikelyKeyLengths:
234. if not SILENT_MODE:
235. print('Attempting hack with key length %s (%s possible keys)...' % (keyLength, NUM_MOST_FREQ_LETTERS ** keyLength))
236. hackedMessage = attemptHackWithKeyLength(ciphertext, keyLength)
237. if hackedMessage != None:
238. break
239.
240. # If none of the key lengths we found using Kasiski Examination
241. # worked, start brute-forcing through key lengths.
242. if hackedMessage == None:
243. if not SILENT_MODE:
244. print('Unable to hack message with likely key length(s). Brute-forcing key length...')
245. for keyLength in range(1, MAX_KEY_LENGTH + 1):
246. # don't re-check key lengths already tried from Kasiski
247. if keyLength not in allLikelyKeyLengths:
248. if not SILENT_MODE:
249. print('Attempting hack with key length %s (%s possible keys)...' % (keyLength, NUM_MOST_FREQ_LETTERS ** keyLength))
250. hackedMessage = attemptHackWithKeyLength(ciphertext, keyLength)
251. if hackedMessage != None:
252. break
253. return hackedMessage
254.
255.
256. # If vigenereHacker.py is run (instead of imported as a module) call
257. # the main() function.
258. if __name__ == '__main__':
259. main()
When you run the vigenereHacker.py program, the output will look like this:
Kasiski Examination results say the most likely key lengths are: 3 2 6 4 12
Attempting hack with key length 3 (27 possible keys)...
Possible letters for letter 1 of the key: A L M
Possible letters for letter 2 of the key: S N O
Possible letters for letter 3 of the key: V I Z
Attempting with key: ASV
Attempting with key: ASI
Attempting with key: ASZ
Attempting with key: ANV
Attempting with key: ANI
Attempting with key: ANZ
Attempting with key: AOV
Attempting with key: AOI
Attempting with key: AOZ
Attempting with key: LSV
Attempting with key: LSI
Attempting with key: LSZ
Attempting with key: LNV
Attempting with key: LNI
Attempting with key: LNZ
Attempting with key: LOV
Attempting with key: LOI
Attempting with key: LOZ
Attempting with key: MSV
Attempting with key: MSI
Attempting with key: MSZ
Attempting with key: MNV
Attempting with key: MNI
Attempting with key: MNZ
Attempting with key: MOV
Attempting with key: MOI
Attempting with key: MOZ
Attempting hack with key length 2 (9 possible keys)...
Possible letters for letter 1 of the key: O A E
Possible letters for letter 2 of the key: M S I
Attempting with key: OM
Attempting with key: OS
Attempting with key: OI
Attempting with key: AM
Attempting with key: AS
Attempting with key: AI
Attempting with key: EM
Attempting with key: ES
Attempting with key: EI
Attempting hack with key length 6 (729 possible keys)...
Possible letters for letter 1 of the key: A E O
Possible letters for letter 2 of the key: S D G
Possible letters for letter 3 of the key: I V X
Possible letters for letter 4 of the key: M Z Q
Possible letters for letter 5 of the key: O B Z
Possible letters for letter 6 of the key: V I K
Attempting with key: ASIMOV
Possible encryption hack with key ASIMOV:
ALAN MATHISON TURING WAS A BRITISH MATHEMATICIAN, LOGICIAN, CRYPTANALYST, AND COMPUTER SCIENTIST. HE WAS HIGHLY INFLUENTIAL IN THE DEVELOPMENT OF COMPUTER SCIENCE, PROVIDING A FORMALISATION OF THE CON
Enter D for done, or just press Enter to continue hacking:
> d
Copying hacked message to clipboard:
Alan Mathison Turing was a British mathematician, logician, cryptanalyst, and computer scientist. He was highly influential in the development of computer
...skipped for brevity...
his death was accidental. On 10 September 2009, following an Internet campaign, British Prime Minister Gordon Brown made an official public apology on behalf of the British government for "the appalling way he was treated." As of May 2012 a private member's bill was before the House of Lords which would grant Turing a statutory pardon if enacted.
vigenereHacker.py
1. # Vigenere Cipher Hacker
2. # http://inventwithpython.com/hacking (BSD Licensed)
3.
4. import itertools, re
5. import vigenereCipher, pyperclip, freqAnalysis, detectEnglish
6.
7. LETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
8. SILENT_MODE = False # if set to True, program doesn't print attempts
9. NUM_MOST_FREQ_LETTERS = 4 # attempts this many letters per subkey
10. MAX_KEY_LENGTH = 16 # will not attempt keys longer than this
11. NONLETTERS_PATTERN = re.compile('[^A-Z]')
The hacking program imports many different modules, including a new module named itertools. The constants will be explained as they are used in the program.
vigenereHacker.py
14. def main():
15. # Instead of typing this ciphertext out, you can copy & paste it
16. # from http://invpy.com/vigenereHacker.py
17. ciphertext = """Adiz Avtzqeci Tmzubb wsa m Pmilqev halpqavtakuoi, lgouqdaf, kdmktsvmztsl, izr xoexghzr kkusitaaf. Vz wsa twbhdg ubalmmzhdad qz
...skipped for brevity...
at Haq 2012 i bfdvsbq azmtmd'g widt ion bwnafz tzm Tcpsw wr Zjrva ivdcz eaigd yzmbo Tmzubb a kbmhptgzk dvrvwz wa efiohzd."""
18. hackedMessage = hackVigenere(ciphertext)
19.
20. if hackedMessage != None:
21. print('Copying hacked message to clipboard:')
22. print(hackedMessage)
23. pyperclip.copy(hackedMessage)
24. else:
25. print('Failed to hack encryption.')
The main() function of the hacking program is similar to the main() functions of previous hacking functions. The ciphertext is passed to the hackVigenere() cipher, which either returns the decrypted string (if the hack was successful) or the None value (if the hack failed). If successful, the hacked message is printed to the screen and copied to the clipboard.
vigenereHacker.py
28. def findRepeatSequencesSpacings(message):
29. # Goes through the message and finds any 3 to 5 letter sequences
30. # that are repeated. Returns a dict with the keys of the sequence and
31. # values of a list of spacings (num of letters between the repeats).
32.
33. # Use a regular expression to remove non-letters from the message.
34. message = NONLETTERS_PATTERN.sub('', message.upper())
35.
36. # Compile a list of seqLen-letter sequences found in the message.
37. seqSpacings = {} # keys are sequences, values are list of int spacings
38. for seqLen in range(3, 6):
The findRepeatSequencesSpacings() locates all the repeated sequences of letters in the message string and counts the spacings (that is, the number of letters) between the sequences. First, line 34 converts the message to uppercase and removes any non-letter characters from message using the sub() regular expression method.
The seqSpacings dictionary will have keys of the sequence strings and values of a list with the integer number of letters between all the occurrences of that sequence in the key. The previous “PPQCAXQV…” example string from earlier in the “Kasiski Examination, Step 1” section, if passed as message, would cause findRepeatSequenceSpacings() to return {'VRA': [8, 24, 32], 'AZU': [48], 'YBN': [8]}.
The code inside line 38’s for loop will find the repeated sequences in message and calculate the spacings. On the first iteration, it will find sequences that are exactly 3 letters long. On the next iteration it will find sequences exactly 4 letters long, and then 5 letters long. (You can change what sequence lengths the code searches for by modifying the range(3, 6) call on line 38, but finding repeated sequences of length 3, 4 and 5 seems to work for most ciphertexts.)
vigenereHacker.py
39. for seqStart in range(len(message) - seqLen):
40. # Determine what the sequence is, and store it in seq
41. seq = message[seqStart:seqStart + seqLen]
The for loop on line 39 makes sure that we iterate over every possible substring of length seqLen in the message string. Line 41 sets the seq variable with the sequence we are looking for. For example, if seqLen is 3 and message is 'PPQCAXQ', we would want to search for the following sequences (notice the indexes at the top of the 'PPQCAXQ' string):
Table 21-3. Values of seq from message depending on the value in seqStart.
Indexes: |
0123456 |
|
On 1st iteration, seqStart is 0: |
'PPQCAXQ' |
(PPQ starts at index 0.) |
On 2nd iteration, seqStart is 1: |
'PPQCAXQ' |
(PQC starts at index 1.) |
On 3rd iteration, seqStart is 2: |
'PPQCAXQ' |
(QCA starts at index 2.) |
On 4th iteration, seqStart is 3: |
'PPQCAXQ' |
(CAX starts at index 3.) |
On 5th iteration, seqStart is 4: |
'PPQCAXQ' |
(AXQ starts at index 4, which is what len(message) - seqLen evaluates to and is the last index.) |
vigenereHacker.py
43. # Look for this sequence in the rest of the message
44. for i in range(seqStart + seqLen, len(message) - seqLen):
45. if message[i:i + seqLen] == seq:
The for loop on line 44 is inside line 39’s for loop and sets i to be the indexes of every possible sequence of length seqLen in message. These indexes start at seqStart + seqLen (that is, after the sequence currently in seq) and go up to len(message) - seqLen (which is the last index where a sequence of length seqLen can be found).
The expression message[i:i + seqLen] will evaluate to the substring of message that gets checked for being a repeat of seq on line 45. If it is, then we need to calculate the spacing and add it to the seqSpacings dictionary. This is done on lines 46 to 52.
vigenereHacker.py
46. # Found a repeated sequence.
47. if seq not in seqSpacings:
48. seqSpacings[seq] = [] # initialize blank list
49.
50. # Append the spacing distance between the repeated
51. # sequence and the original sequence.
52. seqSpacings[seq].append(i - seqStart)
The spacing between the sequence we’ve found at message[i:i + seqLen] and the original sequence at message[seqStart:seqStart+seqLen] is simply i - seqStart. Notice that i and seqStart are the beginning indexes before the colons. So the integer that i - seqStart evaluates to is the spacing between the two sequences, which is appended to the list stored at seqSpacings[seq].
(Lines 47 and 48 guarantee there is a list at this key by checking beforehand if seq exists as a key in seqSpacings. If it does not, then seqSpacings[seq] is set as a key with a blank list as its value.)
vigenereHacker.py
53. return seqSpacings
By the time all these for loops have finished, the seqSpacings dictionary will contain every repeated sequence of length 3, 4, and 5 and their spacings. This dictionary is returned from findRepeatSequencesSpacings() on line 53.
vigenereHacker.py
56. def getUsefulFactors(num):
57. # Returns a list of useful factors of num. By "useful" we mean factors
58. # less than MAX_KEY_LENGTH + 1. For example, getUsefulFactors(144)
59. # returns [2, 72, 3, 48, 4, 36, 6, 24, 8, 18, 9, 16, 12]
60.
61. if num < 2:
62. return [] # numbers less than 2 have no useful factors
63.
64. factors = [] # the list of factors found
The only useful factors for the hacking program’s Kasiski Examination code are of length MAX_KEY_LENGTH and under, not including 1. The getUsefulFactors() takes a num parameter and returns a list of “useful” factors. The function does not necessarily return all the factors of num in this list.
Line 61 checks for the special case where num is less than 2. In this case, line 62 returns the empty list because these numbers have no useful factors.
vigenereHacker.py
66. # When finding factors, you only need to check the integers up to
67. # MAX_KEY_LENGTH.
68. for i in range(2, MAX_KEY_LENGTH + 1): # don't test 1
69. if num % i == 0:
70. factors.append(i)
71. factors.append(int(num / i))
The for loop on line 68 loops through the integers 2 up to MAX_KEY_LENGTH (including the value in MAX_KEY_LENGTH itself, since the second argument to range() is MAX_KEY_LENGTH + 1).
If num % i is equal to 0, then we know that i evenly divides (that is, has 0 remainder) num and is a factor of num. In this case, line 70 appends i to the list of factors in the factors variable. Line 71 also appends num / i (after converting it from a float to an int, since the / operator always evaluates to a float value).
vigenereHacker.py
72. if 1 in factors:
73. factors.remove(1)
The value 1 is not a useful factor, so we remove it from the factors list. (If the Vigenère key had a length of 1, the Vigenère cipher would be no different from the Caesar cipher!)
vigenereHacker.py
74. return list(set(factors))
The factors list might contain duplicates. For example, if getUsefulFactors() was passed 9 for the num parameter, then 9 % 3 == 0 would be True and both i and int(num / i) (both of which evaluate to 3) would have been appended to factors. But we don’t want duplicate numbers to appear in our factors list.
Line 74 passes the list value in factors to set() which returns a set form of the list. The set data type is similar to the list data type, except a set value can only contain unique values. You can pass a list value to the set() function and it will return a set value form of the list. This set value will not have any duplicate values in it. If you pass this set value to list(), it will return a list value version of the set. This is how line 74 removes duplicate values from the factors list. Try typing the following into the interactive shell:
>>> set([1, 2, 3, 3, 4])
set([1, 2, 3, 4])
>>> spam = list(set([2, 2, 2, 'cats', 2, 2]))
>>> spam
[2, 'cats']
>>>
This list(set(factors)) code is an easy way to remove duplicate factors from the factors list. The final list value is then returned from the function.
vigenereHacker.py
77. def getItemAtIndexOne(x):
78. return x[1]
The getItemAtIndexOne() is almost identical to getItemAtIndexZero() from the freqAnalysis.py program in the previous chapter. This function is passed to sort() to sort based on the item at index 1 of the items being sorted. (See the “The Program’s getItemAtIndexZero() Function” section in Chapter 20.)
vigenereHacker.py
81. def getMostCommonFactors(seqFactors):
82. # First, get a count of how many times a factor occurs in seqFactors.
83. factorCounts = {} # key is a factor, value is how often if occurs
84.
85. # seqFactors keys are sequences, values are lists of factors of the
86. # spacings. seqFactors has a value like: {'GFD': [2, 3, 4, 6, 9, 12,
87. # 18, 23, 36, 46, 69, 92, 138, 207], 'ALW': [2, 3, 4, 6, ...], ...}
Remember, we need to know the most common factor of the sequence spacings as a part of the Kasiski Examination because the most common factor is most likely going to be the length of the Vigenère key.
The seqFactors parameter is a dictionary value created in the kasiskiExamination() function, which is explained later. This dictionary has strings of sequences for keys and a list of integer factors for the value of each key. (These are factors of the spacing integers found by findRepeatSequencesSpacings().) For example, seqFactors could contain a dictionary value like {'VRA': [8, 2, 4, 2, 3, 4, 6, 8, 12, 16, 8, 2, 4], 'AZU': [2, 3, 4, 6, 8, 12, 16, 24], 'YBN': [8, 2, 4]}.
The getMostCommonFactors() function will find the most common factors in seqFactors and return a list of two-integer tuples. The first integer in the tuple will be the factor and the second integer will be how many times it was in seqFactors.
For example, getMostCommonFactors() may return a list value such as [(3, 556), (2, 541), (6, 529), (4, 331), (12, 325), (8, 171), (9, 156), (16, 105), (5, 98), (11, 86), (10, 84), (15, 84), (7, 83), (14, 68), (13, 52)]. This means that in the seqFactors dictionary passed to getMostCommonFactors(), the factor 3 showed up 556 times, the factor 2 showed up 541 times, the factor 5 showed up 529 times, and so on. Note that 3 is the most frequent factor in the list and appears first in the list. 13 is the least frequent factor and is last in the list.
vigenereHacker.py
88. for seq in seqFactors:
89. factorList = seqFactors[seq]
90. for factor in factorList:
91. if factor not in factorCounts:
92. factorCounts[factor] = 0
93. factorCounts[factor] += 1
For the first step of getMostCommonFactors() the for loop on line 88 loops over every sequence in seqFactors, storing it in a variable named seq on each iteration. The list of factors in seqFactors for seq is stored in a variable named factorList on line 89.
The factors in this list are looped over with a for loop on line 90. If a factor does not exist as a key in factorCounts, it is added on line 92 with a value of 0. On line 93, factorCounts[factor] (that is, the factor’s value in factorCounts) is incremented.
vigenereHacker.py
95. # Second, put the factor and its count into a tuple, and make a list
96. # of these tuples so we can sort them.
97. factorsByCount = []
98. for factor in factorCounts:
99. # exclude factors larger than MAX_KEY_LENGTH
100. if factor <= MAX_KEY_LENGTH:
101. # factorsByCount is a list of tuples: (factor, factorCount)
102. # factorsByCount has a value like: [(3, 497), (2, 487), ...]
103. factorsByCount.append( (factor, factorCounts[factor]) )
For the second step of getMostCommonFactors(), we need to sort the values in the factorCounts dictionary by their count. But dictionaries do not have an order, so we must first convert the dictionary into a list of two-integer tuples. (We did something similar in Chapter 20 in the getFrequencyOrder() function of the freqAnalaysis.py module.) This list value will be stored in a variable named factorsByCount, which starts as an empty list on line 97.
The for loop on line 98 goes through each of the factors in factorCounts and appends this (factor, factorCounts[factor]) tuple to the factorsByCount list as long as the factor is less than or equal to MAX_KEY_LENGTH.
vigenereHacker.py
105. # Sort the list by the factor count.
106. factorsByCount.sort(key=getItemAtIndexOne, reverse=True)
107.
108. return factorsByCount
After the for loop finishes adding all the tuples to factorsByCount, the last step of getMostCommonFactors() is that the factorsByCount list is sorted on line 106. Because the getItemAtIndexOne function is passed for the key keyword argument and True is passed for the reverse keyword argument, the list is sorted in descending order by the factor counts.
After being sorted, the list in factorsByCount is returned on line 108.
vigenereHacker.py
111. def kasiskiExamination(ciphertext):
112. # Find out the sequences of 3 to 5 letters that occur multiple times
113. # in the ciphertext. repeatedSeqSpacings has a value like:
114. # {'EXG': [192], 'NAF': [339, 972, 633], ... }
115. repeatedSeqSpacings = findRepeatSequencesSpacings(ciphertext)
The kasiskiExamination() function returns a list of the most likely key lengths for the given ciphertext argument. The key lengths are integers in a list, with the first integer in the list being the most likely key length, the second integer the second most likely, and so on.
The first step is to find the spacings between repeated sequences in the ciphertext. This is returned from findRepeatSequencesSpacings() as a dictionary with keys of the sequence strings and values of a list with the spacings as integers.
The extend() list method is very similar to the append() list method. While the append() method adds a single value passed to it to the end of the list, the extend() method will add every item in a list argument to the end of the list. Try typing the following into the interactive shell:
>>> spam = []
>>> eggs = ['cat', 'dog', 'mouse']
>>> spam.extend(eggs)
>>> spam
['cat', 'dog', 'mouse']
>>> spam.extend([1, 2, 3])
>>> spam
['cat', 'dog', 'mouse', 1, 2, 3]
>>>
Notice the difference if you pass a list to the append() list method. The list itself gets appended to the end instead of the values in the list:
>>> spam = []
>>> eggs = ['cat', 'dog', 'mouse']
>>> spam.append(eggs)
>>> spam
[['cat', 'dog', 'mouse']]
>>> spam.append([1, 2, 3])
>>> spam
[['cat', 'dog', 'mouse'], [1, 2, 3]]
>>>
vigenereHacker.py
117. # See getMostCommonFactors() for a description of seqFactors.
118. seqFactors = {}
119. for seq in repeatedSeqSpacings:
120. seqFactors[seq] = []
121. for spacing in repeatedSeqSpacings[seq]:
122. seqFactors[seq].extend(getUsefulFactors(spacing))
While repeatedSeqSpacings is a dictionary that maps sequence strings to lists of integer spacings, we actually need a dictionary that maps sequence strings to lists of factors of those integer spacings. Lines 118 to 122 do this.
Line 118 starts with an empty dictionary in seqFactors. The for loop on line 119 iterates over every key (which is a sequence string) in repeatedSeqSpacings. For each key, line 120 sets a blank list to be the value in seqFactors.
The for loop on line 121 iterates over all the spacing integers, which are each passed to a getUsefulFactors() call. The list returned from getUsefulFactors() has each of its items appended to seqFactors[seq].
When all the for loops are finished, seqFactors is a dictionary that maps sequence strings to lists of factors of integer spacings.
vigenereHacker.py
123. # See getMostCommonFactors() for a description of factorsByCount.
124. factorsByCount = getMostCommonFactors(seqFactors)
The seqFactors dictionary is passed to getMostCommonFactors() on line 124. A list of two-integer tuples (the first integer in the tuple being the factor, the second integer being the count of how often that factor appeared in seqFactors) is returned and stored in factorsByCount.
vigenereHacker.py
126. # Now we extract the factor counts from factorsByCount and
127. # put them in allLikelyKeyLengths so that they are easier to
128. # use later.
129. allLikelyKeyLengths = []
130. for twoIntTuple in factorsByCount:
131. allLikelyKeyLengths.append(twoIntTuple[0])
132.
133. return allLikelyKeyLengths
The kasiskiExamination() function doesn’t return a list of two-integer tuples though, it returns a list of integer factors. These integer factors are in the first item of the two-integer tuples list in factorsByCount, so we need code to pull these integer factors out and put them in a separate list.
This separate list will be stored in allLikelyKeyLengths, which to begin with is set to an empty list on line 129. The for loop on line 130 iterates over each of the tuples in factorsByCount, and appends the tuple’s index 0 item to the end of allLikelyKeyLengths.
After this for loop completes, the allLikelyKeyLengths variable contains all the factor integers that were in factorsByCount. This list is returned from kasiskiExamination().
vigenereHacker.py
137. def getNthSubkeysLetters(n, keyLength, message):
138. # Returns every Nth letter for each keyLength set of letters in text.
139. # E.g. getNthSubkeysLetters(1, 3, 'ABCABCABC') returns 'AAA'
140. # getNthSubkeysLetters(2, 3, 'ABCABCABC') returns 'BBB'
141. # getNthSubkeysLetters(3, 3, 'ABCABCABC') returns 'CCC'
142. # getNthSubkeysLetters(1, 5, 'ABCDEFGHI') returns 'AF'
143.
144. # Use a regular expression to remove non-letters from the message.
145. message = NONLETTERS_PATTERN.sub('', message)
In order to pull out the letters from a ciphertext that were encrypted with the same subkey, we need a function that can create a string for the 1st, 2nd, or “N th” subkey’s letters from a message. The first part of getting these letters is to remove the non-letter characters from message using a regular expression object and its sub() method on line 145. (Regular expressions were first discussed in Chapter 18’s “A Brief Intro to Regular Expressions and the sub() Regex Method”.) This letters-only string is stored as the new value in message.
vigenereHacker.py
147. i = n - 1
148. letters = []
149. while i < len(message):
150. letters.append(message[i])
151. i += keyLength
152. return ''.join(letters)
Next we will build up a string by appending the letter strings to a list and using the join() list method to create the final string value. This approach executes much faster than string concatenation with the + operator. (This approach was first discussed in Chapter 18’s “Building Strings in Python with Lists” section.)
The i variable will point to the index of the letter in message that we want to append to our string-building list. This list is stored in a variable named letters. The i variable starts with the value n - 1 on line 147 and the letters variable starts with a blank list on line 148.
The while loop on line 149 keeps looping while i is less than the length of message. On each iteration the letter at message[i] is appended to the list in letters. Then i is updated to point to the next of the subkey’s letters by adding keyLength to i on line 151.
After this loop finishes, the code on line 152 joins the single-letter string values in the letters list together to form a single string, and this string is returned from getNthSubkeysLetters().
vigenereHacker.py
155. def attemptHackWithKeyLength(ciphertext, mostLikelyKeyLength):
156. # Determine the most likely letters for each letter in the key.
157. ciphertextUp = ciphertext.upper()
Recall that our kasiskiExamination() function isn’t guaranteed to return the one true integer length of the Vigenère key, but rather the function returns a list of several lengths sorted in order of most likely to be the key length. If our code has guessed the wrong key length, then it will have to try again with a different key length. The attemptHackWithKeyLength() function is passed the ciphertext and the key length guess. If successful, this function returns a string of the hacked message. If the hacking fails, the function returns None.
The hacking code works on uppercase letters but the original string will also be needed, so the uppercase form of the ciphertext string will be stored in a separate variable named ciphertextUp.
vigenereHacker.py
158. # allFreqScores is a list of mostLikelyKeyLength number of lists.
159. # These inner lists are the freqScores lists.
160. allFreqScores = []
161. for nth in range(1, mostLikelyKeyLength + 1):
162. nthLetters = getNthSubkeysLetters(nth, mostLikelyKeyLength, ciphertextUp)
If we assume the value in the mostLikelyKeyLength is the correct key length, the hack algorithm calls getNthSubkeysLetters() for each subkey and then brute-forces through the 26 possible letters for each subkey to find the one that produces decrypted text whose letter frequency closest matches the letter frequency of English.
First, an empty list is stored in allFreqScores on line 160. What this list stores will be explained a little later.
The for loop on line 161 sets the nth variable to each integer from 1 to the mostLikelyKeyLength value. (Remember, that when range() is passed two arguments, the range goes up to, but not including, the second argument. The + 1 is put into the code so that the integer value in mostLikelyKeyLength itself is included in the range object returned.)
The letters of the Nth subkey are returned from getNthSubkeysLetters() on line 162.
vigenereHacker.py
164. # freqScores is a list of tuples like:
165. # [(<letter>, <Eng. Freq. match score>), ... ]
166. # List is sorted by match score. Higher score means better match.
167. # See the englishFreqMatchScore() comments in freqAnalysis.py.
168. freqScores = []
169. for possibleKey in LETTERS:
170. decryptedText = vigenereCipher.decryptMessage(possibleKey, nthLetters)
171. keyAndFreqMatchTuple = (possibleKey, freqAnalysis.englishFreqMatchScore(decryptedText))
172. freqScores.append(keyAndFreqMatchTuple)
Next, a list of English frequency match scores is stored in a list in a variable named freqScores. This variable starts as an empty list on line 168 and then the for loop on line 169 loops through each of the 26 uppercase letter from the LETTERS string. The possibleKey value is used to decrypt the ciphertext by calling vigenereCipher.decryptMessage() on line 170. The subkey in possibleKey is only one letter, but the string in nthLetters is made up of only the letters from message that would have been encrypted with that subkey if we’ve guessed the key length correctly.
The decrypted text is then passed to freqAnalysis.englishFreqMatchScore() to see how closely the frequency of the letters in decryptedText matches the letter frequency of regular English. (Remember from the last chapter that the return value will be an integer between 0 and 12, with a higher number meaning a closer match.)
This frequency match score, along with the key used to decrypt, are put into a tuple that is stored in a variable named keyAndFreqMatchTuple on line 171. This tuple is appended to the end of freqScores on line 172.
vigenereHacker.py
173. # Sort by match score
174. freqScores.sort(key=getItemAtIndexOne, reverse=True)
After the for loop on line 169 completes, the freqScores list will contain 26 key-and-frequency-match-score tuples: one tuple for each of the 26 subkeys. We need to sort this so that the tuples with the largest English frequency match scores are first in the list.
This means that we want to sort by the value at index 1 of the tuples in freqScores and in reverse (that is, descending) order. We call the sort() method on the freqScores list, passing the function value getItemAtIndexOne (not calling the function: note the lack of parentheses) for the key keyword argument. The value True is passed for the reverse keyword argument to sort in reverse (that is, descending) order.
vigenereHacker.py
176. allFreqScores.append(freqScores[:NUM_MOST_FREQ_LETTERS])
The NUM_MOST_FREQ_LETTERS constant was set to the integer value 3 on line 9. Once the tuples in freqScores are sorted, a list containing only the first 3 tuples (that is, the tuples with the three highest English frequency match scores) is appended to allFreqScores.
After the for loop on line 161 completes, allFreqScores will contain a number of list values equal to the integer value in mostLikelyKeyLength. (For example, since mostLikelyKeyLength was 3, allFreqScores would be a list of three lists.) The first list value will hold the tuples for the top three highest matching subkeys for the first subkey of the full Vigenère key. The second list value will hold the tuples for the top three highest matching subkeys for the second subkey of the full Vigenère key, and so on.
Originally, if we wanted to brute-force through the full Vigenère key, there would be (26 ^ key length) number of possible keys. For example, if the key was ROSEBUD (with a length of 7) there would be 26 ^ 7 (that is, 8,031,810,176) possible keys.
But by checking the English frequency matching, we’ve narrowed it down to the 4 most likely letters for each subkey, meaning that there are now only (4 ^ key length) possible keys. Using the example of ROSEBUD (with a length of 7) for a Vigenère key, now we only need to check 4 ^ 7 (that is, 16,384) possible keys. This is a huge improvement over 8 billion!
vigenereHacker.py
178. if not SILENT_MODE:
179. for i in range(len(allFreqScores)):
180. # use i + 1 so the first letter is not called the "0th" letter
181. print('Possible letters for letter %s of the key: ' % (i + 1), end='')
182. for freqScore in allFreqScores[i]:
183. print('%s ' % freqScore[0], end='')
184. print() # print a newline
At this point, the user might want to know which letters are in the top three most likely list for each subkey. If the SILENT_MODE constant was set to False, then the code on lines 178 to 184 will print out the values in allFreqScores to the screen.
Whenever the print() function is called, it will print the string passed to it on the screen along with a newline character. If we want something else printed at the end of the string instead of a newline character, we can specify the string for the print() function’s end keyword argument. Try typing the following into the interactive shell:
>>> print('HEllo', end='\n')
HEllo
>>> print('Hello', end='\n')
Hello
>>> print('Hello', end='')
Hello>>> print('Hello', end='XYZ')
HelloXYZ>>>
(The above was typed into the python.exe interactive shell rather than IDLE. IDLE will always add a newline character before printing the >>> prompt.)
The itertools.product() function produces every possible combination of items in a list or list-like value, such as a string or tuple. (Though the itertools.product() function returns a “itertools product” object value, this can be converted to a list by passing it to list().) This combination of things is called a Cartesian product, which is where the function gets its name. Try typing the following into the interactive shell:
>>> import itertools
>>> itertools.product('ABC', repeat=4)
<itertools.product object at 0x02C40170>
>>> list(itertools.product('ABC', repeat=4))
[('A', 'A', 'A', 'A'), ('A', 'A', 'A', 'B'), ('A', 'A', 'A', 'C'), ('A', 'A', 'B', 'A'), ('A', 'A', 'B', 'B'), ('A', 'A', 'B', 'C'), ('A', 'A', 'C', 'A'), ('A', 'A', 'C', 'B'), ('A', 'A', 'C', 'C'), ('A', 'B', 'A', 'A'), ('A', 'B', 'A', 'B'), ('A', 'B', 'A', 'C'), ('A', 'B', 'B', 'A'), ('A', 'B', 'B', 'B'),
...skipped for brevity...
('C', 'B', 'C', 'B'), ('C', 'B', 'C', 'C'), ('C', 'C', 'A', 'A'), ('C', 'C', 'A', 'B'), ('C', 'C', 'A', 'C'), ('C', 'C', 'B', 'A'), ('C', 'C', 'B', 'B'), ('C', 'C', 'B', 'C'), ('C', 'C', 'C', 'A'), ('C', 'C', 'C', 'B'), ('C', 'C', 'C', 'C')]
As you can see, by passing 'ABC' and the integer 4 for the repeat keyword argument, itertools.product() returns an “itertools product” object that, when converted to a list, has tuples of four values with every possible combination of 'A', 'B', and 'C'. (This results in a list with a total of 3 ^ 4 or 81 tuples in it.)
Since range objects returned from range() are also list-like, they can be passed to itertools.product() as well. Try typing the following into the interactive shell:
>>> import itertools
>>> list(itertools.product(range(8), repeat=5))
[(0, 0, 0, 0, 0), (0, 0, 0, 0, 1), (0, 0, 0, 0, 2), (0, 0, 0, 0, 3), (0, 0, 0, 0, 4), (0, 0, 0, 0, 5), (0, 0, 0, 0, 6), (0, 0, 0, 0, 7), (0, 0, 0, 1, 0), (0, 0, 0, 1, 1), (0, 0, 0, 1, 2), (0, 0, 0, 1, 3), (0, 0, 0, 1, 4),
...skipped for brevity...
(7, 7, 7, 6, 6), (7, 7, 7, 6, 7), (7, 7, 7, 7, 0), (7, 7, 7, 7, 1), (7, 7, 7, 7, 2), (7, 7, 7, 7, 3), (7, 7, 7, 7, 4), (7, 7, 7, 7, 5), (7, 7, 7, 7, 6), (7, 7, 7, 7, 7)]
When the range object returned from range(8) is passed to itertools.product() (along with 5 for the repeat keyword argument), the list that is generated has tuples of 5 values, and each value are from the integers 0 to 7.
The itertools.product() function is an easy way to generate a list with every possible combination of some group of values. This is how our hacking program will create integer indexes to test every possible combination of possible subkeys.
vigenereHacker.py
186. # Try every combination of the most likely letters for each position
187. # in the key.
188. for indexes in itertools.product(range(NUM_MOST_FREQ_LETTERS), repeat=mostLikelyKeyLength):
The allFreqScores variable is a list of lists of tuples such that allFreqScores[i] will evaluate to a list of tuples of possible letters for a single subkey. That is, allFreqScores[0] has a list of tuples for the first subkey, allFreqScores[1] has a list of tuples for the second subkey, and so on.
Also, since the NUM_MOST_FREQ_LETTERS constant is set to 4, itertools.product(range(NUM_MOST_FREQ_LETTERS), repeat=mostLikelyKeyLength) will cause the for loop to have a tuple of integers (from 0 to 3) for the indexes variable. If 5 was passed for mostLikelyKeyLength, then the following values would be set to indexes for each iteration:
Table 21-4. Value of indexes on each iteration.
On the 1st iteration, indexes is set to: |
(0, 0, 0, 0, 0) |
On the 2nd iteration, indexes is set to: |
(0, 0, 0, 0, 1) |
On the 3rd iteration, indexes is set to: |
(0, 0, 0, 0, 2) |
On the 4th iteration, indexes is set to: |
(0, 0, 0, 1, 0) |
On the 5th iteration, indexes is set to: |
(0, 0, 0, 1, 1) |
And so on… |
|
vigenereHacker.py
189. # Create a possible key from the letters in allFreqScores
190. possibleKey = ''
191. for i in range(mostLikelyKeyLength):
192. possibleKey += allFreqScores[i][indexes[i]][0]
The full Vigenère key will be constructed from the subkeys in allFreqScores using the indexes supplied by indexes. The key starts off as a blank string on line 190, and the for loop on line 191 will iterate through the integers from 0 up to, but not including, mostLikelyKeyLength.
As the i variable changes for each iteration of the for loop, the value at indexes[i] will be the index of the tuple we want to use in allFreqScores[i]. This is why allFreqScores[i][indexes[i]] evaluates to the correct tuple we want (and the subkey we want is at index 0 in that tuple).
vigenereHacker.py
194. if not SILENT_MODE:
195. print('Attempting with key: %s' % (possibleKey))
If SILENT_MODE is False, the key created by the for loop on line 191 is printed to the screen.
vigenereHacker.py
197. decryptedText = vigenereCipher.decryptMessage(possibleKey, ciphertextUp)
198.
199. if detectEnglish.isEnglish(decryptedText):
200. # Set the hacked ciphertext to the original casing.
201. origCase = []
202. for i in range(len(ciphertext)):
203. if ciphertext[i].isupper():
204. origCase.append(decryptedText[i].upper())
205. else:
206. origCase.append(decryptedText[i].lower())
207. decryptedText = ''.join(origCase)
Now that we have a complete Vigenère key, lines 197 to 208 will decrypt the ciphertext and check if the decrypted text is readable English. If it is, then it is printed to the screen for the user to confirm it is English (since isEnglish() might produce a false positive).
But decryptedText is in all uppercase letters. The code on lines 201 to 207 builds a new string by appending the origCase list with an uppercase or lowercase form of the letters in decryptedText. The for loop on line 202 goes through each of the indexes in the ciphertext string (which, unlike ciphertextUp, has the original casing of the ciphertext). If ciphertext[i] is uppercase, then the uppercase form of decryptedText[i] is appended to origCase. Otherwise, the lowercase form of decryptedText[i] is appended. The list in origCase is then joined together on line 207 to become the new value of decryptedText.
This table shows how the ciphertext and decryptedText values produce the strings that go into origCase:
ciphertext |
Adiz Avtzqeci Tmzubb wsa m Pmilqev halpqavtakuoi |
decryptedText |
ALAN MATHISON TURING WAS A BRITISH MATHEMATICIAN |
''.join(origCase) |
Alan Mathison Turing was a British mathematician |
vigenereHacker.py
209. # Check with user to see if the key has been found.
210. print('Possible encryption hack with key %s:' % (possibleKey))
211. print(decryptedText[:200]) # only show first 200 characters
212. print()
213. print('Enter D for done, or just press Enter to continue hacking:')
214. response = input('> ')
215.
216. if response.strip().upper().startswith('D'):
217. return decryptedText
The correctly-cased decrypted text is printed to the screen for the user to confirm it is English. If the user enters 'D' then the function returns the decryptedText string.
vigenereHacker.py
219. # No English-looking decryption found, so return None.
220. return None
Otherwise, after the for loop on line 188 iterates through all of the possible indexes to use and none of the decryptions look like English, the hacking has failed and the None value is returned.
vigenereHacker.py
223. def hackVigenere(ciphertext):
224. # First, we need to do Kasiski Examination to figure out what the
225. # length of the ciphertext's encryption key is.
226. allLikelyKeyLengths = kasiskiExamination(ciphertext)
Now we define the hackVigenere() function, which calls all of the previous functions. We’ve already defined all the work it will do. Let’s run through the steps it goes through to perform the hacking. The first step is to get the most likely lengths of the Vigenère key based on Kasiski Examination of ciphertext.
vigenereHacker.py
227. if not SILENT_MODE:
228. keyLengthStr = ''
229. for keyLength in allLikelyKeyLengths:
230. keyLengthStr += '%s ' % (keyLength)
231. print('Kasiski Examination results say the most likely key lengths are: ' + keyLengthStr + '\n')
The likely key lengths are printed to the screen if SILENT_MODE is False.
Similar to how the continue statement is used inside of a loop to continue back to the start of the loop, the break statement (which is just the break keyword by itself) is used inside of a loop to immediately exit the loop. When the program execution “breaks out of a loop”, it immediately moves to the first line of code after the loop ends.
vigenereHacker.py
233. for keyLength in allLikelyKeyLengths:
234. if not SILENT_MODE:
235. print('Attempting hack with key length %s (%s possible keys)...' % (keyLength, NUM_MOST_FREQ_LETTERS ** keyLength))
236. hackedMessage = attemptHackWithKeyLength(ciphertext, keyLength)
237. if hackedMessage != None:
238. break
For each possible key length, the code calls the attemptHackWithKeyLength() function on line 236. If attemptHackWithKeyLength() does not return None, then the hack was successful and the program execution should break out of the for loop on line 238.
vigenereHacker.py
240. # If none of the key lengths we found using Kasiski Examination
241. # worked, start brute-forcing through key lengths.
242. if hackedMessage == None:
243. if not SILENT_MODE:
244. print('Unable to hack message with likely key length(s). Brute-forcing key length...')
245. for keyLength in range(1, MAX_KEY_LENGTH + 1):
246. # don't re-check key lengths already tried from Kasiski
247. if keyLength not in allLikelyKeyLengths:
248. if not SILENT_MODE:
249. print('Attempting hack with key length %s (%s possible keys)...' % (keyLength, NUM_MOST_FREQ_LETTERS ** keyLength))
250. hackedMessage = attemptHackWithKeyLength(ciphertext, keyLength)
251. if hackedMessage != None:
252. break
If the hack had failed for all the possible key lengths that kasiskiExamination() returned, hackedMessage will be set to None when the if statement on line 242 executes. In this case, all the other key lengths up to MAX_KEY_LENGTH are tried. If Kasiski Examination failed to calculate the correct key length, then we can just brute-force through the key lengths.
Line 245 starts a for loop that will call attemptHackWithKeyLength() for each value of keyLength (which ranges from 1 to MAX_KEY_LENGTH) as long as it was not in allLikelyKeyLengths. (This is because the key lengths in allLikelyKeyLengths have already been tried in the code on lines 233 to 238.)
vigenereHacker.py
253. return hackedMessage
Finally, the value in hackedMessage is returned on line 253.
vigenereHacker.py
256. # If vigenereHacker.py is run (instead of imported as a module) call
257. # the main() function.
258. if __name__ == '__main__':
259. main()
Lines 258 and 259 call the main() function if this program was run by itself rather than imported by another program.
That’s the full Vigenère hacking program. Whether it is successful or not depends on the characteristics of the ciphertext. Also, the closer the original plaintext’s letter frequency is to regular English’s letter frequency and the longer the plaintext, the more likely our hacking program will work.
Practice exercises can be found at http://invpy.com/hackingpractice21A.
There are a few things we can modify if the hacking program doesn’t work though. There are three constants we set on lines 8 to 10 that affect how our hacking program runs:
vigenereHacker.py
8. MAX_KEY_LENGTH = 16 # will not attempt keys longer than this
If the Vigenère key was longer than the integer in MAX_KEY_LENGTH, there is no possible way the hacking program will find the correct key. However, if we have MAX_KEY_LENGTH set very high and the kasiskiExamination() function mistakenly thinks that the key length could be a very large integer, the program could be spending hours (or days or months) attempting to hack the ciphertext with wrong key lengths.
Trying to hack the wrong key length that is small is not that big of a problem: it will only take seconds or minutes to go through the likely keys of that length. If the hacking program fails to hack the ciphertext, try increasing this value and running the program again.
vigenereHacker.py
9. NUM_MOST_FREQ_LETTERS = 3 # attempts this many letters per subkey
The NUM_MOST_FREQ_LETTERS limits the number of possible letters tried for each subkey. By increasing this value, the hacking program tries many more keys (which is needed if the freqAnalysis.englishFreqMatchScore() was inaccurate for the original plaintext message), but this will also cause the program to slow down. And setting NUM_MOST_FREQ_LETTERS to 26 will cause the program to not narrow down the number of possible letters for each subkey at all!
Table 21-5. Tradeoffs for the MAX_KEY_LENGTH and NUM_MOST_FREQ_LETTERS.
Smaller value: |
Larger value: |
Faster to execute. |
Slower to execute. |
Less likely to hack. |
More likely to hack. |
vigenereHacker.py
10. SILENT_MODE = False # if set to True, program doesn't print attempts
While your computer can perform calculations very fast, displaying characters on the screen is relatively slow. If you want to speed up your program, you can set SILENT_MODE to True so that the program does not waste time printing information to the screen. On the downside, you will not know how the program is doing until it has completely finished running.
Hacking the Vigenère cipher requires several detailed steps to follow. There are also many parts where our hacking program could fail: perhaps the Vigenère key used for encryption was larger in length than MAX_KEY_LENGTH, or perhaps the English frequency matching function got inaccurate results because the plaintext doesn’t follow normal letter frequency, or maybe the plaintext has too many words that aren’t in our dictionary file and isEnglish() doesn’t recognize it as English.
If you identify different ways that the hacking program could fail, you could change the code to become ever more sophisticated to handle these other cases. But the hacking program in this book does a pretty good job at reducing billions or trillions of possible keys to brute-force through to mere thousands.
However, there is one trick to make the Vigenère cipher mathematically impossible to break, no matter how powerful your computer or how clever your hacking program is. We’ll learn about these “one-time pads” in the next chapter.