Chapter 11 - Detecting English Programmatically

11
DETECTING ENGLISH PROGRAMMATICALLY

The gaffer says something longer and more complicated.
After a while, Waterhouse (now wearing his cryptoanalyst hat, searching for meaning midst apparent randomness, his neural circuits exploiting the redundancies in the signal) realizes that the man is speaking heavily accented English.
—Neal Stephenson, Cryptonomicon

Previously, we used the transposition file cipher to encrypt and decrypt entire files, but we haven’t tried writing a brute-force program to hack the cipher yet. Messages encrypted with the transposition file cipher can have thousands of possible keys, which your computer can still easily brute-force, but you’d then have to look through thousands of decryptions to find the one correct plaintext. As you can imagine, this can be a big problem, but there is a work-around.

When the computer decrypts a message using the wrong key, the resulting string is garbage text instead of English text. We can program the computer to recognize when a decrypted message is English. That way, if the computer decrypts using the wrong key, it knows to go on and try the next possible key. Eventually, when the computer tries a key that decrypts to English text, it can stop and bring that key to your attention, sparing you from having to look through thousands of incorrect decryptions.

How Can a Computer Understand English?

A computer can’t understand English, at least, not in the way that human beings understand English. Computers don’t understand math, chess, or human rebellions either, any more than a clock understands lunchtime. Computers just execute instructions one after another. But these instructions can mimic complex behaviors to solve math problems, win at chess, or hunt down the future leaders of the human resistance.

Ideally, what we need to create is a Python function (let’s call it the isEnglish() function) that we can pass a string to and get a return value of True if the string is English text or False if it’s random gibberish. Let’s look at some English text and some garbage text to see what patterns they might have:

Robots are your friends. Except for RX-686. She will try to eat you.
ai-pey e. xrx ne augur iirl6 Rtiyt fhubE6d hrSei t8..ow eo.telyoosEs t

Notice that the English text is made up of words that you would find in a dictionary, but the garbage text isn’t. Because words are usually separated by spaces, one way of checking whether a message string is English is to split the message into smaller strings at each space and to check whether each substring is a dictionary word. To split the message strings into substrings, we can use the Python string method named split(), which checks where each word begins and ends by looking for spaces between characters. (“The split() Method” on page 150 covers this in more detail.) We can then compare each substring to each word in the dictionary using an if statement, as in the following code:

if word == 'aardvark' or word == 'abacus' or word == 'abandon' or word ==
'abandoned' or word == 'abbreviate' or word == 'abbreviation' or word ==
'abdomen' or ...

We could write code like that, but we probably wouldn’t because it would be tedious to type it all out. Fortunately, we can use English dictionary files, which are text files that contain nearly every word in English. I’ll provide you with a dictionary file to use, so we just need to write the isEnglish() function that checks whether the substrings in the message are in the dictionary file.

Not every word exists in our dictionary file. The dictionary file might be incomplete; for example, it might not have the word aardvark. There are also perfectly good decryptions that might have non-English words in them, such as RX-686 in our example English sentence. The plaintext could also be in a different language, but we’ll assume it’s in English for now.

So the isEnglish() function won’t be foolproof, but if most of the words in the string argument are English words, it’s a good bet the string is English text. There’s a very low probability that a ciphertext decrypted using the wrong key will decrypt to English.

You can download the dictionary file we’ll use for this book (which has more than 45,000 words) from https://www.nostarch.com/crackingcodes/. The dictionary text file lists one word per line in uppercase. Open it, and you’ll see something like this:

AARHUS
AARON
ABABA
ABACK
ABAFT
ABANDON
ABANDONED
ABANDONING
ABANDONMENT
ABANDONS
--snip--

Our isEnglish() function will split a decrypted string into individual substrings and check whether each substring exists as a word in the dictionary file. If a certain number of the substrings are English words, we’ll identify that text as English. And if the text is in English, there’s a good chance that we’ll have successfully decrypted the ciphertext with the correct key.

Source Code for the Detect English Module

Open a new file editor window by selecting File▸New File. Enter the following code into the file editor and then save it as detectEnglish.py. Make sure dictionary.txt is in the same directory as detectEnglish.py or this code won’t work. Press F5 to run the program.

detectEnglish.py

1. # Detect English module
2. # https://www.nostarch.com/crackingcodes/ (BSD Licensed)
3.
4. # To use, type this code:
5. #   import detectEnglish
6. #   detectEnglish.isEnglish(someString) # Returns True or False
7. # (There must be a "dictionary.txt" file in this directory with all
8. # English words in it, one word per line. You can download this from
9. # https://www.nostarch.com/crackingcodes/.)
10. UPPERLETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
11. LETTERS_AND_SPACE = UPPERLETTERS + UPPERLETTERS.lower() + ' \t\n'
12.
13. def loadDictionary():
14.     dictionaryFile = open('dictionary.txt')
15.     englishWords = {}
16.     for word in dictionaryFile.read().split('\n'):
17.         englishWords[word] = None
18.     dictionaryFile.close()
19.     return englishWords
20.
21. ENGLISH_WORDS = loadDictionary()
22.
23.
24. def getEnglishCount(message):
25.     message = message.upper()
26.     message = removeNonLetters(message)
27.     possibleWords = message.split()
28.
29.     if possibleWords == []:
30.         return 0.0 # No words at all, so return 0.0
31.
32.     matches = 0
33.     for word in possibleWords:
34.         if word in ENGLISH_WORDS:
35.             matches += 1
36.     return float(matches) / len(possibleWords)
37.
38.
39. def removeNonLetters(message):
40.     lettersOnly = []
41.     for symbol in message:
42.         if symbol in LETTERS_AND_SPACE:
43.             lettersOnly.append(symbol)
44.     return ''.join(lettersOnly)
45.
46.
47. def isEnglish(message, wordPercentage=20, letterPercentage=85):
48.     # By default, 20% of the words must exist in the dictionary file, and
49.     # 85% of all the characters in the message must be letters or spaces
50.     # (not punctuation or numbers).
51.     wordsMatch = getEnglishCount(message) * 100 >= wordPercentage
52.     numLetters = len(removeNonLetters(message))
53.     messageLettersPercentage = float(numLetters) / len(message) * 100
54.     lettersMatch = messageLettersPercentage >= letterPercentage
55.     return wordsMatch and lettersMatch

Sample Run of the Detect English Module

The detectEnglish.py program we’ll write in this chapter won’t run by itself. Instead, other encryption programs will import detectEnglish.py so they can call the detectEnglish.isEnglish() function, which returns True when the string is determined to be English. This is why we don’t give detectEnglish.py a main() function. The other functions in detectEnglish.py are helper functions that the isEnglish() function will call. All the work we’ll do in this chapter will allow any program to import the detectEnglish module with an import statement and use the functions in it.

You’ll also be able to use this module in the interactive shell to check whether an individual string is in English, as shown here:

>>> import detectEnglish
>>> detectEnglish.isEnglish('Is this sentence English text?')
True

In this example, the function determined that the string 'Is this sentence English text?' is indeed in English, so it returns True.

Instructions and Setting Up Constants

Let’s look at the first portion of the detectEnglish.py program. The first nine lines of code are comments that give instructions on how to use this module.

The first nine lines of code are comments that give instructions on how to use this module. They remind users that this module won’t work unless a file named dictionary.txt is in the same directory as detectEnglish.py.

Lines 10 and 11 set up a few variables as constants, which are in uppercase. As you learned in Chapter 5, constants are variables whose values should never be changed after they’re set. UPPERLETTERS is a constant containing the 26 uppercase letters that we set up for convenience and to save time typing. We use the UPPERLETTERS constant to set up LETTERS_AND_SPACE, which contains all the uppercase and lowercase letters of the alphabet as well as the space character, the tab character, and the newline character. Instead of typing out all the uppercase and lowercase letters, we just concatenate UPPERLETTERS with the lowercase letters returned by UPPERLETTERS.lower() and the additional non-letter characters. The tab and newline characters are represented with the escape characters \t and \n.

The Dictionary Data Type

Before we continue with the rest of the detectEnglish.py code, you need to learn more about the dictionary data type to understand how to convert the text in the file into a string value. The dictionary data type (not to be confused with the dictionary file) stores values, which can contain multiple other values just as lists do. In lists, we use an integer index to retrieve items in the list, such as spam[42]. But for each item in the dictionary value, we instead use a key to retrieve a value. Although we can use only integers to retrieve items from a list, the key in a dictionary value can be an integer or a string, such as spam['hello'] or spam[42]. Dictionaries let us organize our program’s data with more flexibility than lists and don’t store items in any particular order. Instead of using square brackets as lists do, dictionaries use braces. For example, an empty dictionary looks like this {}.

NOTE

Keep in mind that dictionary files and dictionary values are completely different concepts that just happen to have similar names. A Python dictionary value can contain multiple other values. A dictionary file is a text file containing English words.

A dictionary’s items are typed as key-value pairs, in which the keys and values are separated by colons. Multiple key-value pairs are separated by commas. To retrieve values from a dictionary, use square brackets with the key between them, similar to when indexing with lists. To try retrieving values from a dictionary using keys, enter the following into the interactive shell:

>>> spam = {'key1': 'This is a value', 'key2': 42}
>>> spam['key1']
'This is a value'

First, we set up a dictionary called spam with two key-value pairs. We then access the value associated with the 'key1' string key, which is another string. As with lists, you can store all kinds of data types in your dictionaries.

Note that, as with lists, variables don’t store dictionary values; instead, they store references to dictionaries. The following example code shows two variables with references to the same dictionary:

>>> spam = {'hello': 42}
>>> eggs = spam
>>> eggs['hello'] = 99
>>> eggs
{'hello': 99}
>>> spam
{'hello': 99}

The first line of code sets up another dictionary called spam, this time with only one key-value pair. You can see that it stores an integer value 42 associated with the 'hello' string key. The second line assigns that dictionary key-value pair to another variable called eggs. You can then use eggs to change the original dictionary value associated with the 'hello' string key to 99. Now both variables, eggs and spam, should return the same dictionary key-value pair with the updated value.

The Difference Between Dictionaries and Lists

Dictionaries are like lists in many ways, but there are a few important differences:

Dictionary items are not in any order. There is no first or last item in a dictionary as there is in a list.
You can’t concatenate dictionaries with the + operator. If you want to add a new item, use indexing with a new key. For example, foo['a new key'] = 'a string'.
Lists only have integer index values that range from 0 to the length of the list minus one, but dictionaries can use any key. If you have a dictionary stored in a variable spam, you can store a value in spam[3] without needing values for spam[0], spam[1], or spam[2].

Adding or Changing Items in a Dictionary

You can also add or change values in a dictionary by using the dictionary keys as indexes. Enter the following into the interactive shell to see how this works:

>>> spam = {42: 'hello'}
>>> print(spam[42])
hello
>>> spam[42] = 'goodbye'
>>> print(spam[42])
goodbye

This dictionary has an existing dictionary string value 'hello' associated with the key 42. We can reassign a new string value 'goodbye' to that key using spam[42] = 'goodbye'. Assigning a new value to an existing dictionary key overwrites the original value associated with that key. For example, when we try to access the dictionary with the key 42, we get the new value associated with it.

And just as lists can contain other lists, dictionaries can also contain other dictionaries (or lists). To see an example, enter the following into the interactive shell:

>>> foo = {'fizz': {'name': 'Al', 'age': 144}, 'moo':['a', 'brown', 'cow']}
>>> foo['fizz']
{'age': 144, 'name': 'Al'}
>>> foo['fizz']['name']
'Al'
>>> foo['moo']
['a', 'brown', 'cow']
>>> foo['moo'][1]
'brown'

This example code shows a dictionary (named foo) that contains two keys 'fizz' and 'moo', each corresponding to a different value and data type. The 'fizz' key holds another dictionary, and the 'moo' key holds a list. (Remember that dictionary values don’t keep their items in order. This is why foo['fizz'] shows the key-value pairs in a different order from what you typed.) To retrieve a value from a dictionary nested within another dictionary, you first specify the key of the larger data set you want to access using square brackets, which is 'fizz' in this example. Then you use square brackets again and enter the key 'name' corresponding to the nested string value 'Al' that you want to retrieve.

Using the len() Function with Dictionaries

The len() function shows you the number of items in a list or the number of characters in a string. It can also show you the number of items in a dictionary. Enter the following code into the interactive shell to see how to use the len() function to count items in a dictionary:

>>> spam = {}
>>> len(spam)
0
>>> spam['name'] = 'Al'
>>> spam['pet'] = 'Zophie the cat'
>>> spam['age'] = 89
>>> len(spam)
3

The first line of this example shows an empty dictionary called spam. The len() function correctly shows the length of this empty dictionary is 0. However, after you introduce the following three values, 'Al', 'Zophie the cat', and 89, into the dictionary, the len() function now returns 3 for the three key-value pairs you’ve just assigned to the variable.

Using the in Operator with Dictionaries

You can use the in operator to see whether a certain key exists in a dictionary. It’s important to remember that the in operator checks keys, not values. To see this operator in action, enter the following into the interactive shell:

>>> eggs = {'foo': 'milk', 'bar': 'bread'}
>>> 'foo' in eggs
True
>>> 'milk' in eggs
False
>>> 'blah blah blah' in eggs
False
>>> 'blah blah blah' not in eggs
True

We set up a dictionary called eggs with some key-value pairs and then check which keys exist in the dictionary using the in operator. The key 'foo' is a key in eggs, so True is returned. Whereas 'milk' returns False because it is a value, not a key, 'blah blah blah' evaluates to False because no such item exists in this dictionary. The not in operator works with dictionary values as well, which you can see in the last command.

Finding Items Is Faster with Dictionaries than with Lists

Imagine the following list and dictionary values in the interactive shell:

>>> listVal = ['spam', 'eggs', 'bacon']
>>> dictionaryVal = {'spam':0, 'eggs':0, 'bacon':0}

Python can evaluate the expression 'bacon' in dictionaryVal a bit faster than 'bacon' in listVal. This is because for a list, Python must start at the beginning of the list and then move through each item in order until it finds the search item. If the list is very large, Python must search through numerous items, a process that can take a lot of time.

But a dictionary, also called a hash table, directly translates where in the computer’s memory the value for the key-value pair is stored, which is why a dictionary’s items don’t have an order. No matter how large the dictionary is, finding any item always takes the same amount of time.

This difference in speed is hardly noticeable when searching short lists and dictionaries. But our detectEnglish module will have tens of thousands of items, and the expression word in ENGLISH_WORDS, which we’ll use in our code, will be evaluated many times when the isEnglish() function is called. Using dictionary values speeds up this process when handling a large number of items.

Using for Loops with Dictionaries

You can also iterate over the keys in a dictionary using for loops, just as you can iterate over the items in a list. Enter the following into the interactive shell:

>>> spam = {'name': 'Al', 'age': 99}
>>> for k in spam:
... print(k, spam[k])
...
Age 99
name Al

To use a for statement to iterate over the keys in a dictionary, start with the for keyword. Set the variable k, use the in keyword to specify that you want to loop over spam and end the statement with a colon. As you can see, entering print(k, spam[k]) returns each key in the dictionary along with its corresponding value.

Implementing the Dictionary File

Now let’s return to detectEnglish.py and set up the dictionary file. The dictionary file sits on the user’s hard drive, but unless we load the text in this file as a string value, our Python code can’t use it. We’ll create a loadDictionary() helper function to do this:

13. def loadDictionary():
14. dictionaryFile = open('dictionary.txt')
15. englishWords = {}

First, we get the dictionary’s file object by calling open() and passing the string of the filename 'dictionary.txt'. Then we name the dictionary variable englishWords and set it to an empty dictionary.

We’ll store all the words in the dictionary file (the file that stores the English words) in a dictionary value (the Python data type). The similar names are unfortunate, but the two are completely different. Even though we could have used a list to store the string values of each word in the dictionary file, we’re using a dictionary instead because the in operator works faster on dictionaries than lists.

Next, you’ll learn about the split() string method, which we’ll use to split our dictionary file into substrings.

The split() Method

The split() string method takes one string and returns a list of several strings by splitting the passed string at each space. To see an example of how this works, enter the following into the interactive shell:

>>> 'My very energetic mother just served us Nutella.'.split()
['My', 'very', 'energetic', 'mother', 'just', 'served', 'us', 'Nutella.']

The result is a list of eight strings, one string for each of the words in the original string. Spaces are dropped from the items in the list, even if there is more than one space. You can pass an optional argument to the split() method to tell it to split on a different string other than a space. Enter the following into the interactive shell:

>>> 'helloXXXworldXXXhowXXXareXXyou?'.split('XXX')
['hello', 'world', 'how', 'areXXyou?']

Notice that the string doesn’t have any spaces. Using split('XXX') splits the original string wherever 'XXX' occurs, resulting in a list of four strings. The last part of the string, 'areXXyou?', isn’t split because 'XX' isn’t the same as 'XXX'.

Splitting the Dictionary File into Individual Words

Let’s return to our source code in detectEnglish.py to see how we split the string in the dictionary file and store each word in a key.

16. for word in dictionaryFile.read().split('\n'):
17. englishWords[word] = None

Let’s break down line 16. The dictionaryFile variable stores the file object of the opened file. The dictionaryFile.read() method call reads the entire file and returns it as one large string value. We then call the split() method on this long string and split on newline characters. Because the dictionary file has one word per line, splitting on newline characters returns a list value made up of each word in the dictionary file.

The for loop at the beginning of the line iterates over each word to store each one in a key. But we don’t need values associated with the keys since we’re using the dictionary data type, so we’ll just store the None value for each key.

None is a type of value that you can assign to a variable to represent the lack of a value. Whereas the Boolean data type has only two values, the NoneType has only one value, None. It’s always written without quotes and with a capital N.

For example, say you had a variable named quizAnswer, which holds a user’s answer to a true-false pop quiz question. If the user skips a question and doesn’t answer it, it makes the most sense to assign quizAnswer to None as a default value rather than to True or False. Otherwise, it might look like the user answered the question when they didn’t. Likewise, function calls that exit by reaching the end of the function and not from a return statement evaluate to None because they don’t return anything.

Line 17 uses the word that is being iterated over as a key in englishWords and stores None as a value for that key.

Returning the Dictionary Data

After the for loop finishes, the englishWords dictionary should have tens of thousands of keys in it. At this point, we close the file object because we’re done reading from it, and then return englishWords:

18. dictionaryFile.close()
19. return englishWords

Then we call loadDictionary() and store the dictionary value it returns in a variable named ENGLISH_WORDS:

21. ENGLISH_WORDS = loadDictionary()

We want to call loadDictionary() before the rest of the code in the detectEnglish module, but Python must execute the def statement for loadDictionary() before we can call the function. This is the reason the assignment for ENGLISH_WORDS comes after the loadDictionary() function’s code.

Counting the Number of English Words in message

Lines 24 through 27 of the program’s code define the getEnglishCount() function, which takes a string argument and returns a float value indicating the ratio of recognized English words to total words. We’ll represent the ratio as a value between 0.0 and 1.0. A value of 0.0 means none of the words in message are English words, and 1.0 means all of the words in message are English words. Most likely, getEnglishCount() will return a float value between 0.0 and 1.0. The isEnglish() function uses this return value to determine whether to evaluate as True or False.

24. def getEnglishCount(message):
25.     message = message.upper()
26.     message = removeNonLetters(message)
27.     possibleWords = message.split()

To code this function, first we create a list of individual word strings from the string in message. Line 25 converts the string to uppercase letters. Then line 26 removes the non-letter characters from the string, such as numbers and punctuation, by calling removeNonLetters(). (You’ll see how this function works later.) Finally, the split() method on line 27 splits the string into individual words and stores them in a variable named possibleWords.

For example, if the string 'Hello there. How are you?' is passed after calling getEnglishCount(), the value stored in possibleWords after lines 25 to 27 execute would be ['HELLO', 'THERE', 'HOW', 'ARE', 'YOU'].

If the string in message is made up of integers, such as '12345', the call to removeNonLetters() would return a blank string, which split() would be called on to return an empty list. In the program, an empty list is the equivalent of zero words being English, which could cause a divide-by-zero error.

Divide-by-Zero Errors

To return a float value between 0.0 and 1.0, we divide the number of words in possibleWords recognized as English by the total number of words in possibleWords. Although this is mostly straightforward, we need to make sure possibleWords is not an empty list. If possibleWords is empty, it means the total number of words in possibleWords is 0.

Because in mathematics dividing by zero has no meaning, dividing by zero in Python results in a divide-by-zero error. To see an example of this error, enter the following into the interactive shell:

>>> 42 / 0
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
42 / 0
ZeroDivisionError: division by zero

You can see that 42 divided by 0 results in a ZeroDivisionError and a message explaining what went wrong. To avoid a divide-by-zero error, we’ll need to make sure the possibleWords list is never empty.

Line 29 checks whether possibleWords is an empty list, and line 30 returns 0.0 if no words are in the list.

29. if possibleWords == []:
30. return 0.0 # No words at all, so return 0.0

This check is necessary to avoid a divide-by-zero error.

Counting the English Word Matches

To produce the ratio of English words to total words, we’ll divide the number of words in possibleWords that are recognized as English by the total number of words in possibleWords. To do this, we need to count the number of recognized English words in possibleWords. Line 32 sets the variable matches to 0. Line 33 uses the for loop to iterate over each word in possibleWords and check whether the word exists in the ENGLISH_WORDS dictionary. If the word exists in the dictionary, the value in matches is incremented on line 35.

32.     matches = 0
33.     for word in possibleWords:
34.         if word in ENGLISH_WORDS:
35.             matches += 1

After the for loop is complete, the number of English words in the string is stored in the matches variable. Remember that we’re relying on the dictionary file to be accurate and complete for the detectEnglish module to work correctly. If a word isn’t in the dictionary text file, it won’t be counted as English, even if it’s a real word. Conversely, if a word is misspelled in the dictionary, words that aren’t English might accidentally be counted as real words.

Right now, the number of the words in possibleWords that are recognized as English and the total number of words in possibleWords are represented by integers. To return a float value between 0.0 and 1.0 by dividing these two integers, we’ll need to change one or the other into a float.

The float(), int(), and str() Functions and Integer Division

Let’s look at how to change an integer into a float because the two values we’ll need to divide to find the ratio are both integers. Python 3 always does regular division regardless of the value type, whereas Python 2 performs integer division when both values in the division operation are integers. Because users might use Python 2 to import detectEnglish.py, we’ll need to pass at least one integer variable to float() to make sure a float is returned when doing division. Doing so ensures that regular division will be performed no matter which version of Python is used. This is an example of making the code backward compatible with previous versions.

Although we won’t use them in this program, let’s review some other functions that convert values into other data types. The int() function returns an integer version of its argument, and the str() function returns a string. To see how these functions work, enter the following into the interactive shell:

>>> float(42)
42.0
>>> int(42.0)
42
>>> int(42.7)
42
>>> int('42')
42
>>> str(42)
'42'
>>> str(42.7)
'42.7'

You can see that the float() function changes the integer 42 into a float value. The int() function can turn the floats 42.0 and 42.7 into integers by truncating their decimal values, or it can turn a string value '42' into an integer. The str() function changes numerical values into string values. These functions are helpful if you need a value’s equivalent to be a different data type.

Finding the Ratio of English Words in the Message

To find the ratio of English words to total words, we divide the number of matches we found by the total number of possibleWords. Line 36 uses the / operator to divide these two numbers:

36. return float(matches) / len(possibleWords)

After we pass the integer matches to the float() function, it returns a float version of that number, which we divide by the length of the possibleWords list.

The only way return float(matches) / len(possibleWords) would lead to a divide-by-zero error is if len(possibleWords) evaluated to 0. The only way that would be possible is if possibleWords were an empty list. However, lines 29 and 30 specifically check for this case and return 0.0 if the list is empty. If possibleWords were set to the empty list, the program execution would never get past line 30, so we can be confident that line 36 won’t cause a ZeroDivisionError.

Removing Non-Letter Characters

Certain characters, such as numbers or punctuation marks, will cause our word detection to fail because words won’t look exactly as they’re spelled in our dictionary file. For example, if the last word in message is 'you.' and we didn’t remove the period at the end of the string, it wouldn’t be counted as an English word because 'you' wouldn’t be spelled with a period in the dictionary file. To avoid such misinterpretation, numbers and punctuation marks need to be removed.

The previously explained getEnglishCount() function calls the function removeNonLetters() on a string to remove any numbers and punctuation characters from it.

39. def removeNonLetters(message):
40.     lettersOnly = []
41.     for symbol in message:
42.         if symbol in LETTERS_AND_SPACE:
43.             lettersOnly.append(symbol)

Line 40 creates a blank list called lettersOnly, and line 41 uses a for loop to loop over each character in the message argument. Next, the for loop checks whether the character exists in the string LETTERS_AND_SPACE. If the character is a number or punctuation mark, it won’t exist in the LETTERS_AND_SPACE string and won’t be added to the list. If the character does exist in the string, it’s added to the end of the list using the append() method, which we’ll look at next.

The append() List Method

When we add a value to the end of a list, we say we’re appending the value to the list. This is done with lists so frequently in Python that there is an append() list method that takes a single argument to append to the end of the list. Enter the following into the interactive shell:

>>> eggs = []
>>> eggs.append('hovercraft')
>>> eggs
['hovercraft']
>>> eggs.append('eels')
>>> eggs
['hovercraft', 'eels']

After creating an empty list named eggs, we can enter eggs.append ('hovercraft') to add the string value 'hovercraft' to this list. Then when we enter eggs, it returns the only value stored in this list, which is 'hovercraft'. If you use append() again to add 'eels' to the end of the list, eggs now returns 'hovercraft' followed by 'eels'. Similarly, we can use the append() list method to add items to the lettersOnly list we created in our code earlier. This is what lettersOnly.append(symbol) on line 43 does in the for loop.

Creating a String of Letters

After finishing the for loop, lettersOnly should be a list of each letter and space character from the original message string. Because a list of one-character strings isn’t useful for finding English words, line 44 joins the character strings in the lettersOnly list into one string and returns it:

44. return ''.join(lettersOnly)

To concatenate the list elements in lettersOnly into one large string, we call the join() string method on a blank string ''. This joins the strings in lettersOnly with a blank string between them. This string value is then returned as the removeNonLetters() function’s return value.

Detecting English Words

When a message is decrypted with the wrong key, it will often produce far more non-letter and non-space characters than are found in a typical English message. Also, the words it produces will often be random and not found in a dictionary of English words. The isEnglish() function can check for both of these issues in a given string.

47. def isEnglish(message, wordPercentage=20, letterPercentage=85):
48.     # By default, 20% of the words must exist in the dictionary file, and
49.     # 85% of all the characters in the message must be letters or spaces
50.     # (not punctuation or numbers).

Line 47 sets up the isEnglish() function to accept a string argument and return a Boolean value of True when the string is English text and False when it’s not. This function has three parameters: message, wordPercentage=20, and letterPercentage=85. The first parameter contains the string to be checked, and the second and third parameters set default percentages for words and letters, which the string must contain in order to be confirmed as English. (A percentage is a number between 0 and 100 that shows how much of something is proportional to the total number of those things.) We’ll explore how to use default arguments and calculate percentages in the following sections.

Using Default Arguments

Sometimes a function will almost always have the same values passed to it when called. Instead of including these for every function call, you can specify a default argument in the function’s def statement.

Line 47’s def statement has three parameters, with default arguments of 20 and 85 provided for wordPercentage and letterPercentage, respectively. The isEnglish() function can be called with one to three arguments. If no arguments are passed for wordPercentage or letterPercentage, then the values assigned to these parameters will be their default arguments.

The default arguments define what percent of the message string needs to be made up of real English words for isEnglish() to determine that message is an English string and what percent of the message needs to be made up of letters or spaces instead of numbers or punctuation marks. For example, if isEnglish() is called with only one argument, the default arguments are used for the wordPercentage (the integer 20) and letterPercentage (the integer 85) parameters, which means 20 percent of the string needs to be made up of English words and 85 percent of the string needs to be made up of letters. These percentages work for detecting English in most cases, but you might want to try other argument combinations in specific cases when isEnglish() needs looser or more restrictive thresholds. In those situations, a program can just pass arguments for wordPercentage and letterPercentage instead of using the default arguments. Table 11-1 shows function calls to isEnglish() and what they’re equivalent to.

Table 11-1: Function Calls with and without Default Arguments

Function call	Equivalent to
isEnglish('Hello')	isEnglish('Hello', 20, 85)
isEnglish('Hello', 50)	isEnglish('Hello', 50, 85)
isEnglish('Hello', 50, 60)	isEnglish('Hello', 50, 60)
isEnglish('Hello', letterPercentage=60)	isEnglish('Hello', 20, 60)

For instance, the third example in Table 11-1 shows that when the function is called with the second and third parameters specified, the program will use those arguments, not the default arguments.

Calculating Percentages

Once we know the percentages our program will use, we’ll need to calculate the percentages for the message string. For example, the string value 'Hello cat MOOSE fsdkl ewpin' has five “words,” but only three are English. To calculate the percentage of English words in this string, you divide the number of English words by the total number of words and multiply the result by 100. The percentage of English words in 'Hello cat MOOSE fsdkl ewpin' is 3 / 5 * 100, which is 60 percent. Table 11-2 shows a few examples of calculated percentages.

Table 11-2: Calculating Percentages of English Words

Number of English words	Total number of words	Ratio of English words	× 100	=	Percentage
3	5	0.6	× 100	=	60
6	10	0.6	× 100	=	60
300	500	0.6	× 100	=	60
32	87	0.3678	× 100	=	36.78
87	87	1.0	× 100	=	100
0	10	0	× 100	=	0

The percentage will always be between 0 percent (meaning no words are English) and 100 percent (meaning all of the words are English). Our isEnglish() function will consider a string English if at least 20 percent of the words exist in the dictionary file and 85 percent of the characters in the string are letters or spaces. This means the message will still be detected as English even if the dictionary file isn’t perfect or if some words in the message are something other than what we define as English words.

Line 51 calculates the percentage of recognized English words in message by passing message to getEnglishCount(), which does the division and returns a float between 0.0 and 1.0:

51. wordsMatch = getEnglishCount(message) * 100 >= wordPercentage

To get a percentage from this float, multiply it by 100. If the resulting number is greater than or equal to the wordPercentage parameter, True is stored in wordsMatch. (Recall that the >= comparison operator evaluates expressions to a Boolean value.) Otherwise, False is stored in wordsMatch.

Lines 52 to 54 calculate the percentage of letter characters in the message string by dividing the number of letter characters by the total number of characters in message.

52.     numLetters = len(removeNonLetters(message))
53.     messageLettersPercentage = float(numLetters) / len(message) * 100
54.     lettersMatch = messageLettersPercentage >= letterPercentage

Earlier in the code, we wrote the removeNonLetters() function to find all the letter and space characters in a string, so we can just reuse it. Line 52 calls removeNonLetters(message) to get a string of just the letter and space characters in message. Passing this string to len() should return the total number of letter and space characters in message, which we store as an integer in the numLetters variable.

Line 53 determines the percentage of letters by getting a float version of the integer in numLetters and dividing it by len(message). The return value of len(message) will be the total number of characters in message. As discussed previously, the call to float() is made to make sure that line 53 performs regular division instead of integer division just in case the programmer who imports the detectEnglish module is running Python 2.

Line 54 checks whether the percentage in messageLettersPercentage is greater than or equal to the letterPercentage parameter. This expression evaluates to a Boolean value that is stored in lettersMatch.

We want isEnglish() to return True only if both the wordsMatch and lettersMatch variables contain True. Line 55 combines these values into an expression using the and operator:

55. return wordsMatch and lettersMatch

If both the wordsMatch and lettersMatch variables are True, isEnglish() will declare the message is English and return True. Otherwise, isEnglish() will return False.

Summary

The transposition file cipher is an improvement over the Caesar cipher because it can have hundreds or thousands of possible keys for messages instead of just 26 different keys. Even though a computer has no problem decrypting a message with thousands of potential keys, we need to write code that can determine whether a decrypted string is valid English and therefore the original message.

In this chapter, we created an English-detecting program using a dictionary text file to create a dictionary data type. The dictionary data type is useful because it can contain multiple values just as a list does. However, unlike with a list, you can index values in a dictionary using string values as keys instead of only integers. Most of the tasks you can do with a list you can also do with a dictionary, such as passing it to len() or using the in and not in operators on it. However, the in operator executes on a very large dictionary value much faster than on a very large list. This proved particularly useful for us because our dictionary data contained thousands of values that we needed to sift through quickly.

This chapter also introduced the split() method, which can split strings into a list of strings, and the NoneType data type, which has only one value: None. This value is useful for representing a lack of a value.

You learned how to avoid divide-by-zero errors when using the / operator; convert values into other data types using the int(), float(), and str() functions; and use the append() list method to add a value to the end of a list.

When you define functions, you can give some of the parameters default arguments. If no argument is passed for these parameters when the function is called, the program uses the default argument value, which can be a useful shortcut in your programs. In Chapter 12, you’ll learn to hack the transposition cipher using the English detection code!

PRACTICE QUESTIONS

Answers to the practice questions can be found on the book’s website at https://www.nostarch.com/crackingcodes/.

What does the following code print?

spam = {'name': 'Al'}
print(spam['name'])
What does this code print?

spam = {'eggs': 'bacon'}
print('bacon' in spam)
What for loop code would print the values in the following spam dictionary?

spam = {'name': 'Zophie', 'species':'cat', 'age':8}
What does the following line print?

print('Hello, world!'.split())
What will the following code print?

def spam(eggs=42):
print(eggs)
spam()
spam('Hello')
What percentage of words in this sentence are valid English words?

"Whether it's flobulllar in the mind to quarfalog the slings and
arrows of outrageous guuuuuuuuur."

11DETECTING ENGLISH PROGRAMMATICALLY