Detecting English Programmatically

Topics Covered In This Chapter:

·         Dictionaries

·         The split() Method

·         The None Value

·         "Divide by Zero" Errors

·         The float(), int(), and str() Functions and Python 2 Division

·         The append() List Method

·         Default Arguments

·         Calculating Percentage

The gaffer says something longer and more complicated. After a while, Waterhouse (now wearing his cryptoanalyst hat, searching for meaning midst apparent randomness, his neural circuits exploiting the redundancies in the signal) realizes that the man is speaking heavily accented English.

“Cryptonomicon” by Neal Stephenson

A message encrypted with the transposition cipher can have thousands of possible keys. Your computer can still easily brute-force this many keys, but you would then have to look through thousands of decryptions to find the one correct plaintext. This is a big problem for the brute-force method of cracking the transposition cipher.

When the computer decrypts a message with the wrong key, the resulting plaintext is garbage text. We need to program the computer to be able to recognize if the plaintext is garbage text or English text. That way, if the computer decrypts with the wrong key, it knows to go on and try the next possible key. And when the computer tries a key that decrypts to English text, it can stop and bring that key to the attention of the cryptanalyst. Now the cryptanalyst won’t have to look through thousands of incorrect decryptions.

How Can a Computer Understand English?

It can’t. At least, not in the way that human beings like you or I understand English. Computers don’t really understand math, chess, or lethal military androids either, any more than a clock understands lunchtime. Computers just execute instructions one after another. But these instructions can mimic very complicated behaviors that solve math problems, win at chess, or hunt down the future leaders of the human resistance.

Ideally, what we need is a Python function (let’s call it isEnglish()) that has a string passed to it and then returns True if the string is English text and False if it’s random gibberish. Let’s take a look at some English text and some garbage text and try to see what patterns the two have:

Robots are your friends. Except for RX-686. She will try to eat you.

 

ai-pey  e. xrx ne augur iirl6 Rtiyt fhubE6d hrSei t8..ow eo.telyoosEs  t

One thing we can notice is that the English text is made up of words that you could find in a dictionary, but the garbage text is made up of words that you won’t. Splitting up the string into individual words is easy. There is already a Python string method named split() that will do this for us (this method will be explained later). The split() method just sees when each word begins or ends by looking for the space characters. Once we have the individual words, we can test to see if each word is a word in the dictionary with code like this:

if word == 'aardvark' or word == 'abacus' or word == 'abandon' or word == 'abandoned' or word == 'abbreviate' or word == 'abbreviation' or word == 'abdomen' or …

We can write code like that, but we probably shouldn’t. The computer won’t mind running through all this code, but you wouldn’t want to type it all out. Besides, somebody else has already typed out a text file full of nearly all English words. These text files are called dictionary files. So we just need to write a function that checks if the words in the string exist somewhere in that file.

Remember, a dictionary file is a text file that contains a large list of English words. A dictionary value is a Python value that has key-value pairs.

Not every word will exist in our “dictionary file”. Maybe the dictionary file is incomplete and doesn’t have the word, say, “aardvark”. There are also perfectly good decryptions that might have non-English words in them, such as “RX-686” in our above English sentence. (Or maybe the plaintext is in a different language besides English. But we’ll just assume it is in English for now.)

And garbage text might just happen to have an English word or two in it by coincidence. For example, it turns out the word “augur” means a person who tries to predict the future by studying the way birds are flying. Seriously.

So our function will not be foolproof. But if most of the words in the string argument are English words, it is a good bet to say that the string is English text. It is a very low probability that a ciphertext will decrypt to English if decrypted with the wrong key.

The dictionary text file will have one word per line in uppercase. It will look like this:

AARHUS

AARON

ABABA

ABACK

ABAFT

ABANDON

ABANDONED

ABANDONING

ABANDONMENT

ABANDONS

…and so on. You can download this entire file (which has over 45,000 words) from http://invpy.com/dictionary.txt.

Our isEnglish() function will have to split up a decrypted string into words, check if each word is in a file full of thousands of English words, and if a certain amount of the words are English words, then we will say that the text is in English. And if the text is in English, then there’s a good bet that we have decrypted the ciphertext with the correct key.

And that is how the computer can understand if a string is English or if it is gibberish.

Practice Exercises, Chapter 12, Section A

Practice exercises can be found at http://invpy.com/hackingpractice12A.

The Detect English Module

The detectEnglish.py program that we write in this chapter isn’t a program that runs by itself. Instead, it will be imported by our encryption programs so that they can call the detectEnglish.isEnglish() function. This is why we don’t give detectEnglish.py a main() function. The other functions in the program are all provided for isEnglish() to call.

Source Code for the Detect English Module

Open a new file editor window by clicking on FileNew Window. Type in the following code into the file editor, and then save it as detectEnglish.py. Press F5 to run the program.

Source code for detectEnglish.py

 1. # Detect English module

 2. # http://inventwithpython.com/hacking (BSD Licensed)

 3.

 4. # To use, type this code:

 5. #   import detectEnglish

 6. #   detectEnglish.isEnglish(someString) # returns True or False

 7. # (There must be a "dictionary.txt" file in this directory with all English

 8. # words in it, one word per line. You can download this from

 9. # http://invpy.com/dictionary.txt)

10. UPPERLETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

11. LETTERS_AND_SPACE = UPPERLETTERS + UPPERLETTERS.lower() + ' \t\n'

12.

13. def loadDictionary():

14.     dictionaryFile = open('dictionary.txt')

15.     englishWords = {}

16.     for word in dictionaryFile.read().split('\n'):

17.         englishWords[word] = None

18.     dictionaryFile.close()

19.     return englishWords

20.

21. ENGLISH_WORDS = loadDictionary()

22.

23.

24. def getEnglishCount(message):

25.     message = message.upper()

26.     message = removeNonLetters(message)

27.     possibleWords = message.split()

28.

29.     if possibleWords == []:

30.         return 0.0 # no words at all, so return 0.0

31.

32.     matches = 0

33.     for word in possibleWords:

34.         if word in ENGLISH_WORDS:

35.             matches += 1

36.     return float(matches) / len(possibleWords)

37.

38.

39. def removeNonLetters(message):

40.     lettersOnly = []

41.     for symbol in message:

42.         if symbol in LETTERS_AND_SPACE:

43.             lettersOnly.append(symbol)

44.     return ''.join(lettersOnly)

45.

46.

47. def isEnglish(message, wordPercentage=20, letterPercentage=85):

48.     # By default, 20% of the words must exist in the dictionary file, and

49.     # 85% of all the characters in the message must be letters or spaces

50.     # (not punctuation or numbers).

51.     wordsMatch = getEnglishCount(message) * 100 >= wordPercentage

52.     numLetters = len(removeNonLetters(message))

53.     messageLettersPercentage = float(numLetters) / len(message) * 100

54.     lettersMatch = messageLettersPercentage >= letterPercentage

55.     return wordsMatch and lettersMatch

How the Program Works

detectEnglish.py

 1. # Detect English module

 2. # http://inventwithpython.com/hacking (BSD Licensed)

 3.

 4. # To use, type this code:

 5. #   import detectEnglish

 6. #   detectEnglish.isEnglish(someString) # returns True or False

 7. # (There must be a "dictionary.txt" file in this directory with all English

 8. # words in it, one word per line. You can download this from

 9. # http://invpy.com/dictionary.txt)

These comments at the top of the file give instructions to programmers on how to use this module. They give the important reminder that if there is no file named dictionary.txt in the same directory as detectEnglish.py then this module will not work. If the user doesn’t have this file, the comments tell them they can download it from http://invpy.com/dictionary.txt.

detectEnglish.py

10. UPPERLETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

11. LETTERS_AND_SPACE = UPPERLETTERS + UPPERLETTERS.lower() + ' \t\n'

Lines 10 and 11 set up a few variables that are constants, which is why they have uppercase names. UPPERLETTERS is a variable containing the 26 uppercase letters, and LETTERS_AND_SPACE contain these letters (and the lowercase letters returned from UPPERLETTERS.lower()) but also the space character, the tab character, and the newline character. The tab and newline characters are represented with escape characters \t and \n.

detectEnglish.py

13. def loadDictionary():

14.     dictionaryFile = open('dictionary.txt')

The dictionary file sits on the user’s hard drive, but we need to load the text in this file as a string value so our Python code can use it. First, we get a file object by calling open() and passing the string of the filename 'dictionary.txt'. Before we continue with the loadDictionary() code, let’s learn about the dictionary data type.

Dictionaries and the Dictionary Data Type

The dictionary data type has values which can contain multiple other values, just like lists do. In list values, you use an integer index value to retrieve items in the list, like spam[42]. For each item in the dictionary value, there is a key used to retrieve it. (Values stored inside lists and dictionaries are also sometimes called items.) The key can be an integer or a string value, like spam['hello'] or spam[42]. Dictionaries let us organize our program’s data with even more flexibility than lists.

Instead of typing square brackets like list values, dictionary values (or simply, dictionaries) use curly braces. Try typing the following into the interactive shell:

>>> emptyList = []

>>> emptyDictionary = {}

>>> 

A dictionary’s values are typed out as key-value pairs, which are separated by colons. Multiple key-value pairs are separated by commas. To retrieve values from a dictionary, just use square brackets with the key in between them (just like indexing with lists). Try typing the following into the interactive shell:

>>> spam = {'key1':'This is a value', 'key2':42}

>>> spam['key1']

'This is a value'

>>> spam['key2']

42

>>> 

It is important to know that, just as with lists, variables do not store dictionary values themselves, but references to dictionaries. The example code below has two variables with references to the same dictionary:

>>> spam = {'hello': 42}

>>> eggs = spam

>>> eggs['hello'] = 99

>>> eggs

{'hello': 99}

>>> spam

{'hello': 99}

>>> 

Adding or Changing Items in a Dictionary

You can add or change values in a dictionary with indexes as well. Try typing the following into the interactive shell:

>>> spam = {42:'hello'}

>>> print(spam[42])

hello

>>> spam[42] = 'goodbye'

>>> print(spam[42])

goodbye

>>> 

And just like lists can contain other lists, dictionaries can also contain other dictionaries (or lists). Try typing the following into the interactive shell:

>>> foo = {'fizz': {'name': 'Al', 'age': 144}, 'moo':['a', 'brown', 'cow']}

>>> foo['fizz']

{'age': 144, 'name': 'Al'}

>>> foo['fizz']['name']

'Al'

>>> foo['moo']

['a', 'brown', 'cow']

>>> foo['moo'][1]

'brown'

>>> 

Practice Exercises, Chapter 12, Set B

Practice exercises can be found at http://invpy.com/hackingpractice12B.

Using the len() Function with Dictionaries

The len() function can tell you how many items are in a list or how many characters are in a string, but it can also tell you how many items are in a dictionary as well. Try typing the following into the interactive shell:

>>> spam = {}

>>> len(spam)

0

>>> spam['name'] = 'Al'

>>> spam['pet'] = 'Zophie the cat'

>>> spam['age'] = 89

>>> len(spam)

3

>>> 

Using the in Operator with Dictionaries

The in operator can also be used to see if a certain key value exists in a dictionary. It is important to remember that the in operator checks if a key exists in the dictionary, not a value. Try typing the following into the interactive shell:

>>> eggs = {'foo': 'milk', 'bar': 'bread'}

>>> 'foo' in eggs

True

>>> 'blah blah blah' in eggs

False

>>> 'milk' in eggs

False

>>> 'bar' in eggs

True

>>> 'bread' in eggs

False

>>> 

The not in operator works with dictionary values as well.

Using for Loops with Dictionaries

You can also iterate over the keys in a dictionary with for loops, just like you can iterate over the items in a list. Try typing the following into the interactive shell:

>>> spam = {'name':'Al', 'age':99}

>>> for k in spam:

...   print(k)

...   print(spam[k])

...   print('==========')

...

age

99

==========

name

Al

==========

>>> 

Practice Exercises, Chapter 12, Set C

Practice exercises can be found at http://invpy.com/hackingpractice12C.

The Difference Between Dictionaries and Lists

Dictionaries are like lists in many ways, but there are a few important differences:

1.      Dictionary items are not in any order. There is no “first” or “last” item in a dictionary like there is in a list.

2.      Dictionaries do not have concatenation with the + operator. If you want to add a new item, you can just use indexing with a new key. For example, foo['a new key'] = 'a string'

3.      Lists only have integer index values that range from 0 to the length of the list minus one. But dictionaries can have any key. If you have a dictionary stored in a variable spam, then you can store a value in spam[3] without needing values for spam[0], spam[1], or spam[2] first.

Finding Items is Faster with Dictionaries Than Lists

detectEnglish.py

15.     englishWords = {}

In the loadDictionary() function, we will store all the words in the “dictionary file” (as in, a file that has all the words in an English dictionary book) in a dictionary value (as in, the Python data type.) The similar names are unfortunate, but they are two completely different things.

We could have also used a list to store the string values of each word from the dictionary file. The reason we use a dictionary is because the in operator works faster on dictionaries than lists. Imagine that we had the following list and dictionary values:

>>> listVal = ['spam', 'eggs', 'bacon']

>>> dictionaryVal = {'spam':0, 'eggs':0, 'bacon':0}

Python can evaluate the expression 'bacon' in dictionaryVal a little bit faster than 'bacon' in listVal. The reason is technical and you don’t need to know it for the purposes of this book (but you can read more about it at http://invpy.com/listvsdict). This faster speed doesn’t make that much of a difference for lists and dictionaries with only a few items in them like in the above example. But our detectEnglish module will have tens of thousands of items, and the expression word in ENGLISH_WORDS will be evaluated many times when the isEnglish() function is called. The speed difference really adds up for the detectEnglish module.

The split() Method

The split() string method returns a list of several strings. The “split” between each string occurs wherever a space is. For an example of how the split() string method works, try typing this into the shell:

>>> 'My very energetic mother    just served us Nutella.'.split()

['My', 'very', 'energetic', 'mother', 'just', 'served', 'us', 'Nutella.']

>>> 

The result is a list of eight strings, one string for each of the words in the original string. The spaces are dropped from the items in the list (even if there is more than one space). You can pass an optional argument to the split() method to tell it to split on a different string other than just a space. Try typing the following into the interactive shell:

>>> 'helloXXXworldXXXhowXXXareXXyou?'.split('XXX')

['hello', 'world', 'how', 'areXXyou?']

>>> 

 

 

detectEnglish.py

16.     for word in dictionaryFile.read().split('\n'):

Line 16 is a for loop that will set the word variable to each value in the list dictionaryFile.read().split('\n'). Let’s break this expression down. dictionaryFile is the variable that stores the file object of the opened file. The dictionaryFile.read() method call will read the entire file and return it as a very large string value. On this string, we will call the split() method and split on newline characters. This split() call will return a list value made up of each word in the dictionary file (because the dictionary file has one word per line.)

This is why the expression dictionaryFile.read().split('\n') will evaluate to a list of string values. Since the dictionary text file has one word on each line, the strings in the list that split() returns will each have one word.

The None Value

None is a special value that you can assign to a variable. The None value represents the lack of a value. None is the only value of the data type NoneType. (Just like how the Boolean data type has only two values, the NoneType data type has only one value, None.) It can be very useful to use the None value when you need a value that means “does not exist”. The None value is always written without quotes and with a capital “N” and lowercase “one”.

For example, say you had a variable named quizAnswer which holds the user's answer to some True-False pop quiz question. You could set quizAnswer to None if the user skipped the question and did not answer it. Using None would be better because if you set it to True or False before assigning the value of the user's answer, it may look like the user gave an answer for the question even though they didn't.

Calls to functions that do not return anything (that is, they exit by reaching the end of the function and not from a return statement) will evaluate to None.

detectEnglish.py

17.         englishWords[word] = None

In our program, we only use a dictionary for the englishWords variable so that the in operator can find keys in it. We don’t care what is stored for each key, so we will just use the None value. The for loop that starts on line 16 will iterate over each word in the dictionary file, and line 17 will use that word as a key in englishWords with None stored for that key.

Back to the Code

detectEnglish.py

18.     dictionaryFile.close()

19.     return englishWords

After the for loop finishes, the englishWords dictionary will have tens of thousands of keys in it. At this point, we close the file object since we are done reading from it and then return englishWords.

detectEnglish.py

21. ENGLISH_WORDS = loadDictionary()

Line 21 calls loadDictionary() and stores the dictionary value it returns in a variable named ENGLISH_WORDS. We want to call loadDictionary() before the rest of the code in the detectEnglish module, but Python has to execute the def statement for loadDictionary() before we can call the function. This is why the assignment for ENGLISH_WORDS comes after the loadDictionary() function’s code.

detectEnglish.py

24. def getEnglishCount(message):

25.     message = message.upper()

26.     message = removeNonLetters(message)

27.     possibleWords = message.split()

The getEnglishCount() function will take one string argument and return a float value indicating the amount of recognized English words in it. The value 0.0 will mean none of the words in message are English words and 1.0 will mean all of the words in message are English words, but most likely getEnglishCount() will return something in between 0.0 and 1.0. The isEnglish() function will use this return value as part of whether it returns True or False.

First we must create a list of individual word strings from the string in message. Line 25 will convert it to uppercase letters. Then line 26 will remove the non-letter characters from the string, such as numbers and punctuation, by calling removeNonLetters(). (We will see how this function works later.) Finally, the split() method on line 27 will split up the string into individual words that are stored in a variable named possibleWords.

So if the string 'Hello there. How are you?' was passed when getEnglishCount() was called, the value stored in possibleWords after lines 25 to 27 execute would be ['HELLO', 'THERE', 'HOW', 'ARE', 'YOU'].

detectEnglish.py

29.     if possibleWords == []:

30.         return 0.0 # no words at all, so return 0.0

If the string in message was something like '12345', all of these non-letter characters would have been taken out of the string returned from removeNonLetters(). The call to removeNonLetters() would return the blank string, and when split() is called on the blank string, it will return an empty list.

Line 29 does a special check for this case, and returns 0.0. This is done to avoid a “divide-by-zero” error (which is explained later on).

detectEnglish.py

32.     matches = 0

33.     for word in possibleWords:

34.         if word in ENGLISH_WORDS:

35.             matches += 1

The float value that is returned from getEnglishCount() ranges between 0.0 and 1.0. To produce this number, we will divide the number of the words in possibleWords that are recognized as English by the total number of words in possibleWords.

The first part of this is to count the number of recognized English words in possibleWords, which is done on lines 32 to 35. The matches variable starts off as 0. The for loop on line 33 will loop over each of the words in possibleWords, and checks if the word exists in the ENGLISH_WORDS dictionary. If it does, the value in matches is incremented on line 35.

Once the for loop has completed, the number of English words is stored in the matches variable. Note that technically this is only the number of words that are recognized as English because they existed in our dictionary text file. As far as the program is concerned, if the word exists in dictionary.txt, then it is a real English word. And if it doesn’t exist in the dictionary file, it is not an English word. We are relying on the dictionary file to be accurate and complete in order for the detectEnglish module to work correctly.

“Divide by Zero” Errors

detectEnglish.py

36.     return float(matches) / len(possibleWords)

Returning a float value between 0.0 and 1.0 is a simple matter of dividing the number of recognized words by the total number of words.

However, whenever we divide numbers using the / operator in Python, we should be careful not to cause a “divide-by-zero” error. In mathematics, dividing by zero has no meaning. If we try to get Python to do it, it will result in an error. Try typing the following into the interactive shell:

>>> 42 / 0

Traceback (most recent call last):

  File "<pyshell#0>", line 1, in <module>

    42 / 0

ZeroDivisionError: int division or modulo by zero

>>> 

But a divide by zero can’t possibly happen on line 36. The only way it could is if len(possibleWords) evaluated to 0. And the only way that would be possible is if possibleWords were the empty list. However, our code on lines 29 and 30 specifically checks for this case and returns 0.0. So if possibleWords had been set to the empty list, the program execution would have never gotten past line 30 and line 36 would not cause a “divide-by-zero” error.

The float(), int(), and str() Functions and Integer Division

detectEnglish.py

36.     return float(matches) / len(possibleWords)

The value stored in matches is an integer. However, we pass this integer to the float() function which returns a float version of that number. Try typing the following into the interactive shell:

>>> float(42)

42.0

>>> 

The int() function returns an integer version of its argument, and the str() function returns a string. Try typing the following into the interactive shell:

>>> float(42)

42.0

>>> int(42.0)

42

>>> int(42.7)

42

>>> int("42")

42

>>> str(42)

'42'

>>> str(42.7)

'42.7'

>>> 

The float(), int(), and str() functions are helpful if you need a value’s equivalent in a different data type. But you might be wondering why we pass matches to float() on line 36 in the first place.

The reason is to make our detectEnglish module work with Python 2. Python 2 will do integer division when both values in the division operation are integers. This means that the result will be rounded down. So using Python 2, 22 / 7 will evaluate to 3. However, if one of the values is a float, Python 2 will do regular division: 22.0 / 7 will evaluate to 3.142857142857143. This is why line 36 calls float(). This is called making the code backwards compatible with previous versions.

Python 3 always does regular division no matter if the values are floats or ints.

Practice Exercises, Chapter 12, Set D

Practice exercises can be found at http://invpy.com/hackingpractice12D.

Back to the Code

detectEnglish.py

39. def removeNonLetters(message):

40.     lettersOnly = []

41.     for symbol in message:

The previously explained getEnglishCount() function calls the removeNonLetters() function to return a string that is the passed argument, except with all the numbers and punctuation characters removed.

The code in removeNonLetters() starts with a blank list and loops over each character in the message argument. If the character exists in the LETTERS_AND_SPACE string, then it is added to the end of the list. If the character is a number or punctuation mark, then it won’t exist in the LETTERS_AND_SPACE string and won’t be added to the list.


 

The append() List Method

detectEnglish.py

42.         if symbol in LETTERS_AND_SPACE:

43.             lettersOnly.append(symbol)

Line 42 checks if symbol (which is set to a single character on each iteration of line 41’s for loop) exists in the LETTERS_AND_SPACE string. If it does, then it is added to the end of the lettersOnly list with the append() list method.

If you want to add a single value to the end of a list, you could put the value in its own list and then use list concatenation to add it. Try typing the following into the interactive shell, where the value 42 is added to the end of the list stored in spam:

>>> spam = [2, 3, 5, 7, 9, 11]

>>> spam

[2, 3, 5, 7, 9, 11]

>>> spam = spam + [42]

>>> spam

[2, 3, 5, 7, 9, 11, 42]

>>> 

When we add a value to the end of a list, we say we are appending the value to the list. This is done with lists so frequently in Python that there is an append() list method which takes a single argument to append to the end of the list. Try typing the following into the shell:

>>> eggs = []

>>> eggs.append('hovercraft')

>>> eggs

['hovercraft']

>>> eggs.append('eels')

>>> eggs

['hovercraft', 'eels']

>>> eggs.append(42)

>>> eggs

['hovercraft', 'eels', 42]

>>> 

For technical reasons, using the append() method is faster than putting a value in a list and adding it with the + operator. The append() method modifies the list in-place to include the new value. You should always prefer the append() method for adding values to the end of a list.

detectEnglish.py

44.     return ''.join(lettersOnly)

After line 41’s for loop is done, only the letter and space characters are in the lettersOnly list. To make a single string value from this list of strings, we call the join() string method on a blank string. This will join the strings in lettersOnly together with a blank string (that is, nothing) between them. This string value is then returned as removeNonLetters()’s return value.

Default Arguments

detectEnglish.py

47. def isEnglish(message, wordPercentage=20, letterPercentage=85):

48.     # By default, 20% of the words must exist in the dictionary file, and

49.     # 85% of all the characters in the message must be letters or spaces

50.     # (not punctuation or numbers).

The isEnglish() function will accept a string argument and return a Boolean value that indicates whether or not it is English text. But when you look at line 47, you can see it has three parameters. The second and third parameters (wordPercentage and letterPercentage) have equal signs and values next to them. These are called default arguments. Parameters that have default arguments are optional. If the function call does not pass an argument for these parameters, the default argument is used by default.

If isEnglish() is called with only one argument, the default arguments are used for the wordPercentage (the integer 20) and letterPercentage (the integer 85) parameters. Table 12-1 shows function calls to isEnglish(), and what they are equivalent to:

Table 12-1. Function calls with and without default arguments.

Function Call

Equivalent To

isEnglish('Hello')

isEnglish('Hello', 20, 85)

isEnglish('Hello', 50)

isEnglish('Hello', 50, 85)

isEnglish('Hello', 50, 60)

isEnglish('Hello', 50, 60)

isEnglish('Hello', letterPercentage=60)

isEnglish('Hello', 20, 60)

 

When isEnglish() is called with no second and third argument, the function will require that 20% of the words in message are English words that exist in the dictionary text file and 85% of the characters in message are letters. These percentages work for detecting English in most cases. But sometimes a program calling isEnglish() will want looser or more restrictive thresholds. If so, a program can just pass arguments for wordPercentage and letterPercentage instead of using the default arguments.

Calculating Percentage

A percentage is a number between 0 and 100 that shows how much of something there is proportional to the total number of those things. In the string value 'Hello cat MOOSE fsdkl ewpin' there are five “words” but only three of them are English words. To calculate the percentage of English words, you divide the number of English words by the total number of words and multiply by 100. The percentage of English words in 'Hello cat MOOSE fsdkl ewpin' is 3 / 5 * 100, which is 60.

Table 12-2 shows some percentage calculations:

Table 12-2. Some percentage calculations.

Number of English Words

Total Number of Words

English Words / Total

* 100

=

Percentage

3

5

0.6

* 100

=

60

6

10

0.6

*100

=

60

300

500

0.6

* 100

=

60

32

87

0.3678

* 100

=

36.78

87

87

1.0

* 100

=

100

0

10

0

* 100

=

0

 

The percentage will always be between 0% (meaning none of the words) and 100% (meaning all of the words). Our isEnglish() function will consider a string to be English if at least 20% of the words are English words that exist in the dictionary file and 85% of the characters in the string are letters (or spaces).

detectEnglish.py

51.     wordsMatch = getEnglishCount(message) * 100 >= wordPercentage

Line 51 calculates the percentage of recognized English words in message by passing message to getEnglishCount(), which does the division for us and returns a float between 0.0 and 1.0. To get a percentage from this float, we just have to multiply it by 100. If this number is greater than or equal to the wordPercentage parameter, then True is stored in wordsMatch. (Remember, the >= comparison operator evaluates expressions to a Boolean value.) Otherwise, False is stored in wordsMatch.

detectEnglish.py

52.     numLetters = len(removeNonLetters(message))

53.     messageLettersPercentage = float(numLetters) / len(message) * 100

54.     lettersMatch = messageLettersPercentage >= letterPercentage

Lines 52 to 54 calculate the percentage of letter characters in the message string. To determine the percentage of letter (and space) characters in message, our code must divide the number of letter characters by the total number of characters in message. Line 52 calls removeNonLetters(message). This call will return a string that has the number and punctuation characters removed from the string. Passing this string to len() will return the number of letter and space characters that were in message. This integer is stored in the numLetters variable.

Line 53 determines the percentage of letters getting a float version of the integer in numLetters and dividing this by len(message). The return value of len(message) will be the total number of characters in message. (The call to float() was made so that if the programmer who imports our detectEnglish module is running Python 2, the division done on line 53 will always be regular division instead of integer division.)

Line 54 checks if the percentage in messageLettersPercentage is greater than or equal to the letterPercentage parameter. This expression evaluates to a Boolean value that is stored in lettersMatch.

detectEnglish.py

55.     return wordsMatch and lettersMatch

We want isEnglish() to return True only if both the wordsMatch and lettersMatch variables contain True, so we put them in an expression with the and operator. If both the wordsMatch and lettersMatch variables are True, then isEnglish() will declare that the message argument is English and return True. Otherwise, isEnglish() will return False.

Practice Exercises, Chapter 12, Set E

Practice exercises can be found at http://invpy.com/hackingpractice12E.

Summary

The dictionary data type is useful because like a list it can contain multiple values. However unlike the list, we can index values in it with string values instead of only integers. Most of the the things we can do with lists we can also do with dictionaries, such as pass it to len() or use the in and not in operators on it. In fact, using the in operator on a very large dictionary value executes much faster than using in on a very large list.

The NoneType data type is also a new data type introduced in this chapter. It only has one value: None. This value is very useful for representing a lack of a value.

We can convert values to other data types by using the int(), float(), and str() functions. This chapter brings up “divide-by-zero” errors, which we need to add code to check for and avoid. The split() string method can convert a single string value into a list value of many strings. The split() string method is sort of the reverse of the join() list method. The append() list method adds a value to the end of the list.

When we define functions, we can give some of the parameters “default arguments”. If no argument is passed for these parameters when the function is called, the default argument value is used instead. This can be a useful shortcut in our programs.

The transposition cipher is an improvement over the Caesar cipher because it can have hundreds or thousands of possible keys for messages instead of just 26 different keys. A computer has no problem decrypting a message with thousands of different keys, but to hack this cipher, we need to write code that can determine if a string value is valid English or not.

Since this code will probably be useful in our other hacking programs, we will put it in its own module so it can be imported by any program that wants to call its isEnglish() function. All of the work we’ve done in this chapter is so that any program can do the following:

>>> import detectEnglish

>>> detectEnglish.isEnglish('Is this sentence English text?')

True

>>> 

Now armed with code that can detect English, let’s move on to the next chapter and hack the transposition cipher!