9TEXT PATTERN MATCHING WITH REGULAR EXPRESSIONS

Most programming languages implement regular expressions, or regexes, because they make it easy to locate particular patterns of text. An understanding of Python’s regexes can prepare you for learning regexes in any programming language and in many word processor applications as well, so digging into this topic is a worthy investment.

A simple drawing of a light bulb. LEARNING OBJECTIVES

  • Master the basics of regular expression syntax in the Python programming language.
  • Know how to use qualifiers to describe what characters to match.
  • Know how to use quantifiers to describe the number of characters to match.
  • Be able to resolve the ambiguity between greedy and non-greedy matching using the question mark (?) syntax.
  • Understand how to pass flags such as re.IGNORECASE to the re.compile() function to do case-insensitive matching.
  • Be able to use verbose mode to write larger regexes across multiple lines.
  • Know how to write human-readable regular expressions using the Humre module.

A grey circle with a white question mark at the center Practice Questions

These questions test your understanding of the particular style of regex that Python uses in its re module.

The Syntax of Regular Expressions

Regular expressions allow you to specify a pattern of text to search for. For example, the characters \d in a regex stand for a decimal numeral between 0 and 9, and adding a numeral, such as 3, in curly brackets ({3}) after a pattern is like saying, “Match this pattern three times.” Further, parentheses can create groups in the regex string that let you grab different portions of the matched text.

  1. 1. What is the difference between the re.compile() function and the search() method?

  2. 2. How many groups are in the regex (\\d{3})-(\\d{3})-(\\d{4})?

  3. 3. What about in the regex (\\d{3})-(\\d{3}-(\\d{4}))?

  4. 4. Rewrite this regex using a raw string: \\(\\d{3}\\)-(\\d{3})-(\\d{4}).

  5. 5. List four characters that have special meaning in regex strings and must be escaped if you want to literally match them.

  6. 6. Write a regex that uses the alternation syntax to match the word clutter, clue, or club.

  7. 7. Which of the following strings does the regex (A|B)(A|B) match: A, B, AA, AB, BA, or BB?

  8. 8. What is the main difference between the search() method and the findall() method?

  9. 9. If findall() were called on a Pattern object of the regex r'\d{3}-\d{3}-\d{4}', which could it possibly return: ['415-555-9999'] or [('415', '555', '9999')]?

  10. 10. If findall() were called on a Pattern object of the regex r'(\d{3})-(\d{3})-(\d{4})', which could it possibly return: ['415-555-9999'] or [('415', '555', '9999')]?

Qualifier Syntax: What Characters to Match

The qualifiers of a regular expression dictate what characters you’re trying to match. You can specify these using character classes, shorthand character classes, and characters with special meaning in regular expressions. Test your understanding of qualifier syntax.

  1. 11. Write a regex with a character class that is equivalent to a|b|c|d.

  2. 12. Write a regex that uses shorthand character classes to match strings like a1z, B3x, and L0L.

  3. 13. Will the regex [a-z] match the string é (an e with an accent mark)?

  4. 14. Will the regex \w match the string é (an e with an accent mark)?

  5. 15. Will the regex \W match the string é (an e with an accent mark)?

  6. 16. Will the regex [A-Z] match the string z?

  7. 17. Will the regex . match the string é (an e with an accent mark)?

  8. 18. Will the regex r'\.' match the string é(an e with an accent mark)?

  9. 19. Name two shorthand character classes that will match the string 5.

Quantifier Syntax: How Many Qualifiers to Match

In a regular expression string, quantifiers follow qualifier characters to dictate how many of them to match. For example, a {3} might follow \d to match exactly three digits. Answer the following questions about quantifier syntax.

  1. 20. Which of the following strings does the regex '(A|B?)(A|B)?' match: A, B, AA, AB, BA, or BB?

  2. 21. Write a regex that matches both Cheese? and Cheese.

  3. 22. What string will the regexes X? and X* match that X+ won’t match?

  4. 23. Write a regex that matches the same thing as the regex X{1,}.

  5. 24. Do the regexes X{3,} and XX{2,} and XXX+ match the same strings?

  6. 25. What is the difference between the regexes Ha{3} and (Ha){3}?

  7. 26. Write a regex that matches a dot-com website address. The address should begin with https://, may optionally have www., should include at least one letter or number for the website name, and should end with .com.

  8. 27. In the XKCD comic at https://xkcd.com/1105/, the main character has a license plate made up of a jumble of 1s and capital letter Is: 1I1-III1. Write a regular expression that matches all possible license plates in this style. Such a license plate consists of three 1s or Is, a dash, then four more 1s or Is.

Greedy and Non-Greedy Matching

In ambiguous situations, a greedy match will match the longest string possible. A non-greedy match (also called a lazy match) will match the shortest string possible. Answer the following questions about greedy and non-greedy matching.

  1. 28. Between greedy and non-greedy matching, which is the default behavior of Python regular expressions?

  2. 29. Is greedy/non-greedy matching a feature of qualifier syntax or quantifier syntax?

  3. 30. What does the regex .* mean?

  4. 31. What does the regex .*? mean?

  5. 32. What is the difference between the Pattern object returned by re.compile('.*') and the one returned by re.compile('.*', re.DOTALL)?

Matching at the Start and End of a String

You can use the caret symbol (^) at the start of a regex to indicate that a match must occur at the beginning of the searched text. Likewise, you can put a dollar sign ($) at the end of the regex to indicate that the string must end with this regex pattern. Lastly, you can use ^ and $ together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string. Python’s regex syntax also includes matching on word boundaries (separated by whitespace) with \b.

  1. 33. Which regex matches the entire string spam: spam, $spam^, or ^spam$?

  2. 34. While \b matches a word boundary, what does \B match?

Case-Insensitive Matching

Normally, regular expressions match text with the exact casing you specify. To make your regex case insensitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile(). Answer the following questions about case-insensitive matching.

  1. 35. Does Python’s re module do case-insensitive matching by default?

  2. 36. What are the two arguments you can pass to re.compile() that enable case-insensitive matching?

  3. 37. Will a case-insensitive search with the regex ^[A-Z]$ match the string Sinéad?

  4. 38. Does case-insensitive matching have any effect for the regex r'\d+'?

Substituting Strings

The sub() method for Pattern objects accepts two arguments. The first is a string that should replace any matches. The second is the string of the regular expression. The sub() method returns a string with the substitutions applied. Answer the following questions about the sub() method.

  1. 39. What are \1, \2, and \3 in regular expressions?

  2. 40. Does the sub() method return a Match object?

  3. 41. What arguments does the sub() method take?

Managing Complex Regexes with Verbose Mode

Matching complicated text patterns might require long, convoluted regular expressions. You can mitigate this complexity and enable “verbose mode” by passing the variable re.VERBOSE as the second argument to re.compile(). Answer the following questions about verbose mode.

  1. 42. What flag do you pass to re.compile() to enable verbose mode?

  2. 43. How does verbose mode make regular expression strings more readable?

  3. 44. What do verbose mode comments look like?

Humre: A Module for Human-Readable Regexes

The third-party Humre Python module takes the good ideas of verbose mode even further by using human-readable, plain-English names to create readable regex code. Answer the following questions about the Humre module.

  1. 45. What is the return data type of Humre functions?

  2. 46. What does the Humre function exactly(3, 'A') return?

  3. 47. What value does the Humre constant PERIOD have?

  4. 48. What do the Humre functions either(exactly(3, 'A'), exactly(2, 'B')) return?

  5. 49. Name two benefits of Humre over the re module.

A simple drawing of a sharpened pencil. Practice Projects

Continue working with regexes as you complete these short projects.

Hashtag-Finding Regex

Create a regex that can find social media hashtags. For the purposes of this project, a “hashtag” pattern begins with a # character followed by one or more alphanumeric characters (letters, numbers, or underscores). Write a function named get_hashtags(sentence) that takes a string argument and returns a list of the hashtags. For example, get_hashtags('Remember to #vote on #electionday.') should return ['#vote', '#electionday'].

Finish the program by asking the user to enter a sentence and then print the hashtags. For example, the running program could look like this:

Enter a sentence:
Remember to #vote on #electionday.
#vote
#electionday

Save this function as a program named hashtagRegex.py.

Price-Finding Regex

Many websites go to great lengths to describe how great their product is without ever telling you the price. I often find myself pressing ctrl-F to search for “$” to get this information. Let’s write a program that immediately finds prices in text using regular expressions.

Create a function named get_price(sentence) that takes a string argument and returns the prices in it. For this project, a price is the dollar sign '$' followed by one or more digits, optionally followed by a period and two more digits. For example, get_price('It was $5.99 but is now on sale for $5.95!!') would return ['$5.99', '$5.95'].

Save this function as a program named priceRegex.py.

Creating a CSV File of PyCon Speakers

Many countries and regions have conferences on Python, called PyCons. The https://pyvideo.org website hosts a collection of recorded talks from various PyCon conferences. Who has given the most PyCon talks? What is the median number of PyCon talks that speakers give? There are several statistics you could gather, but first you need to organize this information into some sort of data structure.

If you select all of the text from https://pyvideo.org/speakers.html and paste it into a text editor, you’ll find a series of speakers followed by the number of talks they’ve given:

    A Bessas 1
    A Bingham 1
    A Cuni 3
    A Garassino 1
    A Jesse Jiryu Davis 13
    A Kanterakis 1
--snip--

You can use this example text for the project if, for some reason, you can’t retrieve the web page. Place the text into a single multiline string by enclosing it with triple quotes. Then, call the splitlines() method and store the returned list of strings in a variable named speakers:

speakers = """    A Bessas 1
    A Bingham 1
    A Cuni 3
--snip--
    Žygimantas Medelis 1""".splitlines()

To put this information into a spreadsheet, you could try formatting it as comma-separated values (CSV), discussed in Chapter 18 of Automate the Boring Stuff with Python.

To do so, you need to write a regex to pass to the re.sub() function. Each speaker line consists of four spaces (which we want to remove), followed by the speaker name, then a space (which we want to replace with a comma) and one or more digits at the end of the line. Write the code that changes the string in speakers to this:

A Bessas,1
A Bingham,1
A Cuni,3
A Garassino,1
A Jesse Jiryu Davis,13
A Kanterakis,1
--snip--

The speaker names have different widths, and some include non-English characters. To accommodate this, your regex will need to capture the speaker name in a group with (.*), then store it in the \1 back reference. The number of talks the speaker has given can be a varying number of digits but always comes at the end of the line. So, you can use the $ regex character to match it.

Once you’ve put the entire string in CSV format, you can place the text in a text file and save it as speakers.csv. Excel, Google Sheets, and other spreadsheet applications can then structure the speaker name and number of talks into separate columns to make further sorting and processing easier. Note that some of the speaker names have commas in them, which will make some rows in the CSV file contain more than two columns. This is fine for our purposes.

Save this program in a file named pyconSpeakers.py. When you run the program, the speakers.csv file it creates should have a column of speakers and how many talks they’ve given.

Laugh Score

We can scientifically measure how funny a joke is based on the length of the text-based laughing response. For example, a joke that elicits the response “Hahaha” is objectively funnier than a joke that gets only a “Haha” response. A joke that provokes a “HAHAaaHAhhAHAHA” response is a very funny joke. (On a personal note, I’ve never understood humor, and no one has ever said I am funny, but that doesn’t matter now that I have software to understand humor for me.)

Let’s write a function called laugh_score(laugh) that uses a regular expression to identify and measure the length of laughing specified by the laugh string argument. A text-based laugh is defined as beginning with ha, then consisting of any number of consecutive h or a characters. Both lowercase and uppercase characters are acceptable. If there are multiple laughs in a string, count only the first one.

To write the function, you can complete the following template:

import re

def laugh_score(laugh):
    # YOUR CODE GOES HERE

assert laugh_score('abcdefg') == 0
assert laugh_score('h') == 0
assert laugh_score('ha') == 2
assert laugh_score('HA') == 2
assert laugh_score('hahaha') == 6
assert laugh_score('ha ha ha') == 2
assert laugh_score('haaaaa') == 6
assert laugh_score('ahaha') == 4
assert laugh_score('Harry said Hahaha') == 2

Save this function in a file named laughScore.py.

Word Twister—ordW wisterT

Write a program that “twists” the words in a string. For example, calling twist_words('Hello world! How are you? I am fine.') returns 'oHell dworl! wHo ear uyo? I ma efin.' To do so, the sub() method for Pattern objects can move the last letter of every word in a string to the front of the word.

As arguments, the sub() method accepts a regex of the pattern to match, a string to replace the matches with, and the string to search for matches. Your regex should use the \b shorthand character class for word boundaries. For example, the regex \b[AEIOUaeiou]\w*\b would match every word that begins with an uppercase or lowercase vowel.

The regex should also use parentheses to put the first letter of each word in one group and the remaining letters in a second group. This way, the second argument can include the \1 and \2 back references to reorder these two groups.

Your code needs to be only three lines long:

import re
pattern = re.compile(r'THE_REGEX')
print(pattern.sub(r'REPLACEMENT', 'Hello world! How are you? I am fine.'))

Save this program in a file named wordTwister.py.