Introducing Humre: Human-Readable Regular Expressions
Tue 23 August 2022 Al Sweigart
Regular expressions (aka regexes) are a mini-language to specify a pattern of text to look for. However, regex syntax is composed of various punctuation marks that can be hard to remember. Humre is a Python module that gives a more human-readable syntax that works better with code editing tools. You can install Humre just like any other Python module with pip install humre
and the full documentation is available in the git repo's README file.
For example, if you want to find an American phone number, you can use the regex string r'\d{3}-\d{3}-\d{4}'
(the r
makes it a raw string that automatically escapes the backslashes) to search for a 3 numeric digit characters for the area code (the \d{3}
), then a dash (the -
), followed by three digits, a dash, and four digits. But r'\d{3}-\d{3}-\d{4}'
is not an intuitive way to convey this. Humre fixes this by providing functions and constants with readable names. The equivalent Humre code for an American phone number is:
exactly(3, DIGIT) + '-' + exactly(3, DIGIT) + '-' exactly(4, DIGIT)
Humre's constants are strings and Humre's functions return strings, so the above expression evaluates to the same string: r'\d{3}-\d{3}-\d{4}'
Consider this regex code using Python's re
module for an American phone number with optional parentheses around the area code:
>>> import re >>> regexStr = r'(\d{3})|(\(\d{3}\))-\d{3}-\d{4}' >>> patternObj = re.compile(regexStr) >>> patternObj.search('My number is (415)-555-5555.') <re.Match object; span=(13, 27), match='(415)-555-5555'>
The regex string is hard to read, especially with those escaped parentheses mixed in with the
>>> from humre import * >>> regexStr = either(group(exactly(3, DIGIT)), group(OPEN_PAREN, exactly(3, DIGIT), CLOSE_PAREN)) + '-' + exactly(3, DIGIT) + '-' + exactly(4, DIGIT) >>> regexStr '(\\d{3})|(\\(\\d{3}\\))-\\d{3}-\\d{4}' >>> patternObj = compile(regexStr) >>> patternObj.search('My number is (415)-555-5555.') <re.Match object; span=(13, 27), match='(415)-555-5555'>
The Humre code produces the same regex string except using more readable code. Humre is not a reimplementation of a regular expression engine; it's a wrapper that adds readable names to standard regex syntax.
Python's re
module has "verbose mode" which allows you to use multiline strings with comments. This can make your regex strings easier to read, but they are still strings and won't benefit from advanced features your IDE provides, such as:
- Your editor's parentheses matching works.
- Your editor's syntax highlight works.
- Your editor's linter and type hints tool picks up typos.
- Your editor's autocomplete works.
- Auto-formatter tools like Black can automatically format your regex code.
- Humre handles raw strings/string escaping for you.
- You can put actual Python comments alongside your Humre code.
- Better error messages for invalid regexes.
The README on Humre's git repository has full documentation, but here's a general introduction. I still recommend all programmers learn standard regex syntax, as you'll need that knowledge to read other people's (non-Humre) code and diagnose any bugs you accidentally write with the Humre module.
Creating Basic Regex Strings with Humre
Humre functions return strings, so that instead of remembering the regex syntax for matching between 3 and 5 letter X's with 'X{3,5}'
, you can write the code between(3, 5, 'X')
. To avoid having to prefix humre.
in front of every function call, I recommend importing the Humre module with from humre import *
.
Because Humre functions take string arguments and return strings, you can compose them together. For example, if you want to put the 3-to-5-letter-Xs inside of a regex group, you can write group(between(3, 5, 'X'))
.
Humre functions also accept a variable number of string arguments and concatenates them together for your convenience. The code group('Z', between(3, 5, 'X'))
adds a letter Z to the start of the group, and this is equivalent to the code group('Z' + between(3, 5, 'X'))
.
The exception to this is the either()
function, which is used for the regex alternation pipe character '|'
. If you wanted to match "cat" or "dog" with the regex 'cat|dog'
, you would need to call Humre's either('cat', 'dog')
function. The two string arguments are not concatenated together for either()
.
Humre also provides constants for common regex strings, such as DIGIT
for r'\d'
and OPEN_PAREN
for r'\('
. These manage escaping for you.
A full list of Humre functions and constants is available in the README documentation.
Experimenting with what strings Humre returns is easy to do in the interactive shell if you want to explore what Humre is doing:
>>> from humre import * >>> exactly(3, 'X') 'X{3}' >>> >>> zero_or_more('X') 'X*' >>> >>> group(DIGIT) '(\\d)' >>> group(either(exactly(3, DIGIT), 'cat')) '(\\d{3}|cat)' >>> either(exactly(3, DIGIT), group('cat')) '\\d{3}|(cat)'
Humre's Convenience Functions
You don't need to keep typing out quotes and +
operators to join strings. Humre's join()
function offers an easier way to join strings together. The following humre.join()
function and ''.join()
method calls are equivalent:
>>> ''.join(['Hello', 'World']) 'HelloWorld' >>> from humre import * >>> join('Hello', 'World') 'HelloWorld'
Humre also provides a compile()
function which wraps the standard re.compile()
function. This is so that you don't need to mix your Humre and re
code. The following code produces equivalent pattern objects:
>>> import re >>> re.compile(r'\d{3}') re.compile('\\d{3}') >>> from humre import * >>> compile(exactly(3, DIGIT)) re.compile('\\d{3}')
Because Humre functions concatenate multiple string arguments together, if you want to pass regex flags (such as re.IGNORECASE
) you need to use humre.compile()
's flags
keyword argument instead. The following code produces equivalent pattern objects:
>>> import re >>> re.compile('Hello', re.IGNORECASE) re.compile('Hello', re.IGNORECASE) >>> from humre import * >>> compile('Hello', flags=re.IGNORECASE) re.compile('Hello', re.IGNORECASE)
Note that this means running from humre import *
overwrites Python's built-in compile()
function. However, most applications don't call this function. If yours does, you can import Humre with import humre
or import individual Humre functions and constants instead.
Human-Readable Error Messages
Typos in your Humre code give much better error messages than the standard re
module does. For example, if you make a typo and ask for between 5 and 3 occurrences of the letter X (instead of between 3 and 5), the re
module produces this unwieldy error:
>>> import re >>> re.compile('X{5,3}') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 227, in compile return _compile(pattern, flags) ^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 294, in _compile p = _compiler.compile(pattern, flags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\_compiler.py", line 743, in compile p = _parser.parse(p, flags) ^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 980, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 455, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 672, in _parse raise source.error("min repeat greater than max repeat", re.error: min repeat greater than max repeat at position 2
That's quite a lot of text to read through. The same mistake in Humre provides a more direct, detailed error message:
>>> from humre import * >>> between(5, 3, 'X') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\github\humre\src\humre\__init__.py", line 439, in between raise ValueError('minimum argument ' + str(minimum) + ' must be less than maximum argument ' + str(maximum)) ValueError: minimum argument 5 must be less than maximum argument 3
Unicode Letter Support
It's common to write [A-Za-z]
for a character class to match all letters, but this suffers from the problem of not matching letters with accent marks or non-English letters. Humre's LETTER
character class looks like '[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͺ-ͽͿΆΈ-ΊΌΎ-Ρ...
and matches all letters. The definition of "letter" in this case is based on Python's isalpha()
string method.
Massive Regexes Are Easier with Humre
The benefits of Humre compound as your regular expressions get larger. Let's compare a massive regex used in Python packaging code to its equivalent Humre code.
Here's the massive regular expression in verbose mode:
And here's the equivalent code with Humre:
While both are a lot to take in, the Humre code benefits from not being in a multiline string. This means your IDE can match parentheses, your linter can spot typos, and your automatic code formatting tools can clean up your Humre code for you.
Full Documentation
This is a broad introduction to Humre. The full documentation is available in the git repo's README file.