Introducing Humre: Human-Readable Regular Expressions

Tue 23 August 2022 Al Sweigart

Regular expressions (aka regexes) are a mini-language to specify a pattern of text to look for. However, regex syntax is composed of various punctuation marks that can be hard to remember. Humre is a Python module that gives a more human-readable syntax that works better with code editing tools. You can install Humre just like any other Python module with pip install humre and the full documentation is available in the git repo's README file.

For example, if you want to find an American phone number, you can use the regex string r'\d{3}-\d{3}-\d{4}' (the r makes it a raw string that automatically escapes the backslashes) to search for a 3 numeric digit characters for the area code (the \d{3}), then a dash (the -), followed by three digits, a dash, and four digits. But r'\d{3}-\d{3}-\d{4}' is not an intuitive way to convey this. Humre fixes this by providing functions and constants with readable names. The equivalent Humre code for an American phone number is:

exactly(3, DIGIT) + '-' + exactly(3, DIGIT) + '-' exactly(4, DIGIT)

Humre's constants are strings and Humre's functions return strings, so the above expression evaluates to the same string: r'\d{3}-\d{3}-\d{4}'

Consider this regex code using Python's re module for an American phone number with optional parentheses around the area code:

>>> import re
>>> regexStr = r'(\d{3})|(\(\d{3}\))-\d{3}-\d{4}'
>>> patternObj = re.compile(regexStr)
>>> patternObj.search('My number is (415)-555-5555.')
<re.Match object; span=(13, 27), match='(415)-555-5555'>

The regex string is hard to read, especially with those escaped parentheses mixed in with the

>>> from humre import *
>>> regexStr = either(group(exactly(3, DIGIT)), group(OPEN_PAREN, exactly(3, DIGIT), CLOSE_PAREN)) + '-' + exactly(3, DIGIT) + '-' + exactly(4, DIGIT)
>>> regexStr
'(\\d{3})|(\\(\\d{3}\\))-\\d{3}-\\d{4}'
>>> patternObj = compile(regexStr)
>>> patternObj.search('My number is (415)-555-5555.')
<re.Match object; span=(13, 27), match='(415)-555-5555'>

The Humre code produces the same regex string except using more readable code. Humre is not a reimplementation of a regular expression engine; it's a wrapper that adds readable names to standard regex syntax.

Python's re module has "verbose mode" which allows you to use multiline strings with comments. This can make your regex strings easier to read, but they are still strings and won't benefit from advanced features your IDE provides, such as:

Your editor's parentheses matching works.
Your editor's syntax highlight works.
Your editor's linter and type hints tool picks up typos.
Your editor's autocomplete works.
Auto-formatter tools like Black can automatically format your regex code.
Humre handles raw strings/string escaping for you.
You can put actual Python comments alongside your Humre code.
Better error messages for invalid regexes.

The README on Humre's git repository has full documentation, but here's a general introduction. I still recommend all programmers learn standard regex syntax, as you'll need that knowledge to read other people's (non-Humre) code and diagnose any bugs you accidentally write with the Humre module.

Creating Basic Regex Strings with Humre

Humre functions return strings, so that instead of remembering the regex syntax for matching between 3 and 5 letter X's with 'X{3,5}', you can write the code between(3, 5, 'X'). To avoid having to prefix humre. in front of every function call, I recommend importing the Humre module with from humre import *.

Because Humre functions take string arguments and return strings, you can compose them together. For example, if you want to put the 3-to-5-letter-Xs inside of a regex group, you can write group(between(3, 5, 'X')).

Humre functions also accept a variable number of string arguments and concatenates them together for your convenience. The code group('Z', between(3, 5, 'X')) adds a letter Z to the start of the group, and this is equivalent to the code group('Z' + between(3, 5, 'X')).

The exception to this is the either() function, which is used for the regex alternation pipe character '|'. If you wanted to match "cat" or "dog" with the regex 'cat|dog', you would need to call Humre's either('cat', 'dog') function. The two string arguments are not concatenated together for either().

Humre also provides constants for common regex strings, such as DIGIT for r'\d' and OPEN_PAREN for r'\('. These manage escaping for you.

A full list of Humre functions and constants is available in the README documentation.

Experimenting with what strings Humre returns is easy to do in the interactive shell if you want to explore what Humre is doing:

>>> from humre import *
>>> exactly(3, 'X')
'X{3}'
>>> >>> zero_or_more('X')
'X*'
>>>
>>> group(DIGIT)
'(\\d)'
>>> group(either(exactly(3, DIGIT), 'cat'))
'(\\d{3}|cat)'
>>> either(exactly(3, DIGIT), group('cat'))
'\\d{3}|(cat)'

Humre's Convenience Functions

You don't need to keep typing out quotes and + operators to join strings. Humre's join() function offers an easier way to join strings together. The following humre.join() function and ''.join() method calls are equivalent:

>>> ''.join(['Hello', 'World'])
'HelloWorld'
>>> from humre import *
>>> join('Hello', 'World')
'HelloWorld'

Humre also provides a compile() function which wraps the standard re.compile() function. This is so that you don't need to mix your Humre and re code. The following code produces equivalent pattern objects:

>>> import re
>>> re.compile(r'\d{3}')
re.compile('\\d{3}')
>>> from humre import *
>>> compile(exactly(3, DIGIT))
re.compile('\\d{3}')

Because Humre functions concatenate multiple string arguments together, if you want to pass regex flags (such as re.IGNORECASE) you need to use humre.compile()'s flags keyword argument instead. The following code produces equivalent pattern objects:

>>> import re
>>> re.compile('Hello', re.IGNORECASE)
re.compile('Hello', re.IGNORECASE)
>>> from humre import *
>>> compile('Hello', flags=re.IGNORECASE)
re.compile('Hello', re.IGNORECASE)

Note that this means running from humre import * overwrites Python's built-in compile() function. However, most applications don't call this function. If yours does, you can import Humre with import humre or import individual Humre functions and constants instead.

Human-Readable Error Messages

Typos in your Humre code give much better error messages than the standard re module does. For example, if you make a typo and ask for between 5 and 3 occurrences of the letter X (instead of between 3 and 5), the re module produces this unwieldy error:

>>> import re
>>> re.compile('X{5,3}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 227, in compile
    return _compile(pattern, flags)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\__init__.py", line 294, in _compile
    p = _compiler.compile(pattern, flags)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\_compiler.py", line 743, in compile
    p = _parser.parse(p, flags)
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 980, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 455, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Al\AppData\Local\Programs\Python\Python311\Lib\re\_parser.py", line 672, in _parse
    raise source.error("min repeat greater than max repeat",
re.error: min repeat greater than max repeat at position 2

That's quite a lot of text to read through. The same mistake in Humre provides a more direct, detailed error message:

>>> from humre import *
>>> between(5, 3, 'X')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\github\humre\src\humre\__init__.py", line 439, in between
    raise ValueError('minimum argument ' + str(minimum) + ' must be less than maximum argument ' + str(maximum))
ValueError: minimum argument 5 must be less than maximum argument 3

Unicode Letter Support

It's common to write [A-Za-z] for a character class to match all letters, but this suffers from the problem of not matching letters with accent marks or non-English letters. Humre's LETTER character class looks like '[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͺ-ͽͿΆΈ-ΊΌΎ-Ρ... and matches all letters. The definition of "letter" in this case is based on Python's isalpha() string method.

Massive Regexes Are Easier with Humre

The benefits of Humre compound as your regular expressions get larger. Let's compare a massive regex used in Python packaging code to its equivalent Humre code.

Here's the massive regular expression in verbose mode:

_version_regex_str = r"""
(?P<version>
    (?:
        # The identity operators allow for an escape hatch that will
        # do an exact string match of the version you wish to install.
        # This will not be parsed by PEP 440 and we cannot determine
        # any semantic meaning from it. This operator is discouraged
        # but included entirely as an escape hatch.
        (?<====)  # Only match for the identity operator
        \s*
        [^\s]*    # We just match everything, except for whitespace
                  # since we are only testing for strict identity.
    )
    |
    (?:
        # The (non)equality operators allow for wild card and local
        # versions to be specified so we have to define these two
        # operators separately to enable that.
        (?<===|!=)            # Only match for equals and not equals
        \s*
        v?
        (?:[0-9]+!)?          # epoch
        [0-9]+(?:\.[0-9]+)*   # release
        (?:                   # pre release
            [-_\.]?
            (a|b|c|rc|alpha|beta|pre|preview)
            [-_\.]?
            [0-9]*
        )?
        (?:                   # post release
            (?:-[0-9]+)|(?:[-_\.]?(post|rev|r)[-_\.]?[0-9]*)
        )?
        # You cannot use a wild card and a dev or local version
        # together so group them with a | and make them optional.
        (?:
            \.\*  # Wild card syntax of .*
            |
            (?:[-_\.]?dev[-_\.]?[0-9]*)?         # dev release
            (?:\+[a-z0-9]+(?:[-_\.][a-z0-9]+)*)? # local
        )?
    )
    |
    (?:
        # The compatible operator requires at least two digits in the
        # release segment.
        (?<=~=)               # Only match for the compatible operator
        \s*
        v?
        (?:[0-9]+!)?          # epoch
        [0-9]+(?:\.[0-9]+)+   # release  (We have a + instead of a *)
        (?:                   # pre release
            [-_\.]?
            (a|b|c|rc|alpha|beta|pre|preview)
            [-_\.]?
            [0-9]*
        )?
        (?:                                   # post release
            (?:-[0-9]+)|(?:[-_\.]?(post|rev|r)[-_\.]?[0-9]*)
        )?
        (?:[-_\.]?dev[-_\.]?[0-9]*)?          # dev release
    )
    |
    (?:
        # All other operators only allow a sub set of what the
        # (non)equality operators do. Specifically they do not allow
        # local versions to be specified nor do they allow the prefix
        # matching wild cards.
        (?<!==|!=|~=)         # We have special cases for these
                              # operators so we want to make sure they
                              # don't match here.
        \s*
        v?
        (?:[0-9]+!)?          # epoch
        [0-9]+(?:\.[0-9]+)*   # release
        (?:                   # pre release
            [-_\.]?
            (a|b|c|rc|alpha|beta|pre|preview)
            [-_\.]?
            [0-9]*
        )?
        (?:                                   # post release
            (?:-[0-9]+)|(?:[-_\.]?(post|rev|r)[-_\.]?[0-9]*)
        )?
        (?:[-_\.]?dev[-_\.]?[0-9]*)?          # dev release
    )
)
"""

And here's the equivalent code with Humre:

from humre import *

SEPARATOR = chars('-_' + PERIOD)
OPT_SEPARATOR = optional(SEPARATOR)

def version_template(fn):
    return ''.join([
    zero_or_more(WHITESPACE),
    optional('v'),
    optional(noncap_group(one_or_more(chars('0-9')), '!')), # epoch

one_or_more(chars('0-9')), fn(noncap_group(PERIOD, one_or_more(chars('0-9')))), # release

optional(noncap_group( # pre release
        OPT_SEPARATOR,
        group(either('a', 'b', 'c', 'rc', 'alpha', 'beta', 'pre', 'preview')),
        OPT_SEPARATOR,
        zero_or_more(chars('0-9')),
    )),
    optional(noncap_group( # post release
        either(
            noncap_group('-', one_or_more(chars('0-9'))),
            noncap_group(OPT_SEPARATOR, group_either('post', 'rev', 'r') + OPT_SEPARATOR + zero_or_more(chars('0-9')))
        )
    ))
])

EQ_NE_VERSION_TEMPLATE = version_template(zero_or_more)
COMPATIBILITY_VERSION_TEMPLATE = version_template(one_or_more)

DEV_RELEASE = optional(noncap_group(OPT_SEPARATOR, 'dev', OPT_SEPARATOR, zero_or_more(chars('0-9'))))  # dev release

_version_regex_str = named_group('version',
    either(
        noncap_group(
            # The identity operators allow for an escape hatch that will
            # do an exact string match of the version you wish to install.
            # This will not be parsed by PEP 440 and we cannot determine
            # any semantic meaning from it. This operator is discouraged
            # but included entirely as an escape hatch.
            positive_lookbehind('==='), # Only match for the identity operator
            zero_or_more(WHITESPACE),
            zero_or_more(nonchars(WHITESPACE)) # We just match everything, except for whitespace
                                               # since we are only testing for strict identity.
        ),
        noncap_group(
            # The (non)equality operators allow for wild card and local
            # versions to be specified so we have to define these two
            # operators separately to enable that.
            positive_lookbehind(either('==', '!=')), # Only match for equals and not equals

EQ_NE_VERSION_TEMPLATE,

# You cannot use a wild card and a dev or local version
            # together so group them with a | and make them optional.
            optional(noncap_group(
                either(
                    PERIOD + ASTERISK, # Wild card syntax of .*
                    DEV_RELEASE +
                    optional(noncap_group(PLUS, one_or_more(chars('a-z0-9')), zero_or_more(noncap_group(SEPARATOR, one_or_more(chars('a-z0-9')))))) # local
                )
            ))
        ),
        noncap_group(
            # The compatible operator requires at least two digits in the
            # release segment.
            positive_lookbehind('~='), # Only match for the compatible operator

COMPATIBILITY_VERSION_TEMPLATE,

DEV_RELEASE,
        ),
        noncap_group(
            # All other operators only allow a sub set of what the
            # (non)equality operators do. Specifically they do not allow
            # local versions to be specified nor do they allow the prefix
            # matching wild cards.
            negative_lookbehind(either('==', '!=', '~=')), # We have special cases for these
                                                           # operators so we want to make sure they
                                                           # don't match here.
            EQ_NE_VERSION_TEMPLATE,

DEV_RELEASE,
        )
    )
)

print(_version_regex_str)

While both are a lot to take in, the Humre code benefits from not being in a multiline string. This means your IDE can match parentheses, your linter can spot typos, and your automatic code formatting tools can clean up your Humre code for you.