Notes on Using Python Regex Package

Python Regex
Some notes on using regular expressions in Python.

Escape special characters

Some ASCII characters have special meaning in regex. If you want to match them literally in your pattern, you need to escape them. For example, if we want to match (abc) literally, we need to write it as \(abc\). Doing this manually is tedious and error-prone.

A better way is using re.escape() instead of doing it manually.


The meaning of re.ASCII

re.ASCII only affects what characters are in a character class. It doesn’t restrict the searched strings in any way.


Regex search is slow?

If we have a lot of regex patterns, it is better to compile them using re.compile(). It will boost performance significantly.


Using compiled regex in is slow.

I accidentally use compiled regex pattern in and find that it is slower than normal string patterns. To verify, see the below code:

In [1]: test = 'sdf dfads dfads fsdfdas jkjlajdfsa adf'

In [2]: p = 'sdf'

In [3]: import re

In [4]: %timeit, test)
1.17 µs ± 2.49 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: cp = re.compile(p)

In [6]: %timeit
381 ns ± 0.315 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit, test)
1.75 µs ± 7.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In the above code, in terms of execution speed, (using compiled regex) is the fastest,, test) (using the normal string) is the second, and, test) (using the compiled regex pattern in turns out to be the slowest.

In fact, when the pattern p is more complex, the time gap between, test) and, test) is even larger.

This has something to do with how regex search is implemented in Python re package. If you use, some_str), re package will try to compile the p string under the hood and cache it in an internal dict using its hash values if this pattern hasn’t been stored yet. If the pattern has been stored in the cache, re will use the compiled version.

The slowness for compiled pattern using is mainly caused by the calculating the hash key. Every time you call the re package with this compiled pattern, it will calculate the hash key for this pattern, consuming a lot of time. There is a more detailed discussion here.



