Some notes on using regular expressions in Python.

Escape special characters

Some ASCII characters have special meaning in regex. If you want to match them literally in your pattern, you need to escape them. For example, if we want to match (abc) literally, we need to write it as \(abc\). Doing this manually is tedious and error-prone.

Use re.escape() instead of doing it manually.

Ref

The meaning of re.ASCII

re.ASCII only affects what characters are in a character class. It doesn’t restrict the searched strings in any way.

Ref:

Regex search slow?

If we have a lot of regex patterns, it is better to compile them using re.compile(). It will boost performance significantly.

Ref:

Using compiled regex in re.search() is slow.

I accidentally use compiled regex pattern in re.search() function and find that it is slower than normal string patterns. To verify, see the below code:

In [1]: test = 'sdf dfads dfads fsdfdas jkjlajdfsa adf'

In [2]: p = 'sdf'

In [3]: import re

In [4]: %timeit re.search(p, test)
1.17 µs ± 2.49 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: cp = re.compile(p)

In [6]: %timeit cp.search(test)
381 ns ± 0.315 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit re.search(cp, test)
1.75 µs ± 7.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In the above code, in terms of execution speed, cp.search(test) (using compiled regex) is the fastest, re.search(p, test) (using the normal string) is the second, and re.search(cp, test) (using the compiled regex pattern in re.search) turns out to be the slowest.

In fact, when the pattern p is more complex, the time gap between re.search(p, test) and re.search(cp, test) is even larger.

This has something to do with how regex search is implemented in Python re package. If you use re.search(p, some_str), re package will try to compile the p string under the hood and cache it in an internal dict using its hash values if this pattern hasn’t been stored yet. If the pattern has been stored in the cache, re will use the compiled version.

The slowness for compiled pattern using re.search() is mainly caused by the calculating the has key. Every time you call the re package with this compiled pattern, it will calculate the hash key for this pattern, consuming a lot of time. There is a more detailed discussion here.

References