Python, Catastrophic Regular Expressions and the GIL

“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” Jamie Zawinski

The other day at work one of our python processes stopped responding. It was on one of our servers that was responsible for fetching and analyzing the web pages that we might recommend to people.

By the time I got around to debugging this problem, the last log message was over 20 minutes old - and the server had been maxed out at 100% CPU for the whole time. And while it wasn’t dead, it wasn’t responding to any requests, or even sending out any heartbeats. Even worse, in the meantime this problem had repeated itself on another server. Our whole ability to ingest new web pages looked to be going down one server at a time.

Before I killed the process, I connected up to it with GDB to try to get a sense of what was going on. Only one thread was active, and looking at the stack trace showed something like:

(gdb) bt
#0  0x0000000105c62f79 in init_sre ()
#1  0x0000000105c64885 in init_sre ()
#2  0x0000000105c61a75 in init_sre ()
#3  0x0000000105c2bd77 in PyEval_EvalFrameEx ()

A little googling showed that it was stuck processing a regular expression, and after some more extended debugging I narrowed it down to a regular expression that had been recently added to handle some of our content extraction.

Catastrophic Backtracking

The regular expression that was causing the issue here was rather long and convoluted, but the core of the problem was that it had two levels of greedy matching going on.

A really simple example of what I’m talking about would be:

import re
re.search("(a+)+b", "aaaaaaaaaaaaaaaa")

The problem is caused when the regular expression fails to match the string. It starts out greedily getting the longest possible match of the inner (a+) group here, which in this case is all 16 a’s. Since it doesn’t match the regular expression (no ‘b’ and the end) it backtracks to a group of 15 ‘a’s and tries to match again. Since that doesn’t work, it keeps on backtracking, trying all possible different groups of ‘a’s before figuring out that there is no possible segmentation that will match this regular expression.

Since the input string can be segmented at any position, there are O( 2^N ) possible different combinations that the regular expression must try out before failing. On my laptop it takes about 20ms to do the search above with 16 characters - but by doubling the input string size to 32 characters, it takes over 12 minutes to finish. Exponential running time really does suck.

What had happened in our case was that everything worked fine on almost every single web page out there. But eventually a web page came along that triggered this type of catastrophic backtracking and the process locked up. Even worse was that after a couple of minutes of running this regex, our system retried fetching the same web page on a different server - which also promptly locked up.

Python and its GIL

While it totally makes sense that the thread would block given some arbitrary malicious input, it still doesn’t explain why all the other threads were hung.

The reason why is that python has a Global Interpreter Lock that only lets one python thread run at a time. To let other threads have a chance, the python interpreter releases the lock for a thread after running 100 instructions or so. However it can only do this inside pure python code - extensions written in C can’t have the GIL pre-empted like this since the python interpreter doesn’t know when they’ll try to access python interpreter variables. This isn’t usually a big deal, as many C extensions release the GIL before doing long running operations.

However, the regular expression module is a C module that doesn’t release the GIL before running.

This all means that if you have a long running regular expression, like one that is doing catastrophic backtracking - no other code running in another thread will ever get run. No heartbeats, RPC handlers - nothing. The whole process will simply stop responding while all the other threads wait to acquire the GIL from the regular expression which never releases it.

Dealing with catastrophe

There are a couple of ways to mitigate this type of problem.

The simple answer is don’t ever in any circumstance write a regular expression with two levels of greedy matching going on. In the above example there really is no reason not to write the regular expression as a+b instead.

Alternatively you could switch to a regular expression engine that doesn’t exhibit this kind of behaviour. RE2 is an excellent library that uses a finite automata approach to avoid this type of catastrophic backtracking. Using this library on the same 32 character input reduces the running time from 12 minutes to basically instant. The downside being that RE2 doesn’t support all the functionality of the builtin re package.

In the end we just fixed the code (rather than swap out libraries), and the only long term result was two dead python processes and a couple of hours of debugging.