Profiling Native Python Extensions

One of the cool new features in py-spy is the ability to profile native Python extensions written in languages like C, C++ or Cython.

Almost all other Python profilers^[1] only show program activity that is in pure Python code, and native code will instead show up as spending time in the line of Python that calls the native function. Using native profiling tools like perf can get you a sense of what’s going on the native side of things but at the expense of losing any visibility into what’s happening with the Python function calls.

The big problem with this is that a huge amount of the Python ecosystem is in native extensions. It’s a common optimization pattern to rewrite the slowest part of your Python program in a language like Cython or C++ after profiling, and by only being able to profile either the native code or the python code you only get half the picture of what’s happening in your python codebase.

With the 0.2 release of py-spy, if you enable native extension profiling with the --native flag you will get a profile containing both python code and C++/Cython code. As an example, here is a flamegraph generated by profiling the lastfm example in my implicit recommendation library:

Example flamegraph showing both Cython, C and Python code

The problem with profiling native Python extensions

To illustrate the problem that we’re trying to solve, let’s take a look at the python callstack at a single point in time when profiling that example program above. Using the py-spy dump command we can get a single stack trace from the python program, which for the main thread looks like:

Thread 14976 (active)
     fit (implicit/als.py:159)
     calculate_similar_movies (movielens.py:73)
     &ltmodule&gt (movielens.py:112)

This library is mostly Cython code though, and the ‘fit’ call is just at the line where the first Cython function is getting called, so this isn’t useful at all for optimizing the extension.

Using native profiling tools like perf can show all the native frames, but the callstack doesn’t have any details on what Python functions are getting called then - and it also ends up cluttered with irrelevant information. As an example, here is what the native stack trace looks like at the same point in time:

mkl_blas_avx512_xsaxpy (libmkl_avx512.so)
mkl_blas_saxpy (libmkl_intel_thread.so)
saxpy_ (libmkl_intel_lp64.so)
__pyx_fuse_0__pyx_f_8implicit_4_als_axpy (implicit/_als.cpp:2859)
__pyx_pf_8implicit_4_als_30_least_squares_cg (implicit/_als.cpp:15263)
__pyx_fuse_1_0__pyx_pw_8implicit_4_als_31_least_squares_cg (implicit/_als.cpp:14740)
__pyx_FusedFunction_call (_reordering.cpython-37m-x86_64-linux-gnu.so)
__Pyx_PyObject_Call (implicit/_als.cpp:40418)
__pyx_pf_8implicit_4_als_4least_squares_cg (implicit/_als.cpp:11553)
__pyx_pw_8implicit_4_als_5least_squares_cg (implicit/_als.cpp:11439)
cfunction_call_varargs (Objects/call.c:755)
PyCFunction_Call (Objects/call.c:784)
partial_call_impl.isra.1 (Modules/_functoolsmodule.c:187)
partial_call (Modules/_functoolsmodule.c:230)
_PyObject_FastCallKeywords (Objects/call.c:199)
call_function (Python/ceval.c:4619)
_PyEval_EvalFrameDefault (Python/ceval.c:3139)
_PyEval_EvalCodeWithName (Python/ceval.c:3940)
_PyFunction_FastCallKeywords (Objects/call.c:433)
call_function (Python/ceval.c:4621)
_PyEval_EvalFrameDefault (Python/ceval.c:3110)
_PyEval_EvalCodeWithName (Python/ceval.c:3930)
_PyFunction_FastCallKeywords (Objects/call.c:433)
call_function (Python/ceval.c:4616)
_PyEval_EvalFrameDefault (Python/ceval.c:3139)
_PyEval_EvalCodeWithName (Python/ceval.c:3940)
PyEval_EvalCodeEx (Python/ceval.c:3959)
PyEval_EvalCode (Python/ceval.c:530)
run_mod (Python/pythonrun.c:1036)
PyRun_FileExFlags (Python/pythonrun.c:988)
PyRun_SimpleFileExFlags (Python/pythonrun.c:429)
pymain_run_file (Modules/main.c:428)
pymain_run_filename (Modules/main.c:1627)
pymain_run_python (Modules/main.c:2876)
pymain_main (Modules/main.c:3037)
_Py_UnixMain (Modules/main.c:3073)
__libc_start_main (libc-2.27.so)

The python callstack is lost here and replaced by __PyEval_EvalFrameDefault and other internal interpreter methods instead of the Python function name. Also the Cython function calls are mangled, and show the line in the generated .cpp file rather than the line in the original .pyx file.

With the latest release of py-spy, if you dump out a callstack with the --native option you’ll see the python stack and native stack merged together, and all the Cython calls demangled and replaced with the line numbers in the original pyx file:

Thread 14976 (active)
    mkl_blas_avx512_xsaxpy (libmkl_avx512.so)
    mkl_blas_saxpy (libmkl_intel_thread.so)
    saxpy_ (libmkl_intel_lp64.so)
    _als_axpy (implicit/_als.pyx:18)
    _least_squares_cg (implicit/_als.pyx:184)
    _least_squares_cg (implicit/_als.pyx:137)
    least_squares_cg (implicit/_als.pyx:132)
    least_squares_cg (implicit/_als.pyx:130)
    fit (implicit/als.py:159)
    calculate_similar_artists (lastfm.py:79)
    <module> (lastfm.py:161)

The rest of this post is going to go into the details of how we go about generating this: unwinding the native stack of the other process, merging the native stack and python stack’s together, and how we go about profiling Cython extensions.

Getting the native stack

All that is needed to profile native extensions is to get the native call stack and then merge in the python callstack. We already have code from the first version of py-spy to get the python stack, so the first challenge here is getting the native stack to merge in.

gimli: not just a LOTR character. Image from Wikipedia

To get the native stack I used the StackWalk64 API on Windows and used libunwind-ptrace on Linux. Both of these API’s let me unwind the stack of a thread running in another process and show me the Instruction Pointer and other CPU registers for every frame being unwound. Using these methods, I gathered a list of Instruction Pointers for each thread where each IP corresponds to a frame in the stack trace.

Just having a list of instruction pointers isn’t all that helpful by itself though, we have to symbolicate these pointers to get the filename, function and line corresponding to the source code. Windows ships with a SymFromAddr and SymGetLineFromAddr API to do symbolication, but for Linux I ended up using some symbolication provided by the gimli crate. Gimli is a great project with a bunch of potential here, not only am I using their addr2line functionality to resolve symbols - I’m also using their cpp-demangle crate to demangle C++ linker symbols.

With a fully symbolicated native stack frame, all that’s left is to merge in the python callstack so that we can replace those internal python interpreter methods.

Merging the Native and Python stacks

Merging the native stacks and python stacks is a little tricky unfortunately.

We can’t figure out the Python frame information directly from the native stack trace. Inside the Python interpreter, this information is held inside the PyFrameObject struct and contains details like the function name, filename and line number of a Python frame. While we can see native function calls like _PyEval_EvalFrameDefault where the PyFrameObject is passed as the first parameter, on x64_64 calling conventions the first parameter is passed on the RDI register which we can’t retrieve from unwinding the stack. Since we’re also not guaranteed to have debugging information for the Python interpreter, this means that there is no reliable way to get the PyFrameObject just by looking at the native stack.

So instead what we’re doing is generating both the native stack trace and the python stack trace independently of one another, and then merging them into one coherent stack trace. To merge the stacks we replace the _PyEval_EvalFrame* calls in the native stack with the function from the python stack trace, strip out other irrelevant internal python interpreter calls and for everything else leave the native function call.

The big problem in doing this is figuring out which python stack trace corresponds to each native stack trace because the thread id’s for the python stack trace and the native stack trace aren’t always the same. On Unix based systems, Python uses the pthread ID internally instead of the OS thread ID. This difference in thread ID’s makes it hard for us to figure out which native callstack matches up to the equivalent python callstack - since there is no common key to join on. While Python 3.8 has added support to get the OS thread id from within python, there isn’t an easy way to call this from another program - and we also support profiling Python all the way back to version 2.3. I also wanted to avoid injecting code into the running process (say by calling pthread_self on the native thread) to do the lookup, since I don’t want py-spy to modify the running program.

So instead what we’re doing is looking up the pthread id from the native stack trace by looking at the registers of the top-level function. After forking a new thread, pthreads sticks the pthread ID into the RBX register of the top level frame of the native stack - and this register is one of the ones that can be retrieved from unwinding. While this is a bit of a hack, it seems to work fairly well and let us join the native and python stacks together.

Having the native threadid also lets us query the OS to figure out if the thread is idle or not, so the idle detection code in the new version of py-spy is much better than it used to be. People were literally laughing on twitter about how bad the previous heuristic based method was - so any improvement is a good one =).

Profiling Cython Extensions

There are also a couple of problems with profiling Cython extensions.

The biggest problem is that symbolication returns the filename and line number of the C or C++ file that the Cython compiler generates rather than the original .pyx file that contains the actual Cython source code. The best way to overcome this is to use the emit_linenums=True option during the cythonize call. When this option is set, Cython will add #line directives to the generated C or C++ file that map back to the correct line and filename in the original Cython file.

If you haven’t used this option to create your Cython extension, py-spy will still try to get the original filename/line numbers from comments left by Cython in the generated C file itself. py-spy will load up the C file that was generated by the Cython compiler, and then find the comment block with the original file/line number that corresponds to the line number and then use that if possible. The big downside here is that the generated C file isn’t installed with the extension, so this usually only works if you are running a development version (python setup.py develop or pip install -e .) of your extension.

The other problem with Cython is that the function names get mangled. To fix this I wrote a basic Cython demangler, that probably has some unresolved issues around fused function calls - but otherwise seems to do the job well enough.

Caveats

There are some limitations with native extensions that I wanted to point out before wrapping this post up:

This only works on x86_64 Linux and Windows: OSX, FreeBSD and ARM/i686 Linux aren’t supported yet
For best results you should compile your extension with symbols, you may see pointer addresses if we fail to symbolicate and the function name isn’t exported
This doesn’t yet work profiling processes running in a docker container from the host OS
This adds some extra processing time - especially on Windows
Looking up line numbers in Cython requires the generated C/C++ file to be found

Final Thoughts

Since this post is also serving as a launch announcement for the new version of py-spy, I also wanted to point out that there are a bunch of other cool new features in py-spy 0.2. One of the big ones is that py-spy now works on FreeBSD thanks to the work of Artem Khramov. py-spy can also write out more file formats now, writing out both raw profile data and speedscope formatted profiles. We also support python 3.8, have better idle detection, support ARM processors, as well as other tweaks. Finally this version of py-spy is now released under a MIT license.

I think that py-spy is pretty useful myself, but I’d love to hear your thoughts on how it’s working out. If there are any features you’d like to see, either thumb up the appropriate issue or create a new one that describes what functionality is missing. I’m going to try to use thumbs up as a voting mechanism to see what missing features in py-spy people are most interested in - the native extension issue had 115 thumbs up, which showed me that this feature was worth finishing up.

Footnote 1

The notable exception here being vmprof-python which in addition to being able to profile native extensions, can also profile python programs running under the pypy interpreter.