One of the cool things about the rbspy profiler is that it can profile any running Ruby program, without even pausing the Ruby program that is being profiled. Rbspy is a sampling profiler, and when the --nonblocking argument is passed to it, it will collect each stack trace from the profiled program without pausing it or doing any synchronization. This has the advantage of not slowing down the profiled program at all, but has the disadvantage of leading to a data race between the rbspy profiler and the Ruby program being profiled. In the nonblocking mode, rbspy tries to get an accurate stack trace from the Ruby program while the Ruby program is actively changing the stack by running the code - and since there is no locking happening there is potential for a data race. Amazingly, rbspy still manages to get good results even without doing any synchronization.

Julia Evans wrote an excellent post about whether to pause Ruby during profiling - and concluded that pausing the Ruby program wasn’t always necessary. While running in the nonblocking mode can cause some minor errors and isn’t the default anymore, the profiles generated in the nonblocking mode are still fairly accurate.

This isn’t the case with py-spy, which is a similar sampling profiler for Python programs. When you run py-spy in the nonblocking mode, the profiles generated can be wildly misleading and nonsensical.

This post is talking about what the effects of a data race look like when profiling with py-spy, why this happens with py-spy much more frequently than in rbspy due to differences in the CPython and CRuby interpreter implementations, and why you still might want to consider using the nonblocking in some conditions despite the potential for inaccurate profiles being generated.

What does a data race during profiling look like?

Errors when profiling can be most clearly seen in Python programs calling functions that execute quickly - changing the stack of the python program as it runs, without spending much time in each function call.

As an example of a program that exhibits this behaviour, consider this simple recursive function for calculating a factorial in Python:

def factorial(x):
    if x <= 1:
        return 1
    return x * factorial(x-1)

while True:

When profiling this code with py-spy with the default option to pause the program while collecting each stack trace, you get a flame graph with sensible looking output:

py-spy flamegraph - with pausing the program to collect each sample

Flame graphs represent aggregated stack traces with each rectangle being one frame in the stack trace - and the width of the rectangle being proportional to the amount of time that frame took. This provides a compact representation of the hierarchy between all the aggregated stack traces that were sampled by py-spy.

The flamegraph above is what I’d expect from tracing a recursive factorial function. This output shows that there is one frame per recursive function call, and each frame takes roughly the same amount of time - leading to a flamegraph that looks kind of like a staircase.

However, when calling py-spy with the --nonblocking flag - the results are totally different and give a good visual of what a race condition during profiling looks like:

py-spy flamegraph with the --nonblocking option

The output here is clearly incorrect. Half of the stack frames don’t end up in a <module> entry point, which should be impossible. Likewise, the stack goes 60+ frames deep inside the recursive function, when we know by looking at the program that it should only go 16 frames deep.

The really interesting thing here is that we don’t see the same results with Ruby and rbspy. Profiling the same factorial function written in Ruby with rbspy produces sensible output even with the --nonblocking option set:

rbspy flamegraph when not pausing the program

The output from running rbspy when pausing the program is roughly the same as when running in nonblocking mode on this program. While rbspy has recently changed the default options to pause the program when profiling - since it can still have some issues when profiling in its nonblocking mode, these issues are much smaller than we’ve seen with py-spy. The rbspy project went years before adding the option to pause the Ruby process during profiling, where this was added within a couple of weeks of py-spy’s initial release.

Why is Python so much more affected than Ruby?

Rbspy and py-spy both operate very similarly - and even share a bunch of code for reading information from the profiled process. Given all that, the big question here is why do we see so many more of these type of errors with py-spy than with rbspy?

The answer as far as I can figure out is due to differences in how the CPython and CRuby interpreters manage their internal frame state.

Ruby stores the stack trace for the program as an array on the rb_thread_struct:

typedef struct rb_thread_struct {
    VALUE self;
    rb_vm_t *vm;

    /* execution information */
    VALUE *stack;		/* must free, must mark */
    size_t stack_size;          /* size in word (byte size / sizeof(VALUE)) */


To profile a ruby program, rbspy reads the stack array from the Ruby process with a single syscall using the process_vm_readv system call on Linux, the vm_read call on OSX or the ReadProcessMemory call on Windows. This lets us get the whole stack trace for the thread at once, and then we can use that to get frame information from each VALUE in the stack. Since the frame values change much slower than the stack does, this leads to a fairly accurate picture of the ruby program even with the --nonblocking option.

Python on the other hand uses a linked list to represent each frame with each frame pointing to the previous frame in stack through a f_back member:

struct _frame {
    struct _frame *f_back;      /* previous frame, or NULL */


To get the stack trace, py-spy reads in each frame - and then uses the f_back member to find the next frame. Each frame involves a different syscall to read in the memory from the other process, and since that memory is changing while this stack unwinding is happening we can get garbage output like we saw above.

When you might want to use nonblocking mode anyways

The whole point of not pausing the profiled process is to reduce the performance impact on the profiled program. Typically pausing the program during profiling with py-spy reduces its running speed by around 5% or so, though this depends on several factors like how many threads the program has, how deep the call stack is, and what the sampling rate is. Running in nonblocking mode has a negligible performance impact on the profiled program - which might be desirable when profiling production services.

The errors we’ve seen are mostly in pure python code, where each python function call exits quickly without doing much work. If the function is calling into C extensions, the Python stack trace can be stable enough to get a good stack trace reliably. Likewise if the Python function calls take a longer amount of time - you can still get a good stack trace in nonblocking mode. As an example, writing the factorial function as for loop rather than a recursive function produces decent looking output.

Published on 09 November 2021

Get new posts by email!

Enter your email address to get an email whenever I write a new post:

  • Follow me on: