An exciting feature just landed in the main branch of the Clang compiler. Using the [[clang::musttail]] or __attribute__((musttail)) statement attributes, you can now get guaranteed tail calls in C, C++, and Objective-C.

While tail calls are usually associated with a functional programming style, I am interested in them purely for performance reasons. It turns out that in some cases we can use tail calls to get better code out of the compiler than would otherwise be possible—at least given current compiler technology—without dropping to assembly.

Applying this technique to protobuf parsing has yielded amazing results: we have managed to demonstrate protobuf parsing at over 2GB/s, more than double the previous state of the art. There are multiple techniques that contributed to this speedup, so “tail calls == 2x speedup” is the wrong message to take away. But tail calls are a key part of what made that speedup possible.

In this blog entry I will describe why tail calls are such a powerful technique, how we applied them to protobuf parsing, and how this technique generalizes to interpreters. I think it’s likely that all of the major language interpreters written in C (Python, Ruby, PHP, Lua, etc.) could get significant performance benefits by adopting this technique. The main downside is portability: currently musttail is a nonstandard compiler extension, and while I hope it catches on it will be a while before it spreads widely enough that your system’s C compiler is likely to support it. That said, at build time you can compromise some efficiency for portability if you detect that musttail is not available.

Tail Call Basics

A tail call is any function call that is in tail position, the final action to be performed before a function returns. When tail call optimization occurs, the compiler emits a jmp instruction for the tail call instead of call. This skips over the bookkeeping that would normally allow the callee g() to return back to the caller f(), like creating a new stack frame or pushing the return address. Instead f() jumps directly to g() as if it were part of the same function, and g() returns directly to whatever function called f(). This optimization is safe because f()’s stack frame is no longer needed once the tail call has begun, since it is no longer possible to access any of f()’s local variables.

While this may seem like a run-of-the-mill optimization, it has two very important properties unlock new possibilities in the kinds of algorithms we can write. First, it reduces the stack memory from from ++O(n)++ to ++O(1)++ when making ++n++ consecutive tail calls, which is important because stack memory is limited and stack overflow will crash your program. This means that certain algorithms are not actually safe to write unless this optimization is performed. Secondly, jmp eliminates the performance overhead of call, such that a function call can be just as efficient as any other branch. These two properties enable us to use tail calls as an efficient alternative to normal iterative control structures like for or while.

This is by no means a new idea, indeed it goes back to at least 1977 when Guy Steele wrote an entire paper arguing that procedure calls make for cleaner designs than GOTO, and that tail call optimization can make them just as fast. This was one of the “Lambda Papers” written between 1975 and 1980 that developed many of the ideas underlying Lisp and Scheme.

Tail call optimization is not even new to Clang: like GCC and many other compilers, Clang was already capable of optimizing tail calls. In fact, the musttail attribute in our first example above did not change the output of the compiler at all: Clang would already have optimized the tail call under -O2.

What is new is the guarantee. While compilers will often optimize tail calls successfully, this is best-effort, not something you can rely on. In particular, the optimization will most likely not happen in non-optimized builds:

Here the tail call was compiled to an actual call, so we are back to ++O(n)++ stack space. This is why we need musttail: unless we can get a guarantee from the compiler that our tail calls will always be optimized, in all build modes, it isn’t safe to write algorithms that use tail calls for iteration. It would be a pretty severe limitation to have code that only works when optimizations are enabled.

The Trouble With Interpreter Loops

Compilers are incredible pieces of technology, but they are not perfect. Mike Pall, author of LuaJIT, decided to write LuaJIT 2.x’s interpreter in assembly rather than C, and he cites this decision as a major factor that explains why LuaJIT’s interpreter is so fast. He later went into more detail about why C compilers struggle with interpreter main loops. His two most central points are:

  • The larger a function is, and the more complex and connected its control flow, the harder it is for the compiler’s register allocator to keep the most important data in registers.
  • When fast paths and slow paths are intermixed in the same function, the presence of the slow paths compromises the code quality of the fast paths.

These observations closely mirror our experiences optimizing protobuf parsing. The good news is that tail calls can help solve both of these problems.

It may seem odd to compare interpreter loops to protobuf parsers, but the nature of the protobuf wire format makes them more similar than you might expect. The protobuf wire format is a series of tag/value pairs, where the tag contains a field number and wire type. This tag acts similarly to an interpreter opcode: it tells us what operation we need to perform to parse this field’s data. Like interpreter opcodes, protobuf field numbers can come in any order, so we have to be prepared to dispatch to any part of the code at any time.

The natural way to write such a parser is to have a while loop surrounding a switch statement, and indeed this has been the state of the art in protobuf parsing for basically as long as protobufs have existed. For example, here is some parsing code from the current C++ version of protobuf. If we represent the control flow graphically, we get something like this:

But this is incomplete, because at almost every stage there are things that can go wrong. The wire type could be wrong, or we could see some corrupt data, or we could just hit the end of the current buffer. So the full control flow graph looks more like this.

We want to stay on the fast paths (in blue) as much as possible, but when we hit a hard case we have to execute some fallback code to handle it. These fallback paths are usually bigger and more complicated than the fast paths, touch more data, and often even make out-of-line calls to other functions to handle the more complex cases.

Theoretically, this control flow graph paired with a profile should give the compiler all of the information it needs to generate the most optimal code. In practice, when a function is this big and connected, we often find ourselves fighting the compiler. It spills an important variable when we want it to keep it in a register. It hoists stack frame manipulation that we want to shrink wrap around a fallback function invocation. It merges identical code paths that we wanted to keep separate for branch prediction reasons. The experience can end up feeling like trying to play the piano while wearing mittens.

Improving Interpreter Loops With Tail Calls

The analysis above is mainly just a rehash of of Mike’s observations about interpreter main loops. But instead of dropping to assembly, as Mike did with LuaJIT 2.x, we found that a tail call oriented design could give us the control we needed to get nearly optimal code from C. I worked on this together with my colleague Gerben Stavenga, who came up with much of the design. Our approach is similar to the design of the wasm3 WebAssembly interpreter which describes this pattern as a “meta machine”.

The code for our 2+GB/s protobuf parser was submitted to upb, a small protobuf library written in C, in pull/310. While it is fully working and passing all protobuf conformance tests, it is not rolled out anywhere yet, and the design has not been implemented in the C++ version of protobuf. But now that musttail is available in Clang (and upb has been updated to use it), one of the biggest barriers to fully productionizing the fast parser has been removed.

Our design does away with a single big parse function and instead gives each operation its own small function. Each function tail calls the next operation in sequence. For example here is a function to parse a single fixed-width field. (This code is simplified from the actual code in upb; there are many details of our design that I am leaving out of this article, but will hopefully cover in future articles).

#include <stdint.h>
#include <stddef.h>
#include <string.h>

typedef void *upb_msg;
struct upb_decstate;
typedef struct upb_decstate upb_decstate;

// The standard set of arguments passed to each parsing function.
// Thanks to x86-64 calling conventions, these will be passed in registers.
#define UPB_PARSE_PARAMS                                          \
  upb_decstate *d, const char *ptr, upb_msg *msg, intptr_t table, \
      uint64_t hasbits, uint64_t data
#define UPB_PARSE_ARGS d, ptr, msg, table, hasbits, data

#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))

const char *fallback(UPB_PARSE_PARAMS);
const char *dispatch(UPB_PARSE_PARAMS);

// Code to parse a 4-byte fixed field that uses a 1-byte tag (field 1-15).
const char *upb_pf32_1bt(UPB_PARSE_PARAMS) {
  // Decode "data", which contains information about this field.
  uint8_t hasbit_index = data >> 24;
  size_t ofs = data >> 48;

  if (UNLIKELY(data & 0xff)) {
    // Wire type mismatch (the dispatch function xor's the expected wire type
    // with the actual wire type, so data & 0xff == 0 indicates a match).
    MUSTTAIL return fallback(UPB_PARSE_ARGS);
  }

  ptr += 1;  // Advance past tag.

  // Store data to message.
  hasbits |= 1ull << hasbit_index;
  memcpy((char*)msg + ofs, ptr, 4);

  ptr += 4;  // Advance past data.

  // Call dispatch function, which will read the next tag and branch to the
  // correct field parser function.
  MUSTTAIL return dispatch(UPB_PARSE_ARGS);
}

For a function this small and simple, Clang gives us code that is basically impossible to beat.

upb_pf32_1bt:                           # @upb_pf32_1bt
        mov     rax, r9
        shr     rax, 24
        bts     r8, rax
        test    r9b, r9b
        jne     .LBB0_1
        mov     r10, r9
        shr     r10, 48
        mov     eax, dword ptr [rsi + 1]
        mov     dword ptr [rdx + r10], eax
        add     rsi, 5
        jmp     dispatch                        # TAILCALL
.LBB0_1:
        jmp     fallback                        # TAILCALL

Note that there is no prologue or epilogue, no register spills, indeed there is no usage of the stack whatsoever. The only exits are jmps from the two tail calls, but no code is required to forward the parameters, because the arguments are already sitting in the correct registers. Pretty much the only improvement we could hope for is to get a conditional jump for the tail call, jne fallback, instead of jne followed by jmp.

If you were looking at a disassembly of this code without symbol information, you would have no reason to know that this was an entire function. It could just as easily be a basic block from a larger function. And that, in essence, is exactly what we are doing. We are taking an interpreter loop that is conceptually a big complicated function and programming it block by block, transferring control flow from one to the next via tail calls. We have full control of the register allocation at every block boundary (well, for six registers at least), and by using the same set of parameters for each function, we eliminate the need to shuffle any values around from one call to the next. As long as the function is simple enough to not spill those six registers, we’ve achieved our goal of keeping our most important state in registers throughout all of the fast paths.

We can optimize every instruction sequence independently, and crucially, the compiler will treat each sequence as independent too because they are in separate functions (we can prevent inlining with noinline if necessary). This solves the problem we described earlier where the code from fallback paths would degrade the code quality for fast paths. If we put the slow paths in entirely separate functions from the fast paths, we can be guaranteed that the fast paths will not suffer. The nice assembly sequence we see above is effectively frozen, unaffected by any changes we make to other parts of the parser.

If we apply this pattern to Mike’s example from LuaJIT, we can more or less match his hand-written assembly with only minor code quality defects:

#define PARAMS unsigned RA, void *table, unsigned inst, \
               int *op_p, double *consts, double *regs
#define ARGS RA, table, inst, op_p, consts, regs
typedef void (*op_func)(PARAMS);
void fallback(PARAMS);

#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))

void ADDVN(PARAMS) {
    op_func *op_table = table;
    unsigned RC = inst & 0xff;
    unsigned RB = (inst >> 8) & 0xff;
    unsigned type;
    memcpy(&type, (char*)&regs[RB] + 4, 4);
    if (UNLIKELY(type > -13)) {
        return fallback(ARGS);
    }
    regs[RA] += consts[RC];
    inst = *op_p++;
    unsigned op = inst & 0xff;
    RA = (inst >> 8) & 0xff;
    inst >>= 16;
    MUSTTAIL return op_table[op](ARGS);
}

The resulting assembly is:

ADDVN:                                  # @ADDVN
        movzx   eax, dh
        cmp     dword ptr [r9 + 8*rax + 4], -12
        jae     .LBB0_1
        movzx   eax, dl
        movsd   xmm0, qword ptr [r8 + 8*rax]    # xmm0 = mem[0],zero
        mov     eax, edi
        addsd   xmm0, qword ptr [r9 + 8*rax]
        movsd   qword ptr [r9 + 8*rax], xmm0
        mov     edx, dword ptr [rcx]
        add     rcx, 4
        movzx   eax, dl
        movzx   edi, dh
        shr     edx, 16
        mov     rax, qword ptr [rsi + 8*rax]
        jmp     rax                             # TAILCALL
.LBB0_1:
        jmp     fallback

The only opportunity for improvement I see here, aside from the jne fallback issue mentioned before, is that for some reason the compiler doesn’t want to generate jmp qword ptr [rsi + 8*rax]. Instead it prefers to load into rax and then follow with jmp rax. These are minor code generation issues that could hopefully be fixed in Clang without too much work.

Limitations

One of the biggest caveats with this approach is that these beautiful assembly sequences get catastrophically pessimized if any non tail calls are present. Any non tail call forces a stack frame to be created, and a lot of data spills to the stack.

#define PARAMS unsigned RA, void *table, unsigned inst, \
               int *op_p, double *consts, double *regs
#define ARGS RA, table, inst, op_p, consts, regs
typedef void (*op_func)(PARAMS);
void fallback(PARAMS);

#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))

void ADDVN(PARAMS) {
    op_func *op_table = table;
    unsigned RC = inst & 0xff;
    unsigned RB = (inst >> 8) & 0xff;
    unsigned type;
    memcpy(&type, (char*)&regs[RB] + 4, 4);
    if (UNLIKELY(type > -13)) {
        // When we leave off "return", things get real bad.
        fallback(ARGS);
    }
    regs[RA] += consts[RC];
    inst = *op_p++;
    unsigned op = inst & 0xff;
    RA = (inst >> 8) & 0xff;
    inst >>= 16;
    MUSTTAIL return op_table[op](ARGS);
}

This leads to the very unfortunate:

ADDVN:                                  # @ADDVN
        push    rbp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        push    rax
        mov     r15, r9
        mov     r14, r8
        mov     rbx, rcx
        mov     r12, rsi
        mov     ebp, edi
        movzx   eax, dh
        cmp     dword ptr [r9 + 8*rax + 4], -12
        jae     .LBB0_1
.LBB0_2:
        movzx   eax, dl
        movsd   xmm0, qword ptr [r14 + 8*rax]   # xmm0 = mem[0],zero
        mov     eax, ebp
        addsd   xmm0, qword ptr [r15 + 8*rax]
        movsd   qword ptr [r15 + 8*rax], xmm0
        mov     edx, dword ptr [rbx]
        add     rbx, 4
        movzx   eax, dl
        movzx   edi, dh
        shr     edx, 16
        mov     rax, qword ptr [r12 + 8*rax]
        mov     rsi, r12
        mov     rcx, rbx
        mov     r8, r14
        mov     r9, r15
        add     rsp, 8
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        jmp     rax                             # TAILCALL
.LBB0_1:
        mov     edi, ebp
        mov     rsi, r12
        mov     r13d, edx
        mov     rcx, rbx
        mov     r8, r14
        mov     r9, r15
        call    fallback
        mov     edx, r13d
        jmp     .LBB0_2

To avoid this, we tried to follow a discipline of only calling other functions via inlining or tail calls. This can get annoying if an operation has multiple points at which an unusual case can occur that is not an error. For example, when we are parsing protobufs, the fast and common case is that varints are only one byte long, but longer varints are not an error. Handling the unusual case inline can compromise the quality of the fast path if the fallback code is too complicated. But tail calling to a fallback function gives no way of easily resuming the operation once the unusual case is handled, so the fallback function must be capable of pushing forward and completing the operation. This leads to code duplication and complexity.

Ideally this issue could be solved by adding __attribute__((preserve_most)) to the fallback functions and then calling them normally, without tail calls. The preserve_most attribute makes the callee responsible for preserving nearly all registers, which moves the cost of the register spills to the fallback functions where we want it. We experimented some with this attribute but ran into some mysterious problems that we were not able to get to the bottom of. It may have been an error on our part; revisiting this is future work. [Update: 2023-03-14: it turns out this was a bug in Clang, which was fixed two months ago]

The other major limitation is that musttail is not portable. I very much hope that the attribute will catch on, spreading to GCC, Visual C++, and other popular compilers, and even get standardized someday. But that day is far off, so what to do in the meantime?

When musttail is not available, we need to perform at least one true return, without a tail call, for every conceptual loop iteration. We have not yet implemented this fallback in upb, but I expect it will involve a macro that either tail calls to dispatch or just returns, based on the availability of musttail.