Let’s say we’re building an allocator. Good allocators need to serve many threads simultaneously, and as such any lock they take is going to be highly contended. One way to work around this, pioneered by TCMalloc, is to have thread-local caches of blocks (hence, the “TC” - thread cached).
Unfortunately threads can be ephemeral, so book-keeping needs to grow dynamically, and large, complex programs (like the Google Search ranking server) can have tens of thousands of threads, so per-thread cost can add up. Also, any time a thread context-switches and resumes, its CPU cache will contain different cache lines – likely the wrong ones. This is because either another thread doing something compeltely different executed on that CPU, or the switched thread migrated to execute on a different core.
These days, instead of caching per-thread, TCMalloc uses per-CPU data. This means that book-keeping is fixed, and this is incredibly friendly to the CPU’s cache: in the steady-state, each piece of the data will only ever be read or written to by a single CPU. It also has the amazing property that there are no atomic operations involved in the fast path, because operations on per-CPU data, by definition, do not need to be synchronized with other cores.
This post gives an overview of how to build a CPU-local data structure on modern Linux. The exposition will be for x86, but other than the small bits of assembly you need to write, the technique is architecture-independent.
The Kernel Primitive
Concurrency primitives require cooperating with the kernel, which is responsible for global scheduling decisions on the system. However, making syscalls is quite expensive; to alieviate this, there has been a trend in Linux to use shared memory as a kernelspace/userspace communication channel.
Futexes are the classic “cas-with-the-kernel” syscall (I’m assuming basic knowledge of atomic operations like cas in this article). In the happy path, we just need to cas on some memory to lock a futex, and only make a syscall if we need to go to sleep because of contention. The kernel will perform its own cas on this variable if necessary.
Restartable sequences are another such proto-primitive, which are used for per-CPUuprogramming. The relevant syscall for us, rseq(2)
, was added in Linux 4.18. Its manpage reads
A restartable sequence is a sequence of instructions guaranteed to be executed atomically with respect to other threads and signal handlers on the current CPU. If its execution does not complete atomically, the kernel changes the execution flow by jumping to an abort handler defined by userspace for that restartable sequence.
A restartable sequence, or “rseq” is a special kind of critical section that the kernel guarantees executes from start to finish without any kind of preemption. If preemption does happen (because of a signal or whatever), userspace observes this as a jump to a special handler for that critical section. Conceptually it’s like handling an exception:
These critical sections are usually of the following form:
- Read the current CPU index (the rseq mechanism provides a way to do this).
- Index into some data structure and do something to it.
- Complete the operation with a single memory write. This is the “commit”.
All the kernel tells us is that we couldn’t finish successfully. We can always try again, but the critical section needs to be such that executing any prefix of it, up to the commit, has no effect on the data structure. We get no opportunity to perform “partial rollbacks”.
In other words, the critical section must be a transaction.
Enabling rseq
Using rseqs requires turning on support for it for a particular thread; this is what calling rseq(2)
(the syscall) accomplishes.
The signature for this syscall looks like this:
The syscall registers “the” Rseq
struct for the current thread; there can be at most one, per thread.
rseq
is a pointer to this struct. len
should be size_of::<Rseq>()
, and signature
can be any 32-bit integer (more on this later). For our purposes, we can ignore flags
on the struct.
flags
on the syscall, on the other hand, is used to indicate whether we’re unregistering the struct; this is explained below.
In the interest of exposition, we’ll call the syscall directly. If you’ve never seen how a Linux syscall is done (on x86), you load the syscall number into rax
, then up to six arguments in rdi
, rsi
, rdx
, r10
, r8
, r9
1. We only need the first four.
The return value comes out in rax
, which is 0 on success, and a negative of an errno
code otherwise. In particular, we need to check for EINTR
to deal with syscall interruption. (every Linux syscall can be interrupted).
Note the unregister
parameter: this is used to tear down rseq
support on the way out of a thread. Generally, rseq
will be a thread-local, and registration happens at thread startup. Glibc will do this and has a mechanism for acquiring the rseq
pointer. Unfortunately, the glibc I have isn’t new enough to know to do this, so I hacked up something to register my own thread local.
I had the bright idea of putting my Rseq
struct in a box, which triggered an interesting bug: when a thread exits, it destroys all of the thread local variables, including the box to hold our Rseq
. But if the thread then syscalls to deallocate its stack, when the kernel goes to resume, it will attempt to write the current CPU index to the rseq.cpu_id
field.
This presents a problem, because the kernel is probably going to write to a garbage location. This is all but guaranteed to result in a segfault. Debuggers observe this as a segfault on the instruction right after the syscall
instruction; I spent half an hour trying to figure out what was causing a call to madvise(2)
to segfault.
Hence, we need to wrap our thread local in something that will call rseq(2)
to unregister the struct. Putting everything together we get something like this.
Per Rust’s semantics, this will execute the first time we access this thread local, instead of at thread startup. Not ideal, since now we pay for an (uncontended) atomic read every time we touch RSEQ, but it will do.
Creating a Critical Section
To set up and execute a restartable sequence, we need to assemble a struct that describes it. The following struct is also defined by Linux’s syscall ABI:
start
is the address of the first instruction in the sequence, and len
is the length of the sequence in bytes. abort_handler
is the address of the abort handler. version
must be 0 and we can ignore flags
.
Once we have a value of this struct (on the stack or as a constant), we grab RSEQ
and atomically store the address of our CritSec
to RSEQ.crit_sec
. This needs to be atomic because the kernel may decide to look at this pointer from a different CPU core, but it likely will not be contended.
Note that RSEQ.crit_sec
should be null before we do this; restartable sequences can’t nest.
Next time the kernel preempts our thread (and later gets ready to resume it), it will look at RSEQ.crit_sec
to decide if it preempted a restartable sequence and, if so, jump to the abort handler.
Once we finish our critical section, we must reset RSEQ.crit_sec
to 0.
Labels and Constants, Oh My
There is a wrinkle: we would like for our
CritSec
value to be a constant, but Rust doesn’t provide us with a way to initialize thestart
andabort_handler
fields directly, since it doesn’t have a way to refer2 to the labels (jump targets) inside the inline assembly. The simplest way to get around this is to assemble (lol) theCritSec
on the stack, with inline assembly. The overhead is quite minimal.
On x86, this is what our boilerplate will look like:
A few things to note:
- Because this is inline assembly, we need to use numeric labels. I’ve chosen labels in the 90s for no particular reason.
90:
declares a jump target, and90f
is a forward reference to that instruction address. - Most of this assembly is just initalizing a struct3. It’s not until the
mov
right before90:
(the critical section start) that anything interesting happens. - Immediately before
92:
(the abort handler) is an.int
directive that emits the same four-byte signature we passed torseq(2)
into the instruction stream. This must be here, otherwise the kernel will issue a segfault to the thread. This is a very basic control-flow integrity feature. - We clear
RSEQ.crit_sec
at the very end.
This is a lot of boilerplate. In an ideal world, we could have something like the following:
Unfortunately, this is very hard to do, because the constraints on restartable sequences are draconian:
- Can’t jump out of the critical section until it completes or aborts. This means you can’t call functions or make syscalls!
- Last instruction must be the commit, which is a memory store operation, not a return.
This means that you can’t have the compiler generating code for you; it might outline things or move things around in ways you don’t want. In something like ASAN mode, it might inject function calls that will completely break the primitive.
This means we muyst write our critical section in assembly. That assembly also almost unavoidably needs to be part of the boilerplate given above, and it means it can’t participate in ASAN or TSAN instrumentation.
In the interest of exposition, we can build a wrapper over this inline assembly boilerplate that looks something like this:
When I wrote the snippet above, I chose numeric labels in the 90s to avoid potential conflicts with whatever assembly gets pasted here. This is also why I used a leading _
on the names of some of the assembly constraints; thise are private to the macro. rseq
isn’t, though, since callers will want to access the CPU id in it.
The intent is for the assembly string to be pasted over the // Do something cool here
comment, and for the constraints to be tacked on after the boilerplate’s constraints.
But with that we now have access to the full rseq primitive, in slightly sketchy macro form. Let’s use it to build a CPU-local data structure.
A Checkout Desk
Let’s say we have a pool of objects that are needed to perform an allocation, our putative page caches. Let’s say we have the following interface:
get_cache()
grabs a cache of pages off the global free list. This requires taking a lock or traversing a lockless linked list, so it’s pretty expensive. return_cache()
returns a cache back to the global free list for re-use; it is a similarly expensive operation. Both of these operations are going to be contended like crazy, so we want to memoize them.
To achieve this, we want one slot for every CPU to hold the cache it (or rather, a thread running on it) most recently acquired, so that it can be reused. These slots will have “checkout desk” semantics: if you take a thing, you must put something in its place, even if it’s just a sign that says you took the thing.
Matthew Kulukundis came up with this idea, and he’d totally put this gif in a slide deck about this data structure.
As a function signature, this is what it looks like:
We can then use it like this.
The semantics of PerCpu<T>
is that it is an array of nprocs
(the number of logical cores on the system) pointers, all initialized to null. checkout()
swaps the pointer stored in the current CPU’s slot in the PerCpu<T>
with the replacement argument.
Building the Checkout Desk
The implementation of this type is relatively simple, but the devil is in the details. Naively, you’d think you literally want an array of pointers:
Unfortunately, this is cache-hostile. We expect that (depending on how ptrs
is aligned in memory) for eight CPUs’ checkout pointers to be on the same cache line. This means eight separate cores are going to be writing to the same cache line, which is going to result in a lot of cache thrash. This memory wants to be in L1 cache, but will probably wind up mostly in shared L3 cache.
This effect is called “false sharing”, and is a fundamental part of the design of modern processors. We have to adjust for this.
Instead, we want to give each core a full cache line (64 bytes aligned to a 64-byte boundary) for it to store its pointer in. This sounds super wasteful (56 of those bytes will go unused), but this is the right call for a perf-sensitive primitive.
This amount of memory can add up pretty fast (two whole pages of memory for a 128-core server!), so we’ll want to lazilly initialize them. Our cache-friendly struct will look more like this:
Initializing it requires finding out how many cores there are on the machine. This is a… fairly platform-specific affair. Rust does offer a “maximum paralellism” query in its standard library, but it is intended as a hint for how many worker threads to spawn, as opposed to a hard upper bound on the number of CPU indices.
Instead, we call get_nprocs_conf()
, which is fine since we’re already extremely non-portable already. This is a GNU libc extension.
In code…
(I’m not going to implement Drop
for this type. That’s an exercise for the reader.)
Implementing checkout()
Now’s the moment we’ve all be waiting for: writing our restartable sequence. As critical sections go, this one’s pretty simple:
- Index into the
ptrs
array to get this CPU’s pointer-to-cache-line. - If that pointer is null, bail out of the rseq and initialize a fresh cache line (and then try again).
- If it’s not null, swap
replacement
with the value in the cache line.
This code listing is a lot to take in. It can be broken into two parts: the restartable sequence itself, and the allocation fallback if the pointer-to-cache-line happens to be null.
The restartable sequence is super short. It looks at the pointer-to-cache-line, bails if its null (this triggers the later part of the function) and then does an xchg
between the actual *mut T
in the per-CPU cache line, and the replacement.
If the rseq aborts, we just try again. This is short enough that preemption in the middle of the rseq is quite rare. Then, if need_alloc
was zeroed, that means we successfully committed, so we’re done.
Otherwise we need to allocate a cache line for this CPU. We’re now outside of the rseq, so we’re back to needing atomics. Many threads might be racing to be the thread that initializes the pointer-to-cache-line; we use a basic cas loop to make sure that we only initialize from null, and if someone beats us to it, we don’t leak the memory we had just allocated. This is an RMW operation, so we want both acquire and release ordering. Atomics 101!
Then, we try again. Odds are good we won’t have migrated CPUs when we execute again, so we won’t need to allocate again. Eventually all of the pointers in the ptrs
array will be non-null, so in the steady state this needs_alloc
case doesn’t need to happen.
Conclusion
This is just a glimpse of what per-CPU concurrent programming looks like. I’m pretty new to it myself, and this post was motivated by building an end-to-end example in Rust. You can read more about how TCMalloc makes use of restartable sequences here.
-
This is annoyingly different from the function calling convention, which passes arguments in
rdi
,rsi
,rdx
,rcx
,r8
,r9
, with the mnemonic “Diana’s silk dress cost $89.” I don’t know a cute mnemonic for the syscall registers. ↩ -
It’s actually worse than that. You’d think you could do
but this makes the resulting code non-position-independent on x86. What this means is that the code must know at link time what address it will be loaded at, which breaks the position-independent requirement of many modern platforms.
Indeed, this code will produce a linker error like the following:
Not only is
.int foo
a problem, but so is referring topointers
. Instead we must writeto be able to load the address of
pointers
at all. This can be worked around if you’re smart; after all, it is possible to put the addresses of functions into static variables and not have the linker freak out. It’s too hard to do in inline assembly tho. ↩ -
Basically this code, which can’t be properly-expressed in Rust.