2023-03-29 • 5413 words • 60 minutes
• #dark-arts • #assembly • #concurrency • #rust

Atomicless Concurrency

Let’s say we’re building an allocator. Good allocators need to serve many threads simultaneously, and as such any lock they take is going to be highly contended. One way to work around this, pioneered by TCMalloc, is to have thread-local caches of blocks (hence, the “TC” - thread cached).

Unfortunately threads can be ephemeral, so book-keeping needs to grow dynamically, and large, complex programs (like the Google Search ranking server) can have tens of thousands of threads, so per-thread cost can add up. Also, any time a thread context-switches and resumes, its CPU cache will contain different cache lines – likely the wrong ones. This is because either another thread doing something compeltely different executed on that CPU, or the switched thread migrated to execute on a different core.

These days, instead of caching per-thread, TCMalloc uses per-CPU data. This means that book-keeping is fixed, and this is incredibly friendly to the CPU’s cache: in the steady-state, each piece of the data will only ever be read or written to by a single CPU. It also has the amazing property that there are no atomic operations involved in the fast path, because operations on per-CPU data, by definition, do not need to be synchronized with other cores.

This post gives an overview of how to build a CPU-local data structure on modern Linux. The exposition will be for x86, but other than the small bits of assembly you need to write, the technique is architecture-independent.

The Kernel Primitive

Concurrency primitives require cooperating with the kernel, which is responsible for global scheduling decisions on the system. However, making syscalls is quite expensive; to alieviate this, there has been a trend in Linux to use shared memory as a kernelspace/userspace communication channel.

Futexes are the classic “cas-with-the-kernel” syscall (I’m assuming basic knowledge of atomic operations like cas in this article). In the happy path, we just need to cas on some memory to lock a futex, and only make a syscall if we need to go to sleep because of contention. The kernel will perform its own cas on this variable if necessary.

Restartable sequences are another such proto-primitive, which are used for per-CPUuprogramming. The relevant syscall for us, rseq(2), was added in Linux 4.18. Its manpage reads

reference
A restartable sequence is a sequence of instructions guaranteed to be executed atomically with respect to other threads and signal handlers on the current CPU. If its execution does not complete atomically, the kernel changes the execution flow by jumping to an abort handler defined by userspace for that restartable sequence.

A restartable sequence, or “rseq” is a special kind of critical section that the kernel guarantees executes from start to finish without any kind of preemption. If preemption does happen (because of a signal or whatever), userspace observes this as a jump to a special handler for that critical section. Conceptually it’s like handling an exception:

try {
  // Per-CPU code here.
} catch (PremptionException) {
  // Handle having been preempted, which usally just means
  // "try again".
}

C++

These critical sections are usually of the following form:

Read the current CPU index (the rseq mechanism provides a way to do this).
Index into some data structure and do something to it.
Complete the operation with a single memory write. This is the “commit”.

All the kernel tells us is that we couldn’t finish successfully. We can always try again, but the critical section needs to be such that executing any prefix of it, up to the commit, has no effect on the data structure. We get no opportunity to perform “partial rollbacks”.

In other words, the critical section must be a transaction.

Enabling `rseq`

Using rseqs requires turning on support for it for a particular thread; this is what calling rseq(2) (the syscall) accomplishes.

The signature for this syscall looks like this:

// This type is part of Linux's ABI.
#[repr(C, align(32))]
struct Rseq {
  cpu_id_start: u32,
  cpu_id: u32,
  crit_sec: u64,
  flags: u32,
}

// Note: this is a syscall, not an actual Rust function.
fn rseq(rseq: *mut Rseq, len: u32, flags: i32, signature: u32) -> i32;

Related Posts