Better Trait Resolution in Rust

Traits are the core of polymorphism in Rust. Let’s review:

trait Stringer {
  fn string(&self) -> String;
}

impl Stringer for i32 {
  fn string(&self) -> String {
    format!("{self}")
  }
}

// Prints `42`.
println!("{}", 42.string());

Notice that we call the trait method Stringer::string directly on the value in question. This means that traits (at least, those currently in scope) inject their methods into the namespace of anything that implements them.

Now, this isn’t immediately a problem, because Rust’s namespace lookup rules are such that methods inherent to a type are searched for first:

trait Stringer {
  fn string(&self) -> String;
}

struct Woofer;
impl Stringer for Woofer {
  fn string(&self) -> String {
    format!("woof")
  }
}

impl Woofer {
  fn string(&self) -> String {
    format!("bark")
  }
}

// Prints `bark`.
println!("{}", Woofer.string());

This means that traits cannot easily break downstream code by adding new methods, but there are a few possible hazards:

  • If the owner of a type adds a method with the same name as a trait method, it will override direct (i.e., foo.string()) calls to that trait method, even if the type owner is unaware of the trait method.

  • If traits A and B are in scope, and String implements both, and we call str.foo() (which resolves to A::foo()), and later B adds a new method B::foo(), the callsite for String will break. A and B’s owners do not need to be aware of each other for this to happen.

Of course, Rust has a disambiguation mechanism. Given any trait implementation Foo: Bar, we can reference its items by writing <Foo as Bar>::baz. However, this syntax is very unweildy (it doesn’t work with method chaining), so it doesn’t get used. As a result, small evolution hazards can build up in a large codebase.

Those who know me know that I often talk about a syntax that I call foo.Trait::method(), or “qualified method call syntax”. In this post, I want to discuss this syntax in more detail, and some related ideas, and how they factor into type and trait design.

Paths-as-Methods

This idea isn’t new; others have proposed it, and it forms the core of Carbon’s version of trait method calls (you can read more about Carbon’s name lookup story here).

Let’s recreate the original example in Carbon (bear in mind that I am not an expert on this language, and the semantics are still up in the air).

interface Stringer {
  fn String[self: Self]() -> String;
}

external impl i32 as Stringer {
  fn String[self: Self]() -> String { ... }
}

fn Run() {
  Carbon.Print(42.(Stringer.String)())
}
Carbon

Notice 42.(Stringer.String)(): Carbon requires that we qualify the method call, because 42 has the concrete type i32. If this were in a generic context and we had a type variable bounded by Stringer, we could just write x.String(); no ambiguity.

In Carbon, all qualification uses ., so they have to add parens. Because Rust uses :: for qualifying paths, we don’t have this syntactic abiguity, so we can augment the syntax to allow more path expressions after the ..

The current grammar is

MethodCallExpression :
   Expression . PathExprSegment (CallParams?)

PathExprSegment :
   PathIdentSegment (:: GenericArgs)?
Plaintext

That is, exactly one identifier and an optional turbofish. I would like to see this extended to allow any QualifiedPathInExpression after the . and before the parens. This would allow, for example:

  • expr.Trait::method()
  • expr.Self::method()
  • expr.<Foo as Trait>::method()
  • expr.::path::to::Trait::<Args>::method::<MoreArgs>()

These would all be desugared as the equivalent UFCS, taking into account that method call syntax can trigger autoref.

  • Trait::method(expr)
  • Self::method(expr)
  • <Foo as Trait>::method(expr)
  • ::path::to::Trait::<Args>::method::<MoreArgs>(expr)

The method would still need to be valid to have been called via . syntax; I’m not proposing we add the following, even though it is unambiguous.

fn square(x: i32) -> i32 {
  x * x
}

// Would be equivalent to `square(42)`.
42.self::square()
Rust

Trait method callers can now use qualified method call syntax where they might want to use UFCS without issues around wordiness.

Impl Modality

Of course, this isn’t the only idea from Carbon’s interfaces worth stealing; Carbon also has a notion of “external” and “internal” impls; I will call these “impl modes”.

An external impl is like the one we showed above, whose methods can only be found by qualified lookup: foo.(Bar.Baz)(). An internal impl is one which is “part” of a type.

interface Stringer {
  fn String[self: Self]() -> String;
}

class Woofer {
  impl as Stringer {
    fn String[self: Self]() -> String { ... }
  }
}

fn Run() {
  let w: Woofer = ...;
  w.String()  // Unqualified!
}
Carbon

This also implies that we don’t need to import Stringer to call w.String().

There are definitely traits in Rust which fit into these modes.

Clone and Iterator almost always want to be internal. An iterator exists to implement Iterator, and cloning is a fundamental operation. Because both of these traits are in the prelude, it’s not a problem, but it is a problem for traits provided by a non-std crate, like rand::Rng. The lack of a way to do this leads to the proliferation of prelude modules and namespace pollution. (I think that preludes are bad library design.)

On the other hand, something like Debug wants to be external very badly. It almost never makes sense to call foo.fmt(), since that gets called for you by println! and friends; not to mention that all of the std::fmt traits have a method fmt(), making such a call likely to need disambiguation with UFCS. Borrow is similar; it exists to be a bound for things like Cow more than to provide the .borrow() method.

There’s also a third mode, which I will call “extension impls”. These want to inject methods into a type, either to extend it, like itertools, or as part of some framework, like tap. This use of traits is somewhat controversial, but I can sympathize with wanting to have this.

If we have paths-as-methods, we can use this classification to move towards something more like the Carbon model of method lookup, without impacting existing uses.

My strawman is to add a #[mode] attribute to place on trait impls, which allows a caller to select the behavior:

  • #[mode(extension)] is today’s behavior. The impl’s trait must be in scope so that unqualified calls like foo.method() resolve to it.
  • #[mode(internal)] makes it so that foo.method() can resolve to a method from this impl without its trait being in scope1. It can only be applied to impls that are such that you could write a corresponding inherent impl, so things like #[mode(internal)] impl<T> Trait for T { .. } are forbidden.
  • #[mode(external)] makes it so that foo.method() never resolves to a method from this impl. It must be called as Trait::method(foo) or foo.Trait::method().

Every trait would be #[mode(extension)] if not annotated, and it would be easy to migrate to external-by-default across an edition. Similarly, we could change whether a std impl is external vs extension based on the edition of the caller, and provide a cargo fix rewrite to convert from foo.method() to foo.Trait::method().

It may also make sense for traits to be able to specify the default modality of their impls, but I haven’t thought very carefully about this.

Note that moving in the external -> extension -> internal direction is not a breaking change, but moving the other way is.

What about Delegation?

A related feature that I will touch on lightly is delegation; that is, being able to write something like the following:

// crate a
struct Foo { ... }
impl Foo {
  fn boing() -> i32 { ... }
}

// crate b
trait Bar {
  fn boing() -> i32;
}

impl Bar for Foo {
  use Foo::boing;
}
Rust

The use in the impl indicates that we want to re-use Foo::boing to implement <Foo as Bar>::boing. This saves us having to write out a function signature, and results in less work for the compiler because that’s one less function we risk asking LLVM to codegen for us (at scale, this is a Big Deal).

You could imagine using delegation instead of #[mode]:

struct Woofer;
impl Stringer for Woofer {
  fn string(&self) -> String {
    format!("woof")
  }
}

impl Woofer {
  use Stringer::*;
}
Rust

The reason I haven’t gone down this road is because delegation is a very large feature, and doesn’t give us a clean way to express #[mode(external)], which is a big part of what I want. A delegation-compatible way to express this proposal is to not add #[mode(internal)], and add use Trait::method; and use Trait::*; (and no other variations) inside of inherent impl blocks.

Conclusion

I don’t have the personal bandwidth to write RFCs for any of this stuff, but it’s something I talk about a lot as a potential evolution hazard for Rust. I hope that putting these ideas to paper can help make name resolution in Rust more robust.

  1. This needs to carry a bunch of other restrictions, because it’s equivalent to adding inherent methods to the implee. For example, none of the methods can have the same name as a method in any other inherent or internal impl block, and internal impl methods should take lookup priority over extension impl methods during name lookup. 

Atomicless Concurrency

Let’s say we’re building an allocator. Good allocators need to serve many threads simultaneously, and as such any lock they take is going to be highly contended. One way to work around this, pioneered by TCMalloc, is to have thread-local caches of blocks (hence, the “TC” - thread cached).

Unfortunately threads can be ephemeral, so book-keeping needs to grow dynamically, and large, complex programs (like the Google Search ranking server) can have tens of thousands of threads, so per-thread cost can add up. Also, any time a thread context-switches and resumes, its CPU cache will contain different cache lines – likely the wrong ones. This is because either another thread doing something compeltely different executed on that CPU, or the switched thread migrated to execute on a different core.

These days, instead of caching per-thread, TCMalloc uses per-CPU data. This means that book-keeping is fixed, and this is incredibly friendly to the CPU’s cache: in the steady-state, each piece of the data will only ever be read or written to by a single CPU. It also has the amazing property that there are no atomic operations involved in the fast path, because operations on per-CPU data, by definition, do not need to be synchronized with other cores.

This post gives an overview of how to build a CPU-local data structure on modern Linux. The exposition will be for x86, but other than the small bits of assembly you need to write, the technique is architecture-independent.

The Kernel Primitive

Concurrency primitives require cooperating with the kernel, which is responsible for global scheduling decisions on the system. However, making syscalls is quite expensive; to alieviate this, there has been a trend in Linux to use shared memory as a kernelspace/userspace communication channel.

Futexes are the classic “cas-with-the-kernel” syscall (I’m assuming basic knowledge of atomic operations like cas in this article). In the happy path, we just need to cas on some memory to lock a futex, and only make a syscall if we need to go to sleep because of contention. The kernel will perform its own cas on this variable if necessary.

Restartable sequences are another such proto-primitive, which are used for per-CPUuprogramming. The relevant syscall for us, rseq(2), was added in Linux 4.18. Its manpage reads

A restartable sequence is a sequence of instructions guaranteed to be executed atomically with respect to other threads and signal handlers on the current CPU. If its execution does not complete atomically, the kernel changes the execution flow by jumping to an abort handler defined by userspace for that restartable sequence.

A restartable sequence, or “rseq” is a special kind of critical section that the kernel guarantees executes from start to finish without any kind of preemption. If preemption does happen (because of a signal or whatever), userspace observes this as a jump to a special handler for that critical section. Conceptually it’s like handling an exception:

try {
  // Per-CPU code here.
} catch (PremptionException) {
  // Handle having been preempted, which usally just means
  // "try again".
}
C++

These critical sections are usually of the following form:

  1. Read the current CPU index (the rseq mechanism provides a way to do this).
  2. Index into some data structure and do something to it.
  3. Complete the operation with a single memory write. This is the “commit”.

All the kernel tells us is that we couldn’t finish successfully. We can always try again, but the critical section needs to be such that executing any prefix of it, up to the commit, has no effect on the data structure. We get no opportunity to perform “partial rollbacks”.

In other words, the critical section must be a transaction.

Enabling rseq

Using rseqs requires turning on support for it for a particular thread; this is what calling rseq(2) (the syscall) accomplishes.

The signature for this syscall looks like this:

// This type is part of Linux's ABI.
#[repr(C, align(32))]
struct Rseq {
  cpu_id_start: u32,
  cpu_id: u32,
  crit_sec: u64,
  flags: u32,
}

// Note: this is a syscall, not an actual Rust function.
fn rseq(rseq: *mut Rseq, len: u32, flags: i32, signature: u32) -> i32;
Rust

The syscall registers “the” Rseq struct for the current thread; there can be at most one, per thread.

rseq is a pointer to this struct. len should be size_of::<Rseq>(), and signature can be any 32-bit integer (more on this later). For our purposes, we can ignore flags on the struct.

flags on the syscall, on the other hand, is used to indicate whether we’re unregistering the struct; this is explained below.

In the interest of exposition, we’ll call the syscall directly. If you’ve never seen how a Linux syscall is done (on x86), you load the syscall number into rax, then up to six arguments in rdi, rsi, rdx, r10, r8, r91. We only need the first four.

The return value comes out in rax, which is 0 on success, and a negative of an errno code otherwise. In particular, we need to check for EINTR to deal with syscall interruption. (every Linux syscall can be interrupted).

unsafe fn raw_rseq(rseq: *mut Rseq, unregister: bool, signature: u32) {
  // Perform an open-coded Linux syscall.
  loop {
    let mut rax = 334;  // rseq(2) syscall number; x86-specific.
    asm! {
      "syscall",
      inout("rax") rax,
      /* rseq:      */ in("rdi") rseq,
      /* len:       */ in("rsi") mem::size_of::<Rseq>(),
      /* flags:     */ in("rdx") unregister as u64,
      /* signature: */ in("r10") signature,
    }
    match rax {
      0 => break,      // Success, we're done.
      -4 => continue,  // EINTR, try again.
      errno => panic!("error calling rseq(2): {}", -errno),
    }
  }
}
Rust

Note the unregister parameter: this is used to tear down rseq support on the way out of a thread. Generally, rseq will be a thread-local, and registration happens at thread startup. Glibc will do this and has a mechanism for acquiring the rseq pointer. Unfortunately, the glibc I have isn’t new enough to know to do this, so I hacked up something to register my own thread local.

I had the bright idea of putting my Rseq struct in a box, which triggered an interesting bug: when a thread exits, it destroys all of the thread local variables, including the box to hold our Rseq. But if the thread then syscalls to deallocate its stack, when the kernel goes to resume, it will attempt to write the current CPU index to the rseq.cpu_id field.

This presents a problem, because the kernel is probably going to write to a garbage location. This is all but guaranteed to result in a segfault. Debuggers observe this as a segfault on the instruction right after the syscall instruction; I spent half an hour trying to figure out what was causing a call to madvise(2) to segfault.

Hence, we need to wrap our thread local in something that will call rseq(2) to unregister the struct. Putting everything together we get something like this.

fn current_thread_rseq() -> *mut Rseq {
  // This has to be its own struct so we can run a thread-exit destructor.
  pub struct RseqBox(Box<UnsafeCell<Rseq>>);
  impl Drop for RseqBox {
    fn drop(&mut self) {
      unsafe { raw_rseq(self.0.get(), true, RSEQ_SIG); }
    }
  }

  thread_local! {
    static RSEQ: RseqBox = {
      // Has to be in a box, since we need pointer stability.
      let rseq = RseqBox(Box::new(UnsafeCell::new(Rseq {
        cpu_id_start: 0,
        cpu_id: !0,
        crit_sec: 0,
        flags: 0,
      })));

      // Register it!!!
      unsafe { raw_rseq(rseq.0.get(), false, RSEQ_SIG); }
      rseq
    };
  }

  RSEQ.with(|ra| ra.0.get())
}
Rust

Per Rust’s semantics, this will execute the first time we access this thread local, instead of at thread startup. Not ideal, since now we pay for an (uncontended) atomic read every time we touch RSEQ, but it will do.

Creating a Critical Section

To set up and execute a restartable sequence, we need to assemble a struct that describes it. The following struct is also defined by Linux’s syscall ABI:

#[repr(C, align(32))]
struct CritSec {
  version: u32,
  flags: u32,
  start: u64,
  len: u64,
  abort_handler: u64,
}
Rust

start is the address of the first instruction in the sequence, and len is the length of the sequence in bytes. abort_handler is the address of the abort handler. version must be 0 and we can ignore flags.

Once we have a value of this struct (on the stack or as a constant), we grab RSEQ and atomically store the address of our CritSec to RSEQ.crit_sec. This needs to be atomic because the kernel may decide to look at this pointer from a different CPU core, but it likely will not be contended.

Note that RSEQ.crit_sec should be null before we do this; restartable sequences can’t nest.

Next time the kernel preempts our thread (and later gets ready to resume it), it will look at RSEQ.crit_sec to decide if it preempted a restartable sequence and, if so, jump to the abort handler.

Once we finish our critical section, we must reset RSEQ.crit_sec to 0.

Labels and Constants, Oh My

There is a wrinkle: we would like for our CritSec value to be a constant, but Rust doesn’t provide us with a way to initialize the start and abort_handler fields directly, since it doesn’t have a way to refer2 to the labels (jump targets) inside the inline assembly. The simplest way to get around this is to assemble (lol) the CritSec on the stack, with inline assembly. The overhead is quite minimal.

On x86, this is what our boilerplate will look like:

let mut cs = MaybeUninit::<CritSec>::uninit();
let mut ok = 1;

asm! {r"
  // We meed to do `rip`-relative loads so that this code is PIC;
  // otherwise we'll get linker errors. Thus, we can't `mov`
  // directly; we need to compute the address with a `lea`
  // first.

  // Initialize the first two fields to zero.
  mov qword ptr [{_cs}], 0

  // Load `90f` into `cs.start`. Note that this is 'forward
  // reference' to the jump target `90:` below.
  lea {_pc}, [90f + rip]
  mov qword ptr [{_cs} + 8], {_pc}

  // We need to get the difference `91f - 90f` into `cs.len`.
  // To do that, we write `-90f` to it, and then add `91f`.
  neg {_pc}
  mov qword ptr [{_cs} + 16], {_pc}
  lea {_pc}, [91f + rip]
  add qword ptr [{_cs} + 16], {_pc}

  // Same as the first line, but loading `cs.abort_handler`.
  lea {_pc}, [92f + rip]
  mov qword ptr [{_cs} + 24], {_pc}

  // Write `&cs` to `RSEQ.crit_sec`. This turns on
  // restartable sequence handling.
  mov qword ptr [{rseq} + 8], {_cs}

90:
  // Do something cool here (coming soon).

91:
  // Jump over the abort handler.
  jmp 93f

  .int 0x53053053  // The signature!
92:
  // Nothing special, just zero `ok` to indicate this was a failure.
  // This is written this way simply because we can't early-return
  // out of inline assembly.
  xor {_ok:e}, {_ok:e}

93:
  // Clear `RSEQ.crit_sec`, regardless of which exit path
  // we took.
  mov qword ptr [{rseq} + 8], 0
  ",
  _pc = out(reg) _,
  _ok = inout(reg) ok,
  _cs = in(reg) &mut cs as *mut CritSec,
  rseq = in(reg) current_thread_rseq(),
}
Rust

A few things to note:

  1. Because this is inline assembly, we need to use numeric labels. I’ve chosen labels in the 90s for no particular reason. 90: declares a jump target, and 90f is a forward reference to that instruction address.
  2. Most of this assembly is just initalizing a struct3. It’s not until the mov right before 90: (the critical section start) that anything interesting happens.
  3. Immediately before 92: (the abort handler) is an .int directive that emits the same four-byte signature we passed to rseq(2) into the instruction stream. This must be here, otherwise the kernel will issue a segfault to the thread. This is a very basic control-flow integrity feature.
  4. We clear RSEQ.crit_sec at the very end.

This is a lot of boilerplate. In an ideal world, we could have something like the following:

fn run_rseq(cs: extern "C" unsafe fn(u32));
Rust

Unfortunately, this is very hard to do, because the constraints on restartable sequences are draconian:

  • Can’t jump out of the critical section until it completes or aborts. This means you can’t call functions or make syscalls!
  • Last instruction must be the commit, which is a memory store operation, not a return.

This means that you can’t have the compiler generating code for you; it might outline things or move things around in ways you don’t want. In something like ASAN mode, it might inject function calls that will completely break the primitive.

This means we muyst write our critical section in assembly. That assembly also almost unavoidably needs to be part of the boilerplate given above, and it means it can’t participate in ASAN or TSAN instrumentation.

In the interest of exposition, we can build a wrapper over this inline assembly boilerplate that looks something like this:

let result: Result<(), RseqAbort> = rseq! {r"
    // Assembly for our critical section...
  ",
  // Inline asm constraints.
};
Rust

When I wrote the snippet above, I chose numeric labels in the 90s to avoid potential conflicts with whatever assembly gets pasted here. This is also why I used a leading _ on the names of some of the assembly constraints; thise are private to the macro. rseq isn’t, though, since callers will want to access the CPU id in it.

The intent is for the assembly string to be pasted over the // Do something cool here comment, and for the constraints to be tacked on after the boilerplate’s constraints.

But with that we now have access to the full rseq primitive, in slightly sketchy macro form. Let’s use it to build a CPU-local data structure.

A Checkout Desk

Let’s say we have a pool of objects that are needed to perform an allocation, our putative page caches. Let’s say we have the following interface:

impl FreeList {
  unsafe fn get_cache(&self) -> *mut PageCache;
  unsafe fn return_cache(&self, *mut PageCache);
}
Rust

get_cache() grabs a cache of pages off the global free list. This requires taking a lock or traversing a lockless linked list, so it’s pretty expensive. return_cache() returns a cache back to the global free list for re-use; it is a similarly expensive operation. Both of these operations are going to be contended like crazy, so we want to memoize them.

To achieve this, we want one slot for every CPU to hold the cache it (or rather, a thread running on it) most recently acquired, so that it can be reused. These slots will have “checkout desk” semantics: if you take a thing, you must put something in its place, even if it’s just a sign that says you took the thing.

Matthew Kulukundis came up with this idea, and he’d totally put this gif in a slide deck about this data structure.

As a function signature, this is what it looks like:

impl<T> PerCpu<T> {
  fn checkout(&self, replacement: *mut T) -> *mut T;
}
Rust

We can then use it like this.

let free_list: &FreeList = ...;
let per_cpu: &PerCpu<PageCache> = ...;
let iou = 0 as *mut PageCache;

// Check out this CPU's cache pointer, and replace it with
// an IOU note (a null pointer).
let mut cache = per_cpu.checkout(iou);
if cache == iou {
  // If we got an IOU ourselves, this means another thread that
  // was executing on this CPU took the cache and left *us* with
  // a null, so we need to perform the super-expensive operation
  // to acquire a new one.
  cache = free_list.get_cache();
}

// Do stuff with `cache` here. We have unique access to it.
cache.alloc_page(...);

// Return the pointer to the checkout desk.
cache = per_cpu.checkout(cache);
if cache != iou {
  // Usually, we expect to get back the IOU we put into the cache.
  // If we don't, that probably means another thread (or
  // hundreds) are hammering this slot and fighting for page caches.
  // If this happens, we need to throw away the cache.
  free_list.return_cache(cache);
}
Rust

The semantics of PerCpu<T> is that it is an array of nprocs (the number of logical cores on the system) pointers, all initialized to null. checkout() swaps the pointer stored in the current CPU’s slot in the PerCpu<T> with the replacement argument.

Building the Checkout Desk

The implementation of this type is relatively simple, but the devil is in the details. Naively, you’d think you literally want an array of pointers:

pub struct PerCpu<T> {
  ptrs: Box<[*mut T]>,
}

unsafe impl<T> Send for PerCpu<T> {}
unsafe impl<T> Sync for PerCpu<T> {}
Rust

Unfortunately, this is cache-hostile. We expect that (depending on how ptrs is aligned in memory) for eight CPUs’ checkout pointers to be on the same cache line. This means eight separate cores are going to be writing to the same cache line, which is going to result in a lot of cache thrash. This memory wants to be in L1 cache, but will probably wind up mostly in shared L3 cache.

This effect is called “false sharing”, and is a fundamental part of the design of modern processors. We have to adjust for this.

Instead, we want to give each core a full cache line (64 bytes aligned to a 64-byte boundary) for it to store its pointer in. This sounds super wasteful (56 of those bytes will go unused), but this is the right call for a perf-sensitive primitive.

This amount of memory can add up pretty fast (two whole pages of memory for a 128-core server!), so we’ll want to lazilly initialize them. Our cache-friendly struct will look more like this:

pub struct PerCpu<T> {
  ptrs: Box<[AtomicPtr<CacheLine<*mut T>>]>,
}

// This struct wraps a T and forces it to take up an entire cache line.
#[repr(C, align(64))]
struct CacheLine<T>(T);

unsafe impl<T> Send for PerCpu<T> {}
unsafe impl<T> Sync for PerCpu<T> {}
Rust

Initializing it requires finding out how many cores there are on the machine. This is a… fairly platform-specific affair. Rust does offer a “maximum paralellism” query in its standard library, but it is intended as a hint for how many worker threads to spawn, as opposed to a hard upper bound on the number of CPU indices.

Instead, we call get_nprocs_conf(), which is fine since we’re already extremely non-portable already. This is a GNU libc extension.

In code…

impl<T> PerCpu<T> {
  pub fn new() -> Self {
    extern "C" {
      // #include <sys/sysinfo.h>
      //
      // This function returns the maximum number of cores the
      // kernel knows of for the current machine. This function
      // is very expensive to call, so we need to cache it.
      fn get_nprocs_conf() -> i32;
    }

    static mut NPROCS: usize = 0;
    static INIT: Once = Once::new();
    INIT.call_once(|| unsafe {
      NPROCS = get_nprocs_conf() as usize;
    });

    let len = unsafe { NPROCS };
    let mut ptrs = Vec::with_capacity(len);
    for _ in 0..len {
      ptrs.push(AtomicPtr::new(ptr::null_mut()));
    }

    Self { ptrs: ptrs.into_boxed_slice() }
  }
}
Rust

(I’m not going to implement Drop for this type. That’s an exercise for the reader.)

Implementing checkout()

Now’s the moment we’ve all be waiting for: writing our restartable sequence. As critical sections go, this one’s pretty simple:

  1. Index into the ptrs array to get this CPU’s pointer-to-cache-line.
  2. If that pointer is null, bail out of the rseq and initialize a fresh cache line (and then try again).
  3. If it’s not null, swap replacement with the value in the cache line.
impl<T> PerCpu<T> {
  fn checkout(&self, mut replacement: *mut T) -> *mut T {
    // We need to try this operation in a loop, to deal with
    // rseq aborts.
    loop {
      let ptrs = self.ptrs.as_ptr();
      let mut vcpu: i32 = -1;
      let mut need_alloc: i32 = 1;
      let result: Result<(), RseqAbort> = rseq! {r"
        // Load the current CPU number.
        mov {vcpu:e}, dword ptr [{rseq} + 4]

        // Load the `vcpu`th pointer from `ptrs`.
        // On x86, `mov` is atomic. The only threads we might
        // be condending with are those that are trying to
        // initialize this pointer if it's null.
        mov {scratch}, qword ptr [{ptrs} + 8 * {vcpu:r}]

        // If null, exit early and trigger an allocation
        // for this vcpu.
        test {scratch}, {scratch}
        jz 1f

        // Make sure the outer code knows not to allocate
        // a new cache line.
        xor {need_alloc:e}, {need_alloc:e}

        // Commit the checkout by exchanging `replacement`.
        xchg {ptr}, qword ptr [{scratch}]
      1:
        ",
        ptrs = in(reg) ptrs,
        scratch = out(reg) _,
        ptr = inout(reg) replacement,
        vcpu = out(reg) vcpu,
        need_alloc = inout(reg) need_alloc,
      };

      // We got preempted, so it's time to try again.
      if result.is_err() { continue }

      // If we don't need to allocate, we're done.
      if need_alloc == 0 { return replacement }

      // Otherwise, allocate a new cache line and cas it into
      // place. This is Atomics 101, nothing fancy.
      let mut cache_line = Box::new(CacheLine(ptr::null_mut()));
      loop {
        let cas = self.ptrs[vcpu as usize].compare_exchange_weak(
          ptr::null_mut(), cache_line.as_mut(),
          Ordering::AcqRel, Ordering::Relaxed,
        );

        match cas {
          Ok(p) => {
            // Successful allocation.
            debug_assert!(p.is_null());
            // Make sure to stop `cache_line`'s memory
            // from being freed by `Box`'s dtor.
            mem::forget(cache_line);
            break;
          }
          // Try again: this is a spurious failure.
          Err(p) if p.is_null() => continue,
          // Someone got here first; we can just discard
          // `Box`.
          Err(_) => break,
        }
      }
    }
  }
}
Rust

This code listing is a lot to take in. It can be broken into two parts: the restartable sequence itself, and the allocation fallback if the pointer-to-cache-line happens to be null.

The restartable sequence is super short. It looks at the pointer-to-cache-line, bails if its null (this triggers the later part of the function) and then does an xchg between the actual *mut T in the per-CPU cache line, and the replacement.

If the rseq aborts, we just try again. This is short enough that preemption in the middle of the rseq is quite rare. Then, if need_alloc was zeroed, that means we successfully committed, so we’re done.

Otherwise we need to allocate a cache line for this CPU. We’re now outside of the rseq, so we’re back to needing atomics. Many threads might be racing to be the thread that initializes the pointer-to-cache-line; we use a basic cas loop to make sure that we only initialize from null, and if someone beats us to it, we don’t leak the memory we had just allocated. This is an RMW operation, so we want both acquire and release ordering. Atomics 101!

Then, we try again. Odds are good we won’t have migrated CPUs when we execute again, so we won’t need to allocate again. Eventually all of the pointers in the ptrs array will be non-null, so in the steady state this needs_alloc case doesn’t need to happen.

Conclusion

This is just a glimpse of what per-CPU concurrent programming looks like. I’m pretty new to it myself, and this post was motivated by building an end-to-end example in Rust. You can read more about how TCMalloc makes use of restartable sequences here.

  1. This is annoyingly different from the function calling convention, which passes arguments in rdi, rsi, rdx, rcx, r8, r9, with the mnemonic “Diana’s silk dress cost $89.” I don’t know a cute mnemonic for the syscall registers. 

  2. It’s actually worse than that. You’d think you could do

    jmp foo
    pointers:
      .int foo
    
    foo:
      mov qword ptr [rax], pointers
    x86 Assembly

    but this makes the resulting code non-position-independent on x86. What this means is that the code must know at link time what address it will be loaded at, which breaks the position-independent requirement of many modern platforms.

    Indeed, this code will produce a linker error like the following:

    = note: /usr/bin/ld: /home/mcyoung/projects/cpulocal/target/debug/deps/cpulocal-a7eeabaf0b1f2c43.2l48u2rfiak1q1ik.rcgu.o:
          relocation R_X86_64_32 against `.text._ZN8cpulocal15PerCpu$LT$T$GT$8checkout17h42fde3ce3bd0180aE'
          can not be used when making a PIE object; recompile with -fPIE
          collect2: error: ld returned 1 exit status
    Plaintext

    Not only is .int foo a problem, but so is referring to pointers. Instead we must write

    lea rcx, qword ptr [pointers + rip]
    mov qword ptr [rax], rcx
    x86 Assembly

    to be able to load the address of pointers at all. This can be worked around if you’re smart; after all, it is possible to put the addresses of functions into static variables and not have the linker freak out. It’s too hard to do in inline assembly tho. 

  3. Basically this code, which can’t be properly-expressed in Rust.

    let cs = CrtiticalSection {
      version: 0,
      flags: 0,
      start: /* &90f */,
      len: /* &91f - &90f */,
      abort: /* &92f */,
    };
    Rust

3Hz Computer, Hold the Transistors

I’m not really one to brag publicly about expensive toys, but a few weeks ago I managed to get one that’s really something special. It is a Curta Type II, a mechanical digital1 calculator manufactured in Liechtenstein between the 50s and 70s, before solid-state calculators killed them and the likes of slide-rules.

I have wanted one since I was a kid, and I managed to win an eBay auction for one.

The Curta

The Curta Type II (and Solomon the cat)

It’s a funny looking device, somewhere between a peppermill and a scifi grenade. Mine has serial number 544065, for those keeping score, and comes in a cute little bakelite pod (which has left hand thread?!).

I wanna talk about this thing because unlike something like a slide rule, it shares many features with modern computers. It has operations, flags, and registers. Its core primitive is an adder, but many other operations can be built on top of it: it is very much a platform for complex calculations.

I’m the sort of person who read Hacker’s Delight for fun, so I really like simple numerical algorithms. This article is a survey of the operation of a Curta calculator and algorithms you can implement on it, from the perspective of a professional assembly programmer.

Many of the algorithms I’m going to describe here exist online, but I’ve found them to be a bit difficult to wrap my head around, so this article is also intended as a reference card for myself.

Let’s dive in!

A Well-Lubricated ALU

There are two Curta models, Type I and Type II, which primarily differ in the sizes of their registers. I have a Type II, so I will focus on the layout of that one.

The Curta is not a stored program computer like the one you’re reading this article on. An operator needs to manually execute operations. It is as if we had taken a CPU and pared it down to two of its most basic components: a register file and an arithmetic logic unit (ALU).

The Register File

The Curta’s register file consists of three digital registers, each of which contains a decimal integer (i.e., each digit is from 0 to 9, rather than 0 to 1 like on a binary computer):

  • sr, the setting register, is located on the side of the device. The value in sr can be set manually by the operator using a set of knobs on the side of the device. The machine will never write to it, only read from it. It has 11 digits.
  • rr, the results register, is located at the top of the device along the black part of the dial. It is readable and writable by the machine, but not directly modifiable by the operator. It has 15 digits.
  • cr, the counting register, is located next to rr along the silver part of the dial. Like rr, it is only machine-modifiable. It has 8 digits.

sr, set to 1997.

rr is the black dial; cr is the silver one.

There are also two settings on the device that aren’t really registers, but, since they are changed as part of operation, they are a lot like the control registers of a modern computer.

The carriage (there isn’t an abbreviation for this one, so I’ll call it ca) is the upper knurled ring on the machine. It can be set to a value from 0 to 72. To set it, the operator lifts the ring up (against spring tension), twists it, and lets it spring back into the detent for the chosen value. This is a one-hand motion.

There is a small triangle in the middle of the top of the device that points at which of the digits in cr will get incremented.

ca raised and in motion.

Finally, rl, the reversing lever, is a small switch near the back of the device that can be in the up or down position. This is like a flag register: up is cleared, down is set.

rl in the up position.

The Instruction Set

We have all this memory, but the meat of a machine is what it can do. I will provide an instruction set for the Curta to aid in giving rigorous descriptions of operations you can perform with it.

The core operation of the Curta is “add-with-shift-and-increment”. This is a mouthful. At the very top of the machine is the handle, which is analogous to a clock signal pin. Every clockwise turn of this handle executes one of these operations. Internally, this is implemented using a variation on the Leibniz gear, a common feature of mechanical calculators.

The handle in "addition" mode.

This operation is not that complicated, it just does a lot of stuff. It takes the value of sr, left-shifts it (in decimal) by the value in ca, and adds it to rr. Also, it increments CR by 1 shifted by ca. In other words:

rr += sr << ca
cr += 1 << ca
Plaintext

Recall that this is a decimal machine, so << is the same as multiplication by a power of 10, not a power of 2.

Addition can overflow, and it wraps around as expected: adding one to 999_999_999_999_999_999 already in rr will fill it with zeroes.

Pulling the handle up reveals a red ring, indicating the machine is in subtraction mode. This flips the signs of both the rr and cr modifications:

rr -= sr << ca
cr -= 1 << ca
Plaintext

The handle in "subtraction" mode.

The Curta cannot handle negative numbers, so it will instead display the ten’s complement3 of a negative result. For example, subtracting 1 from 0 will produce all-nines.

You can detect when underflow or overflow occurs when the resulting value is unexpectedly larger or smaller than the prior value in rr, respectively. (This trick is necessary on architectures that lack a carry flags register, like RISC-V.)

Setting rl will reverse the sign of the operation done on cr during a turn of the handle. In addition mode, it will cause cr to be subtracted from, while in subtraction mode, it will cause it to be added to. Some complex algorithms make use of this.

Finally, the clearing lever can be used to clear (to zero) sr or rr, independently. It is a small ring-shaped lever that, while the carriage is raised, can be wiped past digits to clear them. Registers cannot be partially cleared.

The clearing lever.

Notation

Let’s give names to all the instructions the operator needs to follow, so we can write some assembly:

  • mr, or Machine Ready!, means to clear/zero every register. All Curta instructions use the term “Machine Ready” to indicate the beginning of a calculation session.
  • pturn is the core addition operation, a “plus turn”.
  • mturn is its subtraction twin, a “minus turn”.
  • set <flag> requests the operator set one of rl or sm.
  • clr <flag> is the opposite of set.
  • zero <reg> request a clear of one of rr or cr using the clearing lever.
  • add <reg>, <imm> requests manual addition of an immediate to sr or ca. This is limited by what mental math we can ask of the operator.
  • copy <reg>, sr requests a copy of the value in rr or cr to sr.
  • wrnp <reg>, <symbol> indicates we need to write down a value in any register to a handy notepad (hence write notepad), marked with <symbol>.
  • rdnp <reg>, <symbol> asks the operator to read a value recorded with wrnp.
  • if <cond>, <label> asks the operator to check a condition (in terms of cr, rr, and sr) and, if true, proceed to the instruction at the given label:. Here’s some examples of conditions we’ll use:
    • rr == 42, i.e., rr equals some constant value.
    • rr.ovflow, i.e., rr overflowed/underflowed due to the most recent pturn/mturn.
    • cr[1] == 9, i.e. cr’s second digit (zero-indexed, not like the physical device!) equals 9.
    • cr[0..ca] < sr[0..ca], i.e., cr, considering only the digits up to the setting of ca, is less than those same digits in sr.
  • goto <label> is like if without a condition.
  • done means we’re done and the result can be read off of rr (or cr).

Note that there is a lot of mental math in some of the conditions. Algorithms on the Curta are aimed to minimize what work the operator needs to do to compute a result, but remember that it is only an ALU: all of the control flow logic needs to be provided by the human operator.

None of this is real code, and it is specifically for the benefit of readers.

Some Algorithms

So, addition and subtraction are easy, because there are hardware instructions for those. There is, however, no direct way to do multiplication or division. Let’s take a look at some of our options.

Given that a Curta is kinda expensive, you can try out an online simulator if you want to follow along. This one is pretty simple and runs in your browser.

Multiplication

The easiest way to do multiplication is by repeated addition; cr helps us check our work.

Given a value like 8364, we can multiply it by 5 like so:

mul_by_5:
  mr
  add   sr, 8364
loop:
    if    cr == 5, end
    pturn
    goto  loop
end:
  done
Curta "Assembly"

Here, we input the larger factor into sr, and then keep turning until cr contains the other factor. The result is 41820:

8364 * 5 == 41820

Of course, this does not work well for complex products, such as squaring 41820. You could sit there and turn the handle forty thousand times if you wanted to, or you might decided that you should get a better hobby, since modern silicon can do this in nanoseconds.

We can speed this up exponentially by making use of the distributive property and the fact that turn can incorporate multiplication by a power of 10.

Consider:

41820 * 41820
= 41820 * (40000 + 1000 + 800 + 20)
= 41820 * 40000 + 41820 * 1000 + 41820 * 800 + 41820 * 20
Plaintext

Each nice round number here can be achieved in cr by use of ca. Our algorithm will look a bit like this:

square:
  mr
  add   sr, 41820
loop:
    // Check if we're done.
    if    cr == 41820, end
  inner:
      // Turn until the first `ca` digits of `cr` and the
      // other factor match.
      if    cr[1..ca] == 41802[1..ca], inner_end
      pturn
      goto  inner
  inner_end:
    // Increment `ca` and repeat until done.
    add   ca, 1 
    goto  loop
end:
  done
Curta "Assembly"

There are two loops. The inner loop runs as many turns as is necessary to get the next prefix of the factor into cr, then incrementing ca to do the next digit, and on and on until cr contains the entire other factor, at which point we can read off the result.

The actual trace of operations (omitting control flow), and the resulting contents of the registers sr/rr/mr/ca at each step, looks something like this:

mr
// 00000000000/000000000000000/00000000/0
add   sr, 41820
// 00000041820/000000000000000/00000000/0
add   ca, 1
// 00000041820/000000000000000/00000000/1
pturn
// 00000041820/000000000418200/00000010/1
pturn
// 00000041820/000000000083640/00000020/1
add   ca, 1
// 00000041820/000000000083640/00000020/2
pturn
// 00000041820/000000005018400/00000120/2
pturn
// 00000041820/000000009200400/00000220/2
pturn
// 00000041820/000000013382400/00000320/2
pturn
// 00000041820/000000017564400/00000420/2
pturn
// 00000041820/000000021746400/00000520/2
pturn
// 00000041820/000000025928400/00000620/2
pturn
// 00000041820/000000030110400/00000720/2
pturn
// 00000041820/000000034292400/00000820/2
add   ca, 1
// 00000041820/000000034292400/00000820/3
pturn
// 00000041820/000000076112400/00001820/3
add   ca, 1
// 00000041820/000000494312400/00011820/4
pturn
// 00000041820/000000912512400/00021820/4
pturn
// 00000041820/000001330712400/00031820/4
pturn
// 00000041820/000001748912400/00041820/4
pturn
Curta "Assembly"

The result can be read off from rr: 1748912400. In the trace, you can see cr get built up digit by digit, making this operation rather efficient.

41820 * 41820 == 1748912400

We can do even better, if we use subtraction. For example, note that 18 = 20 - 2; we can build up 18 in cr by doing only 4 turns rather than nine, according to this formula. Here’s the general algorithm for n * m:

mul:
  mr
  add   sr, n
loop:
    if    cr == m, end
    // Same as before, but if the next digit is large,
    // go into subtraction mode.
    if    m[ca] > 5, by_sub
  inner:
      if    cr[0..ca] == m[0..ca], inner_end
      pturn
      goto  inner
  by_sub:
    // Store the current `ca` position.
    wrnp  ca,   sub_from
    // Find the next small digit (eg. imagine n * 199, we
    // want to find the 1).
  find_small:
    add   ca,   1
    if    m[ca] > 5, find_small
    // Set the digit to one plus the desired value for that
    // digit.
  outer_turns:
    pturn
    if    cr[ca] != m[ca] + 1, outer_turns
    // Store how far we need to re-advance `ca`.
    wrnp  ca,   continue_from
    // Go back to the original `ca` position and enter
    // subtraction mode.
    rdnp  ca,   sub_from
  subs:
  subs_inner:
      // Perform subtractions until we get the value we want.
      if    cr[ca] == m[ca],  subs_end
      mturn
      goto  subs_inner
  subs_end:
    // Advance `ca` and keep going until we're done.
    add   ca,   1
    if    ca != continue_from, subs
    goto  loop
  inner_end:
    add   ca,   1 
    goto  loop
end:
  done
Curta "Assembly"

Although more complicated, if we execute it step by step, we’ll see we get to our answer in fewer turns:

mr
// 00000000000/000000000000000/00000000/0
add   sr, 41820
// 00000041820/000000000000000/00000000/0
add   ca, 1
// 00000041820/000000000000000/00000000/1
pturn
// 00000041820/000000000418200/00000010/1
pturn
// 00000041820/000000000835400/00000020/1
add   ca, 2
// 00000041820/000000000835400/00000020/3
pturn
// 00000041820/000000042656400/00001020/3
pturn
// 00000041820/000000084476400/00002020/3
add   ca, -1
// 00000041820/000000084476400/00002020/2
mturn
// 00000041820/000000080294400/00001920/2
mturn
// 00000041820/000000076112400/00001820/2
add   ca, 2
// 00000041820/000000494312400/00011820/4
pturn
// 00000041820/000000912512400/00021820/4
pturn
// 00000041820/000001330712400/00031820/4
pturn
// 00000041820/000001748912400/00041820/4
pturn
Curta "Assembly"

In exchange for a little overhead, the number of turns drops from 15 to 10. This is the fastest general algorithm, but some techniques from Hacker’s Delight can likely be applied here to make it faster for some products.

Cubes

As a quick note, computing the cube of a number without taking extra notes is easy, so long as the number is already written down somewhere you can already see it. After computing n^2 by any of the methods above, we can do

cube:
  mr
  add   sr,   n
  // Perform a multiplication by `n`, then copy the result
  // into `sr`.
  copy  sr,   rr
  zero  rr
  zero  cr
  // Perform another multiplication by `n`, but now with
  // its square in `sr`.
  done
Curta "Assembly"

This sequence can be repeated over and over to produce higher powers, and is only limited by the size of rr.

Division

Division is way more interesting, because it can be inexact, and thus produces a remainder in addition to the quotient. There are a few different algorithms, but the simplest one is division by repeated subtraction. Some literature calls this “division by breaking down”.

For small numbers, this is quite simple, such as 21 / 4:

div_by_4:
  mr
  add   sr,   21
  pturn
  zero  cr
  zero  sr
  add   sr,   4
  set   rl
loop:
    if    rr.oflow, end
    mturn
    goto  loop
end:
  pturn
  done
Curta "Assembly"

This works by first getting the dividend into rr and resetting the rest of the machine. Then, with rl set, we subtract the divisor from rr until we get overflow, at which point we add to undo the overflow. The quotient will appear in cr: we set rl, so each subtraction increments cr, giving us a count of mturns executed. The remainder appears in rr.

In this case, we get down to 1 before the next mturn underflows; the result of that underflow is to 99...97, the ten’s complement of -3. We then undo the last operation by pturning, getting 5 in cr: this is our quotient. 1 in rr is the remainder.

The same tricks from earlier work here, using ca to make less work, effectively implementing decimal long division of n/m:

div:
  // Set up the registers.
  mr
  add   sr,   n
  pturn
  zero  cr
  zero  sr
  add   sr,   m
  set   rl
  // Move `ca` to be such that the highest digit of
  // `sr` lines up with the highest digit of `rr`.
  add   ca,   log(m) - log(n) + 1
loop:
  // Make subtractive turns until we underflow.
  inner:
    mturn
    if    !rr.ovflow, inner
  // Undo the turn that underflowed by doing an addition.
  // Because `rl` is set, this will also conveniently subtract
  // from `cr`, to remove the extra count from the
  // underflowing turn.
  pturn
  // We're done if this is the last digit we can be subtracting.
  // Otherwise, decrement `ca` and start over.
  if    ca == 0, done
  add   ca,   -1
  goto  loop
end:
  done
Curta "Assembly"

Let’s execute this on 3141592653 / 137, with an instruction trace as before.

mr
// 00000000000/000000000000000/00000000/0
add   sr, 3141592653
// 03141592653/000000000000000/00000000/0
pturn
// 03141592653/000003141592653/00000001/0
zero  cr
// 03141592653/000003141592653/00000000/0
zero  sr
// 00000000000/000003141592653/00000000/0
add   sr,   137
// 00000000137/000003141592653/00000000/0
add   ca,   7
// 00000000137/000003141592653/00000000/7
mturn
// 00000000137/000001771592653/10000000/7
turn
// 00000000137/000000401592653/20000000/7
turn
// 00000000137/999990031592653/30000000/7
pturn
// 00000000137/000000401592653/20000000/7
add   ca,   -1
// 00000000137/000000401592653/20000000/6
mturn
// 00000000137/000000264592653/21000000/6
mturn
// 00000000137/000000127592653/22000000/6
mturn
// 00000000137/999999990592653/23000000/6
pturn
// 00000000137/000000127592653/22000000/6
add ca,   -1
// 00000000137/000000127592653/22000000/5
// More turns...
add ca,   -1
// 00000000137/000000004292653/22900000/4
// More turns...
add ca,   -1
// 00000000137/000000000182653/22930000/3
// ...
add ca,   -1
// 00000000137/000000000045653/22931000/2
// ...
add ca,   -1
// 00000000137/000000000004553/22931300/1
// ...
add ca,   -1
// 00000000137/000000000000443/22931330/0
// ...
done
// 00000000137/000000000000032/22931333/0
Curta "Assembly"

For a quotient this big, you’ll need to work through all eight cr digits, which is a ton of work. At the end, we get a quotient of 22931333 and reminder 32.

3141592653 / 137 == 22931333, rem 32

Unfortunately, we can’t as easily “cheat” with subtraction as we did with multiplication, because we don’t know the value that needs to appear in cr.

Square Roots

Computing square roots by approximation is one of the premiere operations on the Curta. There’s a number of approaches. Newton’s method is the classic, but requires a prior approximation, access to lookup tables, or a lot of multiplication.

A slower, but much more mechanical approach is to use Töpler’s method. This consists of observing that the sum of the first n odd numbers is the square of n. Thus, we can use an approach similar to that for division, only that we now subtract off consecutive odd numbers. Let’s take the square root of 92:

sqrt_of_92:
  mr
  add   sr,   92
  pturn
  zero  cr
  zero  sr
  add   sr,   1
  set   rl
loop:
  mturn
  if    rr.ovflow, end
  add   sr,  
  goto  loop 
end:
  pturn
  done
Curta "Assembly"

We get 9 as our result, but that’s pretty awful precision. We can improve precision by multiplying 92 by a large, even power of ten, and then dividing the result by that power of ten’s square root (half the zeroes).

Unfortunately, this runs into the same problem as naive multiplication: we have to turn the handle a lot. Turning this algorithm into something that can be done exponentially faster is a bit fussier.

One approach (which I found on ) allows us to compute the root by shifting. Several programmers appear to have independently discovered this in the 70s or 80s.

It is based on the so-called “digit-by-digit” algorithm, dating back to at least the time of Napier. Wikipedia provides a good explanation of why this method works. However, I have not been able to write down a proof that this specific version works, since it incorporates borrowing to compute intermediate terms with successive odd numbers in a fairly subtle way. I would really appreciate a proof, if anyone knows of one!

The algorithm is thus, for a radicand n:

sqrt:
  mr
  // Put `ca` as far as it will go, and then enter
  // the radicand as far right as it will go, so you
  // get as many digits as possible to work with.
  add   ca,   8
  add   sr,   n << (8 - log(n))
  pturn
  zero  cr
  zero  sr
  // Put a 1 under the leftmost pair of digits. This
  // assumes a number with an even number of digits.
  add   sr,   1 << (ca - 1)
  set   rl
loop:
  sqrt_loop:
      // Add an odd number (with a bunch of zeros
      // after it.)
      mturn
      if    rr.ovflow,  sqrt_end
      // Increment sr by 2 (again, with a bunch of
      // zeros after it). This gives us our next odd
      // number.
      add   sr,   2 << (ca - 1)
      goto  sqrt_loop
  sqrt_end:
    // Note that we do NOT undo the increment of `sr`
    // that caused overflow, but we do undo the last
    // mturn.
    pturn
    // If `ca` is all the way to the right, we're out of
    // space, so these are all the digits we're getting.
    // Zeroing out `rr` also means we're done.
    if    ca == 1 || rr == 0, end
    // Subtract ONE from the digit in `sr` we were
    // incrementing in the loop. This results in an even
    // number.
    add   sr,   -(1 << (ca - 1))
    // Decrement `ca` and keep cranking. 
    add   ca,   -1
    add   sr,   1 << (ca - 1)
    goto loop
end:
  done
Curta "Assembly"

Let’s compute some digits of sqrt(2). Here’s the instruction trace.

mr
// 00000000000/000000000000000/00000000/0
add   ca,   7
// 00000000000/000000000000000/00000000/7
add   sr,   2 << (8 - log(n))
// 00020000000/000000000000000/00000000/7
pturn
// 00020000000/200000000000000/10000000/7
zero  cr
// 00020000000/200000000000000/00000000/7
zero  sr
// 00000000000/200000000000000/00000000/7
add   sr,   1 << (ca - 2)
// 00010000000/200000000000000/00000000/7
mturn
// 00010000000/100000000000000/10000000/7
add   sr,   2 << (ca - 2)
// 00030000000/100000000000000/10000000/7
mturn
// 00030000000/800000000000000/10000000/7
pturn
// 00030000000/100000000000000/10000000/7
add   sr,   -(1 << (ca - 2))
// 00020000000/100000000000000/10000000/7
add   ca,   -1
// 00020000000/100000000000000/10000000/6
add   sr,   1 << (ca - 2)
// 00021000000/100000000000000/10000000/6
mturn
// 00021000000/079000000000000/11000000/6
add   sr,   2 << (ca - 2)
// 00023000000/079000000000000/11000000/6
mturn
// 00023000000/056000000000000/12000000/6
add   sr,   2 << (ca - 2)
// 00025000000/056000000000000/12000000/6
mturn
// 00025000000/031000000000000/13000000/6
add   sr,   2 << (ca - 2)
// 00027000000/031000000000000/13000000/6
mturn
// 00027000000/004000000000000/14000000/6
add   sr,   2 << (ca - 2)
// 00029000000/004000000000000/14000000/6
mturn
// 00029000000/975000000000000/15000000/6
pturn
// 00029000000/004000000000000/14000000/6
add   sr,   -(1 << (ca - 2))
// 00028000000/004000000000000/14000000/6
add   ca,   -1
// 00028000000/004000000000000/14000000/5
// More of the same...
Curta "Assembly"

Over time, the digits 14121356 will appear in cr. This is the square root (although we do need to place the decimal point; the number of digits before it will be half of what we started with, rounded up).

sqrt(2) ~ 1.4121356

Wrap-up

There’s a quite a few other algorithms out there, but most of them boil down to clever use of lookup tables and combinations of the above techniques. For example, the so-called “rule of 3” is simply performing a multiplication to get a product into rr, and then using it as the dividend to produce a quotient of the form a * b / c in cr.

I hope that these simple numeric algorithms, presented in a style resembling assembly, helps illustrate that programming at such a low level is not hard, but merely requires learning a different bag of tricks.

  1. Although this seems like an oxymoron, it is accurate! The Curta contains no electrical or electronic components, and its registers contain discrete symbols, not continuous values. It is not an analog computer! 

  2. The Curta is a one-indexed machine, insofar as the values engraved on ca are not 0 to 7 but 1 to 8. However, as we all know, zero-indexing is far more convenient. Any place where I say “set ca to n”, I mean the n + 1th detent.

    Doing this avoids a lot of otherwise unnecessary -1s in the prose. 

  3. The ten’s complement of a number x is analogous to the two’s complement (i.e., the value of -x when viewed as an unsigned integer on a binary machine). It is equal to MAX_VALUE - x + 1, where MAX_VALUE is the largest value that x could be. For example, this is 999_999_999_999_999_999 (fifteen nines) for rr