3Hz Computer, Hold the Transistors

I’m not really one to brag publicly about expensive toys, but a few weeks ago I managed to get one that’s really something special. It is a Curta Type II, a mechanical digital1 calculator manufactured in Liechtenstein between the 50s and 70s, before solid-state calculators killed them and the likes of slide-rules.

I have wanted one since I was a kid, and I managed to win an eBay auction for one.

The Curta

The Curta Type II (and Solomon the cat)

It’s a funny looking device, somewhere between a peppermill and a scifi grenade. Mine has serial number 544065, for those keeping score, and comes in a cute little bakelite pod (which has left hand thread?!).

I wanna talk about this thing because unlike something like a slide rule, it shares many features with modern computers. It has operations, flags, and registers. Its core primitive is an adder, but many other operations can be built on top of it: it is very much a platform for complex calculations.

I’m the sort of person who read Hacker’s Delight for fun, so I really like simple numerical algorithms. This article is a survey of the operation of a Curta calculator and algorithms you can implement on it, from the perspective of a professional assembly programmer.

Many of the algorithms I’m going to describe here exist online, but I’ve found them to be a bit difficult to wrap my head around, so this article is also intended as a reference card for myself.

Let’s dive in!

A Well-Lubricated ALU

There are two Curta models, Type I and Type II, which primarily differ in the sizes of their registers. I have a Type II, so I will focus on the layout of that one.

The Curta is not a stored program computer like the one you’re reading this article on. An operator needs to manually execute operations. It is as if we had taken a CPU and pared it down to two of its most basic components: a register file and an arithmetic logic unit (ALU).

The Register File

The Curta’s register file consists of three digital registers, each of which contains a decimal integer (i.e., each digit is from 0 to 9, rather than 0 to 1 like on a binary computer):

  • sr, the setting register, is located on the side of the device. The value in sr can be set manually by the operator using a set of knobs on the side of the device. The machine will never write to it, only read from it. It has 11 digits.
  • rr, the results register, is located at the top of the device along the black part of the dial. It is readable and writable by the machine, but not directly modifiable by the operator. It has 15 digits.
  • cr, the counting register, is located next to rr along the silver part of the dial. Like rr, it is only machine-modifiable. It has 8 digits.

sr, set to 1997.

rr is the black dial; cr is the silver one.

There are also two settings on the device that aren’t really registers, but, since they are changed as part of operation, they are a lot like the control registers of a modern computer.

The carriage (there isn’t an abbreviation for this one, so I’ll call it ca) is the upper knurled ring on the machine. It can be set to a value from 0 to 72. To set it, the operator lifts the ring up (against spring tension), twists it, and lets it spring back into the detent for the chosen value. This is a one-hand motion.

There is a small triangle in the middle of the top of the device that points at which of the digits in cr will get incremented.

ca raised and in motion.

Finally, rl, the reversing lever, is a small switch near the back of the device that can be in the up or down position. This is like a flag register: up is cleared, down is set.

rl in the up position.

The Instruction Set

We have all this memory, but the meat of a machine is what it can do. I will provide an instruction set for the Curta to aid in giving rigorous descriptions of operations you can perform with it.

The core operation of the Curta is “add-with-shift-and-increment”. This is a mouthful. At the very top of the machine is the handle, which is analogous to a clock signal pin. Every clockwise turn of this handle executes one of these operations. Internally, this is implemented using a variation on the Leibniz gear, a common feature of mechanical calculators.

The handle in "addition" mode.

This operation is not that complicated, it just does a lot of stuff. It takes the value of sr, left-shifts it (in decimal) by the value in ca, and adds it to rr. Also, it increments CR by 1 shifted by ca. In other words:

rr += sr << ca
cr += 1 << ca
Plaintext

Recall that this is a decimal machine, so << is the same as multiplication by a power of 10, not a power of 2.

Addition can overflow, and it wraps around as expected: adding one to 999_999_999_999_999_999 already in rr will fill it with zeroes.

Pulling the handle up reveals a red ring, indicating the machine is in subtraction mode. This flips the signs of both the rr and cr modifications:

rr -= sr << ca
cr -= 1 << ca
Plaintext

The handle in "subtraction" mode.

The Curta cannot handle negative numbers, so it will instead display the ten’s complement3 of a negative result. For example, subtracting 1 from 0 will produce all-nines.

You can detect when underflow or overflow occurs when the resulting value is unexpectedly larger or smaller than the prior value in rr, respectively. (This trick is necessary on architectures that lack a carry flags register, like RISC-V.)

Setting rl will reverse the sign of the operation done on cr during a turn of the handle. In addition mode, it will cause cr to be subtracted from, while in subtraction mode, it will cause it to be added to. Some complex algorithms make use of this.

Finally, the clearing lever can be used to clear (to zero) sr or rr, independently. It is a small ring-shaped lever that, while the carriage is raised, can be wiped past digits to clear them. Registers cannot be partially cleared.

The clearing lever.

Notation

Let’s give names to all the instructions the operator needs to follow, so we can write some assembly:

  • mr, or Machine Ready!, means to clear/zero every register. All Curta instructions use the term “Machine Ready” to indicate the beginning of a calculation session.
  • pturn is the core addition operation, a “plus turn”.
  • mturn is its subtraction twin, a “minus turn”.
  • set <flag> requests the operator set one of rl or sm.
  • clr <flag> is the opposite of set.
  • zero <reg> request a clear of one of rr or cr using the clearing lever.
  • add <reg>, <imm> requests manual addition of an immediate to sr or ca. This is limited by what mental math we can ask of the operator.
  • copy <reg>, sr requests a copy of the value in rr or cr to sr.
  • wrnp <reg>, <symbol> indicates we need to write down a value in any register to a handy notepad (hence write notepad), marked with <symbol>.
  • rdnp <reg>, <symbol> asks the operator to read a value recorded with wrnp.
  • if <cond>, <label> asks the operator to check a condition (in terms of cr, rr, and sr) and, if true, proceed to the instruction at the given label:. Here’s some examples of conditions we’ll use:
    • rr == 42, i.e., rr equals some constant value.
    • rr.ovflow, i.e., rr overflowed/underflowed due to the most recent pturn/mturn.
    • cr[1] == 9, i.e. cr’s second digit (zero-indexed, not like the physical device!) equals 9.
    • cr[0..ca] < sr[0..ca], i.e., cr, considering only the digits up to the setting of ca, is less than those same digits in sr.
  • goto <label> is like if without a condition.
  • done means we’re done and the result can be read off of rr (or cr).

Note that there is a lot of mental math in some of the conditions. Algorithms on the Curta are aimed to minimize what work the operator needs to do to compute a result, but remember that it is only an ALU: all of the control flow logic needs to be provided by the human operator.

None of this is real code, and it is specifically for the benefit of readers.

Some Algorithms

So, addition and subtraction are easy, because there are hardware instructions for those. There is, however, no direct way to do multiplication or division. Let’s take a look at some of our options.

Given that a Curta is kinda expensive, you can try out an online simulator if you want to follow along. This one is pretty simple and runs in your browser.

Multiplication

The easiest way to do multiplication is by repeated addition; cr helps us check our work.

Given a value like 8364, we can multiply it by 5 like so:

mul_by_5:
  mr
  add   sr, 8364
loop:
    if    cr == 5, end
    pturn
    goto  loop
end:
  done
Curta "Assembly"

Here, we input the larger factor into sr, and then keep turning until cr contains the other factor. The result is 41820:

8364 * 5 == 41820

Of course, this does not work well for complex products, such as squaring 41820. You could sit there and turn the handle forty thousand times if you wanted to, or you might decided that you should get a better hobby, since modern silicon can do this in nanoseconds.

We can speed this up exponentially by making use of the distributive property and the fact that turn can incorporate multiplication by a power of 10.

Consider:

41820 * 41820
= 41820 * (40000 + 1000 + 800 + 20)
= 41820 * 40000 + 41820 * 1000 + 41820 * 800 + 41820 * 20
Plaintext

Each nice round number here can be achieved in cr by use of ca. Our algorithm will look a bit like this:

square:
  mr
  add   sr, 41820
loop:
    // Check if we're done.
    if    cr == 41820, end
  inner:
      // Turn until the first `ca` digits of `cr` and the
      // other factor match.
      if    cr[1..ca] == 41802[1..ca], inner_end
      pturn
      goto  inner
  inner_end:
    // Increment `ca` and repeat until done.
    add   ca, 1 
    goto  loop
end:
  done
Curta "Assembly"

There are two loops. The inner loop runs as many turns as is necessary to get the next prefix of the factor into cr, then incrementing ca to do the next digit, and on and on until cr contains the entire other factor, at which point we can read off the result.

The actual trace of operations (omitting control flow), and the resulting contents of the registers sr/rr/mr/ca at each step, looks something like this:

mr
// 00000000000/000000000000000/00000000/0
add   sr, 41820
// 00000041820/000000000000000/00000000/0
add   ca, 1
// 00000041820/000000000000000/00000000/1
pturn
// 00000041820/000000000418200/00000010/1
pturn
// 00000041820/000000000083640/00000020/1
add   ca, 1
// 00000041820/000000000083640/00000020/2
pturn
// 00000041820/000000005018400/00000120/2
pturn
// 00000041820/000000009200400/00000220/2
pturn
// 00000041820/000000013382400/00000320/2
pturn
// 00000041820/000000017564400/00000420/2
pturn
// 00000041820/000000021746400/00000520/2
pturn
// 00000041820/000000025928400/00000620/2
pturn
// 00000041820/000000030110400/00000720/2
pturn
// 00000041820/000000034292400/00000820/2
add   ca, 1
// 00000041820/000000034292400/00000820/3
pturn
// 00000041820/000000076112400/00001820/3
add   ca, 1
// 00000041820/000000494312400/00011820/4
pturn
// 00000041820/000000912512400/00021820/4
pturn
// 00000041820/000001330712400/00031820/4
pturn
// 00000041820/000001748912400/00041820/4
pturn
Curta "Assembly"

The result can be read off from rr: 1748912400. In the trace, you can see cr get built up digit by digit, making this operation rather efficient.

41820 * 41820 == 1748912400

We can do even better, if we use subtraction. For example, note that 18 = 20 - 2; we can build up 18 in cr by doing only 4 turns rather than nine, according to this formula. Here’s the general algorithm for n * m:

mul:
  mr
  add   sr, n
loop:
    if    cr == m, end
    // Same as before, but if the next digit is large,
    // go into subtraction mode.
    if    m[ca] > 5, by_sub
  inner:
      if    cr[0..ca] == m[0..ca], inner_end
      pturn
      goto  inner
  by_sub:
    // Store the current `ca` position.
    wrnp  ca,   sub_from
    // Find the next small digit (eg. imagine n * 199, we
    // want to find the 1).
  find_small:
    add   ca,   1
    if    m[ca] > 5, find_small
    // Set the digit to one plus the desired value for that
    // digit.
  outer_turns:
    pturn
    if    cr[ca] != m[ca] + 1, outer_turns
    // Store how far we need to re-advance `ca`.
    wrnp  ca,   continue_from
    // Go back to the original `ca` position and enter
    // subtraction mode.
    rdnp  ca,   sub_from
  subs:
  subs_inner:
      // Perform subtractions until we get the value we want.
      if    cr[ca] == m[ca],  subs_end
      mturn
      goto  subs_inner
  subs_end:
    // Advance `ca` and keep going until we're done.
    add   ca,   1
    if    ca != continue_from, subs
    goto  loop
  inner_end:
    add   ca,   1 
    goto  loop
end:
  done
Curta "Assembly"

Although more complicated, if we execute it step by step, we’ll see we get to our answer in fewer turns:

mr
// 00000000000/000000000000000/00000000/0
add   sr, 41820
// 00000041820/000000000000000/00000000/0
add   ca, 1
// 00000041820/000000000000000/00000000/1
pturn
// 00000041820/000000000418200/00000010/1
pturn
// 00000041820/000000000835400/00000020/1
add   ca, 2
// 00000041820/000000000835400/00000020/3
pturn
// 00000041820/000000042656400/00001020/3
pturn
// 00000041820/000000084476400/00002020/3
add   ca, -1
// 00000041820/000000084476400/00002020/2
mturn
// 00000041820/000000080294400/00001920/2
mturn
// 00000041820/000000076112400/00001820/2
add   ca, 2
// 00000041820/000000494312400/00011820/4
pturn
// 00000041820/000000912512400/00021820/4
pturn
// 00000041820/000001330712400/00031820/4
pturn
// 00000041820/000001748912400/00041820/4
pturn
Curta "Assembly"

In exchange for a little overhead, the number of turns drops from 15 to 10. This is the fastest general algorithm, but some techniques from Hacker’s Delight can likely be applied here to make it faster for some products.

Cubes

As a quick note, computing the cube of a number without taking extra notes is easy, so long as the number is already written down somewhere you can already see it. After computing n^2 by any of the methods above, we can do

cube:
  mr
  add   sr,   n
  // Perform a multiplication by `n`, then copy the result
  // into `sr`.
  copy  sr,   rr
  zero  rr
  zero  cr
  // Perform another multiplication by `n`, but now with
  // its square in `sr`.
  done
Curta "Assembly"

This sequence can be repeated over and over to produce higher powers, and is only limited by the size of rr.

Division

Division is way more interesting, because it can be inexact, and thus produces a remainder in addition to the quotient. There are a few different algorithms, but the simplest one is division by repeated subtraction. Some literature calls this “division by breaking down”.

For small numbers, this is quite simple, such as 21 / 4:

div_by_4:
  mr
  add   sr,   21
  pturn
  zero  cr
  zero  sr
  add   sr,   4
  set   rl
loop:
    if    rr.oflow, end
    mturn
    goto  loop
end:
  pturn
  done
Curta "Assembly"

This works by first getting the dividend into rr and resetting the rest of the machine. Then, with rl set, we subtract the divisor from rr until we get overflow, at which point we add to undo the overflow. The quotient will appear in cr: we set rl, so each subtraction increments cr, giving us a count of mturns executed. The remainder appears in rr.

In this case, we get down to 1 before the next mturn underflows; the result of that underflow is to 99...97, the ten’s complement of -3. We then undo the last operation by pturning, getting 5 in cr: this is our quotient. 1 in rr is the remainder.

The same tricks from earlier work here, using ca to make less work, effectively implementing decimal long division of n/m:

div:
  // Set up the registers.
  mr
  add   sr,   n
  pturn
  zero  cr
  zero  sr
  add   sr,   m
  set   rl
  // Move `ca` to be such that the highest digit of
  // `sr` lines up with the highest digit of `rr`.
  add   ca,   log(m) - log(n) + 1
loop:
  // Make subtractive turns until we underflow.
  inner:
    mturn
    if    !rr.ovflow, inner
  // Undo the turn that underflowed by doing an addition.
  // Because `rl` is set, this will also conveniently subtract
  // from `cr`, to remove the extra count from the
  // underflowing turn.
  pturn
  // We're done if this is the last digit we can be subtracting.
  // Otherwise, decrement `ca` and start over.
  if    ca == 0, done
  add   ca,   -1
  goto  loop
end:
  done
Curta "Assembly"

Let’s execute this on 3141592653 / 137, with an instruction trace as before.

mr
// 00000000000/000000000000000/00000000/0
add   sr, 3141592653
// 03141592653/000000000000000/00000000/0
pturn
// 03141592653/000003141592653/00000001/0
zero  cr
// 03141592653/000003141592653/00000000/0
zero  sr
// 00000000000/000003141592653/00000000/0
add   sr,   137
// 00000000137/000003141592653/00000000/0
add   ca,   7
// 00000000137/000003141592653/00000000/7
mturn
// 00000000137/000001771592653/10000000/7
turn
// 00000000137/000000401592653/20000000/7
turn
// 00000000137/999990031592653/30000000/7
pturn
// 00000000137/000000401592653/20000000/7
add   ca,   -1
// 00000000137/000000401592653/20000000/6
mturn
// 00000000137/000000264592653/21000000/6
mturn
// 00000000137/000000127592653/22000000/6
mturn
// 00000000137/999999990592653/23000000/6
pturn
// 00000000137/000000127592653/22000000/6
add ca,   -1
// 00000000137/000000127592653/22000000/5
// More turns...
add ca,   -1
// 00000000137/000000004292653/22900000/4
// More turns...
add ca,   -1
// 00000000137/000000000182653/22930000/3
// ...
add ca,   -1
// 00000000137/000000000045653/22931000/2
// ...
add ca,   -1
// 00000000137/000000000004553/22931300/1
// ...
add ca,   -1
// 00000000137/000000000000443/22931330/0
// ...
done
// 00000000137/000000000000032/22931333/0
Curta "Assembly"

For a quotient this big, you’ll need to work through all eight cr digits, which is a ton of work. At the end, we get a quotient of 22931333 and reminder 32.

3141592653 / 137 == 22931333, rem 32

Unfortunately, we can’t as easily “cheat” with subtraction as we did with multiplication, because we don’t know the value that needs to appear in cr.

Square Roots

Computing square roots by approximation is one of the premiere operations on the Curta. There’s a number of approaches. Newton’s method is the classic, but requires a prior approximation, access to lookup tables, or a lot of multiplication.

A slower, but much more mechanical approach is to use Töpler’s method. This consists of observing that the sum of the first n odd numbers is the square of n. Thus, we can use an approach similar to that for division, only that we now subtract off consecutive odd numbers. Let’s take the square root of 92:

sqrt_of_92:
  mr
  add   sr,   92
  pturn
  zero  cr
  zero  sr
  add   sr,   1
  set   rl
loop:
  mturn
  if    rr.ovflow, end
  add   sr,  
  goto  loop 
end:
  pturn
  done
Curta "Assembly"

We get 9 as our result, but that’s pretty awful precision. We can improve precision by multiplying 92 by a large, even power of ten, and then dividing the result by that power of ten’s square root (half the zeroes).

Unfortunately, this runs into the same problem as naive multiplication: we have to turn the handle a lot. Turning this algorithm into something that can be done exponentially faster is a bit fussier.

One approach (which I found on ) allows us to compute the root by shifting. Several programmers appear to have independently discovered this in the 70s or 80s.

It is based on the so-called “digit-by-digit” algorithm, dating back to at least the time of Napier. Wikipedia provides a good explanation of why this method works. However, I have not been able to write down a proof that this specific version works, since it incorporates borrowing to compute intermediate terms with successive odd numbers in a fairly subtle way. I would really appreciate a proof, if anyone knows of one!

The algorithm is thus, for a radicand n:

sqrt:
  mr
  // Put `ca` as far as it will go, and then enter
  // the radicand as far right as it will go, so you
  // get as many digits as possible to work with.
  add   ca,   8
  add   sr,   n << (8 - log(n))
  pturn
  zero  cr
  zero  sr
  // Put a 1 under the leftmost pair of digits. This
  // assumes a number with an even number of digits.
  add   sr,   1 << (ca - 1)
  set   rl
loop:
  sqrt_loop:
      // Add an odd number (with a bunch of zeros
      // after it.)
      mturn
      if    rr.ovflow,  sqrt_end
      // Increment sr by 2 (again, with a bunch of
      // zeros after it). This gives us our next odd
      // number.
      add   sr,   2 << (ca - 1)
      goto  sqrt_loop
  sqrt_end:
    // Note that we do NOT undo the increment of `sr`
    // that caused overflow, but we do undo the last
    // mturn.
    pturn
    // If `ca` is all the way to the right, we're out of
    // space, so these are all the digits we're getting.
    // Zeroing out `rr` also means we're done.
    if    ca == 1 || rr == 0, end
    // Subtract ONE from the digit in `sr` we were
    // incrementing in the loop. This results in an even
    // number.
    add   sr,   -(1 << (ca - 1))
    // Decrement `ca` and keep cranking. 
    add   ca,   -1
    add   sr,   1 << (ca - 1)
    goto loop
end:
  done
Curta "Assembly"

Let’s compute some digits of sqrt(2). Here’s the instruction trace.

mr
// 00000000000/000000000000000/00000000/0
add   ca,   7
// 00000000000/000000000000000/00000000/7
add   sr,   2 << (8 - log(n))
// 00020000000/000000000000000/00000000/7
pturn
// 00020000000/200000000000000/10000000/7
zero  cr
// 00020000000/200000000000000/00000000/7
zero  sr
// 00000000000/200000000000000/00000000/7
add   sr,   1 << (ca - 2)
// 00010000000/200000000000000/00000000/7
mturn
// 00010000000/100000000000000/10000000/7
add   sr,   2 << (ca - 2)
// 00030000000/100000000000000/10000000/7
mturn
// 00030000000/800000000000000/10000000/7
pturn
// 00030000000/100000000000000/10000000/7
add   sr,   -(1 << (ca - 2))
// 00020000000/100000000000000/10000000/7
add   ca,   -1
// 00020000000/100000000000000/10000000/6
add   sr,   1 << (ca - 2)
// 00021000000/100000000000000/10000000/6
mturn
// 00021000000/079000000000000/11000000/6
add   sr,   2 << (ca - 2)
// 00023000000/079000000000000/11000000/6
mturn
// 00023000000/056000000000000/12000000/6
add   sr,   2 << (ca - 2)
// 00025000000/056000000000000/12000000/6
mturn
// 00025000000/031000000000000/13000000/6
add   sr,   2 << (ca - 2)
// 00027000000/031000000000000/13000000/6
mturn
// 00027000000/004000000000000/14000000/6
add   sr,   2 << (ca - 2)
// 00029000000/004000000000000/14000000/6
mturn
// 00029000000/975000000000000/15000000/6
pturn
// 00029000000/004000000000000/14000000/6
add   sr,   -(1 << (ca - 2))
// 00028000000/004000000000000/14000000/6
add   ca,   -1
// 00028000000/004000000000000/14000000/5
// More of the same...
Curta "Assembly"

Over time, the digits 14121356 will appear in cr. This is the square root (although we do need to place the decimal point; the number of digits before it will be half of what we started with, rounded up).

sqrt(2) ~ 1.4121356

Wrap-up

There’s a quite a few other algorithms out there, but most of them boil down to clever use of lookup tables and combinations of the above techniques. For example, the so-called “rule of 3” is simply performing a multiplication to get a product into rr, and then using it as the dividend to produce a quotient of the form a * b / c in cr.

I hope that these simple numeric algorithms, presented in a style resembling assembly, helps illustrate that programming at such a low level is not hard, but merely requires learning a different bag of tricks.

  1. Although this seems like an oxymoron, it is accurate! The Curta contains no electrical or electronic components, and its registers contain discrete symbols, not continuous values. It is not an analog computer! 

  2. The Curta is a one-indexed machine, insofar as the values engraved on ca are not 0 to 7 but 1 to 8. However, as we all know, zero-indexing is far more convenient. Any place where I say “set ca to n”, I mean the n + 1th detent.

    Doing this avoids a lot of otherwise unnecessary -1s in the prose. 

  3. The ten’s complement of a number x is analogous to the two’s complement (i.e., the value of -x when viewed as an unsigned integer on a binary machine). It is equal to MAX_VALUE - x + 1, where MAX_VALUE is the largest value that x could be. For example, this is 999_999_999_999_999_999 (fifteen nines) for rr

std::tuple the Hard Way

Let’s talk about C++ templates.

C++ is famous for relegating important functionality often built into the language to its standard library1. C++11 added a number of very useful class templates intended to make generic programming easier. By far the most complicated is std::tuple<>, which is literally just a tuple.

It turns out that implementing std::tuple<> is complicated. Very, very complicated.

Naively, we think that we can just splat a variadic pack into a struct:

template <typename... Types>
class tuple {
  Types... values;
};

If you click through to Godbolt, you’ll see it doesn’t: this feature doesn’t exist in C++2 (normally, you’d do std::tuple<Types...>, but we need to write down std::tuple somehow). The usual approach is to use some kind of recursive template, which can tend to generate a lot of code.

However, C++ does actually have tuples built into the language, as a C++11 feature… lambdas! As an extra challenge, we’re going to try to minimize the number of templates that the compiler needs to instantiate; std::tuple is famously bad about this and can lead to very poor build performance.

For our tuple library type, we need to solve the following problems:

  • How do we implement std::tuple() and std::tuple(args...)?
  • How do we implement std::apply?
  • How do we implement std::tuple_element?
  • How do we implement std::get?

The Power of [](){}

Alright, let’s back up. In C++11, we got lambdas, which are expressions that expand to anonymous functions. In C++, lambdas are closures, meaning that they capture (“close over”) their environment.

This is a lambda in action:

int x = 5;
auto add = [x] (int y) { return x + y; }
int z = add(8);  // 13
C++

The [x] syntax is the captures. To represent a lambda, C++ creates an anonymous, one-time-use class. It has the captures as members (whether they be references or values) and provides the necessary operator(). In other words, this is approximately the desugaring:

auto MakeFn(int x) {
  return [x] (int y) {
    return x + y;
  };
}
C++
auto MakeFn(int x) {
  struct _Lambda {
    auto operator()(int y) const {
      return x + y;
    };
    const int x;
  }
  return _Lambda{x};
}
C++

Note the consts in _Lambda. By default, captured values are stored inline but marked const, and the operator() member is also const. We can remove that specifier in both location with the mutable keyword:

auto MakeFn(int x) {
  return [x] (int y) mutable {
    return x + y; // ^^^^^^^
  };
}
C++
auto MakeFn(int x) {
  struct _Lambda {
    auto operator()(int y) {
      return x + y;
    };
    int x;
  }
  return _Lambda{x};
}
C++

Lambdas can capture anything from their scope. In addition to values, they will capture any types visible from that location. This means that, if constructed in a function template, the generated class will effectively capture that template’s arguments. Thus:

template <typename... Args>
auto CaptureMany(Args... args) {
  return [args...] () { /*whatever*/ };
};
C++

This will create a new anonymous class capturing an arbitrary number of arguments, depending on the parameters passed to CaptureMany(). This will form the core of our tuple type.

Now, let’s stick it into a class.

Lambda-Typed Data Members

We don’t want to leak the lambda into the template parameters of our tuple class, so we need it to be strictly in terms of the class’s template parameters. This is straightforward with decltype.

template <typename... Types>
class Tuple {
 private:
  decltype(TupleLambda(Types{}...)) lambda_;
};
C++

Regardless of what our C++ compiler calls the type, we are able to use it as a field. However, a problem arises when we try to write down the main “in-place” constructor, which consists of the usual forwarding-reference and std::forward boilerplate3:

template <typename... Types>
class Tuple {
 public:
  template <typename... Args>
  Tuple(Args&&... args) : lambda_(
    TupleLambda(std::forward<Args>(args)...)) {}
  // ...
};
C++

The initialization for lambda_ doesn’t work, because the return type of TupleLambda is wrong! The compiler is required to synthesize a new type for every specialization of TupleLambda, and so TupleLambda<Types...>() and TupleLambda<Args...> return different types!

A new Kind of Initialization

This requires a major workaround. We’d still like to use our lambda, but we need to give it a type that allows us to construct it before calling the constructors of Types.... We can’t use Types..., so we’ll do a switcheroo.

The following is boilerplate for a type that can hold a T in it but which can be constructed before we construct the T.

template <typename T>
class alignas(T) StorageFor {
 public:
  // Constructor does nothing.
  StorageFor() = default;

  // Constructs a T inside of data_.
  template <typename... Args>
  void Init(Args&&... args) {
    new (reinterpret_cast<T*>(&data_)) T(
      std::forward<Args>(args)...);
  }

  // Allow dereferencing a StorageFor into a T, like
  // a smart pointer.
  const T* get() const { return reinterpret_cast<const T*>(&data_); }
  T* get() { return reinterpret_cast<T*>(&data_); }
  const T& operator*() const { return *get(); }
  T& operator*() { return *get(); }
  const T* operator->() const { return get(); }
  T* operator->() { return get(); }
 private:
  char data_[sizeof(T)];
};
C++

There’s a lot going on here. Let’s break it down.

  1. alignof(T) ensures that even though the only member is a char array, this
pub fn read(x: &i32) -> i32 {
    *x
}
Rust
  1. The constructor does nothing; the T within is only constructed when Init() is called with T’s constructor arguments.

  2. Init() forwards its arguments just like our non-functional constructor for Tuple. This time, the arguments get sent into T’s constructor via placement-new. Placement-new is special syntax that allows us to call a constructor directly on existing memory. It’s spelled like this: new (dest) T(args);.

  3. operator*/operator-> turn StorageFor into a smart pointer over T, which will be useful later. The signatures of these functions aren’t important; it’s library boilerplate.

We can use this type like this:

// Create some storage.
StorageFor<std::string> my_string;

// Separately, initialize it using std::string's constructor
// form char[N].
my_string.Init("cool type!");

// Print it out.
absl::PrintF("%s\n", *my_string);

// Destroy it. This must be done manually because StorageFor<T>
// has a trivial destructor.
using ::std::string;
my_string->~string();
C++

How does this help us?

Constructors Inside-Out

StorageFor<T> will be the types that our lambda captures, making it possible to give it a consistent type without knowing which arguments we’ll use to initialize the contents.

template <typename... Types>
class Tuple {
 private:
  template <typename... Args>
  static auto TupleLambda(Args... args) {
    return [args...] { /* ??? */ };
  }

  decltype(TupleLambda(StorageFor<Types>{}...)) lambda_ =
    TupleLambda(StorageFor<Types>{}...);
};
C++

But now we’re in another bind: how do we call the constructors? Even with placement-new, we can’t reach into the lambda’s data, and the layout of a lambda is compiler-specific. However, that’s from the outside. What if we accessed the lambda from the inside?

We modify the lambda to itself be generic and take a pack of forwarding references as arguments, which we can then pass into Init():

template <typename... Types>
class Tuple {
 private:
  template <typename... Args>
  static auto TupleLambda(Args... args) {
    return [args...] (auto&&... init_args) {
      (args.Init(std::forward<decltype(init_args)>(init_args)), ...);
    };
  }
  // ...
}
C++

That’s a serious mouthful. Let’s break it down.

  1. [args...] (auto&&... init_args) { declares a generic lambda. This means that there’s an imaginary template <typename... Args> on the operator() of the generated class. Because the argument type is Args&&, and Args is a template parameter of operator(), init_args is a pack of forwarding references. This is a C++14 feature.

  2. Init(std::forward<decltype(init_args)>(init_args)) is a forwarded constructor argument. Nothing new here.

  3. The outer (<expr>, ...) that the placement-new is wrapped in is a pack fold, which uses an operator to fold a pack of values into one. For example, (foo + ...) computes the sum of all elements in a pack. In our case, we’re folding with the comma operator ,. All this does is discard the elements of the pack (which are all void, regardless). This is a C++17 feature4

Taken together, this causes the constructor of each type in Types... to be run on the respective StorageFor<T> captures by the lambda when TupleLambda() was originally called. The double-nesting of a function-within-a-function can be a bit confusing: TupleLambda() is not what calls T’s constructor!

Actually, this won’t compile because Init() is not const, but the lambda’s operator() is. This is easily fixed by adding the mutable keyword:

template <typename... Types>
class Tuple {
 private:
  template <typename Args>
  static auto TupleLambda(Args... args) {
    return [args...] (auto&&... init_args) mutable {
      // ...                               ^^^^^^^
    };
  }
  // ...
}
C++

We also need to mark the lambda_ parameter as mutable so that const functions can all it. We’ll just need to be careful we don’t actually mutate through it. This is necessary because we cannot (at least until C++23) write to the captures of a lambda and still be able to call it in const contexts:

template <typename... Types>
class Tuple {
 private:
  mutable decltype(TupleLambda(StorageFor<Types>{}...)) lambda_ =
    TupleLambda(StorageFor<Types>{}...);
  // ...
};
C++

Now, our constructor looks like this:

template <typename... Types>
class Tuple {
 public:
  template <typename... Args>
  Tuple(Args&&... args) {
    lambda_(std::forward<Args>(args)...);
  }
  // ...
}
C++

More Constructors!

We have std::tuple(args) but we still need std::tuple. But, we’ve already used up our one chance to touch the captures of the lambda… we can’t write down a lambda that has both a variadic operator() (many generic arguments) and a niladic operator() (no arguments).

But we can make it take a lambda itself! In this case, all that our “storage lambda” does now is call a callback with a pack of references. Calling lambda_() effectively “unpacks” it:

template <typename... Types>
class Tuple {
 private:
  template <typename Args>
  static auto TupleLambda(Args... args) {
    return [=] (auto callback) mutable -> decltype(auto) {
      return callback(args...);
    };
  }
  // ...
}
C++

The decltype(auto) bit simply ensures that if callback returns a reference, then so does lambda_. By default, lambdas return auto, which will never deduce a reference (you’d need to write auto&, which conversely cannot deduce a value). Instead of using “auto deduction”, we can use the special decltype(auto) type to request “decltype deduction”, which can deduce both references and non-references. This comes in handy later.

Now we can refactor the two constructors to call lambda_ with different lambda arguments. Our original constructor will pass in the original body of lambda_, which calls Init() with args. The new constructor will simply call Init() with no args.

template <typename... Types>
class Tuple {
 public:
  template <typename... Args>
  Tuple(Args&&... args) {
    lambda_([&] (StorageFor<Types>&... places) {
      (places.Init(std::forward<decltype(args)>(args)), ...);
    });
  }
  Tuple() {
    lambda_([] (StorageFor<Types>&... places) {
      (places.Init(), ...);
    }); 
  }
  // ...
}
C++

We need to implement the destructor too, since StorageFor<T> will not destroy the T we’re squirreling away inside, but this is still really easy:

template <typename... Types>
class Tuple {
 public:
  ~Tuple() {
    lambda_([] (StorageFor<Types>&... places) {
      (places->~Types(), ...);
    }); 
  }
  // ...
}
C++

Copy and move are similar, but require interleaving two calls of lambda_:

template <typename... Types>
class Tuple {
 public:
  Tuple(const Tuple& that) {
    lambda_([&] (StorageFor<Types>&... these) {
      // Carefully take a const&, to make sure we don't call a
      // mutable-ref constructor.
      that.lambda_([&] (const StorageFor<Types>&... those) {
        (new (these.get()) Types(*those), ...);   
      });
    });
  }

  Tuple(Tuple&& that) {
    lambda_([&] (StorageFor<Types>&... these) {
      that.lambda_([&] (StorageFor<Types>&... those) {
        // Avoid std::move to cut down on instantiation.
        (new (these) Types(static_cast<Types&&>(*those)), ...);   
      });
    });
  }
  // ...
};
C++

Copy/move assignment are basically identical; I’ll leave those as an exercise!

This gives us our complete set of constructors. We’ll throw in deduction guides5 to avoid needing to implement make_tuple:

template <typename... Types>
Tuple(Types...) -> Tuple<Types...>;
template <typename... Types>
Tuple(const Tuple<Types...>&) -> Tuple<Types...>;

int main() {
  Tuple tup{1, 2, "foo", "bar"};
  Tuple tup2 = tup;
}
C++

This works up until we try to write Tuple tup2 = tup; Overload resolution will incorrectly route to the variadic constructor rather than the copy constructor, so a little bit of SFINAE is needed to grease the compiler’s wheels.

Keeping in the spirit of avoiding extra instantiation logic, we’ll use placement-new inside of a decltype as an ersatz std::enable_if:

template <typename... Args,
          decltype((new (nullptr) Types(std::declval<Args>()), ...))
            = nullptr>
Tuple(Args&&... args) {
  // ...
}
C++

This verifies that we can actually construct a Types from a Args (for each member of the pack). Because this is occurring in an unevaluated context, we can safely placement-new on nullptr. All new expressions produce a pointer value, and a comma-fold produces the last value in the fold, so the overall decltype() is T*, where T is the last element of the pack.

This decltype() is the type of a non-type template parameter, which we can default to nullptr, so the user never notices it.

Ok. We have all of our constructors. The code so far is at this footnote: 6.

Onwards to std::apply.

Unpacking, Again

std::apply(f, tup) is a relatively straight-forward function: call f by splatting tup’s elements int f as a pack. Because of how we’ve implemented lambda_, this is actually super simple:

template <typename... Types>
class Tuple {
 public:
  template <typename F>
  decltype(auto) apply(F&& f) {
    lambda_([&] (StorageFor<Types>&... places) -> decltype(auto) {
      return std::invoke(std::forward<F>(f), *places...);
    });
  }
  // ...
};
C++

(We’re possibly returning a reference, so note the decltype(auto)s.)

lambda_ is basically a funny std::apply already, just with the wrong arguments. The *places fixes this up. With some repetition, we can write down const- and &&-qualified overloads. We can even introduce a free function just like the one in the standard library:

template <typename F, typename Tup>
decltype(auto) apply(F&& f, Tup&& t) {
  return std::forward<Tup>(t).apply(std::forward<F>(f));
}
C++

The other unpacking operation, std::get, is trickier. This is usually where things get really hairy, because we need to get the ith type out of the lambda. There are many approaches for doing this, most of which involve recursive templates. I’ll present two approaches that don’t use recursive templates directly, but which can still be a bit slow, built-time-wise.

This is the function we need to implement:

template <typename... Types>
class Tuple {
 public:
  template <size_t i>
  auto& get();
  // ...
};
C++

Cheating with std::make_index_sequence

std::make_index_sequence is a funny type-level function that produces a pack of integers from 0 to i, given just i. This is usually fast, since most compilers will have intrinsics for doing it without needing to instantiate i templates. For example, in Clang, this is __make_integer_seq, which is used by libc++.

Thus, we can turn the problem of implementing get with a single i to implementing get with a pack:

template <typename... Types>
class Tuple {
 public:
  template <size_t i>
  auto& get() {
    return GetImpl(std::make_index_sequence<i>{});
  }
 private:
  template <size_t... less_than_i>
  /* ??? */ GetImpl(std::index_sequence<less_than_i...>);
  // ...
};
C++

We can then use this pack to cook up just the right lambda to grab just the capture we want out of lambda_. Specifically, we want a lambda that picks out its ith argument. Basically we want to write something with arguments like (auto..., auto, auto...), but somehow use the less_than_i pack to control the size of the first argument pack.

We can whip up a class template for this:

template <size_t>
struct Sink {
  template <typename T>
  Sink(T&&) {}
};
C++

Sink<n> is a type that is implicitly convertible from anything, and has a dummy parameter we can key an expansion off-of. Hence GetImpl() looks like this:

template <typename... Types>
class Tuple {
 private:
  template <size_t... less_than_i>
  auto& GetImpl(std::index_sequence<less_than_i...>) {
    return lambda_(
      [] (Sink<less_than_i>..., auto& the_one, auto...) -> auto& {
        return the_one;
      });
  }
  // ...
};
C++

We can then provide the type of the ith element as a member type alias, using decltype:

template <typename Tuple, size_t i>
using TupleType = std::remove_reference_t<
  decltype(std::declval<Tuple>().template get<i>())>;
C++

(The template keyword isn’t doing anything interesting; it’s just for syntactic disambiguation.)

We can, as usual, repeat implementations for const/&& qualifiers.

Cheating Harder with __type_pack_element

If we’re ok being Clang-specific, Clang just gives us a magic type function that selects out of a pack. This means we can implement TupleType in terms of it:

template <typename... Types>
class Tuple {
 private:
  template <size_t i>
  using type = __type_pack_element<i, Types...>;
  // ...
};

template <typename Tuple, size_t i>
using TupleType = Tuple::template Type<i>;
C++

Then, we can use void* to swindle the type system, since we don’t need to go to any effort to learn the ith type now:

template <typename... Types>
class Tuple {
 public:
  template <size_t i>
  type<i>& get() {
    return lambda_([] (StorageFor<Types>&... places) -> decltype(auto) {
      void* erased[] = {places.get()...};
      return *reinterpret_cast<type<i>*>(erased[i]);
    });
  }
  // ...
};
C++

(We’re returning a reference, so again note the decltype(auto).)

With that we have all of the functions we set out to implement. For kicks, we can add the relevant std specializations to enable structured bindings on our type (along with our get member function):

namespace std {
template <typename... Types>
struct tuple_size<Tuple<Types...>>
    : std::integral_constant<size_t, sizeof...(Types)> {};
template <size_t i, typename... Types>
struct tuple_element<i, Tuple<Types...>> {
  using type = typename Tuple<Types...>::template type<i>;
};
}  // namespace std
C++

Now we can see everything in action:

int main() {
  Tuple tup{1, 2, "foo", "bar", nullptr};
  tup.apply([](auto, auto, auto a, auto b, auto) {
    std::printf("%s %s\n", a, b);
  });

  auto [a, b, c, d, e] = tup;
  std::printf("%d %d\n", a, b);
}
C++

The full code can be found at this footnote: 7.

The Damage

So, the end result is most of an implementation of std::tuple<>. Let’s see how well it builds. We’re going to compile the following code for n from 0 to 150 and measure how long it takes.

tuple t{/* 0 repeated n times */};
t.get<0>();
// ...
t.get<n>();
C++

And here’s the results on Clang 11 (what I had on-hand) on my Zen 2 machine:

We seem to beat libstdc++ by a factor of around 2, but libc++ appears to have us beat. This is because libc++ makes even more aggressive use of Clang’s intrinsics than we did, allowing them to do significantly better. Interestingly, using the builtin makes us perform worse. I’m actually not sure why this is.

But ultimately, this wasn’t really about beating libc++: it’s about having fun with C++ templates.

  1. Arguably, because WG21, the body that standardizes C++, is bad at language evolution, but that’s not why we’re here. 

  2. The Circle compiler totally laughs in our faces, though, because it has this exact syntax. https://github.com/seanbaxter/circle/tree/master/tuple#circle-tuple 

  3. Basically every in-place constructor in C++ looks like this. It takes a variadic pack as a template parameter, and then takes && if that as its arguments. Args&& here is a forwarding reference, which means it is T& or T&& depending on the callsite. This overrides the usual template deduction rules, and is important for making sure that e.g. std::move propagates correctly.

    We cannot write Types&& instead, because that would not be a forwarding reference. T&& refers to a forwarding reference argument only on a function template where T is a parameter of that function and not an enclosing entity. 

  4. If C++17 is too much to ask, polyfilling isn’t too hard. Instead of (<expr>, ...);, we can write (void)(int[]){(<expr>, 0)...};, even if <expr> is a void expression. (<expr>, 0) is still a comma operator call, which discards the result of <expr> as before. The pack expands into an array of integers (a int[]), which we then discard with (void). This still has the behavior of evaluating <expr> once for each element of the pack. 

  5. A deduction guide is a special piece of syntax introduced in C++17 intended to aid deducing the types of constructor calls. When we write std::tuple(a, b, c), the template arguments of std::tuple are deduced. However, the constructor call may not give sufficient information to properly deduce them, because we may be calling a constructor template.

    The syntax looks like this:

    template <args>
    MyType(args) -> MyType<types>;
    C++

    This tells the compiler that when it encounters a call to a constructor of MyTypes that deduces the given types as its arguments, it should deduce the type after the -> for the template arguments of MyType, which can be arbitrary template argument expressions. 

  6. #include <new> 
    #include <utility> 
    
    template <typename T>
    class alignas(T) StorageFor {
     public:
      StorageFor() = default;
      template <typename... Args>
      void Init(Args&&... args) {
        new (reinterpret_cast<T*>(&data_)) T(
          std::forward<Args>(args)...);
      }
    
      const T* get() const { return reinterpret_cast<const T*>(&data_); }
      T* get() { return reinterpret_cast<T*>(&data_); }
      const T& operator*() const { return *get(); }
      T& operator*() { return *get(); }
      const T* operator->() const { return get(); }
      T* operator->() { return get(); }
     private:
      char data_[sizeof(T)];
    };
    
    template <typename... Types>
    class Tuple {
     public:
      Tuple() {
        lambda_([] (StorageFor<Types>&... places) {
          (places.Init(), ...);
        }); 
      }
    
      template <typename... Args,
                decltype((new (nullptr) Types(std::declval<Args>()), ...))
                  = nullptr>
      Tuple(Args&&... args) {
        lambda_([&] (StorageFor<Types>&... places) {
          (places.Init(std::forward<decltype(args)>(args)), ...);
        });
      }
    
      Tuple(const Tuple& that) {
        lambda_([&] (StorageFor<Types>&... these) {
          that.lambda_([&] (const StorageFor<Types>&... those) {
            (new (these.get()) Types(*those), ...);   
          });
        });
      }
    
      Tuple(Tuple&& that) {
        lambda_([&] (StorageFor<Types>&... these) {
          that.lambda_([&] (StorageFor<Types>&... those) {
            (new (these) Types(static_cast<Types&&>(*those)), ...);   
          });
        });
      }
    
      ~Tuple() {
        lambda_([] (StorageFor<Types>&... places) {
          (places->~Types(), ...);
        }); 
      }
    
     private:
      template <typename... Args>
      static auto TupleLambda(Args... args) {
        return [=] (auto callback) mutable -> decltype(auto) {
          return callback(args...);
        };
      }
    
      mutable decltype(TupleLambda(StorageFor<Types>{}...)) lambda_ =
        TupleLambda(StorageFor<Types>{}...);
    };
    
    template <typename... Types>
    Tuple(Types...) -> Tuple<Types...>;
    template <typename... Types>
    Tuple(const Tuple<Types...>&) -> Tuple<Types...>;
    
    int main() {
      Tuple tup{1, 2, "foo", "bar"};
      Tuple tup2 = tup;
    }

  7. #include <cstddef>
    #include <cstdio>
    #include <functional>
    #include <new>
    #include <type_traits>
    #include <utility>
    
    template <typename T>
    class alignas(T) StorageFor {
     public:
      StorageFor() = default;
      template <typename... Args>
      void Init(Args&&... args) {
        new (reinterpret_cast<T*>(&data_)) T(
          std::forward<Args>(args)...);
      }
    
      const T* get() const { return reinterpret_cast<const T*>(&data_); }
      T* get() { return reinterpret_cast<T*>(&data_); }
      const T& operator*() const { return *get(); }
      T& operator*() { return *get(); }
      const T* operator->() const { return get(); }
      T* operator->() { return get(); }
     private:
      char data_[sizeof(T)];
    };
    
    
    template <size_t>
    struct Sink {
      template <typename T>
      Sink(T&&) {}
    };
    
    template <typename... Types>
    class Tuple {
      #if USE_CLANG_INTRINSIC
          template <size_t i>
      using type = __type_pack_element<i, Types...>;
      #endif
    
     public:
      Tuple() {
        lambda_([] (StorageFor<Types>&... places) {
          (places.Init(), ...);
        }); 
      }
    
      template <typename... Args,
                decltype((new (nullptr) Types(std::declval<Args>()), ...))
                  = nullptr>
      Tuple(Args&&... args) {
        lambda_([&] (StorageFor<Types>&... places) {
          (places.Init(std::forward<decltype(args)>(args)), ...);
        });
      }
    
      Tuple(const Tuple& that) {
        lambda_([&] (StorageFor<Types>&... these) {
          that.lambda_([&] (const StorageFor<Types>&... those) {
            (new (these.get()) Types(*those), ...);   
          });
        });
      }
    
      Tuple(Tuple&& that) {
        lambda_([&] (StorageFor<Types>&... these) {
          that.lambda_([&] (StorageFor<Types>&... those) {
            (new (these) Types(static_cast<Types&&>(*those)), ...);   
          });
        });
      }
    
      ~Tuple() {
        lambda_([] (StorageFor<Types>&... places) {
          (places->~Types(), ...);
        }); 
      }
    
      template <typename F>
      decltype(auto) apply(F&& f) const& {
        lambda_([&] (const StorageFor<Types>&... places) -> decltype(auto) {
          return std::invoke(std::forward<F>(f), *places...);
        });
      }
      template <typename F>
      decltype(auto) apply(F&& f) & {
        lambda_([&] (const StorageFor<Types>&... places) -> decltype(auto) {
          return std::invoke(std::forward<F>(f), *places...);
        });
      }
      template <typename F>
      decltype(auto) apply(F&& f) const&& {
        lambda_([&] (const StorageFor<Types>&... places) -> decltype(auto) {
          return std::invoke(std::forward<F>(f), 
            static_cast<const Types&&>(*places)...);
        });
      }
      template <typename F>
      decltype(auto) apply(F&& f) && {
        lambda_([&] (const StorageFor<Types>&... places) -> decltype(auto) {
          return std::invoke(std::forward<F>(f), 
            static_cast<Types&&>(*places)...);
        });
      }
    
      template <typename F, typename Tup>
      friend decltype(auto) apply(F&& f, Tup&& t) {
        return std::forward<Tup>(t).apply(std::forward<F>(f));
      }
    
      #if USE_CLANG_INTRINSIC
          template <size_t i>
      const type<i>& get() const& {
        return lambda_([] (const StorageFor<Types>&... places) -> decltype(auto) {
          const void* erased[] = {places.get()...};
          return *reinterpret_cast<const type<i>*>(erased[i]);
        });
      }
    
      template <size_t i>
      type<i>& get() & {
        return lambda_([] (StorageFor<Types>&... places) -> decltype(auto) {
          void* erased[] = {places.get()...};
          return *reinterpret_cast<type<i>*>(erased[i]);
        });
      }
    
      template <size_t i>
      const type<i>&& get() const&& {
        return lambda_([] (const StorageFor<Types>&... places) -> decltype(auto) {
          const void* erased[] = {places.get()...};
          return static_cast<const type<i>&&>(
            *reinterpret_cast<const type<i>*>(erased[i]));
        });
      }
    
      template <size_t i>
      type<i>&& get() && {
        return lambda_([] (StorageFor<Types>&... places) -> decltype(auto) {
          void* erased[] = {places.get()...};
          return static_cast<type<i>&&>(
            *reinterpret_cast<type<i>*>(erased[i]));
        });
      }
      
      #else // USE_CLANG_INTRINSIC
    
      template <size_t i>
      const auto& get() const& {
        return GetImpl(*this, std::make_index_sequence<i>{});
      }
    
      template <size_t i>
      auto& get() & {
        return GetImpl(*this, std::make_index_sequence<i>{});
      }
    
      template <size_t i>
      const auto&& get() const&& {
        auto& val = GetImpl(*this, std::make_index_sequence<i>{});
        return static_cast<decltype(val)&&>(val);
      }
    
      template <size_t i>
      auto&& get() && {
        auto& val =  GetImpl(*this, std::make_index_sequence<i>{});
        return static_cast<decltype(val)&&>(val);
      }
    
      #endif // USE_CLANG_INTRINSIC
    
      template <size_t i, typename Tup>
      friend decltype(auto) get(Tup&& t) {
        return std::forward<Tup>(t).template get<i>();
      }
    
     private:
      template <typename... Args>
      static auto TupleLambda(Args... args) {
        return [=] (auto callback) mutable -> decltype(auto) {
          return callback(args...);
        };
      }
    
      template <typename Tup, size_t... less_than_i>
      friend decltype(auto) GetImpl(Tup&& t, std::index_sequence<less_than_i...>) {
        return std::forward<Tup>(t).lambda_(
          [] (Sink<less_than_i>..., auto& the_one, auto...) -> auto& {
            return the_one;
          });
      }
    
      mutable decltype(TupleLambda(StorageFor<Types>{}...)) lambda_ =
        TupleLambda(StorageFor<Types>{}...);
    };
    
    #if USE_CLANG_INTRINSIC
        template <typename Tuple, size_t i>
    using TupleType = typename Tuple::template Type<i>;
    #else
        template <typename Tuple, size_t i>
    using TupleType = std::remove_reference_t<
      decltype(std::declval<Tuple>().template get<i>())>;
    #endif
    
    template <typename... Types>
    Tuple(Types...) -> Tuple<Types...>;
    template <typename... Types>
    Tuple(const Tuple<Types...>&) -> Tuple<Types...>;
    
    namespace std {
    template <typename... Types>
    struct tuple_size<Tuple<Types...>>
        : std::integral_constant<size_t, sizeof...(Types)> {};
    template <size_t i, typename... Types>
    struct tuple_element<i, Tuple<Types...>> {
      using type = TupleType<Tuple<Types...>, i>;
    };
    }  // namespace std
    
    int main() {
      Tuple tup{1, 2, "foo", "bar", nullptr};
      tup.apply([](auto, auto, auto a, auto b, auto) {
        std::printf("%s %s\n", a, b);
      });
    
      auto [a, b, c, d, e] = tup;
      std::printf("%d %d\n", a, b);
    }

The Alkyne GC

Alkyne is a scripting language I built a couple of years ago for generating configuration blobs. Its interpreter is a naive AST walker1 that uses ARC2 for memory management, so it’s pretty slow, and I’ve been gradually writing a new evaluation engine for it.

This post isn’t about Alkyne itself, that’s for another day. For now, I’d like to write down some notes for the GC I wrote3 for it, and more generally provide an introduction to memory allocators (especially those that would want to collude with a GC).

This post is intended for people familiar with the basics of low-level programming, such as pointers and syscalls. Alkyne’s GC is intended to be simple while still having reasonable performance. This means that the design contains all the allocator “tropes,” but none of the hairy stuff.

My hope is readers new to allocators or GCs will come away with an understanding of these tropes and the roles they play in a modern allocator.

Thank you to James Farrell, Manish Goregaokar, Matt Kulukundis, JeanHeyd Meneide, Skye Thompson, and Paul Wankadia for providing feedback on various drafts of this article. This was a tough one to get right. :)

Trailhead

The Alkyne GC is solving a very specific problem, which allows us to limit what it actually needs to do. Alkyne is an “embeddable” language like JavaScript, so its heap is not intended to be big; in fact, for the benefit of memory usage optimizations, it’s ideal to use 32-bit pointers (a 4 gigabyte address space).

The heap needs to be able to manage arbitrarily-large allocations (for lists), and allocations as small as eight bytes (for floats4). Allocation should be reasonably quick, but due to the size of the heap, walking the entire heap is totally acceptable.

Because we’re managing a fixed-size heap, we can simply ask the operating system for a contiguous block of that size up-front using the mmap() syscall. An Alkyne pointer is simply a 32-bit offset into this giant allocation, which can be converted to and from a genuine CPU pointer by adding or subtracting the base address of the heap.

 4GB Heap
 +-------------------------------------------------+
 |                x                                |
 +-------------------------------------------------+
 ^                ^
 base             base + ptr_to_x
Plaintext

The OS won’t actually reserve 4GB of memory for us; it will only allocate one system page (4KB) at a time. If we read or write to a particular page in the heap for the first time, the OS will only then find physical RAM to back it5.

Throughout, we’ll be working with this fixed-size heap, and won’t think too hard about where it came from. For our purposes, it is essentially a Box<[u8]>, but we’ll call it a Heap<[u8]> to make it clear this memory we got from the operating system (but, to be clear, the entire discussion applies just as well to an ordinary gigantic Box<[u8]>)

The Alkyne language does not have threads, so we can eschew concurrency. This significantly reduces the problems we will need to solve. Most modern allocators and garbage collectors are violently concurrent by nature, and unfortunately, much too advanced for one article. There are links below to fancier GCs you can poke around in.

A Heap of Trouble

To build a garbage collector, we first need an allocator. We could “just”6 use the system heap as a source of pages, but most garbage collectors collude with the allocator, since they will want to use similar data structures. Thus, if we are building a garbage collector, we might as well build the allocator too.

An allocator, or “memory heap” (not to be confused with a min-heap, an unrelated but wicked data structure), services requests for allocations: unique leases of space in the managed heap of various sizes, which last for lifetimes not known until runtime. These allocations may also be called objects, and a heap may be viewed as a general-purpose object pool.

The most common API for a heap is:

trait Allocator {
  // Returns a *unique pointer* managed by this allocator
  // to memory as large as requested, and as aligned
  // as we'd like.
  // 
  // Returns null on failure.
  unsafe fn alloc(&mut self, size: usize, align: usize) -> *mut u8;
  // Frees a pointer returned by `Alloc` may be called at
  // most once.
  unsafe fn free(&mut self, ptr: *mut u8);
}
Rust

Originally the examples were in C++, which I feel is more accessible (lol) but given that Alkyne itself is written in Rust I felt that would make the story flow better.

This is the “malloc” API, which is actually very deficient; ideally, we would do something like Rust’s Allocator, which requires providing size and alignment to both the allocation and deallocation functions.

Unfortunately7, this means I need to explain alignment.

Good Pointers, Evil Pointers, Lawful Pointers, Chaotic Pointers

“Alignment” is a somewhat annoying property of a pointer. A pointer is aligned to N bytes (always a power of 2) if its address is divisible by N. A pointer is “well-aligned” (or just “aligned”) if its address is aligned to the natural alignment of the thing it points to. For ints, this is usually their size; for structs, it is the maximum alignment among the alignments of the fields of that struct.

Performing operations on a pointer requires that it be aligned8. This is annoying because it requires some math. Specifically we need three functions:

/// Checks that `ptr` is aligned to an alignment.
fn is_aligned(ptr: Int, align: usize) -> bool {
  ptr & (align - 1) == 0
}

/// Rounds `ptr` down to a multiple of `align`.
fn align_down(ptr: Int, align: usize) -> Int {
  ptr & !(align - 1)
}

/// Rounds `ptr` up to a multiple of `align`.
fn align_up(ptr: Int, align: usize) -> Int {
  // (I always look this one up. >_>)
  align_down(ptr + align - 1, align)
}

/// Computes how much needs to be added to `ptr` to align it.
fn misalign(ptr: Int, align: usize) -> usize {
  align_up(ptr, align) - ptr
}
Rust

(Exercise: prove these formulas.)

For the rest of the article I will assume I have these three functions available at any time for whatever type of integer I’d like (including raw pointers which are just boutique9 integers).

Also we will treat the Heap<[u8]> holding our entire heap as being infinitely aligned; i.e. as a pointer it is aligned to all possible alignments that could matter (i.e. page-aligned, 4KB as always). (For an ordinary Box<[u8]>, this is not true.)

The Trivial Heap

Allocating memory is actually very easy. Arenas are the leanest and meanest in the allocator food chain; they simply don’t bother to free any memory.

This means allocation is just incrementing a cursor indicating where the hitherto-unallocated memory is.

 +-------------------------------------------------+
 | Allocated        | Free                         |
 +-------------------------------------------------+
                    ^
                    cursor
Plaintext

Our allocator is as simple as return ptr++;.

This is straightforward to implement in code:

struct Arena {
  heap: Heap<[u8]>,
  cursor: usize,
}

impl Allocator for Arena {
  unsafe fn alloc(&mut self, size: usize, align: usize) -> *mut u8 {
    // To get an aligned pointer, we need to burn some "alignment
    // padding". This is one of the places where alignment is
    // annoying.
    let needed = size + misalign(self.heap.as_ptr(), align);

    // Check that we're not out of memory.
    if self.heap.len() - self.cursor < needed {
      return ptr::null_mut();
    }

    // Advance the cursor and cut off the end of the allocated
    // section.
    self.cursor += needed;
    &mut self.heap[self.cursor - size] as *mut u8;
  }

  unsafe fn free(&mut self, ptr: *mut u8) {
    // ayy lmao
  }
}
Rust

Arenas are very simple, but far from useless! They’re great for holding onto data that exists for the context of a “session”, such as for software that does lots of computations and then exits (a compiler) or software that handles requests from clients, where lots of data lives for the duration of the request and no longer (a webserver).

They are not, however, good for long-running systems. Eventually the heap will be exhausted if objects are not recycled.

Making this work turns out to be hard[citation-needed]. This is the “fundamental theorem” of allocators:

Fundamental “Theorem” of Allocators

Handing out memory is easy. Handing it out repeatedly is hard.

Thankfully, over the last fifty years we’ve mostly figured this out. Allocator designs can get pretty gnarly.

Four Tropes

From here, we will gradually augment our allocator with more features to allow it to service all kinds of requests. For this, we will implement four common allocator features:

  1. Blocks and a block cache.
  2. Free lists.
  3. Block merging and splitting.
  4. Slab allocation.

All four of these are present in some form in most modern allocators.

Blocks

The first thing we should do is to deal in fixed-size blocks of memory of some minimum size. If you ask malloc() for a single byte, it will probably give you like 8 bytes on most systems. No one is asking malloc() for single bytes, so we can quietly round up and not have people care. (Also, Alkyne’s smallest heap objects are eight bytes, anyways.)

Blocks are also convenient, because we can keep per-block metadata on each one, as a header before the user’s data:

#[repr(C)]
struct Block {
  header: Header,
  data: [u8; BLOCK_SIZE],
}
Rust

To allow blocks to be re-used, we can keep a cache of recently freed blocks. The easiest way to do this is with a stack. Note that the heap is now made of Blocks, not plain bytes.

To allocate storage, first we check the stack. If the stack is empty, we revert to being an arena and increment the cursor. To free, we push the block onto the stack, so alloc() can return it on the next call.

struct BlockCache {
  heap: Heap<[Block]>,
  cursor: usize,
  free_stack: Vec<*mut Block>,
}

impl Allocator for BlockCache {
  unsafe fn alloc(&mut self, size: usize, align: usize) -> *mut u8 {
    // Check that the size isn't too big. We don't need to
    // bother with alignment, because every block is
    // infinitely-aligned, just like the heap itself.
    if size > BLOCK_SIZE {
      return ptr::null_mut();
    }

    // Try to serve a block from the stack.
    if let Some(block) = self.free_stack.pop() {
      return &mut *block.data as *mut u8;
    }

    // Fall back to arena mode.
    if self.cursor == self.heap.len() {
      return ptr::null_mut();
    }
    self.cursor += 1;
    &mut self.heap[self.cursor].data as *mut u8
  }

  unsafe fn free(&mut self, ptr: *mut u8) {
    // Use pointer subtraction to find the start of the block.
    let block = ptr.sub(size_of::<Header>()) as *mut Block;
    self.free_stack.push(block);
  }
}
Rust

This allocator has a problem: it relies on the system allocator! Heap came directly from the OS, but Vec talks to malloc() (or something like it). It also adds some pretty big overhead: the Vec needs to be able to resize, since it grows as more and more things are freed. This can lead to long pauses during free() as the vector resizes.

Cutting out the middleman gives us more control over this overhead.

Free Lists

Of course, no one has ever heard of a “free stack”; everyone uses free lists! A free list is the cache idea but implemented as an intrusive linked list.

A linked list is this data structure:

enum Node<T> {
  Nil,
  Cons(Box<(T, Node<T>)>),
  //   ^~~ oops I allocated again
}
Rust

This has the same problem of needing to find an allocator to store the nodes. An intrusive list avoids that by making the nodes part of the elements. The Header we reserved for ourselves earlier is the perfect place to put it:

struct Header {
  /// Pointer to the next and previous blocks in whatever
  /// list the block is in.
  next: *mut Block,
  prev: *mut Block,
}
Rust

In particular we want to make sure block are in doubly-linked lists, which have the property that any element can be removed from them without walking the list.

   list.root
     |
     v
 +-> Block--------------------------+
 |   | next | null | data data data |
 |   +------------------------------+
 +-----/------+
      /       |
     v        |
 +-> Block--------------------------+
 |   | next | prev | data data data |
 |   +------------------------------+
 +-----/------+
      /       |
     v        |
 +-> Block--------------------------+
 |   | next | prev | data data data |
 |   +------------------------------+
 +-----/------+
      /       |
     v        |
     Block--------------------------+
     | null | prev | data data data |
     +------------------------------+
Plaintext

We also introduce a List container type that holds the root node of a list of blocks, to give us a convenient container-like API. This type looks like this:

struct List {
  /// The root is actually a sacrificial block that exists only to
  /// make it possible to unlink blocks in the middle of a list. This
  /// needs to exist so that calling unlink() on the "first" element
  /// of the list has a predecessor to replace itself with.
  root: *mut Block,
}

impl List {
  /// Pushes a block onto the list.
  unsafe fn push(&mut self, block: *mut Block) {
    let block = &mut *block;
    let root = &mut *self.root;
    if !root.header.next.is_null() {
      let first = &mut *root.header.next;
      block.header.next = first;
      first.header.prev = block;
    }
    root.header.next = block;
    block.header.prev = root;
  }

  /// Gets the first element of the list, if any.
  unsafe fn first(&mut self) -> Option<&mut Block> {
    let root = &mut *self.root;
    root.header.next.as_mut()
  }
}
Rust

We should also make it possible to ask a block whether it is in any list, and if so, remove it.

impl Block {
  /// Checks if this block is part of a list.
  fn is_linked(&self) -> bool {
    // Only the prev link is guaranteed to exist; next is
    // null for the last element in a list. Sacrificial
    // nodes will never have prev non-null, and can't be
    // unlinked.
    !self.header.prev.is_null()
  }

  /// Unlinks this linked block from whatever list it's in.
  fn unlink(&mut self) {
    assert!(self.is_linked());
    if !self.header.next.is_null() {
      let next = &mut *self.header.next;
      next.header.prev = self.header.prev; 
    }

    // This is why we need the sacrificial node.
    let prev = &mut *self.header.prev;
    prev.header.next = self.header.next;

    self.header.prev = ptr::null_mut();
    self.header.next = ptr::null_mut();
  }
}
Rust

Using these abstractions we can upgrade BlockCache to FreeList. We only need to rename free_stack to free_list, and make a one-line change:

- if let Some(block) = self.free_stack.pop() {
+ if let Some(block) = self.free_list.first() {
+   block.unlink();
    return &mut *block.data as *mut u8;
 }
Rust

Hooray for encapsulation!

This is a very early malloc() design, similar to the one described in the K&R C book. It does have big blind spot: it can only serve up blocks up to a fixed size! It’s also quite wasteful, because all allocations are served the same size blocks: the bigger we make the maximum request, the more wasteful alloc(8) gets.

Block Splitting (Alkyne’s Way)

The next step is to use a block splitting/merging scheme, such as the buddy system. Alkyne does not precisely use a buddy system, but it does something similar.

Alkyne does not have fixed-size blocks. Like many allocators, it defines a “page” of memory as the atom that it keeps its data structures. Alkyne defines a page to be 4KB, but other choices are possible: TCMalloc uses 8KB pages.

In Alkyne, pages come together to form contiguous, variable-size reams (get it?). These take the place of blocks.

Page Descriptors

Merging and splitting makes it hard to keep headers at the start of reams, so Alkyne puts them all in a giant array somewhere else. Each page gets its own “header” called a page descriptor, or Pd.

The array of page descriptors lives at the beginning of the heap, and the actual pages follow after that. It turns out that this array has a maximum size, which we can use to pre-compute where the array ends.

Currently, each Pd is 32 bytes, in addition to the 4KB it describes. If we divide 4GB by 32 + 4K, it comes out to around four million pages (4067203 to be precise). Rounded up to the next page boundary, this means that pages begin at the 127102nd 4K boundary after the Heap<[u8]> base address, or an offset of 0x7c1f400 bytes.

Having them all in a giant array is also very useful, because it means the GC can trivially find every allocation in the whole heap: just iterate the Pd array!

So! This is our heap:

+---------------------------------------+  <-- mmap(2)'d region base
| Pd | Pd | Pd | Pd | Pd | Pd | Pd | Pd | \
+---------------------------------------+ |--- Page Descriptors
| Pd | Pd | Pd | Pd | Pd | Pd | Pd | Pd | |    for every page we can
+---------------------------------------+ |    ever allocate.
: ...                                   : |
+---------------------------------------+ |
| Pd | Pd | Pd | Pd | Pd | Pd | Pd | Pd | /
+---------------------------------------+  <-- Heap base address
| Page 0                                | \    = region + 0x7c1f400
|                                       | |
|                                       | |--- 4K pages corresponding
+---------------------------------------+ |    to the Pds above.
| Page 1                                | |    (not to scale)
|                                       | |
|                                       | |
+---------------------------------------+ |
: ...                                   | |
+---------------------------------------+ |
| Page N                                | |
|                                       | |
|                                       | /
+---------------------------------------+
  (not to scale by a factor of about 4)
Plaintext

Each one of those little Pds looks something like this:

#[repr(C)]
struct Pd {
  gc_bits: u64,
  prev: Option<u32>,
  next: Option<u32>,
  len: u16,
  class: SizeClass,
  // More fields...
}
Rust

prev and next are the intrusive linked list pointers used for the free lists, but now they are indices into the Pd array. The other fields will be used for this and the trope that follows.

Given a pointer into a page, we can get the corresponding Pd by align_down()‘ing to a page boundary, computing the index of the page (relative to the heap base), and then index into the Pd array. This process can be reversed to convert a pointer to a Pd into a pointer to a page, so going between the two is very easy.

I won’t cover this here, but Alkyne actually wraps Pd pointers in a special PdRef type that also carries a reference to the Heap; this allows implementing functions like is_linked(), unlink(), and data() directly.

I won’t show how this is implemented, since it’s mostly boilerplate.

Reams of Pages

There is one giant free list that contains all of the reams. Reams use their first page’s descriptor to track all of their metadata, including the list pointers for the free list. The len field additionally tracks how many additional pages are in the ream. gc_bits is set to 1 if the page is in use and 0 otherwise.

To allocate N continuous pages from the free ream list:

  1. We walk through the free ream list, and pick the first one with at least N pages.
  2. We “split” it: the first N pages are returned to fulfill the request.
  3. The rest of the ream is put back into the free list.
  4. If no such ream exists, we allocate a max-sized ream10 (65536 pages), and split that as above.

In a sense, each ream is an arena that we allocate smaller reams out of; those reams cannot be “freed” back to the ream they came from. Instead, to free a ream, we just stick it back on the main free list.

If we ever run out, we turn back into an arena and initialize the next uninitialized Pd in the big ol’ Pd array.

struct ReamAlloc {
  heap: Heap<[Page]>,
  cursor: usize,
  free_list: List,
}

/// The offset to the end of the maximally-large Pd array.
/// This can be computed ahead of time.
const PAGES_START: usize = ...;

impl Allocator for ReamAlloc {
  unsafe fn alloc(&mut self, size: usize, align: usize) -> *mut u8 {
    // We don't need to bother with alignment, because every page is
    // already infinitely aligned; we only allocate at the page
    // boundary.
    let page_count = align_up(size, 4096) / 4096;

    // Find the first page in the list that's big enough.
    // (Making `List` iterable is an easy exercise.)
    for pd in &mut self.list {
      if pd.len < page_count - 1 {
        continue
      }
      if pd.len == page_count - 1 {
        // No need to split here.
        pd.unlink();
        return pd.data();
      }

      // We can chop off the *end* of the ream to avoid needing
      // to update any pointers.
      let new_ream = pd.add(page_count);
      new_ream.len = page_count - 1;
      pd.len -= page_count;

      return new_ream.data();
    }

    // Allocate a new ream. This is more of the same arena stuff.
  }
  unsafe fn free(&mut self, ptr: *mut u8) {
    // Find the Pd corresponding to this page's pointer. This
    // will always be a ream's first Pd assuming the user
    // didn't give us a bad pointer.
    let pd = Pd::from_ptr(ptr);
    self.free_list.push(pd);
  }
}
Rust

This presents a problem: over time, reams will shrink and never grow, and eventually there will be nothing left but single pages.

Top fix this, we can merge reams (not yet implemented in Alkyne). Thus:

  1. Find two adjacent, unallocated reams.
  2. Unlink the second ream from the free list.
  3. Increase the length of the first ream by the number of pages in the second.
// `reams()` returns an iterator that walks the `Pd` array using
// the `len` fields to find the next ream each time.
for pd in self.reams() {
  if pd.gc_bits != 0 { continue }
  loop {
    let next = pd.add(pd.len + 1);
    if next.gc_bits != 0 { break }
    next.unlink();
    pd.len += next.len + 1;
  }
}
Rust

We don’t need to do anything to the second ream’s Pd; by becoming part of the first ream, it is subsumed. Walking the heap requires using reams’ lengths to skip over currently-invalid Pds, anyways.

We have two options for finding mergeable reams. Either we can walk the entire heap, as above, or, when a ream is freed, we can check that the previous and following reams are mergeable (finding the previous ream would require storing the length of a ream at its first and last Pd).

Which merging strategy we use depends on whether we’re implementing an ordinary malloc-like heap or a garbage collector; in the malloc case, merging on free makes more sense, but merging in one shot makes more sense for Alkyne’s GC (we’ll see why in a bit).

Slabs and Size Classes

A slab allocator is a specialized allocator that allocates a single type of object; they are quite popular in kernels as pools of commonly-used object. The crux of a slab allocator is that, because everything is the same size, we don’t need to implement splitting and merging. The BlockCache above is actually a very primitive slab allocator.

Our Pd array is also kind of like a slab allocator; instead of mixing them in with the variably-sized blocks, they all live together with no gaps in between; entire pages are dedicated just to Pds.

The Alkyne page allocator cannot allocate pages smaller than 4K, and making them any smaller increases the relative overhead of a Pd. To cut down on book-keeping, we slab-allocate small objects by defining size classes.

A size class is size of smaller-than-a-page object that Alkyne will allocate; other sizes are rounded up to the next size class. Entire pages are dedicated to holding just objects of the same size; these are called small object pages, or simply slabs. The size class is tracked with the class field of the Pd.

Each size class has its own free list of partially-filled slabs of that size. For slabs, gc_bits field becomes a bitset that tracks which slots in the page are currently in-use, reducing the overhead for small objects to only a little over a single bit each!

In the diagram below, bits set in the 32-bit, little-endian bitset indicate which slots in the slab (no to scale!) are filled with three-letter words. (The user likes cats.)

  Pd--------------------------------------------+
  | gc_bits: 0b01010011111010100110000010101011 |
  +---------------------------------------------+

 Page--------------------------------------------+
 | cat | ink |     | hat |     | jug |     | fig |
 +-----------------------------------------------+
 |     |     |     |     |     | zip | net |     |
 +-----------------------------------------------+
 |     | aid |     | yes |     | war | cat | van |
 +-----------------------------------------------+
 | can | cat |     |     | rat |     | urn |     |
 +-----------------------------------------------+
Plaintext

Allocating an object is a bit more complicated now, but now we have a really, really short fast path for small objects:

  1. Round up to the next highest size class, or else to the next page boundary.
  2. If a slab size class… a. Check the pertinent slab list for a partially-filled slab. i. If there isn’t one, allocate a page per the instructions below and initialize it as a slab page. b. Find the next available slot with (!gc_bits).count_trailing_zeros(), and set that bit. c. Return page_addr + slab_slot_size * slot.
  3. Else, if a single page, allocate from the single-page list. a. If there isn’t one, allocate from the ream list as usual.
  4. Else, multiple pages, allocate a ream as usual.

Allocating small objects is very fast, since the slab free lists, if not empty, will always have a spot to fill in gc_bits. Finding the empty spot in the bitset is a few instructions (a not plust a ctz or equivalent on most hardware).

Alkyne maintains a separate free list for single free pages to speed up finding such pages to turn into fresh slabs. This also minimizes the need to allocate single pages off of large reams, which limits fragmentation.

Alkyne’s size classes are the powers of two from 8 (the smallest possible object) to 2048. For the classes 8, 16, and 32, which would have more than 64 slots in the page, we use up to 56 bytes on the page itself to extend gc_bits; 8-byte pages can only hold 505 objects, instead of the full 512, a 1% overhead.

Directly freeing an object via is now tricky, since we do not a priori know the size.

  1. Round the pointer up to the next page boundary, and obtain that page’s Pd.
  2. If this is a start-of-ream page, stick it into the appropriate free list (single page or ream, depending on the size of the ream).
  3. Else, we can look at class to find the size class, and from that, and the offset of the original pointer into the page, the index of the slot.
  4. Clear the slot’s index in gc_bits.
  5. If the page was full before, place it onto the correct slab free list; if it becomes empty, place it into the page free list.

At this point, we know whether the page just became partially full or empty, and can move it to the correct free list.

Size classes are an important allocator optimization. TCMalloc takes this to an . These constants are generated by some crazy script based on profiling data.

Intermission

Before continuing to the GC part of the article, it’s useful to go over what we learned.

A neat thing about this is that most of these tricks are somewhat independent. While giving feedback for an early draft, Matt Kulukundis shared this awesome talk that describes how to build complex allocators out of simple ones, and covers many of the same tropes as we did here. This perspective on allocators actually blew my mind.

Good allocators don’t just use one strategy; the use many and pick and chose the best one for the job based on expected workloads. For example, Alkyne expects to allocate many small objects; the slab pages were originally only for float objects, but it turned out to simplify a lot of the code to make all small objects be slab-allocated.

Even size classes are a deep topic: TCMalloc uses GWP telemetry from Google’s fleet to inform its many tuning parameters, including its comically large tables of size classes.

At this point, we have a pretty solid allocator. Now, let’s get rid of the free function.

Throwing out the Trash

Garbage collection is very different from manual memory management in that frees are performed in batches without cue from the user. There are no calls to free(); instead, we need to figure out which calls to free() we can make on the user’s behalf that they won’t notice (i.e., without quietly freeing pointers the user can still reach, resulting in a use-after-free bug). We need to do this as fast as we can.

Alkyne is a “tracing GC”. Tracing GCs walk the “object graph” from a root set of known-reachable objects. Given an object a, it will trace through any data in the object that it knows is actually a GC pointer. In the object graph, b is reachable from a if one can repeatedly trace through GC pointers to get from a to b.

Alkyne uses tracing to implement garbage collection in a two-step process, commonly called “mark-and-sweep”.

Marking consists of traversing the entire graph from a collection of reachable-by-definition values, such as things on the stack, and recording each object that is visited. Every object not so marked must therefore be definitely unreachable and can be reclaimed; this reclamation is called sweeping.

Alkyne reverses the order of operations somewhat: it “sweeps” first and then marks, i.e., it marks every value as dead and then, as it walks the graph, marks every block as alive. It then rebuilds the free lists to reflect the new marks, allowing the blocks to be reallocated. This is sometimes called “mark and don’t sweep”, but fixing up the free lists is effectively a sweeping step.

Marking and sweeping! (via Wikipedia, CC0)

Alkyne is a “stop-the-world” (STW) GC. It needs to pause all program execution while cleaning out the heap. It is possible to build GCs that do not do this (I believe modern HotSpot GCs very rarely stop the world), but also very difficult. Most GCs are world-stopping to some degree.

One thing we do not touch on is when to sweep. This is a more complicated and somewhat hand-wavy tuning topic that I’m going to quietly sweep under the rug by pointing you to how Go does it.

Heap Armageddon and Resurrection

Delicate freeing of individual objects is quite difficult, but scorching the earth is very easy. To do this, we walk the whole Pd array (see, I said this would be useful!) and blow away every gc_bits. This leaves the heap in a broken state where every pointer appears to be dangling. This is “armageddon”.

To fix this up, we need to “resurrect” any objects we shouldn’t have killed (oops). The roots are objects in the Alkyne interpreter stack11. To mark an object, we convert a pointer to it into a Pd via the page-Pd correspondence, and mark it as alive by “allocating” it.

We then use our knowledge12 of Alkyne objects’ heap layout to find pointers to other objects in the heap (for example, the intepreter knows it’s looking at a list and can just find the list elements within, which are likely pointers themselves). If we trace into an object and find it has been marked as allocated, we don’t recurse; this avoids infinite recursion when encountering cycles.

It’s a big hard to give a code example for this, because the “mark” part that’s part of the GC is mixed up with interpreter code, so there isn’t much to show in this case. :(

At the end of this process, every reachable object will once again be alive, but anything we couldn’t reach stays dead.

Instant Apocalypse

(Alkyne currently does not make this optimization, but really should.)

Rather than flipping every bit, we flip the global convention for whether 0 or 1 means “alive”, implemented by having a global bool specifying which is which at any given time; this would alternate from sweep to sweep. Thus, killing every living object is now a single operation.

This works if the allocated bit of objects in the free lists is never read, and only ever overwritten with the “alive” value when allocated, so that all of the dead objects suddenly becoming alive isn’t noticed. This does not work with slab-allocated small objects: pages may be in a mixed state where they are partially allocated and partially freed.

We can still make this optimization by adding a second bit that tracks whether the page contains any living objects, using the same convention. This allows delaying the clear of the allocated bits for small objects to when the page is visited, which also marks the whole page as alive.

Pages that were never visited (i.e., still marked as dead) can be reclaimed as usual, ignoring the allocated bits.

Free List Reconciliation

At this point, no pointers are dangling, but newly emptied out pages are not in the free lists they should be in. To fix this, we can walk over all Pds and put them where they need to go if they’re not full. This is the kinda-but-not-really sweep phase.

The code for this is simpler to show than explaining it:

for pd in self.reams() {
  if pd.gc_bits == 0 {
    pd.unlink();
    if pd.len == 0 {
      unsafe { self.page_free_list.push(pd) }
    } else {
      unsafe { self.ream_free_list.push(pd) }
    }
  } else if pd.is_full() {
    // GC can't make a not-full-list become full, so we don't
    // need to move it.
  } else {
    // Non-empty, non-full lists cannot be reams.
    debug_assert!(pd.class != SizeClass::Ream);

    pd.unlink();
    unsafe {
      self.slab_free_lists[pd.class].push(pd)
    }
  }
}
Rust

Of course, this will also shuffle around all pages that did not become partially empty or empty while marking. If the “instant apocalypse” optimization is used, this step must still inspect every Pd and modify the free lists.

However, it is a completely separate phase: all it does is find pages that did not survive the previous mark phase. This means that user code can run between the phases, reducing latency. If it turns out to be very expensive to sweep the whole heap, it can even be run less often than mark phases13.

This is also a great chance to merge reams, because we’re inspecting every page anyways; this is why the merging strategy depends on wanting to be a GC’s allocator rather than a normal malloc()/free() allocator.

…and that’s it! That’s garbage collection. The setup of completely owning the layout of blocks in the allocator allows us to cut down significantly on memory needed to track objects in the heap, while keeping the mark and sweep steps short and sweet. A garbage collector is like any other data structure: you pack in a lot of complexity into the invariants to make the actual operations very quick.

Conclusion

Alkyne’s GC is intended to be super simple because I didn’t want to think too hard about it (even though I clearly did lmao). The GC layouts are a whole ‘nother story I have been swirling around in my head for months, which is described here. The choices made there influenced the design of the GC itself.

There are still many optimizations to make, but it’s a really simple but realistic GC design, and I’m pretty happy with it!

A Note on Finalizers (Tools of the Devil!)

Alkyne also does not provide finalizers. A finalizer is the GC equivalent of a destructor: it gets run after the GC declares an object dead. Finalizers complicate a GC significantly by their very nature; they are called in unspecified orders and can witness broken GC state; they can stall the entire program (if they are implemented to run during the GC pause in a multi-threaded GC) or else need to be called with a zombie argument that either can’t escape the finalizer or, worse, must be resurrected if it does!

If finalizers depend on each other, they can’t be run at all, for the same reason an ARC cycle cannot be broken; this weakness of ARC is one of the major benefits of an omniscient GC.

Java’s documentation for Object.finalize() is a wall of text of lies, damned lies, and ghost stories.

I learned earlier (the week before I started writing this article) that Go ALSO has finalizers and that they are similarly cursed. Go does behave somewhat more nicely than Java (finalizers are per-value and avoid zombie problems by unconditionally resurrecting objects with a finalizer).

Further Reading

Here are some other allocators that I find interesting and worth reading about, some of which have inspired elements of Alkyne’s design.

TCMalloc is Google’s crazy thread-caching allocator. It’s really fast and really cool, but I work for Google, so I’m biased. But it uses radix trees! Radix trees are cool!!!

Go has a garbage collector that has well-known performance properties but does not perform any wild optimizations like moving, and is a world-stopping, incremental GC.

Oilpan is the Chronimum renderer’s GC (you know, for DOM elements). It’s actually grafted onto C++ and has a very complex API reflective of the subtleties of GCs as a result.

libboehm is another C/C++ GC written by Hans Boehm, one of the world’s top experts on concurrency.

Orinoco is V8’s GC for the JavaScript heap (i.e., Chronimum’s other GC). It is a generational or moving GC that can defragment the heap over time by moving things around (and updating pointers). It also has a separate sub-GC just for short-lived objects.

Mesh is a non-GC allocator that can do compacting via clever use of mmap(2).

upb_Arena is an arena allocator that uses free-lists to allows fusing arenas together. This part of the μpb Protobuf runtime.

  1. In other words, it uses recursion along a syntax tree, instead of a more efficient approach that compiles the program down to bytecode. 

  2. Automatic Reference Counting is an automatic memory management technique where every heap allocation contains a counter of how many pointers currently point to it; once pointers go out of scope, they decrement the counter; when the counter hits zero the memory is freed.

    This is used by Python and Swift as the core memory management strategy, and provided by C++ and Rust via the std::shared_ptr<T> and Arc<T> types, respectively. 

  3. This is the file. It’s got fairly complete comments, but they’re written for an audience familiar with allocators and garbage collectors. 

  4. This is a tangent, but I should point out that Alkyne does not do “NaN boxing”. This is a technique used by some JavaScript runtimes, like Spidermonkey, which represent dynamically typed values as either ordinary floats, or pointers hidden in the mantissas of 64-bit IEEE 754 signaling NaNs.

    Alkyne instead uses something like V8’s Smi pointer compression, so our heap values are four bytes, not eight. Non-Smi values that aren’t on the stack (which uses a completely different representation) can only exist as elements of lists or objects. Alkyne’s slab allocator design (described below) is focused on trying to minimize the overhead of all floats being in their own little allocations. 

  5. The operating system’s own physical page allocator is actually solving the same problem: given a vast range of memory (in this case, physical RAM), allocate it. The algorithms in this article apply to those, too.

    Operating system allocators can be slightly fussier because they need to deal with virtual memory mappings, but that is a topic for another time. 

  6. As you might expect, these scare-quotes are load-bearing. 

  7. I tried leaving this out of the first draft, and failed. So many things would be simpler without fussing around with alignment. 

  8. Yes yes most architectures can cope with unaligned loads and stores but compilers rather like to pretend that’s not true. 

  9. Boutique means provenance in French. 

  10. Currently Alkyne has a rather small max ream size. A better way to approach this would be to treat the entire heap as one gigantic ream at the start, which is always at the bottom of the free list. 

  11. In GC terms, these are often called “stack roots”. 

  12. The interpreter simply knows this and can instruct the GC appropriately.

    In any tracing GC, the compiler or interpreter must be keenly aware of the layouts of types so that it can generate the appropriate tracing code for each.

    This is why grafting GCs to non-GC’d languages is non-trivial, even though people have totally done it: libboehm and Oilpan are good (albeit sometimes controversial) examples of how this can be done. 

  13. With “instant apocalypse”, this isn’t quite true; after two mark phases, pages from the first mark phase will appear to be alive, since the global “alive” convention has changed twice. Thus, only pages condemned in every other mark phase will be swept; sweeping is most optimal after an odd number of marks.