The Alkyne GC

Alkyne is a scripting language I built a couple of years ago for generating configuration blobs. Its interpreter is a naive AST walker1 that uses ARC2 for memory management, so it’s pretty slow, and I’ve been gradually writing a new evaluation engine for it.

This post isn’t about Alkyne itself, that’s for another day. For now, I’d like to write down some notes for the GC I wrote3 for it, and more generally provide an introduction to memory allocators (especially those that would want to collude with a GC).

This post is intended for people familiar with the basics of low-level programming, such as pointers and syscalls. Alkyne’s GC is intended to be simple while still having reasonable performance. This means that the design contains all the allocator “tropes,” but none of the hairy stuff.

My hope is readers new to allocators or GCs will come away with an understanding of these tropes and the roles they play in a modern allocator.

Thank you to James Farrell, Manish Goregaokar, Matt Kulukundis, JeanHeyd Meneide, Skye Thompson, and Paul Wankadia for providing feedback on various drafts of this article. This was a tough one to get right. :)

Trailhead

The Alkyne GC is solving a very specific problem, which allows us to limit what it actually needs to do. Alkyne is an “embeddable” language like JavaScript, so its heap is not intended to be big; in fact, for the benefit of memory usage optimizations, it’s ideal to use 32-bit pointers (a 4 gigabyte address space).

The heap needs to be able to manage arbitrarily-large allocations (for lists), and allocations as small as eight bytes (for floats4). Allocation should be reasonably quick, but due to the size of the heap, walking the entire heap is totally acceptable.

Because we’re managing a fixed-size heap, we can simply ask the operating system for a contiguous block of that size up-front using the mmap() syscall. An Alkyne pointer is simply a 32-bit offset into this giant allocation, which can be converted to and from a genuine CPU pointer by adding or subtracting the base address of the heap.

 4GB Heap
 +-------------------------------------------------+
 |                x                                |
 +-------------------------------------------------+
 ^                ^
 base             base + ptr_to_x
Plaintext

The OS won’t actually reserve 4GB of memory for us; it will only allocate one system page (4KB) at a time. If we read or write to a particular page in the heap for the first time, the OS will only then find physical RAM to back it5.

Throughout, we’ll be working with this fixed-size heap, and won’t think too hard about where it came from. For our purposes, it is essentially a Box<[u8]>, but we’ll call it a Heap<[u8]> to make it clear this memory we got from the operating system (but, to be clear, the entire discussion applies just as well to an ordinary gigantic Box<[u8]>)

The Alkyne language does not have threads, so we can eschew concurrency. This significantly reduces the problems we will need to solve. Most modern allocators and garbage collectors are violently concurrent by nature, and unfortunately, much too advanced for one article. There are links below to fancier GCs you can poke around in.

A Heap of Trouble

To build a garbage collector, we first need an allocator. We could “just”6 use the system heap as a source of pages, but most garbage collectors collude with the allocator, since they will want to use similar data structures. Thus, if we are building a garbage collector, we might as well build the allocator too.

An allocator, or “memory heap” (not to be confused with a min-heap, an unrelated but wicked data structure), services requests for allocations: unique leases of space in the managed heap of various sizes, which last for lifetimes not known until runtime. These allocations may also be called objects, and a heap may be viewed as a general-purpose object pool.

The most common API for a heap is:

trait Allocator {
  // Returns a *unique pointer* managed by this allocator
  // to memory as large as requested, and as aligned
  // as we'd like.
  // 
  // Returns null on failure.
  unsafe fn alloc(&mut self, size: usize, align: usize) -> *mut u8;
  // Frees a pointer returned by `Alloc` may be called at
  // most once.
  unsafe fn free(&mut self, ptr: *mut u8);
}
Rust

Originally the examples were in C++, which I feel is more accessible (lol) but given that Alkyne itself is written in Rust I felt that would make the story flow better.

This is the “malloc” API, which is actually very deficient; ideally, we would do something like Rust’s Allocator, which requires providing size and alignment to both the allocation and deallocation functions.

Unfortunately7, this means I need to explain alignment.

Good Pointers, Evil Pointers, Lawful Pointers, Chaotic Pointers

“Alignment” is a somewhat annoying property of a pointer. A pointer is aligned to N bytes (always a power of 2) if its address is divisible by N. A pointer is “well-aligned” (or just “aligned”) if its address is aligned to the natural alignment of the thing it points to. For ints, this is usually their size; for structs, it is the maximum alignment among the alignments of the fields of that struct.

Performing operations on a pointer requires that it be aligned8. This is annoying because it requires some math. Specifically we need three functions:

/// Checks that `ptr` is aligned to an alignment.
fn is_aligned(ptr: Int, align: usize) -> bool {
  ptr & (align - 1) == 0
}

/// Rounds `ptr` down to a multiple of `align`.
fn align_down(ptr: Int, align: usize) -> Int {
  ptr & !(align - 1)
}

/// Rounds `ptr` up to a multiple of `align`.
fn align_up(ptr: Int, align: usize) -> Int {
  // (I always look this one up. >_>)
  align_down(ptr + align - 1, align)
}

/// Computes how much needs to be added to `ptr` to align it.
fn misalign(ptr: Int, align: usize) -> usize {
  align_up(ptr, align) - ptr
}
Rust

(Exercise: prove these formulas.)

For the rest of the article I will assume I have these three functions available at any time for whatever type of integer I’d like (including raw pointers which are just boutique9 integers).

Also we will treat the Heap<[u8]> holding our entire heap as being infinitely aligned; i.e. as a pointer it is aligned to all possible alignments that could matter (i.e. page-aligned, 4KB as always). (For an ordinary Box<[u8]>, this is not true.)

The Trivial Heap

Allocating memory is actually very easy. Arenas are the leanest and meanest in the allocator food chain; they simply don’t bother to free any memory.

This means allocation is just incrementing a cursor indicating where the hitherto-unallocated memory is.

 +-------------------------------------------------+
 | Allocated        | Free                         |
 +-------------------------------------------------+
                    ^
                    cursor
Plaintext

Our allocator is as simple as return ptr++;.

This is straightforward to implement in code:

struct Arena {
  heap: Heap<[u8]>,
  cursor: usize,
}

impl Allocator for Arena {
  unsafe fn alloc(&mut self, size: usize, align: usize) -> *mut u8 {
    // To get an aligned pointer, we need to burn some "alignment
    // padding". This is one of the places where alignment is
    // annoying.
    let needed = size + misalign(self.heap.as_ptr(), align);

    // Check that we're not out of memory.
    if self.heap.len() - self.cursor < needed {
      return ptr::null_mut();
    }

    // Advance the cursor and cut off the end of the allocated
    // section.
    self.cursor += needed;
    &mut self.heap[self.cursor - size] as *mut u8;
  }

  unsafe fn free(&mut self, ptr: *mut u8) {
    // ayy lmao
  }
}
Rust

Arenas are very simple, but far from useless! They’re great for holding onto data that exists for the context of a “session”, such as for software that does lots of computations and then exits (a compiler) or software that handles requests from clients, where lots of data lives for the duration of the request and no longer (a webserver).

They are not, however, good for long-running systems. Eventually the heap will be exhausted if objects are not recycled.

Making this work turns out to be hard[citation-needed]. This is the “fundamental theorem” of allocators:

Fundamental “Theorem” of Allocators

Handing out memory is easy. Handing it out repeatedly is hard.

Thankfully, over the last fifty years we’ve mostly figured this out. Allocator designs can get pretty gnarly.

Four Tropes

From here, we will gradually augment our allocator with more features to allow it to service all kinds of requests. For this, we will implement four common allocator features:

  1. Blocks and a block cache.
  2. Free lists.
  3. Block merging and splitting.
  4. Slab allocation.

All four of these are present in some form in most modern allocators.

Blocks

The first thing we should do is to deal in fixed-size blocks of memory of some minimum size. If you ask malloc() for a single byte, it will probably give you like 8 bytes on most systems. No one is asking malloc() for single bytes, so we can quietly round up and not have people care. (Also, Alkyne’s smallest heap objects are eight bytes, anyways.)

Blocks are also convenient, because we can keep per-block metadata on each one, as a header before the user’s data:

#[repr(C)]
struct Block {
  header: Header,
  data: [u8; BLOCK_SIZE],
}
Rust

To allow blocks to be re-used, we can keep a cache of recently freed blocks. The easiest way to do this is with a stack. Note that the heap is now made of Blocks, not plain bytes.

To allocate storage, first we check the stack. If the stack is empty, we revert to being an arena and increment the cursor. To free, we push the block onto the stack, so alloc() can return it on the next call.

struct BlockCache {
  heap: Heap<[Block]>,
  cursor: usize,
  free_stack: Vec<*mut Block>,
}

impl Allocator for BlockCache {
  unsafe fn alloc(&mut self, size: usize, align: usize) -> *mut u8 {
    // Check that the size isn't too big. We don't need to
    // bother with alignment, because every block is
    // infinitely-aligned, just like the heap itself.
    if size > BLOCK_SIZE {
      return ptr::null_mut();
    }

    // Try to serve a block from the stack.
    if let Some(block) = self.free_stack.pop() {
      return &mut *block.data as *mut u8;
    }

    // Fall back to arena mode.
    if self.cursor == self.heap.len() {
      return ptr::null_mut();
    }
    self.cursor += 1;
    &mut self.heap[self.cursor].data as *mut u8
  }

  unsafe fn free(&mut self, ptr: *mut u8) {
    // Use pointer subtraction to find the start of the block.
    let block = ptr.sub(size_of::<Header>()) as *mut Block;
    self.free_stack.push(block);
  }
}
Rust

This allocator has a problem: it relies on the system allocator! Heap came directly from the OS, but Vec talks to malloc() (or something like it). It also adds some pretty big overhead: the Vec needs to be able to resize, since it grows as more and more things are freed. This can lead to long pauses during free() as the vector resizes.

Cutting out the middleman gives us more control over this overhead.

Free Lists

Of course, no one has ever heard of a “free stack”; everyone uses free lists! A free list is the cache idea but implemented as an intrusive linked list.

A linked list is this data structure:

enum Node<T> {
  Nil,
  Cons(Box<(T, Node<T>)>),
  //   ^~~ oops I allocated again
}
Rust

This has the same problem of needing to find an allocator to store the nodes. An intrusive list avoids that by making the nodes part of the elements. The Header we reserved for ourselves earlier is the perfect place to put it:

struct Header {
  /// Pointer to the next and previous blocks in whatever
  /// list the block is in.
  next: *mut Block,
  prev: *mut Block,
}
Rust

In particular we want to make sure block are in doubly-linked lists, which have the property that any element can be removed from them without walking the list.

   list.root
     |
     v
 +-> Block--------------------------+
 |   | next | null | data data data |
 |   +------------------------------+
 +-----/------+
      /       |
     v        |
 +-> Block--------------------------+
 |   | next | prev | data data data |
 |   +------------------------------+
 +-----/------+
      /       |
     v        |
 +-> Block--------------------------+
 |   | next | prev | data data data |
 |   +------------------------------+
 +-----/------+
      /       |
     v        |
     Block--------------------------+
     | null | prev | data data data |
     +------------------------------+
Plaintext

We also introduce a List container type that holds the root node of a list of blocks, to give us a convenient container-like API. This type looks like this:

struct List {
  /// The root is actually a sacrificial block that exists only to
  /// make it possible to unlink blocks in the middle of a list. This
  /// needs to exist so that calling unlink() on the "first" element
  /// of the list has a predecessor to replace itself with.
  root: *mut Block,
}

impl List {
  /// Pushes a block onto the list.
  unsafe fn push(&mut self, block: *mut Block) {
    let block = &mut *block;
    let root = &mut *self.root;
    if !root.header.next.is_null() {
      let first = &mut *root.header.next;
      block.header.next = first;
      first.header.prev = block;
    }
    root.header.next = block;
    block.header.prev = root;
  }

  /// Gets the first element of the list, if any.
  unsafe fn first(&mut self) -> Option<&mut Block> {
    let root = &mut *self.root;
    root.header.next.as_mut()
  }
}
Rust

We should also make it possible to ask a block whether it is in any list, and if so, remove it.

impl Block {
  /// Checks if this block is part of a list.
  fn is_linked(&self) -> bool {
    // Only the prev link is guaranteed to exist; next is
    // null for the last element in a list. Sacrificial
    // nodes will never have prev non-null, and can't be
    // unlinked.
    !self.header.prev.is_null()
  }

  /// Unlinks this linked block from whatever list it's in.
  fn unlink(&mut self) {
    assert!(self.is_linked());
    if !self.header.next.is_null() {
      let next = &mut *self.header.next;
      next.header.prev = self.header.prev; 
    }

    // This is why we need the sacrificial node.
    let prev = &mut *self.header.prev;
    prev.header.next = self.header.next;

    self.header.prev = ptr::null_mut();
    self.header.next = ptr::null_mut();
  }
}
Rust

Using these abstractions we can upgrade BlockCache to FreeList. We only need to rename free_stack to free_list, and make a one-line change:

- if let Some(block) = self.free_stack.pop() {
+ if let Some(block) = self.free_list.first() {
+   block.unlink();
    return &mut *block.data as *mut u8;
 }
Rust

Hooray for encapsulation!

This is a very early malloc() design, similar to the one described in the K&R C book. It does have big blind spot: it can only serve up blocks up to a fixed size! It’s also quite wasteful, because all allocations are served the same size blocks: the bigger we make the maximum request, the more wasteful alloc(8) gets.

Block Splitting (Alkyne’s Way)

The next step is to use a block splitting/merging scheme, such as the buddy system. Alkyne does not precisely use a buddy system, but it does something similar.

Alkyne does not have fixed-size blocks. Like many allocators, it defines a “page” of memory as the atom that it keeps its data structures. Alkyne defines a page to be 4KB, but other choices are possible: TCMalloc uses 8KB pages.

In Alkyne, pages come together to form contiguous, variable-size reams (get it?). These take the place of blocks.

Page Descriptors

Merging and splitting makes it hard to keep headers at the start of reams, so Alkyne puts them all in a giant array somewhere else. Each page gets its own “header” called a page descriptor, or Pd.

The array of page descriptors lives at the beginning of the heap, and the actual pages follow after that. It turns out that this array has a maximum size, which we can use to pre-compute where the array ends.

Currently, each Pd is 32 bytes, in addition to the 4KB it describes. If we divide 4GB by 32 + 4K, it comes out to around four million pages (4067203 to be precise). Rounded up to the next page boundary, this means that pages begin at the 127102nd 4K boundary after the Heap<[u8]> base address, or an offset of 0x7c1f400 bytes.

Having them all in a giant array is also very useful, because it means the GC can trivially find every allocation in the whole heap: just iterate the Pd array!

So! This is our heap:

+---------------------------------------+  <-- mmap(2)'d region base
| Pd | Pd | Pd | Pd | Pd | Pd | Pd | Pd | \
+---------------------------------------+ |--- Page Descriptors
| Pd | Pd | Pd | Pd | Pd | Pd | Pd | Pd | |    for every page we can
+---------------------------------------+ |    ever allocate.
: ...                                   : |
+---------------------------------------+ |
| Pd | Pd | Pd | Pd | Pd | Pd | Pd | Pd | /
+---------------------------------------+  <-- Heap base address
| Page 0                                | \    = region + 0x7c1f400
|                                       | |
|                                       | |--- 4K pages corresponding
+---------------------------------------+ |    to the Pds above.
| Page 1                                | |    (not to scale)
|                                       | |
|                                       | |
+---------------------------------------+ |
: ...                                   | |
+---------------------------------------+ |
| Page N                                | |
|                                       | |
|                                       | /
+---------------------------------------+
  (not to scale by a factor of about 4)
Plaintext

Each one of those little Pds looks something like this:

#[repr(C)]
struct Pd {
  gc_bits: u64,
  prev: Option<u32>,
  next: Option<u32>,
  len: u16,
  class: SizeClass,
  // More fields...
}
Rust

prev and next are the intrusive linked list pointers used for the free lists, but now they are indices into the Pd array. The other fields will be used for this and the trope that follows.

Given a pointer into a page, we can get the corresponding Pd by align_down()‘ing to a page boundary, computing the index of the page (relative to the heap base), and then index into the Pd array. This process can be reversed to convert a pointer to a Pd into a pointer to a page, so going between the two is very easy.

I won’t cover this here, but Alkyne actually wraps Pd pointers in a special PdRef type that also carries a reference to the Heap; this allows implementing functions like is_linked(), unlink(), and data() directly.

I won’t show how this is implemented, since it’s mostly boilerplate.

Reams of Pages

There is one giant free list that contains all of the reams. Reams use their first page’s descriptor to track all of their metadata, including the list pointers for the free list. The len field additionally tracks how many additional pages are in the ream. gc_bits is set to 1 if the page is in use and 0 otherwise.

To allocate N continuous pages from the free ream list:

  1. We walk through the free ream list, and pick the first one with at least N pages.
  2. We “split” it: the first N pages are returned to fulfill the request.
  3. The rest of the ream is put back into the free list.
  4. If no such ream exists, we allocate a max-sized ream10 (65536 pages), and split that as above.

In a sense, each ream is an arena that we allocate smaller reams out of; those reams cannot be “freed” back to the ream they came from. Instead, to free a ream, we just stick it back on the main free list.

If we ever run out, we turn back into an arena and initialize the next uninitialized Pd in the big ol’ Pd array.

struct ReamAlloc {
  heap: Heap<[Page]>,
  cursor: usize,
  free_list: List,
}

/// The offset to the end of the maximally-large Pd array.
/// This can be computed ahead of time.
const PAGES_START: usize = ...;

impl Allocator for ReamAlloc {
  unsafe fn alloc(&mut self, size: usize, align: usize) -> *mut u8 {
    // We don't need to bother with alignment, because every page is
    // already infinitely aligned; we only allocate at the page
    // boundary.
    let page_count = align_up(size, 4096) / 4096;

    // Find the first page in the list that's big enough.
    // (Making `List` iterable is an easy exercise.)
    for pd in &mut self.list {
      if pd.len < page_count - 1 {
        continue
      }
      if pd.len == page_count - 1 {
        // No need to split here.
        pd.unlink();
        return pd.data();
      }

      // We can chop off the *end* of the ream to avoid needing
      // to update any pointers.
      let new_ream = pd.add(page_count);
      new_ream.len = page_count - 1;
      pd.len -= page_count;

      return new_ream.data();
    }

    // Allocate a new ream. This is more of the same arena stuff.
  }
  unsafe fn free(&mut self, ptr: *mut u8) {
    // Find the Pd corresponding to this page's pointer. This
    // will always be a ream's first Pd assuming the user
    // didn't give us a bad pointer.
    let pd = Pd::from_ptr(ptr);
    self.free_list.push(pd);
  }
}
Rust

This presents a problem: over time, reams will shrink and never grow, and eventually there will be nothing left but single pages.

Top fix this, we can merge reams (not yet implemented in Alkyne). Thus:

  1. Find two adjacent, unallocated reams.
  2. Unlink the second ream from the free list.
  3. Increase the length of the first ream by the number of pages in the second.
// `reams()` returns an iterator that walks the `Pd` array using
// the `len` fields to find the next ream each time.
for pd in self.reams() {
  if pd.gc_bits != 0 { continue }
  loop {
    let next = pd.add(pd.len + 1);
    if next.gc_bits != 0 { break }
    next.unlink();
    pd.len += next.len + 1;
  }
}
Rust

We don’t need to do anything to the second ream’s Pd; by becoming part of the first ream, it is subsumed. Walking the heap requires using reams’ lengths to skip over currently-invalid Pds, anyways.

We have two options for finding mergeable reams. Either we can walk the entire heap, as above, or, when a ream is freed, we can check that the previous and following reams are mergeable (finding the previous ream would require storing the length of a ream at its first and last Pd).

Which merging strategy we use depends on whether we’re implementing an ordinary malloc-like heap or a garbage collector; in the malloc case, merging on free makes more sense, but merging in one shot makes more sense for Alkyne’s GC (we’ll see why in a bit).

Slabs and Size Classes

A slab allocator is a specialized allocator that allocates a single type of object; they are quite popular in kernels as pools of commonly-used object. The crux of a slab allocator is that, because everything is the same size, we don’t need to implement splitting and merging. The BlockCache above is actually a very primitive slab allocator.

Our Pd array is also kind of like a slab allocator; instead of mixing them in with the variably-sized blocks, they all live together with no gaps in between; entire pages are dedicated just to Pds.

The Alkyne page allocator cannot allocate pages smaller than 4K, and making them any smaller increases the relative overhead of a Pd. To cut down on book-keeping, we slab-allocate small objects by defining size classes.

A size class is size of smaller-than-a-page object that Alkyne will allocate; other sizes are rounded up to the next size class. Entire pages are dedicated to holding just objects of the same size; these are called small object pages, or simply slabs. The size class is tracked with the class field of the Pd.

Each size class has its own free list of partially-filled slabs of that size. For slabs, gc_bits field becomes a bitset that tracks which slots in the page are currently in-use, reducing the overhead for small objects to only a little over a single bit each!

In the diagram below, bits set in the 32-bit, little-endian bitset indicate which slots in the slab (no to scale!) are filled with three-letter words. (The user likes cats.)

  Pd--------------------------------------------+
  | gc_bits: 0b01010011111010100110000010101011 |
  +---------------------------------------------+

 Page--------------------------------------------+
 | cat | ink |     | hat |     | jug |     | fig |
 +-----------------------------------------------+
 |     |     |     |     |     | zip | net |     |
 +-----------------------------------------------+
 |     | aid |     | yes |     | war | cat | van |
 +-----------------------------------------------+
 | can | cat |     |     | rat |     | urn |     |
 +-----------------------------------------------+
Plaintext

Allocating an object is a bit more complicated now, but now we have a really, really short fast path for small objects:

  1. Round up to the next highest size class, or else to the next page boundary.
  2. If a slab size class… a. Check the pertinent slab list for a partially-filled slab. i. If there isn’t one, allocate a page per the instructions below and initialize it as a slab page. b. Find the next available slot with (!gc_bits).count_trailing_zeros(), and set that bit. c. Return page_addr + slab_slot_size * slot.
  3. Else, if a single page, allocate from the single-page list. a. If there isn’t one, allocate from the ream list as usual.
  4. Else, multiple pages, allocate a ream as usual.

Allocating small objects is very fast, since the slab free lists, if not empty, will always have a spot to fill in gc_bits. Finding the empty spot in the bitset is a few instructions (a not plust a ctz or equivalent on most hardware).

Alkyne maintains a separate free list for single free pages to speed up finding such pages to turn into fresh slabs. This also minimizes the need to allocate single pages off of large reams, which limits fragmentation.

Alkyne’s size classes are the powers of two from 8 (the smallest possible object) to 2048. For the classes 8, 16, and 32, which would have more than 64 slots in the page, we use up to 56 bytes on the page itself to extend gc_bits; 8-byte pages can only hold 505 objects, instead of the full 512, a 1% overhead.

Directly freeing an object via is now tricky, since we do not a priori know the size.

  1. Round the pointer up to the next page boundary, and obtain that page’s Pd.
  2. If this is a start-of-ream page, stick it into the appropriate free list (single page or ream, depending on the size of the ream).
  3. Else, we can look at class to find the size class, and from that, and the offset of the original pointer into the page, the index of the slot.
  4. Clear the slot’s index in gc_bits.
  5. If the page was full before, place it onto the correct slab free list; if it becomes empty, place it into the page free list.

At this point, we know whether the page just became partially full or empty, and can move it to the correct free list.

Size classes are an important allocator optimization. TCMalloc takes this to an . These constants are generated by some crazy script based on profiling data.

Intermission

Before continuing to the GC part of the article, it’s useful to go over what we learned.

A neat thing about this is that most of these tricks are somewhat independent. While giving feedback for an early draft, Matt Kulukundis shared this awesome talk that describes how to build complex allocators out of simple ones, and covers many of the same tropes as we did here. This perspective on allocators actually blew my mind.

Good allocators don’t just use one strategy; the use many and pick and chose the best one for the job based on expected workloads. For example, Alkyne expects to allocate many small objects; the slab pages were originally only for float objects, but it turned out to simplify a lot of the code to make all small objects be slab-allocated.

Even size classes are a deep topic: TCMalloc uses GWP telemetry from Google’s fleet to inform its many tuning parameters, including its comically large tables of size classes.

At this point, we have a pretty solid allocator. Now, let’s get rid of the free function.

Throwing out the Trash

Garbage collection is very different from manual memory management in that frees are performed in batches without cue from the user. There are no calls to free(); instead, we need to figure out which calls to free() we can make on the user’s behalf that they won’t notice (i.e., without quietly freeing pointers the user can still reach, resulting in a use-after-free bug). We need to do this as fast as we can.

Alkyne is a “tracing GC”. Tracing GCs walk the “object graph” from a root set of known-reachable objects. Given an object a, it will trace through any data in the object that it knows is actually a GC pointer. In the object graph, b is reachable from a if one can repeatedly trace through GC pointers to get from a to b.

Alkyne uses tracing to implement garbage collection in a two-step process, commonly called “mark-and-sweep”.

Marking consists of traversing the entire graph from a collection of reachable-by-definition values, such as things on the stack, and recording each object that is visited. Every object not so marked must therefore be definitely unreachable and can be reclaimed; this reclamation is called sweeping.

Alkyne reverses the order of operations somewhat: it “sweeps” first and then marks, i.e., it marks every value as dead and then, as it walks the graph, marks every block as alive. It then rebuilds the free lists to reflect the new marks, allowing the blocks to be reallocated. This is sometimes called “mark and don’t sweep”, but fixing up the free lists is effectively a sweeping step.

Marking and sweeping! (via Wikipedia, CC0)

Alkyne is a “stop-the-world” (STW) GC. It needs to pause all program execution while cleaning out the heap. It is possible to build GCs that do not do this (I believe modern HotSpot GCs very rarely stop the world), but also very difficult. Most GCs are world-stopping to some degree.

One thing we do not touch on is when to sweep. This is a more complicated and somewhat hand-wavy tuning topic that I’m going to quietly sweep under the rug by pointing you to how Go does it.

Heap Armageddon and Resurrection

Delicate freeing of individual objects is quite difficult, but scorching the earth is very easy. To do this, we walk the whole Pd array (see, I said this would be useful!) and blow away every gc_bits. This leaves the heap in a broken state where every pointer appears to be dangling. This is “armageddon”.

To fix this up, we need to “resurrect” any objects we shouldn’t have killed (oops). The roots are objects in the Alkyne interpreter stack11. To mark an object, we convert a pointer to it into a Pd via the page-Pd correspondence, and mark it as alive by “allocating” it.

We then use our knowledge12 of Alkyne objects’ heap layout to find pointers to other objects in the heap (for example, the intepreter knows it’s looking at a list and can just find the list elements within, which are likely pointers themselves). If we trace into an object and find it has been marked as allocated, we don’t recurse; this avoids infinite recursion when encountering cycles.

It’s a big hard to give a code example for this, because the “mark” part that’s part of the GC is mixed up with interpreter code, so there isn’t much to show in this case. :(

At the end of this process, every reachable object will once again be alive, but anything we couldn’t reach stays dead.

Instant Apocalypse

(Alkyne currently does not make this optimization, but really should.)

Rather than flipping every bit, we flip the global convention for whether 0 or 1 means “alive”, implemented by having a global bool specifying which is which at any given time; this would alternate from sweep to sweep. Thus, killing every living object is now a single operation.

This works if the allocated bit of objects in the free lists is never read, and only ever overwritten with the “alive” value when allocated, so that all of the dead objects suddenly becoming alive isn’t noticed. This does not work with slab-allocated small objects: pages may be in a mixed state where they are partially allocated and partially freed.

We can still make this optimization by adding a second bit that tracks whether the page contains any living objects, using the same convention. This allows delaying the clear of the allocated bits for small objects to when the page is visited, which also marks the whole page as alive.

Pages that were never visited (i.e., still marked as dead) can be reclaimed as usual, ignoring the allocated bits.

Free List Reconciliation

At this point, no pointers are dangling, but newly emptied out pages are not in the free lists they should be in. To fix this, we can walk over all Pds and put them where they need to go if they’re not full. This is the kinda-but-not-really sweep phase.

The code for this is simpler to show than explaining it:

for pd in self.reams() {
  if pd.gc_bits == 0 {
    pd.unlink();
    if pd.len == 0 {
      unsafe { self.page_free_list.push(pd) }
    } else {
      unsafe { self.ream_free_list.push(pd) }
    }
  } else if pd.is_full() {
    // GC can't make a not-full-list become full, so we don't
    // need to move it.
  } else {
    // Non-empty, non-full lists cannot be reams.
    debug_assert!(pd.class != SizeClass::Ream);

    pd.unlink();
    unsafe {
      self.slab_free_lists[pd.class].push(pd)
    }
  }
}
Rust

Of course, this will also shuffle around all pages that did not become partially empty or empty while marking. If the “instant apocalypse” optimization is used, this step must still inspect every Pd and modify the free lists.

However, it is a completely separate phase: all it does is find pages that did not survive the previous mark phase. This means that user code can run between the phases, reducing latency. If it turns out to be very expensive to sweep the whole heap, it can even be run less often than mark phases13.

This is also a great chance to merge reams, because we’re inspecting every page anyways; this is why the merging strategy depends on wanting to be a GC’s allocator rather than a normal malloc()/free() allocator.

…and that’s it! That’s garbage collection. The setup of completely owning the layout of blocks in the allocator allows us to cut down significantly on memory needed to track objects in the heap, while keeping the mark and sweep steps short and sweet. A garbage collector is like any other data structure: you pack in a lot of complexity into the invariants to make the actual operations very quick.

Conclusion

Alkyne’s GC is intended to be super simple because I didn’t want to think too hard about it (even though I clearly did lmao). The GC layouts are a whole ‘nother story I have been swirling around in my head for months, which is described here. The choices made there influenced the design of the GC itself.

There are still many optimizations to make, but it’s a really simple but realistic GC design, and I’m pretty happy with it!

A Note on Finalizers (Tools of the Devil!)

Alkyne also does not provide finalizers. A finalizer is the GC equivalent of a destructor: it gets run after the GC declares an object dead. Finalizers complicate a GC significantly by their very nature; they are called in unspecified orders and can witness broken GC state; they can stall the entire program (if they are implemented to run during the GC pause in a multi-threaded GC) or else need to be called with a zombie argument that either can’t escape the finalizer or, worse, must be resurrected if it does!

If finalizers depend on each other, they can’t be run at all, for the same reason an ARC cycle cannot be broken; this weakness of ARC is one of the major benefits of an omniscient GC.

Java’s documentation for Object.finalize() is a wall of text of lies, damned lies, and ghost stories.

I learned earlier (the week before I started writing this article) that Go ALSO has finalizers and that they are similarly cursed. Go does behave somewhat more nicely than Java (finalizers are per-value and avoid zombie problems by unconditionally resurrecting objects with a finalizer).

Further Reading

Here are some other allocators that I find interesting and worth reading about, some of which have inspired elements of Alkyne’s design.

TCMalloc is Google’s crazy thread-caching allocator. It’s really fast and really cool, but I work for Google, so I’m biased. But it uses radix trees! Radix trees are cool!!!

Go has a garbage collector that has well-known performance properties but does not perform any wild optimizations like moving, and is a world-stopping, incremental GC.

Oilpan is the Chronimum renderer’s GC (you know, for DOM elements). It’s actually grafted onto C++ and has a very complex API reflective of the subtleties of GCs as a result.

libboehm is another C/C++ GC written by Hans Boehm, one of the world’s top experts on concurrency.

Orinoco is V8’s GC for the JavaScript heap (i.e., Chronimum’s other GC). It is a generational or moving GC that can defragment the heap over time by moving things around (and updating pointers). It also has a separate sub-GC just for short-lived objects.

Mesh is a non-GC allocator that can do compacting via clever use of mmap(2).

upb_Arena is an arena allocator that uses free-lists to allows fusing arenas together. This part of the μpb Protobuf runtime.

  1. In other words, it uses recursion along a syntax tree, instead of a more efficient approach that compiles the program down to bytecode. 

  2. Automatic Reference Counting is an automatic memory management technique where every heap allocation contains a counter of how many pointers currently point to it; once pointers go out of scope, they decrement the counter; when the counter hits zero the memory is freed.

    This is used by Python and Swift as the core memory management strategy, and provided by C++ and Rust via the std::shared_ptr<T> and Arc<T> types, respectively. 

  3. This is the file. It’s got fairly complete comments, but they’re written for an audience familiar with allocators and garbage collectors. 

  4. This is a tangent, but I should point out that Alkyne does not do “NaN boxing”. This is a technique used by some JavaScript runtimes, like Spidermonkey, which represent dynamically typed values as either ordinary floats, or pointers hidden in the mantissas of 64-bit IEEE 754 signaling NaNs.

    Alkyne instead uses something like V8’s Smi pointer compression, so our heap values are four bytes, not eight. Non-Smi values that aren’t on the stack (which uses a completely different representation) can only exist as elements of lists or objects. Alkyne’s slab allocator design (described below) is focused on trying to minimize the overhead of all floats being in their own little allocations. 

  5. The operating system’s own physical page allocator is actually solving the same problem: given a vast range of memory (in this case, physical RAM), allocate it. The algorithms in this article apply to those, too.

    Operating system allocators can be slightly fussier because they need to deal with virtual memory mappings, but that is a topic for another time. 

  6. As you might expect, these scare-quotes are load-bearing. 

  7. I tried leaving this out of the first draft, and failed. So many things would be simpler without fussing around with alignment. 

  8. Yes yes most architectures can cope with unaligned loads and stores but compilers rather like to pretend that’s not true. 

  9. Boutique means provenance in French. 

  10. Currently Alkyne has a rather small max ream size. A better way to approach this would be to treat the entire heap as one gigantic ream at the start, which is always at the bottom of the free list. 

  11. In GC terms, these are often called “stack roots”. 

  12. The interpreter simply knows this and can instruct the GC appropriately.

    In any tracing GC, the compiler or interpreter must be keenly aware of the layouts of types so that it can generate the appropriate tracing code for each.

    This is why grafting GCs to non-GC’d languages is non-trivial, even though people have totally done it: libboehm and Oilpan are good (albeit sometimes controversial) examples of how this can be done. 

  13. With “instant apocalypse”, this isn’t quite true; after two mark phases, pages from the first mark phase will appear to be alive, since the global “alive” convention has changed twice. Thus, only pages condemned in every other mark phase will be swept; sweeping is most optimal after an odd number of marks. 

Move Constructors Revisited

Almost a year ago I developed the moveit Rust library, which provides primitives for expressing something like C++’s T&& and move constructors while retaining Rust’s so-called “destructive move property”: moving a value transfers ownership, rather than doing a funny copy.

In an earlier blogpost I described the theory behind this library and some of the motivation, which I feel fairly confident about, especially in how constructors (and their use of pinning) are defined.

However, there is a problem.

A Not-so-Quick Recap

The old post is somewhat outdated, since moveit uses different names for a lot of things that are geared to fit in with the rest of Rust.

The core abstraction of moveit is the constructor, which are types that implement the New trait:

#[must_use]
pub unsafe trait New: Sized {
  /// The type to construct.
  type Output;

  /// Construct a new value in-place using the arguments stored
  /// in `self`.
  unsafe fn new(self, this: Pin<&mut MaybeUninit<Self::Output>>);
}
Rust

A New type is not what is being constructed; rather, it represents a method of construction, resembling a specialized Fn trait. The constructed type is given by the associated type Output.

Types that can be constructed are constructed in place, unlike most Rust types. This is a property shared by constructors in C++, allowing values to record their own address at the moment of creation. Explaining why this is useful is a bit long-winded, but let’s assume this is a thing we want to be able to do. Crucially, we need the output of a constructor to be pinned, which is why the this output parameter is pinned.

Calling a constructor requires creating the output location in advance so that we can make it available to it in time:

// Create storage for the new value.
let mut storage = MaybeUninit::uninit();

// Pin that storage on the stack; by calling this, we may never move
// `storage` again, even after this goes out of scope.
let uninit = Pin::new_unchecked(&mut storage);

// Now we can call the constructor. It's only unsafe because it assumes
// the provided memory is uninitialized.
my_new.new(uninit.as_mut());

// This is now safe, since `my_new` initialized the value, so we can
// do with it what we please.
let result = uninit.map_unchecked_mut(|mp| mp.assume_init_mut());
Rust

However, this is not quite right. Pin<P>’s docs are quite clear that we must ensure that, once we create an Pin<&mut T>, we must call T’s destructor before its memory is re-used; since reuse is unavoidable for stack data, and storage will not do it for us (it’s a MaybeUninit<T>, after all), we must somehow run the destructor separately.

An “Easy” Solution

One trick we could use is to replace storage with some kind of wrapper over a MaybeUninit<T> that calls the destructor for us:

struct EventuallyInit<T>(MaybeUninit<T>);
impl<T> Drop for EventuallyInit<T> {
  fn drop(&mut self) {
    unsafe { ptr::drop_in_place(self.0.assume_init_mut()) }
  }
}
Rust

This works, but isn’t ideal, because now we can’t write down something like a C++ move constructor without running into the classic C++ problem: all objects must be destroyed unconditionally, so now you can have moved-from state. Giving up Rust’s moves-transfer-ownership (i.e. affine) property is bad, but it turns out to be avoidable!

There are also some scary details around panics here that I won’t get into.

&T, &mut T, … &move T?

moveit instead provides a MoveRef<'frame, T> type that tries to capture the notion of what an “owning reference” could mean in Rust. An &move or &own type has been discussed many times, but implementing it in the full generality it would deserve as a language feature runs into some interesting problems due to how Box<T>, the heap allocated equivalent, currently behaves.

We can think of MoveRef<'frame, T> as wrapping the longest-lived &mut T reference pointing to a particular location in memory. The longest-lived part is crucial, since it means that MoveRef is entitled to run its pointee’s destructor:

// Notice parallels with EventuallyInit<T> above.
impl<T: ?Sized> Drop for MoveRef<'_, T> {
  fn drop(&mut self) {
    unsafe { ptr::drop_in_place(self.ptr) }
  }
}
Rust

No reference to the pointee can ever outlive the MoveRef itself, by definition, so this is safe. The owner of a value is that which is entitled to destroy it, and therefore a MoveRef literally owns its pointee. Of course, this means we can move out of it (which was the whole point of the original blogpost).

Because of this, we are further entitled to arbitrarily pin a MoveRef with no consequences: pinning it would consume the unpinned MoveRef (for obvious reasons, MoveRefs cannot be reborrowed) so no unpinned reference may outlive the pinning operation.

This gives us a very natural solution to the problem above: result should not be a Pin<&mut T>, but rather a Pin<MoveRef<'_, T>>:

let mut storage = MaybeUninit::uninit();
let uninit = Pin::new_unchecked(&mut storage);
my_new.new(uninit.as_mut());

// This is now safe, since `my_new` initialized the value, so we can
// do with it what we please.
let result = MoveRef::into_pinned(MoveRef::new_unchecked(
  uninit.map_unchecked_mut(|mp| mp.assume_init_mut())
));
Rust

This messy sequence of steps is nicely wrapped up in a macro provided by the library that ensures safe initialization and eventual destruction:

// Allocate storage on the stack, emplace `my_new` onto it, and pin it
// in an owning reference.
moveit!(let result: Pin<MoveRef<T>> = my_new);
Rust

There is also some reasonably complex machinery that allows us to do something like an owning Deref, which I’ll come back to in a bit.

However, there is a small wrinkle that I did not realize when I first designed MoveRef: what happens if I mem::forget a MoveRef?

Undefined Behavior, Obviously

Quashing destruction isn’t new to Rust: we can mem::forget just about anything, leaking all kinds of resources. And that’s ok! Destructors alone cannot be used in type design to advert unsafe catastrophe, a well-understood limitation of the language that we have experience designing libraries around, such as Vec::drain().

MoveRef’s design creates a contradiction:

  • MoveRef is an owning smart pointer, and therefore can be safely pinned, much like Box::into_pinned() enables. Constructors, in particular, are designed to generate pinned MoveRefs!
  • Forgetting a MoveRef will cause the pointee destructor to be suppressed, but its storage will still be freed and eventually re-used, a violation of the Pin drop guarantee.

This would appear to mean that a design like MoveRef is not viable at all, and that this sort of “stack box” strategy is always unsound.

What About Box?

What about it? Even though we can trivially create a Pin<Box<i32>> via Box::pin(), this is a red herring. When we mem::forget a Box, we also forget about its storage too. Because its storage has been leaked unrecoverably, we are still, technically, within the bounds of the Pin contract. Only barely, but we’re inside the circle.

Interestingly, the Rust language has to deal with a similar problem; perhaps it suggests a way out?

Drop Flags and Dynamic Ownership Transfer

Carefully crafted Rust code emits some very interesting assembly. I’ve annotated the key portion of the output with a play-by-play below.

#[inline(never)]
pub fn someone_elses_problem(_: Box<i32>) {
  // What goes in here isn't important,it just needs to
  // be an optimizer black-box.
}

pub fn maybe_drop(flag: bool) {
  let x = Box::new(42);
  if flag {
    someone_elses_problem(x)
  }
}
// See Godbolt widget above for full disassembly.
example::maybe_drop:
  // ...

  // Allocate memory.
  call    __rust_alloc

  // Check if allocation failed; panic if so.
  test    rax, rax
  je      .L.scream

  // Write a 42 to the memory.
  mov     dword ptr [rax], 42

  // Check the flag argument (LLVM decided to put it in rbx). If
  // true, we go free the memory ourselves.
  test    bl, bl
  je      .L.actually_our_problem

  // Otherwise, make it someone else's problem; they get to
  // free the memory for themselves. 
  mov     rdi, rax
  pop     rbx
  jmp     example::someone_elses_problem

  // ...
x86 Assembly

The upshot is that maybe_drop conditions the destructor of x on a flag, which is allocated next to it on the stack. Rust flips this flag when the value is moved into another function, and only runs the destructor when the flag is left alone. In this case, LLVM folded the flag into the bool argument, so this isn’t actually a meaningful perf hit.

These “drop flags” are key to Rust’s ownership model. Since ownership may be transferred dynamically due to reasonably complex control flow, it needs to leave breadcrumbs for itself to figure out whether the value wound up getting moved away or not. This is unique to Rust: in C++, every object is always destroyed, so no such faffing about is necessary.

Similarly, moveit can close this soundness hole by leaving itself breadcrumbs to determine if safe code is trying to undermine its guarantees.

In other words: in Rust, it is not sufficient to manage a pointer to manage a memory location; it is necessary to manage an explicit or implicit drop flag as well.

A Flagged MoveRef

We can extend MoveRef to track an explicit drop flag:

pub struct MoveRef<'frame, T> {
  ptr: &'frame mut T,

  // Set to `false` once the destructor runs.
  drop_flag: &'frame Cell<bool>,
}
Rust

Wrapping it in a Cell is convenient and doesn’t cost us anything, since a MoveRef can never be made Send or Sync anyways. Inside of its destructor, we can flip the flag, much like Rust flips a drop flag when transferring ownership to another function:

impl<T: ?Sized> Drop for MoveRef<'_, T> {
  fn drop(&mut self) {
    self.drop_flag.set(false);
    unsafe { ptr::drop_in_place(self.ptr) }
  }
}
Rust

But, how should we use it? The easiest way is to change the definition of moveit!() to construct a flag trap:

let mut storage = MaybeUninit::uninit();
let uninit = Pin::new_unchecked(&mut storage);

// Create a *trapped flag*, which I'll describe below.
let trap = TrappedFlag::new();

// Run the constructor as before and construct a MoveRef.
my_new.new(uninit.as_mut());
let result = MoveRef::into_pin(MoveRef::new_unchecked(
  Pin::into_inner_unchecked(uninit).assume_init_mut(),
  trap.flag(),  // Creating a MoveRef now requires
                // passing in a flag in addition to 
                // a reference to the owned value itself.
));
Rust

The trap is a deterrent against forgetting a MoveRef: because the MoveRef’s destructor flips the flag, the trap’s destructor will notice if this doesn’t happen, and take action accordingly.

Note: in moveit, this is actually implemented by having the Slot<T> type carry a reference to the trap, created in the slot!() macro. However, this is not a crucial detail for the design.

An Earth-Shattering Kaboom

The trap is another RAII type that basically looks like this:

pub struct TrappedFlag(Cell<bool>);
impl Drop for TrappedFlag {
  fn drop(&mut self) {
    if self.0.get() { abort() }
  }
}
Rust

The trap is simple: if the contained drop flag is not flipped, it crashes the program. Because moveit!() allocates it on the stack where uses cannot mem::forget it, its destructor is guaranteed to run before storage’s destructor runs (although Rust does not guarantee destructors run, it does guarantee their order).

If a MoveRef is forgotten, it won’t have a chance to flip the flag, which the trap will detect. Once the trap’s destructor notices this, it cannot return, either normally or by panic, since this would cause storage to be freed. Crashing the program is the only1 acceptable response.

Some of MoveRef’s functions need to be adapted to this new behavior: for example, MoveRef::into_inner() still needs to flip the flag, since moving out of the MoveRef is equivalent to running the destructor for the purposes of drop flags.

A Safer DerefMove

In order for MoveRef to be a proper “new” reference type, and not just a funny smart pointer, we also need a Deref equivalent:

pub unsafe trait DerefMove: DerefMut + Sized {
  /// An "uninitialized" version of `Self`.
  type Uninit: Sized;
  
  /// "Deinitializes" `self`, producing an opaque type that will
  /// destroy the storage of `*self` without calling the pointee
  /// destructor.
  fn deinit(self) -> Self::Uninit;

  /// Moves out of `this`, producing a `MoveRef` that owns its
  /// contents.
  unsafe fn deref_move(this: &mut Self::Uninit)
    -> MoveRef<'_, Self::Target>;
}
Rust

This is the original design for DerefMove, which had a two-phase operation: first deinit() was used to create a destructor-suppressed version of the smart pointer that would only run the destructor for the storage (e.g., for Box, only the call to free()). Then, deref_move() would extract the “inner pointee” out of it as a MoveRef. This had the effect of splitting the smart pointer’s destructor, much like we did above on the stack.

This has a number of usability problems. Not only does it need to be called through a macro, but deinit() isn’t actually safe: failing to call deref_move() is just as bad as calling mem::forget on the result. Further, it’s not clear where to plumb the drop flags through.

After many attempts to graft drop flags onto this design, I replaced it with a completely new interface:

pub unsafe trait DerefMove: DerefMut + Sized {
  /// The "pure storage" form of `Self`, which owns the storage
  /// but not the pointee.
  type Storage: Sized;

  /// Moves out of `this`, producing a [`MoveRef`] that owns
  /// its contents.
  fn deref_move<'frame>(
    self,
    storage: DroppingSlot<'frame, Self::Storage>,
  ) -> MoveRef<'frame, Self::Target>
  where
    Self: 'frame;
}
Rust

Uninit has been given the clearer name of Storage: a type that owns just the storage of the moved-from pointer. The two functions were merged into a single, safe function that performs everything in one step, emitting the storage as an out-parameter.

The new DroppingSlot<T> is like a Slot<T>, but closer to a safe version of the EventuallyInit<T> type from earlier: its contents are not necessarily initialized, but if they are, it destroys them, and it only does so when its drop flag is set.

Box is the most illuminating example of this trait:

unsafe impl<T> DerefMove for Box<T> {
  type Storage = Box<MaybeUninit<T>>;

  fn deref_move<'frame>(
    self,
    storage: DroppingSlot<'frame, Box<MaybeUninit<T>>>,
  ) -> MoveRef<'frame, T>
  where
    Self: 'frame
  {
    // Dismantle the incoming Box into the "storage-only part".
    let this = unsafe {
      Box::from_raw(Box::into_raw(self).cast::<MaybeUninit<T>>())
    };

    // Put the Box into the provided storage area. Note that we
    // don't need to set the drop flag; `DroppingSlot` does
    // that automatically for us.
    let (storage, drop_flag) = storage.put(this);

    // Construct a new MoveRef, converting `storage` from 
    // `&mut Box<MaybeUninit<T>>` into `&mut T`.
    unsafe { MoveRef::new_unchecked(storage.assume_init_mut(), drop_flag) }
  }
}
Rust

MoveRef’s own implementation illustrates the need for the explicit lifetime bound:

unsafe impl<'a, T: ?Sized> DerefMove for MoveRef<'a, T> {
  type Storage = ();

  fn deref_move<'frame>(
    self,
    _: DroppingSlot<'frame, ()>,
  ) -> MoveRef<'frame, T>
  where
    Self: 'frame
  {
    // We can just return directly; this is a mere lifetime narrowing.
    self
  }
}
Rust

Since this is fundamentally a lifetime narrowing, this can only compile if we insist that 'a: 'frame, which is implied by Self: 'frame. Earlier iterations of this design enforced it via a MoveRef<'frame, Self> receiver, which turned out to be unnecessary.

Conclusions

As of writing, I’m still in the process of self-reviewing this change, but at this point I feel reasonably confident that it’s correct; this article is, in part, written to convince myself that I’ve done this correctly.

The new design will also enable me to finally complete my implementation of a constructor and pinning-friendly vector type; this issue came up in part because the vector type needs to manipulate drop flags in a complex way. For this reason, the actual implementation of drop flags actually uses a counter, not a single boolean.

I doubt this is the last issue I’ll need to chase down in moveit, but for now, we’re ever-closer to true owning references in Rust.

  1. Arguably, running the skipped destructor is also a valid remediation strategy. However, this is incompatible with what the user requested: they asked for the destructor to be supressed, not for it to be run at a later date. This would be somewhat surprising behavior, which we could warn about for the benefit of unsafe code, but ultimately the incorrect choice for non-stack storage, such as a MoveRef referring to the heap. 

Understanding Assembly
Part I: RISC-V

A Turing tarpit is a programming language that is Turing-complete but very painful to accomplish anything in. One particularly notable tarpit is Brainfuck, which has a reputation among beginner and intermediate programmers as being unapproachable and only accessible to the most elite programmers hence the name, as Wikipedia puts it:

The language’s name is a reference to the slang term brainfuck, which refers to things so complicated or unusual that they exceed the limits of one’s understanding.

Assembly language, the “lowest-level” programming language on any computer, has a similar reputation: difficult, mysterious, and beyond understanding. A Turing tarpit that no programmer would want to have anything to do with.

Although advanced programmers usually stop seeing assembly as mysterious and inaccessible, I feel like it is a valuable topic even for intermediate programmers, and one that can be made approachable and interesting.

This series seeks to be that: assuming you have already been using a compiled language like Rust, C++, or Go, how is assembly relevant to you?

If you’re here to just learn assembly and don’t really care for motivation, you can just skip ahead.

This series is about learning to understand assembly, not write it. I do occasionally write assembly for a living, but I’m not an expert, and I don’t particularly relish it. I do read a ton of assembly, though.

What Is It, Anyways?

As every programmer knows, computers are very stupid. They are very good at following instructions and little else. In fact, the computer is so stupid, it can only process basic instructions serially1, one by one. The instructions are very simple: “add these two values”, “copy this value from here to there”, “go run these instructions over here”.

A computer processor implements these instructions as electronic circuits. At its most basic level, every computer looks like the following program:

size_t program_counter = ...;
Instruction *program = ...;

while (true) {
  Instruction next = program[program_counter];
  switch (next.opcode) {
    // Figure out what you're supposed to be doing and do it.
  }
  program_counter++;
}
C

The array program is a your program encoded as a sequence of these “machine instructions” in some kind of binary format. For example, in RISC-V programs, each instruction is a 32-bit integer. This binary format is called machine code.

For example, when a RISC-V processor encounters the value 5407443 decoding circuitry decides that it means that it should take the value in the “register” a0, add 10 to it, and place the result in the register a1.

Decoding Instructions

5407443 seems opaque, but when viewed as binary, we can see how the processor decodes it:

> 0b 000000000101 00101 000 00110 0010011
>    \__________/ \___/ \_/ \___/ \_____/
>     |           |     |   |     |
>     imm         rs1   fn  rd    opcode
Plaintext

opcode describes what sort of instruction this is, and what format it’s in; 0b0010011 means it’s an “immediate arithmetic” instruction, which uses the “I-type” format, given above. fn further specifies what the operation does; 0b000, combined with the value of opcode, means this is an addition instruction.

rs1 is the source: it gives the name of the source register, a0, given by it index, 0b00101, i.e., 10. Similarly, rd specifies the destination a1 by its index 11. Finally, imm is the value to add, 0b000000000101, or 10. The constant value appears immediately in the instruction itself, so it’s called an immediate.

However, if you’re a human programming a computer, writing all of this by hand is… very 60s, and you might prefer to have a textual representation, so you can write this more simply as addi a1, a0, 10.

addi a1, a0, 10 is a single line of assembly: it describes a single instruction in text form. Assembly language is “just” a textual representation of the program’s machine code. Your assembler can convert from text into machine instructions, and a disassembler reverses the process.

The simple nature of these instructions is what makes assembly a sort of Turing tarpit: you only get the most basic operations possible: you’re responsible for building everything else.

On Architectures

There isn’t “an” assembly language. Every computer has a different instruction set architecture, or “ISA”; I use the terms “instruction set”, “architecture”, and “ISA” interchangeably. Each ISA has a corresponding assembly language that describes that ISA’s specific instructions, but they all generally have similar overall structure.

I’m going to focus on three ISAs for ease of exposition, introduced in this order:

  1. RISC-V, a modern and fairly simple instruction set (specifically, the rv32gc variant). That’s Part I.
  2. x86_64, the instruction set of the device you’re reading this on (unless it’s a phone, an Apple M1 laptop, or something like a Nintendo Switch). That’s Part II.
  3. MOS 6502, a fairly ancient ISA still popular in very small microcontrollers. That’s Part III.

We’re starting with RISC-V because it’s a particularly elegant ISA (having been developed for academic work originally), while still being representative of the operations most ISAs offer.

In the future, I may dig into some other, more specialized ISAs.

But Why?

It’s actually very rare to write actual assembly. Thanks to modern (relatively) languages like Rust, C++, and Go, and even things like Haskell and JavaScript, virtually no programmers need to write assembly anymore.

But that’s only because it’s the leading language written by computers themselves. A compiler’s job is, fundamentally, to write the the assembly you would have had to write for you. To better understand what a compiler is doing for you, you need to be able to read its output.

At this point, it may be worth looking at my article on linkers as a refresher on the C compilation model.

For example, let’s suppose we have the very simple C program below.

#include <stdio.h>

int square_and_print(int x) {
    x *= x;
    printf("%d\n", x);
    return x;
}
godbolt
square.c

Clang, my C compiler of choice, can turn it directly into a library via clang -c square.c. -c asks the compiler to stop before the link step, outputting the object file square.o. We can ask the compiler to stop even sooner than that by writing clang -S square.c, which will output square.s, the assembly file the compiler produced! For this example, and virtually all others in this post, I’m using a RISC-V target: -target riscv32-unknown-elf -march=rv32gc.

If you build with -Oz to make the code as small as possible (this makes it easiest to see what’s going on, too), you get something like this:

.text
        .file   "square.c"
        .globl  square_and_print
square_and_print:
        addi    sp, sp, -16
        sw      ra, 12(sp)
        sw      s0, 8(sp)
        mul     s0, a0, a0          // !
        lui     a0, %hi(.L.str)
        addi    a0, a0, %lo(.L.str)
        mv      a1, s0
        call    printf              // !
        mv      a0, s0
        lw      s0, 8(sp)
        lw      ra, 12(sp)
        addi    sp, sp, 16
        ret

        .section        .rodata
.L.str:
        .asciz  "%d\n"
square.s

There’s a lot going on! But pay attention to the two lines with a // !: the first is mul s0, a0, a0, which is the multiplication x *= x;. The second is call printf, which is our function call to printf()! I’ll explain what everything else means in short order.

Writing assembly isn’t a crucial skill, but being able to read it is. It’s actually so useful, that a website exists for quickly generating the assembly output of a vast library of compilers: the Compiler Explorer, frequently just called “godbolt” after its creator, Matt Godbolt. Being able to compare the output of different compilers can help understand what they do! Click on the godbolt button in the code fences to a godbolt for it.

“Low-level” languages like C aren’t the only ones where you can inspect assembly output. Godbolt supports Go: for example, click the godbolt button below.

package sq

import "fmt"

func SquareAndPrint(x int) int {
    x *= x
    fmt.Printf("%d\n", x)
    return x
}

Hopefully this is motivation enough to jump into the language proper. It is very useful to have a godbolt tab open to play around with examples!

Diving In

So, let’s say you do want to read assembly. How do we do that?

Let’s revisit our square.c example above. This time, I’ve added comments explaining what all the salient parts of the code do, including the assembler directives, which are all of the form .blah. Note that the actual compiler output includes way more directives that would get in the way of exposition.

There’s a lot of terms below that I haven’t defined yet. I’ll break down what this code does gradually, so feel free to refer back to it as necessary, using this handy-dandy link.

        // This tells the assembler to place all code that
        // follows in the `.text` section, where executable
        // data goes.
        .text

        // This is just metadata that tools can use to figure out
        // how the executable was built.
        .file   "square.c"

        // This asks the assembler to mark `square_and_print`
        // as an externally linkable symbol. Other files that
        // refer to `square_and_print` will be able to find it
        // at link time.
        .globl  square_and_print

square_and_print: // This is a label, which gives this position
                  // in the executable a name that can be
                  // referenced. They're very similar to `goto`
                  // labels from C.
                  //
                  // We'll see more labels later on.


        // This is the function prologue, which "sets up" the
        // function: it allocates stack space and saves the
        // return address, along with other calling-convention
        // fussiness.
        addi    sp, sp, -16
        sw      ra, 12(sp)
        sw      s0, 8(sp)

        // This is our `x *= x;` from before! Notice that the
        // compiler rewrote this to `temp = x * x;` at some
        // point, since the destination register is `s0`.
        mul     s0, a0, a0

        // These two instructions load the address of a string
        // constant; this pattern is specific to RISC-V.
        lui     a0, %hi(.L.str)
        addi    a0, a0, %lo(.L.str)
        
        // This copies the multiplication result into `a1`.
        mv      a1, s0

        // Call to printf!
        call    printf

        // Move `s0` into `a0`, since it's the return value.
        mv      a0, s0

        // This is the function epilogue, which restores state
        // saved in the prologue and de-allocates the stack
        // frame.
        lw      s0, 8(sp)
        lw      ra, 12(sp)
        addi    sp, sp, 16
        
        // We're done; return from the function!
        ret

        // This tells the assembler to place what follows in
        // the `.rodata` section, for read-only constants like
        // strings.
        .section        .rodata

.L.str: // Give our string constant a private name. By convention,
        // .L labels are "private" names emitted by the compiler.

        // Emit an ASCII string into `.rodata` with an extra null
        // terminator at the end: that's what the `z` stands for.
        .asciz  "%d\n"
square.s

The Core Syntax

All assemblers are different, but the core syntax tends to be the same. There are three main kinds of syntax productions:

  • Instructions, which consist of a mnemonic followed by some number of operands, such as addi sp, sp -16 and call printf above. These are the text encoding of machine code.
  • Labels, which consist of a symbol followed by a colon, like square_and_print: or .L.str:. These are used to let instruction operands refer to locations in the program.
  • Directives, which vary wildly by assembler. GCC-style assembly like that above uses a .directive arg, arg syntax, as seen in .text, .globl, and .asciz. They control the behavior of the assembler in various ways.

An assembler’s purpose is to read the .s file and serialize it as a binary .o file. It’s kind of like a compiler, but it does virtually no interesting work at all, beyond knowing how to encode instructions.

Directives control how this serialization occurs (such as moving around the output cursor); instructions are emitted as-is, and labels refer to locations in the object file. Simple enough, right?

Anatomy of an Instruction

Let’s look at the very first instruction in square_and_print:

        addi sp, sp, -16
        ---- --  --  ---
         |   |   |    |
        mnemonic |   immediate operand
             |  input operand
             |
            output operand
RISC-V Assembly

The first token is called the mnemonic, which is a painfully terse abbreviation of what the instruction does. In this case, addi means “add with immediate”.

sp is a register. Registers are special variables wired directly into the processor that can be used as operands in instructions. The degree to which only registers are permitted as operands varies by architecture; RISC-V only allows registers, but x86, as we’ll see, does not. Registers come in many flavors, but sp is a GPR, or “general purpose register”; it holds a machine word-sized integer, which in the case of 32-bit RISC-V is… 32-bit2.

RISC-V Registers

One of my absolute favorite parts of RISC-V is how it names its registers. It has 32 GPRs named x0 through x31. However, these registers have so-called “ABI names” that specify the role of each register in the ABI.

The usefulness of these names will be much more apparent when we discuss the calling convention, so feel free to come back to this later.

x0 is called zero, because of its special property: writes to it are ignored, and reads always produce zero. This is handy for encoding certain common operations: for example, it can be used to quickly get a constant value= into a register: addi rd, zero, 42.

x1, x2, x3, and x4 have special roles and generally aren’t used for general computation. The first two are the link register ra, which holds the return address, and sp, the stack pointer.

The latter two are gp and tp; the global ppointer and the thread ppointer; their roles are somewhat complicated, so we won’t discuss them in this post.

The remaining registers belong to one of three categories: argument registers, saved registers, and temporary registers, named so for their role in calling a function (as described below).

The argument registers are x10 through x17, and use the names a0 through a7. The saved registers are x8, x9, and x18 through x27, called s0, through s11. The temporary registers are x5 through x7 and x28 through x31, called t0 through t6.

As a matter of personal preference, you may notice me reaching for argument registers for most examples.

-16 is an immediate, which is a literal value that is encoded directly into the instruction. The encoding of addi sp, sp, -16 will include the binary representation of -16 (in the case of RISC-V, as a 12-bit integer). [The decoding example above]{#decoding-instructions} shows how immediates are literally encoded immediately in the instruction.

Immediates allow for small but fixed integer arguments to be encoded with high locality to the instruction, which is good for code size and performance.

The first operand in RISC-V is (almost) always the output. addi, rd, rs, imm should be read as rd = rs + imm. Virtually all assembler syntax follows this convention, which is called the three-address code.

Other kinds of operands exist: for example, call printf refers to the symbol printf. The assembler, which doesn’t actually know where printf is, will emit a small note in the object file that tells the linker to find printf and splat it into the assembly according to some instructions in the note. These notes are called relocations.

The instructions lui a0, %hi(.L.str) and addi a0, a0, %lo(.L.str) use the %lo and %hi operand types, which are specific to RISC-V; they load the low 12 bits and high 20 bits of a symbol’s address into the immediate operand. This is a RISC-V-specific pattern for loading an address into a register, which most assemblers provide with the pseudoinstruction la a0, .L.str (where la stands for “load address”).

Most architectures have their own funny architecture-specific operand types to deal with the architecture’s idiosyncrasy.

Types of Instructions

Available instructions tend to be motivated by providing one of three classes of functionality:

  1. A Turing-complete register machine execution environment. This lends to the Turing tarpit nature of assembly: only the absolute minimum in terms of control flow and memory access is provided.
  2. Efficient silicon implementation of common operations on bit strings and integers, ranging from arithmetic to cryptographic algorithms.
  3. Building a secure operating system, hosting virtual machines, and actuating hardware external to the processor, like a monitor, a keyboard, or speakers.

Instructions can be broadly classified into four categories: arithmetic, memory, control flow, and “everything else”. In the last thirty years, the bar for general-purpose architectures is usually “this is enough to implement a C runtime.”

Arithmetic Instructions

Arithmetic makes up the bulk of the instruction set. This always includes addition, subtraction, and bitwise and, or, and xor, as well as unary not and negation.

In RISC-V, these come in two variants: a three-register version and a two-register, one immediate version. For example, add a0, a1, a2 is the three-register version of addition, while addi a0, a1, 42 is the immediate version. There isn’t a subi though, since you can just use negative immediates with addi.

not and neg are not actual instructions in RISC-V, but pseudoinstructions: not a0, a1 encodes as xori a0, a1, -1, while neg a0, a1 becomes sub a0, zero, a1.

Most instruction sets also have bit shifts, usually in three flavors: left shifts, right shifts, and arithmetic right shifts; arithmetic right shift is defined such that it behaves like division by powers of two on signed integers. RISC-V’s names for these instructions are sll, srl, and sra.

Multiplication and division are somewhat rarer, because they are expensive to implement in silicon; smaller devices don’t have them3. Division in particular is very complex to implement in silicon. Instruction sets usually have different behavior around division by zero: some architectures will fault, similar to a memory error, while some, like RISC-V, produce a well-defined trap value.

There is usually also a “copy” instruction that moves the value of one register to another, which is kind of like a trivial arithmetic instruction. RISC-V calls this mv a0, a1, but it’s just a pseudoinstruction that expands to addi a0, a1, 0.

Some architectures also offer more exotic arithmetic. This is just a sampler of what’s sometimes available:

  • Bit rotation, which is like a shift but bits that get shifted off end up at the other end of the integer. This is useful for a vast array of numeric algorithms, including ARX ciphers like ChaCha20.
  • Byte reversal, which can be used for changing the endianness of an integer; bit reversal is analogous.
  • Bit extraction, which can be used to form new integers out of bitfields of another.
  • Carry-less multiplication, which is like long multiplication but you don’t bother to carry anything when you add intermediates. This is used to implement Galois/Counter mode encryption.
  • Fused instructions, like xnor and nand.
  • Floating point instructions, usually implementing the IEEE 754 standard.

There is also a special kind of arithmetic instruction called a vector instruction, but I’ll leave those for another time.

Memory Instructions

Load instructions fetch memory from RAM into registers, while store instructions write it back. These instructions are what we use to implement pointers.

They come in all sorts of different sizes: RISC-V has lw, lh, and lb for loading 32-, 16-, and 8-bit values from a location; sw, sh, and sb are their store counterparts. 64-bit RISC-V also provides ld and sd for 64-bit loads and stores.

Load/store instructions frequently take an offset for indexing into memory. lw a1, 4(a0)4 is effectively a1 = a0[4], treating a0 like a pointer.

These instructions frequently have an alignment constraint: the pointer value must (or, at least, should) be divisible by the number of bytes being loaded. RISC-V, for example, mandates that lw only be used on pointers divisible by 4. This constraint simplifies the microarchitecture; even on architectures that don’t mandate it, aligned loads and stores are typically far faster.

This category also includes instructions necessary for implementing atomics, such as lock cmpxchg on x86 and lr/sc on RISC-V. Atomics are fundamentally about changing the semantics of reading and writing from RAM, and thus require special processor support.

Some architectures, like x86, 65816, and very recently, ARM, provide instructions that implement memcpy and its ilk in hardware: in x86, for example, this is called rep movsb.

Control Flow Instructions

Control flow is the secret ingredient that turns our glorified calculator into a Turing tarpit: they allow changing the flow of program execution based on its current state.

Unconditional jumps implement goto: given some label, the j label instruction jumps directly to it. j can be thought of as writing to a special pc register that holds the program counter. RISC-V also provides a dynamic jump, jr, which will jump to the address in a register. Function calls and returns are a special kind of unconditional jump.

Conditional jumps, often called branches, implement if. beq a0, a1, label will jump to label if a0 and a1 contain the same value. RISC-V provides branch instructions for all kinds of comparisons, like bne, blt, and bge.

Conditional and unconditional jumps can be used together to build loops, much like we could in C using if and goto.

For example, to zero a region of memory:

        // Assume a0 is the start of the region, and a1 the
        // number of bytes to zero.

        // Set a1 to the end of the region.
        addi a1, a0, a1
loop_start:
        // If a0 == a1, we're done!
        beq a0, a1, loop_done

        // Store a zero byte to `a0` and advance the pointer.
        sb zero, 0(a0)
        addi a0, a0, 1

        // Take it from the top!
        j loop_start
loop_done:
RISC-V Assembly

Miscellaneous Instructions

“Everything else” is, well… everything else.

No-op instructions do nothing: nop’s only purpose is to take up space in the instruction stream. No-op instructions can be used to pad space in the instruction stream, provide space for the linker to fix things up later, or implement nop sleds.

Instructions for poking processor state, like csrrw in RISC-V and wrmsr in x86 also belong in this category, as do “hinting” instructions like memory prefetches.

There are also instructions for special control flow: ecall is RISC-V’s “syscall” instruction, which “traps” to the kernel for it to do something; other architectures have similar instructions.

Breakpoint instructions and “fence” instructions belong here, too.

The Calling Convention

Functions are the core abstraction of all of programming. Assembly is no different: we have functions there, too!

Like in any language, functions are passed a list of arguments, perform some work, and return a value. For example, in C:

int identity(int x) {
  return x;
}

// ...

identity(5)  // Returns 5.
C

Unfortunately, there isn’t anything like function call syntax in assembly. As with everything else, we need do it instruction by instruction. All we do get in most architectures is a call instruction, which sets up a return address somewhere, and a ret instruction, which uses the return address to jump to where the function was called.

We need some way to pass arguments, return a computed value, and maintain a call stack, so that each function’s return address is kept intact for its ret instruction to consume. We also need this to be universal: if I pull in a library, I should be able to call its functions.

This mechanism is called the calling convention of the platform’s ABI. It’s a convention, because all libraries must respect it in their exposed API for code to work correctly at runtime.

A Function Call in Slow-Mo

At the instruction level, function calls look something like this:

  1. Pre-call setup. The caller sets up the function call arguments by placing them in the appointed locations for arguments. These are usually either registers or locations on the stack. a. The caller also saves the caller-saved registers to the stack.

  2. Jump to the function. The caller executes a call instruction (or whatever the function call instruction might be called – virtually all architectures have one). This sets the program counter to the first instruction of the callee.

  3. Function prologue. The callee does some setup before executing its code. a. The callee reserves space on the stack in an architecture-dependent manner. b. The callee saves the callee-saved registers to this stack space.

  4. Function body. The actual code of the function runs now! This part of the function needs to make sure the return value winds up wherever the return slot for the function is.

  5. Function epilogue. The callee undoes whatever work it did in the prologue, such as restoring saved registers, and executes a ret (or equivalent) instruction to return.

  6. Post-call cleanup. The caller is now executing again; it can unspill any saved state that it needs immediately after the function call, and can retrieve the return value from the return slot.

    In some ABIs, such as C++’s on Linux, this is where the destructors of the arguments get run. (Rust, and C++ on Windows, have callee-destroyed arguments instead.)

When people say that function calls have overhead, this is what they mean. Not only does the call instruction cause the processor to slam the breaks on its pipeline, causing all kinds of work to get thrown away, but state needs to be delicately saved and restored across the function boundary to maintain the illusion of a callstack.

Small functions which don’t need to use as many registers can avoid some of the setup and cleanup, and leaf functions which don’t call any other functions can avoid basically all of it!

Almost all registers in RISC-V are caller-saved, except for ra and the “saved” registers s0 and s11.

Callee-saved registers are convenient, because they won’t be wiped out by function calls. We can actually see the call to printf use this: even though the compiler could have emitted mul a1, a0, a0 and avoided the mv, this is actually less optimal. We need to keep the value around to return, and a1 is caller-saved, so we would have had to spill a1 before calling printf, regardless of whether printf overwrites a1 or not. We would then have to unspill it into a0 before ret. This costs us a hit to RAM. However, by emitting mul s0, a0, a0; mv a1, s0, we speculatively avoid the spill: if printf is compiled such that it never touches s0, the value never leaves registers at all!

Caller-Side

We can see steps 1 and 2 in the call to printf:

        lui     a0, %hi(.L.str)
        addi    a0, a0, %lo(.L.str)
        mv      a1, s0
        call    printf
RISC-V Assembly

Arguments in the usual5 RISC-V calling convention, word-sized arguments are passed in the a0 through a7 registers, falling back to passing on the stack if they run out of space. If the argument is too big to fit in a register, it gets passed by reference instead. Arguments that fit into two registers can be split across registers.

We can see this in action above. The first argument, a string, is passed by pointer in a0; lui and addi do the work of actually putting that pointer into a0. The second argument x is passed in a1, copied from s0 where it landed from the earlier mul instruction.

Complex function signatures require much more6 work to set up.

Once we’re done getting arguments into place, we call, which switches execution over to printf’s first instruction. In addition, it stores the return address, specifically, the address of the instruction immediately after the call, into an architecture-specific location. On RISC-V, this is the special register ra.

Callee-Side

Meanwhile, steps 3 and 4 occur in square_and_print’s prologue/epilogue itself:

square_and_print: 
        addi    sp, sp, -16
        sw      ra, 12(sp)
        sw      s0, 8(sp)

        // ...

        lw      s0, 8(sp)
        lw      ra, 12(sp)
        addi    sp, sp, 16
        ret
RISC-V Assembly

addi sp, sp, -16, which we stared at so hard above, grows the stack by 16 bytes. sp holds the stack pointer, which points to the top of the stack at all times. The stack grows downwards (as in most architectures!) and must be aligned to 16-byte boundaries across function calls: even though square_and_print only uses eight of those bytes, the full 16 bytes must be allocated.

The two sw instructions that follow store (or “spill”) the callee-saved registers ra and s0 to the stack. Note that s1 through s11 are not spilled, since square_and_print doesn’t use them!

Th this point, the function does its thing, whatever that means. This includes putting the return value in the return slot, which, for a function that returns an int, is in a0. In general, the return slot is passed back to the caller much like arguments are: if it fits in registers, a0 and a1 are used; otherwise, the caller allocates space for it and passes a pointer to the return slot as a hidden argument (in e.g. a0)7.

The epilogue inverts all operations of the prologue in reverse, unspilling registers and shrinking the stack, followed by ret. On RISC-V, all ret does is jump to the location referred to by the ra register.

Of course, all this work is only necessary to maintain the illusion of a callstack; if square_and_print were a leaf function, it would not need to spill anything at all! This results in an almost trivial function:

int square(int x) {
  return x * x;
}
// `x` is already in a0, and the
// return value needs to wind up
// in a0. EZ!
square:
        mul a0, a0, a0
        ret
RISC-V Assembly

Because leaf functions won’t call other functions, they won’t need to save the caller-saved tX registers, so they can freely use them instead of the sX registers.

The End, for Now

Phew! We’re around six thousand words in, so let’s checkpoint what we’ve learned:

  1. Computers are stupid, but can at least follow extremely basic instructions, which are encoded as binary.

  2. Assembly language is human-readable version of these basic instructions for a particular computer.

  3. Assembly language programs consist of instructions, labels, and directives.

  4. Each instruction is a mnemonic followed by zero or more operands.

  5. Registers hold values the machine is currently operating on.

  6. Instructions can be broadly categorized as arithmetic, memory, control flow, and “miscellaneous” (plus vector and float instructions, for another time).

  7. The calling convention describes the low-level interface of a general function, consisting of some pre-call setup, and a prologue and epilogue in each function.

That’s all for now. RISC-V is a powerful but reasonably simple ISA. Next time, we’ll dive into the much older, much larger, and much more complex Intel x86.

  1. This is a hilarious lie that is beyond the scope of this post. See, for example, https://en.wikipedia.org/wiki/Superscalar_processor

  2. What’s a machine word, exactly? It really depends on context. Most popular architectures has a straight-forward definition: the size of a GPR or the size of a pointer, which are the same.

    This is not true of all architectures, so beware. 

  3. Thankfully, these can be polyfilled using the previous ubiquitous instructions. Hacker’s Delight contains all of the relevant algorithms, so I won’t reproduce them here. The division polyfills are particularly interesting. 

  4. It’s a bit interesting that we don’t write lw a1, a0[4] in imitation of array syntax. This specific corner of the notation is shockingly diverse across assemblers: in ARM, we write ldr r0, [r1, #offset]; in x86, mov rax, [rdx + offset], or movq offset(%rdx), %rax for AT&T-flavored assemblers (which is surprisingly similar to the RISC-V syntax!); in 6502, lda ($1234, X)

  5. The calling convention isn’t actually determined by the architecture in most cases; that’s why it’s called a convention. The convention on x86 actually differs on Windows and Linux, and is usually also language-dependent; C’s calling convention is usually documented, but C++, Rust, and Go invent their own to handle language-specific fussiness.

    Of course, if you’re writing assembly, you can do whatever you want (though the silicon may be optimized for a particular recommended calling convention).

    RISC-V defines a recommended calling convention for ELF-based targets: https://github.com/riscv-non-isa/riscv-elf-psabi-doc

  6. The following listing shows how all kinds of different arguments are passed. The output isn’t quite what Clang emits, since I’ve cleaned it up for clarity.

    #include <stdio.h>
    #include <stdint.h>
    #include <stdnoreturn.h>
    
    struct Pair {
      uint32_t x, y;
    };
    struct Triple {
      uint32_t x, y, z;
    };
    struct Packed {
      uint8_t x, y, z;
    };
    
    // `noreturn` obviates the
    // {pro,epi}logue in `call_it`.
    noreturn void all_the_args(
      uint32_t a0,
      uint64_t a1a2,
      struct Pair a3a4,
      struct Triple a5_by_ref,
      uint16_t a6,
      struct Packed a7,
      uint32_t on_the_stack,
      struct Triple stack_by_ref
    );
    
    void call_it(void) {
      struct Pair u = {7, 9};
      struct Triple v = {11, 13, 15};
      struct Packed w = {14, 16, 18};
      all_the_args(
        42, -42,  u, v,
         5,   w, 21, v
      );
    }
    call_it:
      // Reserve stack space.
      addi    sp, sp, -48
    
      // Get `&call_it.v` into `a3`.
      lui     a3, %hi(call_it.v)
      addi    a3, a3, %lo(call_it.v)
    
      // Copy contents of `*a3`
      // into `a0...a2`.
      lw      a0, 0(a3)
      lw      a1, 4(a3)
      lw      a2, 8(a3)
    
      // Create two copies of `v`
      // on the stack to pass by
      // reference.
    
      // This is `a5_by_ref`.
      sw      a2, 40(sp)
      sw      a1, 36(sp)
      sw      a0, 32(sp)
    
      // This is `stack_by_ref`.
      sw      a2, 24(sp)
      sw      a2, 20(sp)
      sw      a0, 16(sp)
      
      // Load the argument regs.
      addi    a0, zero, 42
      addi    a1, zero, -42
      addi    a2, zero, -1
      addi    a3, zero, 7
      addi    a4, zero, 9
      // A pointer to `a5_by_ref`!
      addi    a5, sp, 32
      addi    a6, zero, 5
      // Note that `a7` is three
      // packed bytes!
      lui     a0, 289
      addi    a7, a0, 14
    
      // Store `21` on the top of
      // the stack (our "a8")
      addi    t0, zero, 21
      sw      t0, 0(sp)
    
      // Store a pointer to
      // `stack_by_ref` on the 
      // second spot from the
      // stack top (our "a9")
      addi    t0, sp, 16
      sw      t0, 4(sp)
    
      // Call it!
      call    all_the_args
    
    call_it.v:
      // The constant `{11, 13, 15}`.
      .word   11
      .word   13
      .word   15
    RISC-V Assembly

  7. LLVM occasionally does somewhat clueless things around this corner of some ABIs. Given

    typedef struct { char p[100]; } X;
    
    X make_big(int x) {
      return (X) {x};
    } 

    we get the following from Clang:

    // NOTE: Return slot passed in `a0`, `x` passed in `a1`.
    make_big:
            addi    sp, sp, -16
            sw      ra, 12(sp)
            sw      s0, 8(sp)
            sw      s1, 4(sp)
            mv      s0, a1
            mv      s1, a0
            addi    a0, a0, 1
            addi    a2, zero, 99
            mv      a1, zero
            call    memset
            sb      s0, 0(s1)
            lw      s1, 4(sp)
            lw      s0, 8(sp)
            lw      ra, 12(sp)
            addi    sp, sp, 16
            ret
    RISC-V Assembly

    Note that sb s0, 0(s1) stores the input value x into the first element of the big array after calling memset. If we move the store to before, we can avoid much silliness, including some unnecessary spills:

    make_big:
            addi    sp, sp, -16
            sw      ra, 12(sp)
            sb      a1, 0(a0)
            addi    a0, a0, 1
            mv      a1, zero
            addi    a2, zero, 99
            call    memset
            lw      ra, 12(sp)
            addi    sp, sp, 16
            ret
    RISC-V Assembly