Move Constructors Revisited

Almost a year ago I developed the moveit Rust library, which provides primitives for expressing something like C++’s T&& and move constructors while retaining Rust’s so-called “destructive move property”: moving a value transfers ownership, rather than doing a funny copy.

In an earlier blogpost I described the theory behind this library and some of the motivation, which I feel fairly confident about, especially in how constructors (and their use of pinning) are defined.

However, there is a problem.

A Not-so-Quick Recap

The old post is somewhat outdated, since moveit uses different names for a lot of things that are geared to fit in with the rest of Rust.

The core abstraction of moveit is the constructor, which are types that implement the New trait:

pub unsafe trait New: Sized {
  /// The type to construct.
  type Output;

  /// Construct a new value in-place using the arguments stored
  /// in `self`.
  unsafe fn new(self, this: Pin<&mut MaybeUninit<Self::Output>>);

A New type is not what is being constructed; rather, it represents a method of construction, resembling a specialized Fn trait. The constructed type is given by the associated type Output.

Types that can be constructed are constructed in place, unlike most Rust types. This is a property shared by constructors in C++, allowing values to record their own address at the moment of creation. Explaining why this is useful is a bit long-winded, but let’s assume this is a thing we want to be able to do. Crucially, we need the output of a constructor to be pinned, which is why the this output parameter is pinned.

Calling a constructor requires creating the output location in advance so that we can make it available to it in time:

// Create storage for the new value.
let mut storage = MaybeUninit::uninit();

// Pin that storage on the stack; by calling this, we may never move
// `storage` again, even after this goes out of scope.
let uninit = Pin::new_unchecked(&mut storage);

// Now we can call the constructor. It's only unsafe because it assumes
// the provided memory is uninitialized.;

// This is now safe, since `my_new` initialized the value, so we can
// do with it what we please.
let result = uninit.map_unchecked_mut(|mp| mp.assume_init_mut());

However, this is not quite right. Pin<P>’s docs are quite clear that we must ensure that, once we create an Pin<&mut T>, we must call T’s destructor before its memory is re-used; since reuse is unavoidable for stack data, and storage will not do it for us (it’s a MaybeUninit<T>, after all), we must somehow run the destructor separately.

An “Easy” Solution

One trick we could use is to replace storage with some kind of wrapper over a MaybeUninit<T> that calls the destructor for us:

struct EventuallyInit<T>(MaybeUninit<T>);
impl<T> Drop for EventuallyInit<T> {
  fn drop(&mut self) {
    unsafe { ptr::drop_in_place(self.0.assume_init_mut()) }

This works, but isn’t ideal, because now we can’t write down something like a C++ move constructor without running into the classic C++ problem: all objects must be destroyed unconditionally, so now you can have moved-from state. Giving up Rust’s moves-transfer-ownership (i.e. affine) property is bad, but it turns out to be avoidable!

There are also some scary details around panics here that I won’t get into.

&T, &mut T, … &move T?

moveit instead provides a MoveRef<'frame, T> type that tries to capture the notion of what an “owning reference” could mean in Rust. An &move or &own type has been discussed many times, but implementing it in the full generality it would deserve as a language feature runs into some interesting problems due to how Box<T>, the heap allocated equivalent, currently behaves.

We can think of MoveRef<'frame, T> as wrapping the longest-lived &mut T reference pointing to a particular location in memory. The longest-lived part is crucial, since it means that MoveRef is entitled to run its pointee’s destructor:

// Notice parallels with EventuallyInit<T> above.
impl<T: ?Sized> Drop for MoveRef<'_, T> {
  fn drop(&mut self) {
    unsafe { ptr::drop_in_place(self.ptr) }

No reference to the pointee can ever outlive the MoveRef itself, by definition, so this is safe. The owner of a value is that which is entitled to destroy it, and therefore a MoveRef literally owns its pointee. Of course, this means we can move out of it (which was the whole point of the original blogpost).

Because of this, we are further entitled to arbitrarily pin a MoveRef with no consequences: pinning it would consume the unpinned MoveRef (for obvious reasons, MoveRefs cannot be reborrowed) so no unpinned reference may outlive the pinning operation.

This gives us a very natural solution to the problem above: result should not be a Pin<&mut T>, but rather a Pin<MoveRef<'_, T>>:

let mut storage = MaybeUninit::uninit();
let uninit = Pin::new_unchecked(&mut storage);;

// This is now safe, since `my_new` initialized the value, so we can
// do with it what we please.
let result = MoveRef::into_pinned(MoveRef::new_unchecked(
  uninit.map_unchecked_mut(|mp| mp.assume_init_mut())

This messy sequence of steps is nicely wrapped up in a macro provided by the library that ensures safe initialization and eventual destruction:

// Allocate storage on the stack, emplace `my_new` onto it, and pin it
// in an owning reference.
moveit!(let result: Pin<MoveRef<T>> = my_new);

There is also some reasonably complex machinery that allows us to do something like an owning Deref, which I’ll come back to in a bit.

However, there is a small wrinkle that I did not realize when I first designed MoveRef: what happens if I mem::forget a MoveRef?

Undefined Behavior, Obviously

Quashing destruction isn’t new to Rust: we can mem::forget just about anything, leaking all kinds of resources. And that’s ok! Destructors alone cannot be used in type design to advert unsafe catastrophe, a well-understood limitation of the language that we have experience designing libraries around, such as Vec::drain().

MoveRef’s design creates a contradiction:

  • MoveRef is an owning smart pointer, and therefore can be safely pinned, much like Box::into_pinned() enables. Constructors, in particular, are designed to generate pinned MoveRefs!
  • Forgetting a MoveRef will cause the pointee destructor to be suppressed, but its storage will still be freed and eventually re-used, a violation of the Pin drop guarantee.

This would appear to mean that a design like MoveRef is not viable at all, and that this sort of “stack box” strategy is always unsound.

What About Box?

What about it? Even though we can trivially create a Pin<Box<i32>> via Box::pin(), this is a red herring. When we mem::forget a Box, we also forget about its storage too. Because its storage has been leaked unrecoverably, we are still, technically, within the bounds of the Pin contract. Only barely, but we’re inside the circle.

Interestingly, the Rust language has to deal with a similar problem; perhaps it suggests a way out?

Drop Flags and Dynamic Ownership Transfer

Carefully crafted Rust code emits some very interesting assembly. I’ve annotated the key portion of the output with a play-by-play below.

pub fn someone_elses_problem(_: Box<i32>) {
  // What goes in here isn't important,it just needs to
  // be an optimizer black-box.

pub fn maybe_drop(flag: bool) {
  let x = Box::new(42);
  if flag {
// See Godbolt widget above for full disassembly.
  // ...

  // Allocate memory.
  call    __rust_alloc

  // Check if allocation failed; panic if so.
  test    rax, rax
  je      .L.scream

  // Write a 42 to the memory.
  mov     dword ptr [rax], 42

  // Check the flag argument (LLVM decided to put it in rbx). If
  // true, we go free the memory ourselves.
  test    bl, bl
  je      .L.actually_our_problem

  // Otherwise, make it someone else's problem; they get to
  // free the memory for themselves. 
  mov     rdi, rax
  pop     rbx
  jmp     example::someone_elses_problem

  // ...
x86 Assembly

The upshot is that maybe_drop conditions the destructor of x on a flag, which is allocated next to it on the stack. Rust flips this flag when the value is moved into another function, and only runs the destructor when the flag is left alone. In this case, LLVM folded the flag into the bool argument, so this isn’t actually a meaningful perf hit.

These “drop flags” are key to Rust’s ownership model. Since ownership may be transferred dynamically due to reasonably complex control flow, it needs to leave breadcrumbs for itself to figure out whether the value wound up getting moved away or not. This is unique to Rust: in C++, every object is always destroyed, so no such faffing about is necessary.

Similarly, moveit can close this soundness hole by leaving itself breadcrumbs to determine if safe code is trying to undermine its guarantees.

In other words: in Rust, it is not sufficient to manage a pointer to manage a memory location; it is necessary to manage an explicit or implicit drop flag as well.

A Flagged MoveRef

We can extend MoveRef to track an explicit drop flag:

pub struct MoveRef<'frame, T> {
  ptr: &'frame mut T,

  // Set to `false` once the destructor runs.
  drop_flag: &'frame Cell<bool>,

Wrapping it in a Cell is convenient and doesn’t cost us anything, since a MoveRef can never be made Send or Sync anyways. Inside of its destructor, we can flip the flag, much like Rust flips a drop flag when transferring ownership to another function:

impl<T: ?Sized> Drop for MoveRef<'_, T> {
  fn drop(&mut self) {
    unsafe { ptr::drop_in_place(self.ptr) }

But, how should we use it? The easiest way is to change the definition of moveit!() to construct a flag trap:

let mut storage = MaybeUninit::uninit();
let uninit = Pin::new_unchecked(&mut storage);

// Create a *trapped flag*, which I'll describe below.
let trap = TrappedFlag::new();

// Run the constructor as before and construct a MoveRef.;
let result = MoveRef::into_pin(MoveRef::new_unchecked(
  trap.flag(),  // Creating a MoveRef now requires
                // passing in a flag in addition to 
                // a reference to the owned value itself.

The trap is a deterrent against forgetting a MoveRef: because the MoveRef’s destructor flips the flag, the trap’s destructor will notice if this doesn’t happen, and take action accordingly.

Note: in moveit, this is actually implemented by having the Slot<T> type carry a reference to the trap, created in the slot!() macro. However, this is not a crucial detail for the design.

An Earth-Shattering Kaboom

The trap is another RAII type that basically looks like this:

pub struct TrappedFlag(Cell<bool>);
impl Drop for TrappedFlag {
  fn drop(&mut self) {
    if self.0.get() { abort() }

The trap is simple: if the contained drop flag is not flipped, it crashes the program. Because moveit!() allocates it on the stack where uses cannot mem::forget it, its destructor is guaranteed to run before storage’s destructor runs (although Rust does not guarantee destructors run, it does guarantee their order).

If a MoveRef is forgotten, it won’t have a chance to flip the flag, which the trap will detect. Once the trap’s destructor notices this, it cannot return, either normally or by panic, since this would cause storage to be freed. Crashing the program is the only1 acceptable response.

Some of MoveRef’s functions need to be adapted to this new behavior: for example, MoveRef::into_inner() still needs to flip the flag, since moving out of the MoveRef is equivalent to running the destructor for the purposes of drop flags.

A Safer DerefMove

In order for MoveRef to be a proper “new” reference type, and not just a funny smart pointer, we also need a Deref equivalent:

pub unsafe trait DerefMove: DerefMut + Sized {
  /// An "uninitialized" version of `Self`.
  type Uninit: Sized;
  /// "Deinitializes" `self`, producing an opaque type that will
  /// destroy the storage of `*self` without calling the pointee
  /// destructor.
  fn deinit(self) -> Self::Uninit;

  /// Moves out of `this`, producing a `MoveRef` that owns its
  /// contents.
  unsafe fn deref_move(this: &mut Self::Uninit)
    -> MoveRef<'_, Self::Target>;

This is the original design for DerefMove, which had a two-phase operation: first deinit() was used to create a destructor-suppressed version of the smart pointer that would only run the destructor for the storage (e.g., for Box, only the call to free()). Then, deref_move() would extract the “inner pointee” out of it as a MoveRef. This had the effect of splitting the smart pointer’s destructor, much like we did above on the stack.

This has a number of usability problems. Not only does it need to be called through a macro, but deinit() isn’t actually safe: failing to call deref_move() is just as bad as calling mem::forget on the result. Further, it’s not clear where to plumb the drop flags through.

After many attempts to graft drop flags onto this design, I replaced it with a completely new interface:

pub unsafe trait DerefMove: DerefMut + Sized {
  /// The "pure storage" form of `Self`, which owns the storage
  /// but not the pointee.
  type Storage: Sized;

  /// Moves out of `this`, producing a [`MoveRef`] that owns
  /// its contents.
  fn deref_move<'frame>(
    storage: DroppingSlot<'frame, Self::Storage>,
  ) -> MoveRef<'frame, Self::Target>
    Self: 'frame;

Uninit has been given the clearer name of Storage: a type that owns just the storage of the moved-from pointer. The two functions were merged into a single, safe function that performs everything in one step, emitting the storage as an out-parameter.

The new DroppingSlot<T> is like a Slot<T>, but closer to a safe version of the EventuallyInit<T> type from earlier: its contents are not necessarily initialized, but if they are, it destroys them, and it only does so when its drop flag is set.

Box is the most illuminating example of this trait:

unsafe impl<T> DerefMove for Box<T> {
  type Storage = Box<MaybeUninit<T>>;

  fn deref_move<'frame>(
    storage: DroppingSlot<'frame, Box<MaybeUninit<T>>>,
  ) -> MoveRef<'frame, T>
    Self: 'frame
    // Dismantle the incoming Box into the "storage-only part".
    let this = unsafe {

    // Put the Box into the provided storage area. Note that we
    // don't need to set the drop flag; `DroppingSlot` does
    // that automatically for us.
    let (storage, drop_flag) = storage.put(this);

    // Construct a new MoveRef, converting `storage` from 
    // `&mut Box<MaybeUninit<T>>` into `&mut T`.
    unsafe { MoveRef::new_unchecked(storage.assume_init_mut(), drop_flag) }

MoveRef’s own implementation illustrates the need for the explicit lifetime bound:

unsafe impl<'a, T: ?Sized> DerefMove for MoveRef<'a, T> {
  type Storage = ();

  fn deref_move<'frame>(
    _: DroppingSlot<'frame, ()>,
  ) -> MoveRef<'frame, T>
    Self: 'frame
    // We can just return directly; this is a mere lifetime narrowing.

Since this is fundamentally a lifetime narrowing, this can only compile if we insist that 'a: 'frame, which is implied by Self: 'frame. Earlier iterations of this design enforced it via a MoveRef<'frame, Self> receiver, which turned out to be unnecessary.


As of writing, I’m still in the process of self-reviewing this change, but at this point I feel reasonably confident that it’s correct; this article is, in part, written to convince myself that I’ve done this correctly.

The new design will also enable me to finally complete my implementation of a constructor and pinning-friendly vector type; this issue came up in part because the vector type needs to manipulate drop flags in a complex way. For this reason, the actual implementation of drop flags actually uses a counter, not a single boolean.

I doubt this is the last issue I’ll need to chase down in moveit, but for now, we’re ever-closer to true owning references in Rust. ◼

  1. Arguably, running the skipped destructor is also a valid remediation strategy. However, this is incompatible with what the user requested: they asked for the destructor to be supressed, not for it to be run at a later date. This would be somewhat surprising behavior, which we could warn about for the benefit of unsafe code, but ultimately the incorrect choice for non-stack storage, such as a MoveRef referring to the heap. 

Understanding Assembly
Part I: RISC-V

A Turing tarpit is a programming language that is Turing-complete but very painful to accomplish anything in. One particularly notable tarpit is Brainfuck, which has a reputation among beginner and intermediate programmers as being unapproachable and only accessible to the most elite programmers hence the name, as Wikipedia puts it:

The language’s name is a reference to the slang term brainfuck, which refers to things so complicated or unusual that they exceed the limits of one’s understanding.

Assembly language, the “lowest-level” programming language on any computer, has a similar reputation: difficult, mysterious, and beyond understanding. A Turing tarpit that no programmer would want to have anything to do with.

Although advanced programmers usually stop seeing assembly as mysterious and inaccessible, I feel like it is a valuable topic even for intermediate programmers, and one that can be made approachable and interesting.

This series seeks to be that: assuming you have already been using a compiled language like Rust, C++, or Go, how is assembly relevant to you?

If you’re here to just learn assembly and don’t really care for motivation, you can just skip ahead.

This series is about learning to understand assembly, not write it. I do occasionally write assembly for a living, but I’m not an expert, and I don’t particularly relish it. I do read a ton of assembly, though.

What Is It, Anyways?

As every programmer knows, computers are very stupid. They are very good at following instructions and little else. In fact, the computer is so stupid, it can only process basic instructions serially1, one by one. The instructions are very simple: “add these two values”, “copy this value from here to there”, “go run these instructions over here”.

A computer processor implements these instructions as electronic circuits. At its most basic level, every computer looks like the following program:

size_t program_counter = ...;
Instruction *program = ...;

while (true) {
  Instruction next = program[program_counter];
  switch (next.opcode) {
    // Figure out what you're supposed to be doing and do it.

The array program is a your program encoded as a sequence of these “machine instructions” in some kind of binary format. For example, in RISC-V programs, each instruction is a 32-bit integer. This binary format is called machine code.

For example, when a RISC-V processor encounters the value 5407443 decoding circuitry decides that it means that it should take the value in the “register” a0, add 10 to it, and place the result in the register a1.

Decoding Instructions

5407443 seems opaque, but when viewed as binary, we can see how the processor decodes it:

0b 000000000101 00101 000 00110 0010011
   \__________/ \___/ \_/ \___/ \_____/
    |           |     |   |     |
    imm         rs1   fn  rd    opcode

opcode describes what sort of instruction this is, and what format it’s in; 0b0010011 means it’s an “immediate arithmetic” instruction, which uses the “I-type” format, given above. fn further specifies what the operation does; 0b000, combined with the value of opcode, means this is an addition instruction.

rs1 is the source: it gives the name of the source register, a0, given by it index, 0b00101, i.e., 10. Similarly, rd specifies the destination a1 by its index 11. Finally, imm is the value to add, 0b000000000101, or 10. The constant value appears immediately in the instruction itself, so it’s called an immediate.

However, if you’re a human programming a computer, writing all of this by hand is… very 60s, and you might prefer to have a textual representation, so you can write this more simply as addi a1, a0, 10.

addi a1, a0, 10 is a single line of assembly: it describes a single instruction in text form. Assembly language is “just” a textual representation of the program’s machine code. Your assembler can convert from text into machine instructions, and a disassembler reverses the process.

The simple nature of these instructions is what makes assembly a sort of Turing tarpit: you only get the most basic operations possible: you’re responsible for building everything else.

On Architectures

There isn’t “an” assembly language. Every computer has a different instruction set architecture, or “ISA”; I use the terms “instruction set”, “architecture”, and “ISA” interchangeably. Each ISA has a corresponding assembly language that describes that ISA’s specific instructions, but they all generally have similar overall structure.

I’m going to focus on three ISAs for ease of exposition, introduced in this order:

  1. RISC-V, a modern and fairly simple instruction set (specifically, the rv32gc variant). That’s Part I.
  2. x86_64, the instruction set of the device you’re reading this on (unless it’s a phone, an Apple M1 laptop, or something like a Nintendo Switch). That’s Part II.
  3. MOS 6502, a fairly ancient ISA still popular in very small microcontrollers. That’s Part III.

We’re starting with RISC-V because it’s a particularly elegant ISA (having been developed for academic work originally), while still being representative of the operations most ISAs offer.

In the future, I may dig into some other, more specialized ISAs.

But Why?

It’s actually very rare to write actual assembly. Thanks to modern (relatively) languages like Rust, C++, and Go, and even things like Haskell and JavaScript, virtually no programmers need to write assembly anymore.

But that’s only because it’s the leading language written by computers themselves. A compiler’s job is, fundamentally, to write the the assembly you would have had to write for you. To better understand what a compiler is doing for you, you need to be able to read its output.

At this point, it may be worth looking at my article on linkers as a refresher on the C compilation model.

For example, let’s suppose we have the very simple C program below.

#include <stdio.h>

int square_and_print(int x) {
    x *= x;
    printf("%d\n", x);
    return x;

Clang, my C compiler of choice, can turn it directly into a library via clang -c square.c. -c asks the compiler to stop before the link step, outputting the object file square.o. We can ask the compiler to stop even sooner than that by writing clang -S square.c, which will output square.s, the assembly file the compiler produced! For this example, and virtually all others in this post, I’m using a RISC-V target: -target riscv32-unknown-elf -march=rv32gc.

If you build with -Oz to make the code as small as possible (this makes it easiest to see what’s going on, too), you get something like this:

        .file   "square.c"
        .globl  square_and_print
        addi    sp, sp, -16
        sw      ra, 12(sp)
        sw      s0, 8(sp)
        mul     s0, a0, a0          // !
        lui     a0, %hi(.L.str)
        addi    a0, a0, %lo(.L.str)
        mv      a1, s0
        call    printf              // !
        mv      a0, s0
        lw      s0, 8(sp)
        lw      ra, 12(sp)
        addi    sp, sp, 16

        .section        .rodata
        .asciz  "%d\n"

There’s a lot going on! But pay attention to the two lines with a // !: the first is mul s0, a0, a0, which is the multiplication x *= x;. The second is call printf, which is our function call to printf()! I’ll explain what everything else means in short order.

Writing assembly isn’t a crucial skill, but being able to read it is. It’s actually so useful, that a website exists for quickly generating the assembly output of a vast library of compilers: the Compiler Explorer, frequently just called “godbolt” after its creator, Matt Godbolt. Being able to compare the output of different compilers can help understand what they do! Click on the godbolt button in the code fences to a godbolt for it.

“Low-level” languages like C aren’t the only ones where you can inspect assembly output. Godbolt supports Go: for example, click the godbolt button below.

package sq

import "fmt"

func SquareAndPrint(x int) int {
    x *= x
    fmt.Printf("%d\n", x)
    return x

Hopefully this is motivation enough to jump into the language proper. It is very useful to have a godbolt tab open to play around with examples!

Diving In

So, let’s say you do want to read assembly. How do we do that?

Let’s revisit our square.c example above. This time, I’ve added comments explaining what all the salient parts of the code do, including the assembler directives, which are all of the form .blah. Note that the actual compiler output includes way more directives that would get in the way of exposition.

There’s a lot of terms below that I haven’t defined yet. I’ll break down what this code does gradually, so feel free to refer back to it as necessary, using this handy-dandy link.

        // This tells the assembler to place all code that
        // follows in the `.text` section, where executable
        // data goes.

        // This is just metadata that tools can use to figure out
        // how the executable was built.
        .file   "square.c"

        // This asks the assembler to mark `square_and_print`
        // as an externally linkable symbol. Other files that
        // refer to `square_and_print` will be able to find it
        // at link time.
        .globl  square_and_print

square_and_print: // This is a label, which gives this position
                  // in the executable a name that can be
                  // referenced. They're very similar to `goto`
                  // labels from C.
                  // We'll see more labels later on.

        // This is the function prologue, which "sets up" the
        // function: it allocates stack space and saves the
        // return address, along with other calling-convention
        // fussiness.
        addi    sp, sp, -16
        sw      ra, 12(sp)
        sw      s0, 8(sp)

        // This is our `x *= x;` from before! Notice that the
        // compiler rewrote this to `temp = x * x;` at some
        // point, since the destination register is `s0`.
        mul     s0, a0, a0

        // These two instructions load the address of a string
        // constant; this pattern is specific to RISC-V.
        lui     a0, %hi(.L.str)
        addi    a0, a0, %lo(.L.str)
        // This copies the multiplication result into `a1`.
        mv      a1, s0

        // Call to printf!
        call    printf

        // Move `s0` into `a0`, since it's the return value.
        mv      a0, s0

        // This is the function epilogue, which restores state
        // saved in the prologue and de-allocates the stack
        // frame.
        lw      s0, 8(sp)
        lw      ra, 12(sp)
        addi    sp, sp, 16
        // We're done; return from the function!

        // This tells the assembler to place what follows in
        // the `.rodata` section, for read-only constants like
        // strings.
        .section        .rodata

.L.str: // Give our string constant a private name. By convention,
        // .L labels are "private" names emitted by the compiler.

        // Emit an ASCII string into `.rodata` with an extra null
        // terminator at the end: that's what the `z` stands for.
        .asciz  "%d\n"

The Core Syntax

All assemblers are different, but the core syntax tends to be the same. There are three main kinds of syntax productions:

  • Instructions, which consist of a mnemonic followed by some number of operands, such as addi sp, sp -16 and call printf above. These are the text encoding of machine code.
  • Labels, which consist of a symbol followed by a colon, like square_and_print: or .L.str:. These are used to let instruction operands refer to locations in the program.
  • Directives, which vary wildly by assembler. GCC-style assembly like that above uses a .directive arg, arg syntax, as seen in .text, .globl, and .asciz. They control the behavior of the assembler in various ways.

An assembler’s purpose is to read the .s file and serialize it as a binary .o file. It’s kind of like a compiler, but it does virtually no interesting work at all, beyond knowing how to encode instructions.

Directives control how this serialization occurs (such as moving around the output cursor); instructions are emitted as-is, and labels refer to locations in the object file. Simple enough, right?

Anatomy of an Instruction

Let’s look at the very first instruction in square_and_print:

        addi sp, sp, -16
        ---- --  --  ---
         |   |   |    |
        mnemonic |   immediate operand
             |  input operand
            output operand
RISC-V Assembly

The first token is called the mnemonic, which is a painfully terse abbreviation of what the instruction does. In this case, addi means “add with immediate”.

sp is a register. Registers are special variables wired directly into the processor that can be used as operands in instructions. The degree to which only registers are permitted as operands varies by architecture; RISC-V only allows registers, but x86, as we’ll see, does not. Registers come in many flavors, but sp is a GPR, or “general purpose register”; it holds a machine word-sized integer, which in the case of 32-bit RISC-V is… 32-bit2.

RISC-V Registers

One of my absolute favorite parts of RISC-V is how it names its registers. It has 32 GPRs named x0 through x31. However, these registers have so-called “ABI names” that specify the role of each register in the ABI.

The usefulness of these names will be much more apparent when we discuss the calling convention, so feel free to come back to this later.

x0 is called zero, because of its special property: writes to it are ignored, and reads always produce zero. This is handy for encoding certain common operations: for example, it can be used to quickly get a constant value= into a register: addi rd, zero, 42.

x1, x2, x3, and x4 have special roles and generally aren’t used for general computation. The first two are the link register ra, which holds the return address, and sp, the stack pointer.

The latter two are gp and tp; the global ppointer and the thread ppointer; their roles are somewhat complicated, so we won’t discuss them in this post.

The remaining registers belong to one of three categories: argument registers, saved registers, and temporary registers, named so for their role in calling a function (as described below).

The argument registers are x10 through x17, and use the names a0 through a7. The saved registers are x8, x9, and x18 through x27, called s0, through s11. The temporary registers are x5 through x7 and x28 through x31, called t0 through t6.

As a matter of personal preference, you may notice me reaching for argument registers for most examples.

-16 is an immediate, which is a literal value that is encoded directly into the instruction. The encoding of addi sp, sp, -16 will include the binary representation of -16 (in the case of RISC-V, as a 12-bit integer). [The decoding example above]{#decoding-instructions} shows how immediates are literally encoded immediately in the instruction.

Immediates allow for small but fixed integer arguments to be encoded with high locality to the instruction, which is good for code size and performance.

The first operand in RISC-V is (almost) always the output. addi, rd, rs, imm should be read as rd = rs + imm. Virtually all assembler syntax follows this convention, which is called the three-address code.

Other kinds of operands exist: for example, call printf refers to the symbol printf. The assembler, which doesn’t actually know where printf is, will emit a small note in the object file that tells the linker to find printf and splat it into the assembly according to some instructions in the note. These notes are called relocations.

The instructions lui a0, %hi(.L.str) and addi a0, a0, %lo(.L.str) use the %lo and %hi operand types, which are specific to RISC-V; they load the low 12 bits and high 20 bits of a symbol’s address into the immediate operand. This is a RISC-V-specific pattern for loading an address into a register, which most assemblers provide with the pseudoinstruction la a0, .L.str (where la stands for “load address”).

Most architectures have their own funny architecture-specific operand types to deal with the architecture’s idiosyncrasy.

Types of Instructions

Available instructions tend to be motivated by providing one of three classes of functionality:

  1. A Turing-complete register machine execution environment. This lends to the Turing tarpit nature of assembly: only the absolute minimum in terms of control flow and memory access is provided.
  2. Efficient silicon implementation of common operations on bit strings and integers, ranging from arithmetic to cryptographic algorithms.
  3. Building a secure operating system, hosting virtual machines, and actuating hardware external to the processor, like a monitor, a keyboard, or speakers.

Instructions can be broadly classified into four categories: arithmetic, memory, control flow, and “everything else”. In the last thirty years, the bar for general-purpose architectures is usually “this is enough to implement a C runtime.”

Arithmetic Instructions

Arithmetic makes up the bulk of the instruction set. This always includes addition, subtraction, and bitwise and, or, and xor, as well as unary not and negation.

In RISC-V, these come in two variants: a three-register version and a two-register, one immediate version. For example, add a0, a1, a2 is the three-register version of addition, while addi a0, a1, 42 is the immediate version. There isn’t a subi though, since you can just use negative immediates with addi.

not and neg are not actual instructions in RISC-V, but pseudoinstructions: not a0, a1 encodes as xori a0, a1, -1, while neg a0, a1 becomes sub a0, zero, a1.

Most instruction sets also have bit shifts, usually in three flavors: left shifts, right shifts, and arithmetic right shifts; arithmetic right shift is defined such that it behaves like division by powers of two on signed integers. RISC-V’s names for these instructions are sll, srl, and sra.

Multiplication and division are somewhat rarer, because they are expensive to implement in silicon; smaller devices don’t have them3. Division in particular is very complex to implement in silicon. Instruction sets usually have different behavior around division by zero: some architectures will fault, similar to a memory error, while some, like RISC-V, produce a well-defined trap value.

There is usually also a “copy” instruction that moves the value of one register to another, which is kind of like a trivial arithmetic instruction. RISC-V calls this mv a0, a1, but it’s just a pseudoinstruction that expands to addi a0, a1, 0.

Some architectures also offer more exotic arithmetic. This is just a sampler of what’s sometimes available:

  • Bit rotation, which is like a shift but bits that get shifted off end up at the other end of the integer. This is useful for a vast array of numeric algorithms, including ARX ciphers like ChaCha20.
  • Byte reversal, which can be used for changing the endianness of an integer; bit reversal is analogous.
  • Bit extraction, which can be used to form new integers out of bitfields of another.
  • Carry-less multiplication, which is like long multiplication but you don’t bother to carry anything when you add intermediates. This is used to implement Galois/Counter mode encryption.
  • Fused instructions, like xnor and nand.
  • Floating point instructions, usually implementing the IEEE 754 standard.

There is also a special kind of arithmetic instruction called a vector instruction, but I’ll leave those for another time.

Memory Instructions

Load instructions fetch memory from RAM into registers, while store instructions write it back. These instructions are what we use to implement pointers.

They come in all sorts of different sizes: RISC-V has lw, lh, and lb for loading 32-, 16-, and 8-bit values from a location; sw, sh, and sb are their store counterparts. 64-bit RISC-V also provides ld and sd for 64-bit loads and stores.

Load/store instructions frequently take an offset for indexing into memory. lw a1, 4(a0)4 is effectively a1 = a0[4], treating a0 like a pointer.

These instructions frequently have an alignment constraint: the pointer value must (or, at least, should) be divisible by the number of bytes being loaded. RISC-V, for example, mandates that lw only be used on pointers divisible by 4. This constraint simplifies the microarchitecture; even on architectures that don’t mandate it, aligned loads and stores are typically far faster.

This category also includes instructions necessary for implementing atomics, such as lock cmpxchg on x86 and lr/sc on RISC-V. Atomics are fundamentally about changing the semantics of reading and writing from RAM, and thus require special processor support.

Some architectures, like x86, 65816, and very recently, ARM, provide instructions that implement memcpy and its ilk in hardware: in x86, for example, this is called rep movsb.

Control Flow Instructions

Control flow is the secret ingredient that turns our glorified calculator into a Turing tarpit: they allow changing the flow of program execution based on its current state.

Unconditional jumps implement goto: given some label, the j label instruction jumps directly to it. j can be thought of as writing to a special pc register that holds the program counter. RISC-V also provides a dynamic jump, jr, which will jump to the address in a register. Function calls and returns are a special kind of unconditional jump.

Conditional jumps, often called branches, implement if. beq a0, a1, label will jump to label if a0 and a1 contain the same value. RISC-V provides branch instructions for all kinds of comparisons, like bne, blt, and bge.

Conditional and unconditional jumps can be used together to build loops, much like we could in C using if and goto.

For example, to zero a region of memory:

        // Assume a0 is the start of the region, and a1 the
        // number of bytes to zero.

        // Set a1 to the end of the region.
        addi a1, a0, a1
        // If a0 == a1, we're done!
        beq a0, a1, loop_done

        // Store a zero byte to `a0` and advance the pointer.
        sb zero, 0(a0)
        addi a0, a0, 1

        // Take it from the top!
        j loop_start
RISC-V Assembly

Miscellaneous Instructions

“Everything else” is, well… everything else.

No-op instructions do nothing: nop’s only purpose is to take up space in the instruction stream. No-op instructions can be used to pad space in the instruction stream, provide space for the linker to fix things up later, or implement nop sleds.

Instructions for poking processor state, like csrrw in RISC-V and wrmsr in x86 also belong in this category, as do “hinting” instructions like memory prefetches.

There are also instructions for special control flow: ecall is RISC-V’s “syscall” instruction, which “traps” to the kernel for it to do something; other architectures have similar instructions.

Breakpoint instructions and “fence” instructions belong here, too.

The Calling Convention

Functions are the core abstraction of all of programming. Assembly is no different: we have functions there, too!

Like in any language, functions are passed a list of arguments, perform some work, and return a value. For example, in C:

int identity(int x) {
  return x;

// ...

identity(5)  // Returns 5.

Unfortunately, there isn’t anything like function call syntax in assembly. As with everything else, we need do it instruction by instruction. All we do get in most architectures is a call instruction, which sets up a return address somewhere, and a ret instruction, which uses the return address to jump to where the function was called.

We need some way to pass arguments, return a computed value, and maintain a call stack, so that each function’s return address is kept intact for its ret instruction to consume. We also need this to be universal: if I pull in a library, I should be able to call its functions.

This mechanism is called the calling convention of the platform’s ABI. It’s a convention, because all libraries must respect it in their exposed API for code to work correctly at runtime.

A Function Call in Slow-Mo

At the instruction level, function calls look something like this:

  1. Pre-call setup. The caller sets up the function call arguments by placing them in the appointed locations for arguments. These are usually either registers or locations on the stack. a. The caller also saves the caller-saved registers to the stack.

  2. Jump to the function. The caller executes a call instruction (or whatever the function call instruction might be called – virtually all architectures have one). This sets the program counter to the first instruction of the callee.

  3. Function prologue. The callee does some setup before executing its code. a. The callee reserves space on the stack in an architecture-dependent manner. b. The callee saves the callee-saved registers to this stack space.

  4. Function body. The actual code of the function runs now! This part of the function needs to make sure the return value winds up wherever the return slot for the function is.

  5. Function epilogue. The callee undoes whatever work it did in the prologue, such as restoring saved registers, and executes a ret (or equivalent) instruction to return.

  6. Post-call cleanup. The caller is now executing again; it can unspill any saved state that it needs immediately after the function call, and can retrieve the return value from the return slot.

    In some ABIs, such as C++’s on Linux, this is where the destructors of the arguments get run. (Rust, and C++ on Windows, have callee-destroyed arguments instead.)

When people say that function calls have overhead, this is what they mean. Not only does the call instruction cause the processor to slam the breaks on its pipeline, causing all kinds of work to get thrown away, but state needs to be delicately saved and restored across the function boundary to maintain the illusion of a callstack.

Small functions which don’t need to use as many registers can avoid some of the setup and cleanup, and leaf functions which don’t call any other functions can avoid basically all of it!

Almost all registers in RISC-V are caller-saved, except for ra and the “saved” registers s0 and s11.

Callee-saved registers are convenient, because they won’t be wiped out by function calls. We can actually see the call to printf use this: even though the compiler could have emitted mul a1, a0, a0 and avoided the mv, this is actually less optimal. We need to keep the value around to return, and a1 is caller-saved, so we would have had to spill a1 before calling printf, regardless of whether printf overwrites a1 or not. We would then have to unspill it into a0 before ret. This costs us a hit to RAM. However, by emitting mul s0, a0, a0; mv a1, s0, we speculatively avoid the spill: if printf is compiled such that it never touches s0, the value never leaves registers at all!


We can see steps 1 and 2 in the call to printf:

        lui     a0, %hi(.L.str)
        addi    a0, a0, %lo(.L.str)
        mv      a1, s0
        call    printf
RISC-V Assembly

Arguments in the usual5 RISC-V calling convention, word-sized arguments are passed in the a0 through a7 registers, falling back to passing on the stack if they run out of space. If the argument is too big to fit in a register, it gets passed by reference instead. Arguments that fit into two registers can be split across registers.

We can see this in action above. The first argument, a string, is passed by pointer in a0; lui and addi do the work of actually putting that pointer into a0. The second argument x is passed in a1, copied from s0 where it landed from the earlier mul instruction.

Complex function signatures require much more6 work to set up.

Once we’re done getting arguments into place, we call, which switches execution over to printf’s first instruction. In addition, it stores the return address, specifically, the address of the instruction immediately after the call, into an architecture-specific location. On RISC-V, this is the special register ra.


Meanwhile, steps 3 and 4 occur in square_and_print’s prologue/epilogue itself:

        addi    sp, sp, -16
        sw      ra, 12(sp)
        sw      s0, 8(sp)

        // ...

        lw      s0, 8(sp)
        lw      ra, 12(sp)
        addi    sp, sp, 16
RISC-V Assembly

addi sp, sp, -16, which we stared at so hard above, grows the stack by 16 bytes. sp holds the stack pointer, which points to the top of the stack at all times. The stack grows downwards (as in most architectures!) and must be aligned to 16-byte boundaries across function calls: even though square_and_print only uses eight of those bytes, the full 16 bytes must be allocated.

The two sw instructions that follow store (or “spill”) the callee-saved registers ra and s0 to the stack. Note that s1 through s11 are not spilled, since square_and_print doesn’t use them!

Th this point, the function does its thing, whatever that means. This includes putting the return value in the return slot, which, for a function that returns an int, is in a0. In general, the return slot is passed back to the caller much like arguments are: if it fits in registers, a0 and a1 are used; otherwise, the caller allocates space for it and passes a pointer to the return slot as a hidden argument (in e.g. a0)7.

The epilogue inverts all operations of the prologue in reverse, unspilling registers and shrinking the stack, followed by ret. On RISC-V, all ret does is jump to the location referred to by the ra register.

Of course, all this work is only necessary to maintain the illusion of a callstack; if square_and_print were a leaf function, it would not need to spill anything at all! This results in an almost trivial function:

int square(int x) {
  return x * x;
// `x` is already in a0, and the
// return value needs to wind up
// in a0. EZ!
        mul a0, a0, a0
RISC-V Assembly

Because leaf functions won’t call other functions, they won’t need to save the caller-saved tX registers, so they can freely use them instead of the sX registers.

The End, for Now

Phew! We’re around six thousand words in, so let’s checkpoint what we’ve learned:

  1. Computers are stupid, but can at least follow extremely basic instructions, which are encoded as binary.

  2. Assembly language is human-readable version of these basic instructions for a particular computer.

  3. Assembly language programs consist of instructions, labels, and directives.

  4. Each instruction is a mnemonic followed by zero or more operands.

  5. Registers hold values the machine is currently operating on.

  6. Instructions can be broadly categorized as arithmetic, memory, control flow, and “miscellaneous” (plus vector and float instructions, for another time).

  7. The calling convention describes the low-level interface of a general function, consisting of some pre-call setup, and a prologue and epilogue in each function.

That’s all for now. RISC-V is a powerful but reasonably simple ISA. Next time, we’ll dive into the much older, much larger, and much more complex Intel x86. ◼

  1. This is a hilarious lie that is beyond the scope of this post. See, for example,

  2. What’s a machine word, exactly? It really depends on context. Most popular architectures has a straight-forward definition: the size of a GPR or the size of a pointer, which are the same.

    This is not true of all architectures, so beware. 

  3. Thankfully, these can be polyfilled using the previous ubiquitous instructions. Hacker’s Delight contains all of the relevant algorithms, so I won’t reproduce them here. The division polyfills are particularly interesting. 

  4. It’s a bit interesting that we don’t write lw a1, a0[4] in imitation of array syntax. This specific corner of the notation is shockingly diverse across assemblers: in ARM, we write ldr r0, [r1, #offset]; in x86, mov rax, [rdx + offset], or movq offset(%rdx), %rax for AT&T-flavored assemblers (which is surprisingly similar to the RISC-V syntax!); in 6502, lda ($1234, X)

  5. The calling convention isn’t actually determined by the architecture in most cases; that’s why it’s called a convention. The convention on x86 actually differs on Windows and Linux, and is usually also language-dependent; C’s calling convention is usually documented, but C++, Rust, and Go invent their own to handle language-specific fussiness.

    Of course, if you’re writing assembly, you can do whatever you want (though the silicon may be optimized for a particular recommended calling convention).

    RISC-V defines a recommended calling convention for ELF-based targets:

  6. The following listing shows how all kinds of different arguments are passed. The output isn’t quite what Clang emits, since I’ve cleaned it up for clarity.

    #include <stdio.h>
    #include <stdint.h>
    #include <stdnoreturn.h>
    struct Pair {
      uint32_t x, y;
    struct Triple {
      uint32_t x, y, z;
    struct Packed {
      uint8_t x, y, z;
    // `noreturn` obviates the
    // {pro,epi}logue in `call_it`.
    noreturn void all_the_args(
      uint32_t a0,
      uint64_t a1a2,
      struct Pair a3a4,
      struct Triple a5_by_ref,
      uint16_t a6,
      struct Packed a7,
      uint32_t on_the_stack,
      struct Triple stack_by_ref
    void call_it(void) {
      struct Pair u = {7, 9};
      struct Triple v = {11, 13, 15};
      struct Packed w = {14, 16, 18};
        42, -42,  u, v,
         5,   w, 21, v
      // Reserve stack space.
      addi    sp, sp, -48
      // Get `&call_it.v` into `a3`.
      lui     a3, %hi(call_it.v)
      addi    a3, a3, %lo(call_it.v)
      // Copy contents of `*a3`
      // into `a0...a2`.
      lw      a0, 0(a3)
      lw      a1, 4(a3)
      lw      a2, 8(a3)
      // Create two copies of `v`
      // on the stack to pass by
      // reference.
      // This is `a5_by_ref`.
      sw      a2, 40(sp)
      sw      a1, 36(sp)
      sw      a0, 32(sp)
      // This is `stack_by_ref`.
      sw      a2, 24(sp)
      sw      a2, 20(sp)
      sw      a0, 16(sp)
      // Load the argument regs.
      addi    a0, zero, 42
      addi    a1, zero, -42
      addi    a2, zero, -1
      addi    a3, zero, 7
      addi    a4, zero, 9
      // A pointer to `a5_by_ref`!
      addi    a5, sp, 32
      addi    a6, zero, 5
      // Note that `a7` is three
      // packed bytes!
      lui     a0, 289
      addi    a7, a0, 14
      // Store `21` on the top of
      // the stack (our "a8")
      addi    t0, zero, 21
      sw      t0, 0(sp)
      // Store a pointer to
      // `stack_by_ref` on the 
      // second spot from the
      // stack top (our "a9")
      addi    t0, sp, 16
      sw      t0, 4(sp)
      // Call it!
      call    all_the_args
      // The constant `{11, 13, 15}`.
      .word   11
      .word   13
      .word   15
    RISC-V Assembly

  7. LLVM occasionally does somewhat clueless things around this corner of some ABIs. Given

    typedef struct { char p[100]; } X;
    X make_big(int x) {
      return (X) {x};

    we get the following from Clang:

    // NOTE: Return slot passed in `a0`, `x` passed in `a1`.
            addi    sp, sp, -16
            sw      ra, 12(sp)
            sw      s0, 8(sp)
            sw      s1, 4(sp)
            mv      s0, a1
            mv      s1, a0
            addi    a0, a0, 1
            addi    a2, zero, 99
            mv      a1, zero
            call    memset
            sb      s0, 0(s1)
            lw      s1, 4(sp)
            lw      s0, 8(sp)
            lw      ra, 12(sp)
            addi    sp, sp, 16
    RISC-V Assembly

    Note that sb s0, 0(s1) stores the input value x into the first element of the big array after calling memset. If we move the store to before, we can avoid much silliness, including some unnecessary spills:

            addi    sp, sp, -16
            sw      ra, 12(sp)
            sb      a1, 0(a0)
            addi    a0, a0, 1
            mv      a1, zero
            addi    a2, zero, 99
            call    memset
            lw      ra, 12(sp)
            addi    sp, sp, 16
    RISC-V Assembly

Everything You Never Wanted To Know About Linker Script

Low level software usually has lots of .cc or .rs files. Even lower-level software, like your cryptography library, probably has .S containing assembly, my least favorite language for code review.

The lowest level software out there, firmware, kernels, and drivers, have one third file type to feed into the toolchain: an .ld file, a “linker script”. The linker script, provided to Clang as -Wl,-T,foo.ld1, is like a template for the final executable. It tells the linker how to organize code from the input objects. This permits extremely precise control over the toolchain’s output.

Very few people know how to write linker script; it’s a bit of an obscure skill. Unfortunately, I’m one of them, so I get called to do it on occasion. Hopefully, this post is a good enough summary of the linker script language that you, too, can build your own binary!

Everything in this post can be found in excruciating detail in GNU ld’s documentation; lld accepts basically the same syntax. There’s no spec, just what your linker happens to accept. I will, however, do my best to provide a more friendly introduction.

No prior knowledge of how toolchains work is necessary! Where possible, I’ve tried to provide historical context on the names of everything. Toolchains are, unfortunately, bound by half a century of tradition. Better to at least know why they’re called that.

Wait, an .S file?

On Windows, assembly files use the sensible .asm extension. POSIX we use the .s extension, or .S when we’d like Clang to run the C preprocessor on them (virtually all hand-written assembly is of the second kind).

I don’t actually have a historical citation2 for .s, other than that it came from the Unix tradition of obnoxiously terse names. If we are to believe that .o stands for “object”, and .a stands for “archive”, then .s must stand for “source”, up until the B compiler replaced them with .b files! See

A final bit of trivia: .C files are obviously different from .c files… they’re C++ files! (Seriously, try it.)

Note: This post is specifically about POSIX. I know basically nothing about MSVC and link.exe other than that they exist. The most I’ve done is helped people debug trivial __declspec issues.

I will also only be covering things specific to linking an executable; linking other outputs, like shared libraries, is beyond this post.

Seriously, What’s a linker?

A linker is but a small part of a toolchain, the low-level programmer’s toolbox: everything you need to go from source code to execution.

The crown jewel of any toolchain is the compiler. The LLVM toolchain, for example, includes Clang, a C/C++3 compiler. The compiler takes source code, such as .cc, and lowers it down to a .s file, an assembly file which textually describes machine code for a specific architecture (you can also write them yourself).

Another toolchain program, the assembler, assembles each .s into a .o file, an object file4. An assembly file is merely a textual representation of an object file; assemblers are not particularly interesting programs.

A third program, the linker, links all of your object files into a final executable or binary, traditionally given the name a.out5.

This three (or two, if you do compile/assemble in one step) phase process is sometimes called the C compilation model. All modern software build infrastructure is built around this model6.

Even More Stages!

Clang, being based on LLVM, actually exposes one stage in between the .cc file and the .s file. You can ask it to skip doing codegen and emit a .ll file filled with LLVM IR, an intermediate between human-writable source code and assembly. The magic words to get this file are clang -S -emit-llvm. (The Rust equivalent is rustc --emit=llvm-ir.)

The LLVM toolchain provides llc, the LLVM compiler, which performs the .ll -> .s step (optionally assembling it, too). lli is an interpreter for the IR. Studying IR is mostly useful for understanding optimization behavior; topic for another day.

The compiler, assembler, and linker are the central components of a toolchain. Other languages, like Rust, usually provide their own toolchain, or just a compiler, reusing the existing C/C++ toolchain. The assembler and linker are language agnostic.

The toolchain also provides various debugging tools, including an interactive debugger, and tools for manipulating object files, such as nm, objdump, objcopy, and ar.

These days, most of this stuff is bundled into a single program, the compiler frontend, which knows how to compiler, assemble, and link, in one invocation. You can ask Clang to spit out .o files with clang -c, and .s files with clang -S.

Trs Nms

The UNIX crowd at Bell Labs was very excited about short, terse names. This tradition survives in Go’s somewhat questionable practice of single-letter variables.

Most toolchain program names are cute contractions. cc is “C compiler”; compilers for almost all other languages follow this convention, like rustc, javac, protoc, and scalac; Clang is just clang, but is perfectly ok being called as cc.

as is “assembler”; ld is “loader” (you’ll learn why sooner). ar is “archiver”, nm is “names”. Other names tend to be a bit more sensible.

Some fifty years ago at Bell Labs, someone really wanted to write a program with more than one .s file. To solve this, a program that could “link” symbol references across object files was written: the first linker.

You can take several .o files and use ar (an archaic tar, basically) to create a library, which always have names like libfoo.a (the lib is mandatory). A static library is just a collection of objects, which can be provided on an as-needed basis to the linker.

The “final link” incorporates several .o files and .a files to produce an executable. It does roughly the following:

  1. Parse all the objects and static libraries and put their symbols into a database. Symbols are named addresses of functions and global variables.
  2. Search for all unresolved symbol references in the .o files and match it up with a symbol from the database, recursively doing this for any code in a .a referenced during this process. This forms a sort of dependency graph between sections. This step is called symbol resolution.
  3. Throw out any code that isn’t referenced by the input files by tracing the dependency graph from the entry-point symbol (e.g., _start on Linux). This step is called garbage collection.
  4. Execute the linker script to figure out how to stitch the final binary together. This includes discovering the offsets at which everything will go.
  5. Resolve relocations, “holes” in the binary that require knowing the final runtime address of the section. Relocations are instructions placed in the object file for the linker to execute.
  6. Write out the completed binary.

This process is extremely memory-intensive; it is possible for colossal binaries, especially ones with tons of debug information, to “fail to link” because the linker exhausts the system’s memory.

We only care about step 4; whole books can be written about the previous steps. Thankfully, Ian Lance Taylor, mad linker scientist and author of gold, has written several excellent words on this topic:

Object Files and Sections

Linkers, fundamentally, consume object files and produce object files; the output is executable, meaning that all relocations have been resolved and an entry-point address (where the OS/bootloader will jump to to start the binary).

It’s useful to be able to peek into object files. The objdump utility is best for this. objdump -x my_object.o will show all headers, telling you what exactly is in it.

At a high level, an object file describes how a program should be loaded into memory. The object is divided into sections, which are named blocks of data. Sections may have file-like permissions, such as allocatable, loadable, readonly, and executable. objdump -h can be used to show the list of sections. Some selected lines of output from objdump on my machine (I’m on a 64-bit machine, but I’ve trimmed leading zeros to make it all fit):

$ objdump -h "$(which clang)"
/usr/bin/clang:     file format elf64-x86-64

Idx Name    Size      VMA       LMA       File off  Algn
 11 .init   00000017  00691ab8  00691ab8  00291ab8  2**2
 12 .plt    00006bb0  00691ad0  00691ad0  00291ad0  2**4
 13 .text   0165e861  00698680  00698680  00298680  2**4
 14 .fini   00000009  01cf6ee4  01cf6ee4  018f6ee4  2**2
 15 .rodata 0018ec68  01cf6ef0  01cf6ef0  018f6ef0  2**4
 24 .data   000024e8  021cd5d0  021cd5d0  01dcc5d0  2**4
 26 .bss    00009d21  021cfac0  021cfac0  01dceab8  2**4

Allocateable (ALLOC) sections must be allocated space by the operating system; if the section is loadable (LOAD), then the operating system must further fill that space with the contents of the section. This process is called loading and is performed by a loader program7. The loader is sometimes called the “dynamic linker”, and is often the same program as the “program linker”; this is why the linker is called ld.

Loading can also be done beforehand using the binary output format. This is useful for tiny microcontrollers that are too primitive to perform any loading. objcopy is useful for this and many other tasks that involve transforming object files.

Some common (POSIX) sections include:

  • .text, where your code lives8. It’s usually a loadable, readonly, executable section.
  • .data contains the initial values of global variables. It’s loadable.
  • .rodata contains constants. It’s loadable and readonly.
  • .bss is an empty allocatable section9. C specifies that uninitialized globals default to zero; this is a convenient way for avoiding storing a huge block of zeros in the executable!
  • Debug sections that are not loaded or allocated; these are usually removed for release builds.

After the linker decides which sections from the .o and .a inputs to keep (based on which symbols it decided it needed), it looks to the linker script how to arrange them in the output.

Let’s write our first linker script!

  /* Define an output section ".text". */
  .text : {
    /* Pull in all symbols in input sections named .text */
    /* Do the same for sections starting with .text.,
       such as */

  /* Do the same for ".bss", ".rodata", and ".data". */
  .bss : { *(.bss); *(.bss.*) }
  .data : { *(.data); *(.data.*) }
  .rodata : { *(.rodata); *(.rodata.*) }
Linker Script

This tells the linker to create a .text section in the output, which contains all sections named .text from all inputs, plus all sections with names like The content of the section is laid out in order: the contents of all .text sections will come before any .text.* sections; I don’t think the linker makes any promises about the ordering between different objects10.

As I mentioned before, parsers for linker script are fussy11: the space in .text : is significant.

Note that the two .text sections are different, and can have different names! The linker generally doesn’t care what a section is named; just its attributes. We could name it code if we wanted to; even the leading period is mere convention. Some object file formats don’t support arbitrary sections; all the sane ones (ELF, COFF, Mach-O) don’t care, but they don’t all spell it the same way; in Mach-O, you call it __text.

Before continuing, I recommend looking at the appendix so that you have a clear path towards being able to run and test your linker scripts!

Input Section Syntax

None of this syntax is used in practice but it’s useful to contextualize the syntax for pulling in a section. The full form of the syntax is

archive:object(section1 section2 ...)

Naturally, all of this is optional, so you can write foo.o or libbar.a:(.text) or :baz.o(.text .data), where the last one means “not part of a library”. There’s even an EXCLUDE_FILE syntax for filtering by source object, and a INPUT_SECTION_FLAGS syntax for filtering by the presence of format-specific flags.

Do not use any of this. Just write *(.text) and don’t think about it too hard. The * is just a glob for all objects.

Each section has an alignment, which is just the maximum of the alignments of all input sections pulled into it. This is important for ensuring that code and globals are aligned the way the architecture expects them to be. The alignment of a section can be set explicitly with

  .super_aligned : ALIGN(16) {
    /* ... */
Linker Script

You can also instruct the linker to toss out sections using the special /DISCARD/ output section, which overrides any decisions made at garbage-collection time. I’ve only ever used this to discard debug information that GCC was really excited about keeping around.

On the other hand, you can use KEEP(*(.text.*)) to ensure no .text sections are discarded by garbage-collection. Unfortunately, this doesn’t let you pull in sections from static libraries that weren’t referenced in the input objects.


Every section has three addresses associated with it. The simplest is the file offset: how far from the start of the file to find the section.

The virtual memory address, or VMA, is where the program expects to find the section at runtime. This is the address that is used by pointers and the program counter.

The load memory address, or LMA, is where the loader (be it a runtime loader or objcpy) must place the code. This is almost always the same as the VMA. Later on, in Using Symbols and LMAs, I’ll explain a place where this is actually useful.

When declaring a new section, the VMA and LMA are both set to the value12 of the location counter, which has the extremely descriptive name .13. This counter is automatically incremented as data is copied from the input

We can explicitly specify the VMA of a section by putting an expression before the colon, and the LMA by putting an expression in the AT(lma) specifier after the colon:

  .text 0x10008000: AT(0x40008000) {
    /* ... */
Linker Script

This will modify the location counter; you could also write it as

  . = 0x10008000;
  .text : AT(0x40008000) {
    /* ... */
Linker Script

Within SECTIONS, the location counter can be set at any point, even while in the middle of declaring a section (though the linker will probably complain if you do something rude like move it backwards).

The location counter is incremented automatically as sections are added, so it’s rarely necessary to fuss with it directly.

Memory Regions and Section Allocation

By default, the linker will simply allocate sections starting at address 0. The MEMORY statement can be used to define memory regions for more finely controlling how VMAs and LMAs are allocated without writing them down explicitly.

A classic example of a MEMORY block separates the address space into ROM and RAM:

  rom (rx)   : ORIGIN = 0x8000,     LENGTH = 16K
  ram (rw!x) : ORIGIN = 0x10000000, LENGTH = 256M
Linker Script

A region is a block of memory with a name and some attributes. The name is irrelevant beyond the scope of the linker script. The attributes in parens are used to specify what sections could conceivably go in that region. A section is compatible if it has any of the attributes before the !, and none which come after the !. (This filter mini-language isn’t very expressive.)

The attributes are the ones we mentioned earlier: rwxal are readonly, read/write, executable, allocated, and loadable14.

When allocating a section a VMA, the linker will try to pick the best memory region that matches the filter using a heuristic. I don’t really trust the heuristic, but you can instead write > region to put something into a specific region. Thus,

  .data {
    /* ... */
  } > ram AT> rom
Linker Script

AT> is the “obvious” of AT() and >, and sets which region to allocate the LMA from.

The origin and length of a region can be obtained with the ORIGIN(region) and LENGTH(region) functions.

Other Stuff to Put In Sections

Output sections can hold more than just input sections. Arbitrary data can be placed into sections using the BYTE, SHORT, LONG and QUAD for placing literal 8, 16, 32, and 64-bit unsigned integers into the section:

  .screams_internally : { LONG(0xaaaaaaaa) }
Linker Script

Numeric literals in linker script may, conveniently, be given the suffixes K or M to specify a kilobyte or megabyte quantity. E.g., 4K is sugar for 4096.


You can fill the unused portions of a section by using the FILL command, which sets the “fill pattern” from that point onward. For example, we can create four kilobytes of 0xaa using FILL and the location counter:

  .scream_page : {
    . += 4K;
Linker Script

The “fill pattern” is used to fill any unspecified space, such as alignment padding or jumping around with the location counter. We can use multiple FILLs to vary the fill pattern, such as if we wanted half the page to be 0x0a and half 0xa0:

  .scream_page : {
    . += 2K;
    . += 2K;
Linker Script

When using one fill pattern for the whole section, you can just write = fill; at the end of the section. For example,

  .scream_page : {
    . += 4K;
  } = 0xaa;
Linker Script

Linker Symbols

Although the linker needs to resolve all symbols using the input .o and .a files, you can also declare symbols directly in linker script; this is the absolute latest that symbols can be provided. For example:

  my_cool_symbol = 5;
Linker Script

This will define a new symbol with value 5. If we then wrote extern char my_cool_symbol;, we can access the value placed by the linker. However, note that the value of a symbol is an address! If you did

extern char my_cool_symbol;

uintptr_t get() {
  return my_cool_symbol;

the processor would be very confused about why you just dereferenced a pointer with address 5. The correct way to extract a linker symbol’s value is to write

extern char my_cool_symbol;

uintptr_t get() {
  return (uintptr_t)&my_cool_symbol;

It seems a bit silly to take the address of the global and use that as some kind of magic value, but that’s just how it works. The exact same mechanism works in Rust, too:

fn get() -> usize {
  extern "C" {
    #[link_name = "my_cool_symbol"]
    static SYM: u8;

  addr_of!(SYM) as usize

The most common use of this mechanism is percolating information not known until link time. For example, a common idiom is

  .text : {
    __text_start = .;
    /* stuff */
    __text_end = .;
Linker Script

This allows initialization code to find the section’s address and length; in this case, the pointer values are actually meaningful!


It’s common practice to lead linker symbols with two underscores, because C declares a surprisingly large class of symbols reserved for the implementation, so normal user code won’t call them. These include names like __text_start, which start with two underscores, and names starting with an underscore and an uppercase letter, like _Atomic.

However, libc and STL headers will totally use the double underscore symbols to make them resistant to tampering by users (which they are entitled to), so beware!

Symbol assignments can even go inside of a section, to capture the location counter’s value between input sections:

  .text : {
    text_middle = .;
Linker Script

Symbol names are not limited to C identifiers, and may contain dashes, periods, dollar signs, and other symbols. They may even be quoted, like "this symbol has spaces", which C will never be able to access as an extern.

There is a mini-language of expressions that symbols can be assigned to. This includes:

  • Numeric literals like 42, 0xaa, and 4K.
  • The location counter, ..
  • Other symbols.
  • The usual set of C operators, such as arithmetic and bit operations. Xor is curiously missing.
  • A handful of builtin functions, described below.

There are some fairly complicated rules around how symbols may be given relative addresses to the start of a section, which are only relevant when dealing with position-independent code:

Functions belong to one of two board categories: getters for properties of sections, memory regions, and other linker structures; and arithmetic. Useful functions include:

  • ADDR, LOADADDR, SIZEOF, and ALIGNOF, which produce the VMA, LMA, size, and alignment of a previously defined section.
  • ORIGIN and LENGTH, which produce the start address and length of a memory region.
  • MAX, MIN are obvious; LOG2CEIL computes the base-2 log, rounded up.
  • ALIGN(expr, align) rounds expr to the next multiple of align. ALIGN(align) is roughly equivalent to ALIGN(., align) with some subtleties around PIC. . = ALIGN(align); will align the location counter to align.

Some other builtins can be found at

A symbol definition can be wrapped in the PROVIDEO() function to make it “weak”, analogous to the “weak symbol” feature found in Clang. This means that the linker will not use the definition if any input object defines it.

Using Symbols and LMAs

As mentioned before, it is extremely rare for the LMA and VMA to be different. The most common situation where this occurs is when you’re running on a system, like a microcontroller, where memory is partitioned into two pieces: ROM and RAM. The ROM has the executable burned into it, and RAM starts out full of random garbage.

Most of the contents of the linked executable are read-only, so their VMA can be in ROM. However, the .data and .bss sections need to lie in RAM, because they’re writable. For .bss this is easy, because it doesn’t have loadable content. For .data, though, we need to separate the VMA and LMA: the VMA must go in RAM, and the LMA in ROM.

This distinction is important for the code that initializes the RAM: while for .bss all it has to do is zero it, for .data, it has to copy from ROM to RAM! The LMA lets us distinguish the copy source and the copy destination.

This has the important property that it tells the loader (usually objcopy in this case) to use the ROM addresses for actually loading the section to, but to link the code as if it were at a RAM address (which is needed for things like PC-relative loads to work correctly).

Here’s how we’d do it in linker script:

  rom : /* ... */
  ram : /* ... */

  /* .text and .rodata just go straight into the ROM. We don't need
     to mutate them ever. */
  .text : { *(.text) } > rom
  .rodata : { *(.rodata) } > rom

  /* .bss doesn't have any "loadable" content, so it goes straight
     into RAM. We could include `AT> rom`, but because the sections
     have no content, it doesn't matter. */
  .bss : { *(.bss) } > ram

  /* As described above, we need to get a RAM VMA but a ROM LMA;
     the > and AT> operators achieve this. */
  .data : { *(.data) } > ram AT> rom

/* The initialization code will need some symbols to know how to
   zero the .bss and copy the initial .data values. We can use the
   functions from the previous section for this! */

bss_start = ADDR(.bss);
bss_end = bss_start + SIZEOF(.bss);

data_start = ADDR(.data);
data_end = data_start + SIZEOF(.data);

rom_data_start = LOADADDR(.data);
Linker Script

Although we would normally write the initialization code in assembly (since it’s undefined behavior to execute C before initializing the .bss and .data sections), I’ve written it in C for illustrative purposes:

#include <string.h>

extern char bss_start[];
extern char bss_end[];
extern char data_start[];
extern char data_end[];
extern char rom_data_start[];

void init_sections(void) {
  // Zero the .bss.
  memset(bss_start, 0, bss_end - bss_start);

  // Copy the .data values from ROM to RAM.
  memcpy(data_start, rom_data_start, data_end - data_start);

Misc Linker Script Features

Linker script includes a bunch of other commands that don’t fit into a specific category:

  • ENTRY() sets the program entry-point, either as a symbol or a raw address. The -e flag can be used to override it. The ld docs assert that there are fallbacks if an entry-point can’t be found, but in my experience you can sometimes get errors here. ENTRY(_start) would use the _start symbol, for example15.
  • INCLUDE "path/to/file.ld" is #include but for linker script.
  • INPUT(foo.o) will add foo.o as a linker input, as if it was passed at the commandline. GROUP is similar, but with the semantics of --start-group.
  • OUTPUT() overrides the usual a.out default output name.
  • ASSERT() provides static assertions.
  • EXTERN(sym) causes the linker to behave as if an undefined reference to sym existed in an input object.

(Other commands are documented, but I’ve never needed them in practice.)

Real Linker Scripts

It may be useful to look at some real-life linker scripts.

If you wanna see what Clang, Rust, and the like all ultimately use, run ld --verbose. This will print the default linker script for your machine; this is a really intense script that uses basically every feature available in linker script (and, since it’s GNU, is very poorly formatted).

The Linux kernel also has linker scripts, which are differently intense, because they use the C preprocessor. For example, the one for amd64:

Tock OS, a secure operating system written in Rust, has some pretty solid linker scripts, with lots of comments: I recommend taking a look to see what a “real” but not too wild linker script looks like. There’s a fair bit of toolchain-specific stuff in there, too, that should give you an idea of what to expect.

Happy linking! ◼

Appendix: A Linker Playground

tl;dr: If you don’t wanna try out any examples, skip this section.

I want you to be able to try out the examples above, but there’s no Godbolt for linker scripts (yet!). Unlike normal code, you can’t just run linker script through a compiler, you’re gonna need some objects to link, too! Let’s set up a very small C project for testing your linker scripts.

Note: I’m assuming you’re on Linux, with x86_64, and using Clang. If you’re on a Mac (even M1), you can probably make ld64 do the right thing, but this is outside of what I’m an expert on.

If you’re on Windows, use WSL. I have no idea how MSCV does linker scripts at all.

First, we want a very simple static library:

int lib_call(const char* str) {
  // Discard `str`, we just want to take any argument.

  // This will go in `.bss`.
  static int count;
  return count++;

Compile extern.c into a static library like so:

clang -c extern.c
ar rc libextern.a extern.o

We can check out that we got something reasonable by using nm. The nm program shows you all the symbols a library or object defines.

$ nm libextern.a
0000000000000000 T lib_call
0000000000000000 b lib_call.count

This shows us the address, section type, and name of each symbol; man nm tells us that T means .text and b means .bss. Capital letters mean that the symbol is exported, so the linker can use it to resolve a symbol reference or a relocation. In C/C++, symbols declared static or in an unnamed namespace are “hidden”, and can’t be referenced outside of the object. This is sometimes called internal vs external linkage.

Next, we need a C program that uses the library:

extern int lib_call(const char* str);

// We're gonna use a custom entrypoint. This code will never run anyways, we
// just care about the linker output.
void run(void) {
  // This will go in `.data`, because it's initialized to non-zero.
  static int data = 5;

  // The string-constant will go into `.rodata`.
  data = lib_call("Hello from .rodata!");

Compile it with clang -c run.c. We can inspect the symbol table with nm as before:

$ nm run.o
                 U lib_call
0000000000000000 T run
0000000000000000 d

As you might guess, d is just .data. However, U is interesting: it’s an undefined symbol, meaning the linker will need to perform a symbol resolution! In fact, if we ask Clang to link this for us (it just shells out to a linker like ld):

$ clang run.o
/usr/bin/ld: /somewhere/crt1.o: in function `_start':
(.text+0x20): undefined reference to `main'
/usr/bin/ld: run.o: in function `run':
run.c:(.text+0xf): undefined reference to `lib_call'

The linker also complains that there’s no main() function, and that some object we didn’t provide called crt1.o wants it. This is the startup code for the C runtime; we can skip linking it with -nostartfiles. This will result in the linker picking an entry point for us.

We can resolve the missing symbol by linking against our library. -lfoo says to search for the library libfoo.a; -L. says to include the current directory for searching for libraries.

clang run.o -L. -lextern -nostartfiles

This gives us our binary, a.out, which we can now objdump:

$ objdump -d -Mintel a.out

a.out:     file format elf64-x86-64

Disassembly of section .text:

0000000000401000 <run>:
  401000:  55                      push   rbp
  401001:  48 89 e5                mov    rbp,rsp
  401004:  48 bf 00 20 40 00 00    movabs rdi,0x402000
  40100b:  00 00 00 
  40100e:  e8 0d 00 00 00          call   401020 <lib_call>
  401013:  89 04 25 00 40 40 00    mov    DWORD PTR ds:0x404000,eax
  40101a:  5d                      pop    rbp
  40101b:  c3                      ret    
  40101c:  0f 1f 40 00             nop    DWORD PTR [rax+0x0]

0000000000401020 <lib_call>:
  401020:  55                      push   rbp
  401021:  48 89 e5                mov    rbp,rsp
  401024:  48 89 7d f8             mov    QWORD PTR [rbp-0x8],rdi
  401028:  8b 04 25 04 40 40 00    mov    eax,DWORD PTR ds:0x404004
  40102f:  89 c1                   mov    ecx,eax
  401031:  83 c1 01                add    ecx,0x1
  401034:  89 0c 25 04 40 40 00    mov    DWORD PTR ds:0x404004,ecx
  40103b:  5d                      pop    rbp
  40103c:  c3                      ret    

Let’s write up the simplest possible linker script for all this:

  .text : { *(.text); *(.text.*) }
  .bss : { *(.bss); *(.bss.*) }
  .data : { *(.data); *(.data.*) }
  .rodata : { *(.rodata); *(.rodata.*) }

Let’s link! We’ll also want to make sure that the system libc doesn’t get in the way, using -nostdlib16.

clang run.o -L. -lextern -nostartfiles -nostdlib -Wl,-T,link.ld

At this point, you can use objdump to inspect a.out at your leisure! You’ll notice there are a few other sections, like .eh_frame. Clang adds these by default, but you can throw them out using /DISCARD/.

It’s worth it to run the examples in the post through the linker using this “playground”. You can actually control the sections Clang puts symbols into using the __attribute__((section("blah"))) compiler extension. The Rust equivalent is #[link_section = "blah"].

  1. Blame GCC for this. -Wl feeds arguments through to the linker, and -T is ld’s linker script input flag. Thankfully, rustc is far more sensible here: -Clink-args=-Wl,-T,foo.ld (when GCC/Clang is your linker frontend). 

  2. Correction, 2022-09-11. I have really been bothered by not knowing if this is actually true, and have periodically asked around about it. I asked Russ Cox, who was actually at Bell Labs back in the day, and he asked Ken Thompson, who confirms: it’s genuinely .s for source, because it was the only source they had back then.

    I am glad I got this from the horse’s mouth. :) 

  3. And many other things, like Objective-C. 

  4. Completely and utterly unrelated to the objects of object-oriented programming. Best I can tell, the etymology is lost to time. 

  5. a.out is also an object file format, like ELF, but toolchains live and die by tradition, so that’s the name given to the linker’s output by default. 

  6. Rust does not compile each .rs file into an object, and its “crates” are much larger than the average C++ translation unit. However, the Rust compiler will nonetheless produce many object files for a single crate, precisely for the benefit of this compilation model. 

  7. Operating systems are loaded by a bootloader. Bootloaders are themselves loaded by other bootloaders, such as the BIOS. At the bottom of the turtles is the mask ROM, which is a tiny bootloader permanently burned into the device. 

  8. No idea on the etymology. This isn’t ASCII text! 

  9. Back in the 50s, this stood for “block started by symbol”. 

  10. Yes, yes, you can write SORT_BY_NAME(*)(.text) but that’s not really something you ever wind up needing.

    See for more information on this. 

  11. You only get /* */ comment syntax because that’s the lowest common denominator. 

  12. Well, . actually gets increased to the alignment of the section first. If you insist on an unaligned section, the syntax is, obviously,

      .unaligned .: {
        /* ... */
    Linker Script

    (That was sarcasm. It must be stressed that this is not a friendly language.) 

  13. This symbol is also available in assembly files. jmp . is an overly-cute idiom for an infinity busy loop. It is even more terse in ARM and RISC-V, where it’s written b . and j ., respectively.

    Personally, I prefer the obtuse clarity of loop_forever: j loop_forever

  14. These are the same characters used to declare a section in assembly. If I wanted to place my code in a section named .crt0 but wanted it to be placed into a readonly, executable memory block, use the the assembler directive .section .crt0, rxal 

  15. Note that the entry point is almost never a function called main(). In the default configuration of most toolchains, an object called crt0.o is provided as part of the libc, which provides a _start() function that itself calls main(). CRT stands for “C runtime”; thus, crt0.o initializes the C runtime.

    This file contains the moral equivalent of the following C code, which varies according to target:

    extern int main(int argc, char** argv);
    noreturn void _start() {
      init_libc();    // Initializes global libc state.
      run_ctors();    // Runs all library constructors.
      int ret = main(get_argc(), get_argv());
      run_dtors();    // Runs all library destructors.
      cleanup_libc(); // Deinitializes the libc.
      exit(ret); // Asks the OS to gracefully destroy the process.

    This behavior can be disabled with -nostartfiles in Clang. The OSDev wiki has some on this topic:

  16. If you include libc, you will get bizarre errors involving something called “gcc_s”. libgcc (and libgcc_s) is GCC’s compiler runtime library. Where libc exposes high-level operations on the C runtime and utilities for manipulating common objects, libgcc provides even lower-level support, including:

    • Polyfills for arithmetic operations not available on the target. For example, dividing two 64-bit integers on most 32-bit targets will emit a reference to the a symbol like __udivmoddi4 (they all have utterly incomprehensible names like this one).
    • Soft-float implementations, i.e., IEEE floats implemented in software for targets without an FPU.
    • Bits of unwinding (e.g. exceptions and panics) support (the rest is in libunwind).
    • Miscellaneous runtime support code, such as the code that calls C++ static initializers.

    Clang’s version, libcompiler-rt, is ABI-compatible with libgcc and provides various support for profiling, sanitizers, and many, many other things the compiler needs available for compiling code.