Designing a SIMD Algorithm from Scratch

Another explainer on a fun, esoteric topic: optimizing code with SIMD (single instruction multiple data, also sometimes called vectorization). Designing a good, fast, portable SIMD algorithm is not a simple matter and requires thinking a little bit like a circuit designer.

Here’s the mandatory performance benchmark graph to catch your eye.

perf perf perf

“SIMD” often gets thrown around as a buzzword by performance and HPC (high performance computing) nerds, but I don’t think it’s a topic that has very friendly introductions out there, for a lot of reasons.

  • It’s not something you will really want to care about unless you think performance is cool.
  • APIs for programming with SIMD in most programming languages are garbage (I’ll get into why).
  • SIMD algorithms are hard to think about if you’re very procedural-programming-brained. A functional programming mindset can help a lot.

This post is mostly about vb64 (which stands for vector base64), a base64 codec I wrote to see for myself if Rust’s std::simd library is any good, but it’s also an excuse to talk about SIMD in general.

What is SIMD, anyways? Let’s dive in.

If you want to skip straight to the writeup on vb64, click here.

Problems with Physics

Unfortunately, computers exist in the real world[citation-needed], and are bound by the laws of nature. SIMD has relatively little to do with theoretical CS considerations, and everything to do with physics.

In the infancy of modern computing, you could simply improve performance of existing programs by buying new computers. This is often incorrectly attributed to Moore’s law (the number of transistors on IC designs doubles every two years). Moore’s law still appears to hold as of 2023, but some time in the last 15 years the Dennard scaling effect broke down. This means that denser transistors eventually means increased power dissipation density. In simpler terms, we don’t know how to continue to increase the clock frequency of computers without literally liquefying them.

So, since the early aughts, the hot new thing has been bigger core counts. Make your program more multi-threaded and it will run faster on bigger CPUs. This comes with synchronization overhead, since now the cores need to cooperate. All control flow, be it jumps, virtual calls, or synchronization will result in “stall”.

The main causes of stall are branches, instructions that indicate code can take one of two possible paths (like an if statement), and memory operations. Branches include all control flow: if statements, loops, function calls, function returns, even switch statements in C. Memory operations are loads and stores, especially ones that are cache-unfriendly.

Procedural Code Is Slow

Modern compute cores do not execute code line-by-line, because that would be very inefficient. Suppose I have this program:

let a = x + y;
let b = x ^ y;
println!("{a}, {b}");
Rust

There’s no reason for the CPU to wait to finish computing a before it begins computing b; it does not depend on a, and while the add is being executed, the xor circuits are idle. Computers say “program order be damned” and issue the add for a and the xor for b simultaneously. This is called instruction-level parallelism, and dependencies that get in the way of it are often called data hazards.

Of course, the Zen 2 in the machine I’m writing this with does not have one measly adder per core. It has dozens and dozens! The opportunities for parallelism are massive, as long as the compiler in your CPU’s execution pipeline can clear any data hazards in the way.

The better the core can do this, the more it can saturate all of the “functional units” for things like arithmetic, and the more numbers it can crunch per unit time, approaching maximum utilization of the hardware. Whenever the compiler can’t do this, the execution pipeline stalls and your code is slower.

Branches stall because they need to wait for the branch condition to be computed before fetching the next instruction (speculative execution is a somewhat iffy workaround for this). Memory operations stall because the data needs to physically arrive at the CPU, and the speed of light is finite in this universe.

Trying to reduce stall by improving opportunities for single-core parallelism is not a new idea. Consider the not-so-humble GPU, whose purpose in life is to render images. Images are vectors of pixels (i.e., color values), and rendering operations tend to be highly local. For example, a convolution kernel for a Gaussian blur will be two or even three orders of magnitude smaller than the final image, lending itself to locality.

Thus, GPUs are built for divide-and-conquer: they provide primitives for doing batched operations, and extremely limited control flow.

“SIMD” is synonymous with “batching”. It stands for “single instruction, multiple data”: a single instruction dispatches parallel operations on multiple lanes of data. GPUs are the original SIMD machines.

Lane-Wise

“SIMD” and “vector” are often used interchangeably. The fundamental unit a SIMD instruction (or “vector instruction”) operates on is a vector: a fixed-size array of numbers that you primarily operate on component-wise These components are called lanes.

SIMD vectors are usually quite small, since they need to fit into registers. For example, on my machine, the largest vectors are 256 bits wide. This is enough for 32 bytes (a u8x32), 4 double-precision floats (an f64x8), or all kinds of things in between.

some 256-bit vectors

Although this doesn’t seem like much, remember that offloading the overhead of keeping the pipeline saturated by a factor of 4x can translate to that big of a speedup in latency.

One-Bit Lanes

The simplest vector operations are bitwise: and, or, xor. Ordinary integers can be thought of as vectors themselves, with respect to the bitwise operations. That’s literally what “bitwise” means: lanes-wise with lanes that are one bit wide. An i32 is, in this regard, an i1x32.

In fact, as a warmup, let’s look at the problem of counting the number of 1 bits in an integer. This operation is called “population count”, or popcnt. If we view an i32 as an i1x32, popcnt is just a fold or reduce operation:

pub fn popcnt(mut x: u32) -> u32 {
  let mut bits = [0; 32];
  for (i, bit) in bits.iter_mut().enumerate() {
    *bit = (x >> i) & 1;
  }
  bits.into_iter().fold(0, |total, bit| total + bit)
}

In other words, we interpret the integer as an array of bits and then add the bits together to a 32-bit accumulator. Note that the accumulator needs to be higher precision to avoid overflow: accumulating into an i1 (as with the Iterator::reduce() method) will only tell us whether the number of 1 bits is even or odd.

Of course, this produces… comically bad code, frankly. We can do much better if we notice that we can vectorize the addition: first we add all of the adjacent pairs of bits together, then the pairs of pairs, and so on. This means the number of adds is logarithmic in the number of bits in the integer.

Visually, what we do is we “unzip” each vector, shift one to line up the lanes, add them, and then repeat with lanes twice as big.

first two popcnt merge steps

This is what that looks like in code.

pub fn popcnt(mut x: u32) -> u32 {
  // View x as a i1x32, and split it into two vectors
  // that contain the even and odd bits, respectively.
  let even = x & 0x55555555; // 0x5 == 0b0101.
  let odds = x & 0xaaaaaaaa; // 0xa == 0b1010.
  // Shift odds down to align the bits, and then add them together.
  // We interpret x now as a i2x16. When adding, each two-bit
  // lane cannot overflow, because the value in each lane is
  // either 0b00 or 0b01.
  x = even + (odds >> 1);

  // Repeat again but now splitting even and odd bit-pairs.
  let even = x & 0x33333333; // 0x3 == 0b0011.
  let odds = x & 0xcccccccc; // 0xc == 0b1100.
  // We need to shift by 2 to align, and now for this addition
  // we interpret x as a i4x8.
  x = even + (odds >> 2);

  // Again. The pattern should now be obvious.
  let even = x & 0x0f0f0f0f; // 0x0f == 0b00001111.
  let odds = x & 0xf0f0f0f0; // 0xf0 == 0b11110000.
  x = even + (odds >> 4); // i8x4

  let even = x & 0x00ff00ff;
  let odds = x & 0xff00ff00;
  x = even + (odds >> 8);  // i16x2

  let even = x & 0x0000ffff;
  let odds = x & 0xffff0000;
  // Because the value of `x` is at most 32, although we interpret this as a
  // i32x1 add, we could get away with just one e.g. i16 add.
  x = even + (odds >> 16);

  x // Done. All bits have been added.
}

This still won’t optimize down to a popcnt instruction, of course. The search scope for such a simplification is in the regime of superoptimizers. However, the generated code is small and fast, which is why this is the ideal implementation of popcnt for systems without such an instruction.

It’s especially nice because it is implementable for e.g. u64 with only one more reduction step (remember: it’s O(logn)O(\log n)!), and does not at any point require a full u64 addition.

Even though this is “just” using scalars, divide-and-conquer approaches like this are the bread and butter of the SIMD programmer.

Scaling Up: Operations on Real Vectors

Proper SIMD vectors provide more sophisticated semantics than scalars do, particularly because there is more need to provide replacements for things like control flow. Remember, control flow is slow!

What’s actually available is highly dependent on the architecture you’re compiling to (more on this later), but the way vector instruction sets are usually structured is something like this.

We have vector registers that are kind of like really big general-purpose registers. For example, on x86, most “high performance” cores (like my Zen 2) implement AVX2, which provides 256 bit ymm vectors. The registers themselves do not have a “lane count”; that is specified by the instructions. For example, the “vector byte add instruction” interprets the register as being divided into eight-byte lanes and adds them. The corresponding x86 instruction is vpaddb, which interprets a ymm as an i8x32.

The operations you usually get are:

  1. Bitwise operations. These don’t need to specify a lane width because it’s always implicitly 1: they’re bitwise.

  2. Lane-wise arithmetic. This is addition, subtraction, multiplication, division (both int and float), and shifts1 (int only). Lane-wise min and max are also common. These require specifying a lane width. Typically the smallest number of lanes is two or four.

  3. Lane-wise compare. Given a and b, we can create a new mask vector m such that m[i] = a[i] < b[i] (or any other comparison operation). A mask vector’s lanes contain boolean values with an unusual bit-pattern: all-zeros (for false) or all-ones (for true)2.

    • Masks can be used to select between two vectors: for example, given m, x, and y, you can form a fourth vector z such that z[i] = m[i] ? a[i] : b[i].
  4. Shuffles (sometimes called swizzles). Given a and x, create a third vector s such that s[i] = a[x[i]]. a is used as a lookup table, and x as a set of indices. Out of bounds produces a special value, usually zero. This emulates parallelized array access without needing to actually touch RAM (RAM is extremely slow).

    • Often there is a “shuffle2” or “riffle” operation that allows taking elements from one of two vectors. Given a, b, and x, we now define s as being s[i] = (a ++ b)[x[i]], where a ++ b is a double-width concatenation. How this is actually implemented depends on architecture, and it’s easy to build out of single shuffles regardless.

(1) and (2) are ordinary number crunching. Nothing deeply special about them.

The comparison and select operations in (3) are intended to help SIMD code stay “branchless”. Branchless code is written such that it performs the same operations regardless of its inputs, and relies on the properties of those operations to produce correct results. For example, this might mean taking advantage of identities like x * 0 = 0 and a ^ b ^ a = b to discard “garbage” results.

The shuffles described in (4) are much more powerful than meets the eye.

For example, “broadcast” (sometimes called “splat”) makes a vector whose lanes are all the same scalar, like Rust’s [42; N] array literal. A broadcast can be expressed as a shuffle: create a vector with the desired value in the first lane, and then shuffle it with an index vector of [0, 0, ...].

diagram of a broadcast

“Interleave” (also called “zip” or “pack”) takes two vectors a and b and creates two new vectors c and d whose lanes are alternating lanes from a and b. If the lane count is n, then c = [a[0], b[0], a[1], b[1], ...] and d = [a[n/2], b[n/2], a[n/2 + 1], b[n/2 + 1], ...]. This can also be implemented as a shuffle2, with shuffle indices of [0, n, 1, n + 1, ...]. “Deinterleave” (or “unzip”, or “unpack”) is the opposite operation: it interprets a pair of vectors as two halves of a larger vector of pairs, and produces two new vectors consisting of the halves of each pair.

Interleave can also be interpreted as taking a [T; N], transmuting it to a [[T; N/2]; 2], performing a matrix transpose to turn it into a [[T; 2]; N/2], and then transmuting that back to [T; N] again. Deinterleave is the same but it transmutes to [[T; 2]; N/2] first.

diagram of a interleave

“Rotate” takes a vector a with n lanes and produces a new vector b such that b[i] = a[(i + j) % n], for some chosen integer j. This is yet another shuffle, with indices [j, j + 1, ..., n - 1, 0, 1, ... j - 1].

diagram of a rotate

Shuffles are worth trying to wrap your mind around. SIMD programming is all about reinterpreting larger-than-an-integer-sized blocks of data as smaller blocks of varying sizes, and shuffling is important for getting data into the right “place”.

Intrinsics and Instruction Selection

Earlier, I mentioned that what you get varies by architecture. This section is basically a giant footnote.

So, there’s two big factors that go into this.

  1. We’ve learned over time which operations tend to be most useful to programmers. x86 might have something that ARM doesn’t because it “seemed like a good idea at the time” but turned out to be kinda niche.
  2. Instruction set extensions are often market differentiators, even within the same vendor. Intel has AVX-512, which provides even more sophisticated instructions, but it’s only available on high-end server chips, because it makes manufacturing more expensive.

Toolchains generalize different extensions as “target features”. Features can be detected at runtime through architecture-specific magic. On Linux, the lscpu command will list what features the CPU advertises that it recognizes, which correlate with the names of features that e.g. LLVM understands. What features are enabled for a particular function affects how LLVM compiles it. For example, LLVM will only emit ymm-using code when compiling with +avx2.

So how do you write portable SIMD code? On the surface, the answer is mostly “you don’t”, but it’s more complicated than that, and for that we need to understand how the later parts of a compiler works.

When a user requests an add by writing a + b, how should I decide which instruction to use for it? This seems like a trick question… just an add right? On x86, even this isn’t so easy, since you have a choice between the actual add instruction, or a lea instruction (which, among other things, preserves the rflags register). This question becomes more complicated for more sophisticated operations. This general problem is called instruction selection.

Because which “target features” are enabled affects which instructions are available, they affect instruction selection. When I went over operations “typically available”, this means that compilers will usually be able to select good choices of instructions for them on most architectures.

Compiling with something like -march=native or -Ctarget-cpu=native gets you “the best” code possible for the machine you’re building on, but it might not be portable3 to different processors. Gentoo was quite famous for building packages from source on user machines to take advantage of this (not to mention that they loved using -O3, which mostly exists to slow down build times with little benefit).

There is also runtime feature detection, where a program decides which version of a function to call at runtime by asking the CPU what it supports. Code deployed on heterogenous devices (like cryptography libraries) often make use of this. Doing this correctly is very hard and something I don’t particularly want to dig deeply into here.

The situation is made worse by the fact that in C++, you usually write SIMD code using “intrinsics”, which are special functions with inscrutable names like _mm256_cvtps_epu32 that represent a low-level operation in a specific instruction set (this is a float to int cast from AVX2). Intrinsics are defined by hardware vendors, but don’t necessarily map down to single instructions; the compiler can still optimize these instructions by merging, deduplication, and through instruction selection.

As a result you wind up writing the same code multiple times for different instruction sets, with only minor maintainability benefits over writing assembly.

The alternative is a portable SIMD library, which does some instruction selection behind the scenes at the library level but tries to rely on the compiler for most of the heavy-duty work. For a long time I was skeptical that this approach would actually produce good, competitive code, which brings us to the actual point of this article: using Rust’s portable SIMD library to implement a somewhat fussy algorithm, and measuring performance.

Parsing with SIMD

Let’s design a SIMD implementation for a well-known algorithm. Although it doesn’t look like it at first, the power of shuffles makes it possible to parse text with SIMD. And this parsing can be very, very fast.

In this case, we’re going to implement base64 decoding. To review, base64 is an encoding scheme for arbitrary binary data into ASCII. We interpret a byte slice as a bit vector, and divide it into six-bit chunks called sextets. Then, each sextet from 0 to 63 is mapped to an ASCII character:

  1. 0 to 25 go to 'A' to 'Z'.
  2. 26 to 51 go to 'a' to 'z'.
  3. 52 to 61 go to '0' to '9'.
  4. 62 goes to +.
  5. 63 goes to /.

There are other variants of base64, but the bulk of the complexity is the same for each variant.

There are a few basic pitfalls to keep in mind.

  1. Base64 is a “big endian” format: specifically, the bits in each byte are big endian. Because a sextet can span only parts of a byte, this distinction is important.

  2. We need to beware of cases where the input length is not divisible by 4; ostensibly messages should be padded with = to a multiple of 4, but it’s easy to just handle messages that aren’t padded correctly.

The length of a decoded message is given by this function:

fn decoded_len(input: usize) -> usize {
  input / 4 * 3 + match input % 4 {
    1 | 2 => 1,
    3 => 2,
    _ => 0,
  }
}
Rust

Given all this, the easiest way to implement base64 is something like this.

fn decode(data: &[u8], out: &mut Vec<u8>) -> Result<(), Error> {
  // Tear off at most two trailing =.
  let data = match data {
    [p @ .., b'=', b'='] | [p @ .., b'='] | p => p,
  };

  // Split the input into chunks of at most 4 bytes.
  for chunk in data.chunks(4) {
    let mut bytes = 0u32;
    for &byte in chunk {
      // Translate each ASCII character into its corresponding
      // sextet, or return an error.
      let sextet = match byte {
        b'A'..=b'Z' => byte - b'A',
        b'a'..=b'z' => byte - b'a' + 26,
        b'0'..=b'9' => byte - b'0' + 52,
        b'+' => 62,
        b'/' => 63,
        _ => return Err(Error(...)),
      };

      // Append the sextet to the temporary buffer.
      bytes <<= 6;
      bytes |= sextet as u32;
    }

    // Shift things so the actual data winds up at the
    // top of `bytes`.
    bytes <<= 32 - 6 * chunk.len();

    // Append the decoded data to `out`, keeping in mind that
    // `bytes` is big-endian encoded.
    let decoded = decoded_len(chunk.len());
    out.extend_from_slice(&bytes.to_be_bytes()[..decoded]);
  }

  Ok(())
}
Rust

So, what’s the process of turning this into a SIMD version? We want to follow one directive with inexorable, robotic dedication.

Eliminate all branches.

This is not completely feasible, since the input is of variable length. But we can try. There are several branches in this code:

  1. The for chunk in line. This one is is the length check: it checks if there is any data left to process.
  2. The for &byte in line. This is the hottest loop: it branches once per input byte.
  3. The match byte line is several branches, to determine which of the five “valid” match arms we land in.
  4. The return Err line. Returning in a hot loop is extra control flow, which is not ideal.
  5. The call to decoded_len contains a match, which generates branches.
  6. The call to Vec::extend_from_slice. This contains not just branches, but potential calls into the allocator. Extremely slow.

(5) is the easiest to deal with. The match is mapping the values 0, 1, 2, 3 to 0, 1, 1, 2. Call this function f. Then, the sequence given by x - f(x) is 0, 0, 1, 1. This just happens to equal x / 2 (or x >> 1), so we can write a completely branchless version of decoded_len like so.

pub fn decoded_len(input: usize) -> usize {
  let mod4 = input % 4;
  input / 4 * 3 + (mod4 - mod4 / 2)
}

That’s one branch eliminated4. ✅

The others will not prove so easy. Let’s turn our attention to the innermost loop next, branches (2), (3), and (4).

The Hottest Loop

The superpower of SIMD is that because you operate on so much data at a time, you can unroll the loop so hard it becomes branchless.

The insight is this: we want to load at most four bytes, do something to them, and then spit out at most three decoded bytes. While doing this operation, we may encounter a syntax error so we need to report that somehow.

Here’s some facts we can take advantage of.

  1. We don’t need to figure out how many bytes are in the “output” of the hot loop: our handy branchless decoded_len() does that for us.
  2. Invalid base64 is extremely rare. We want that syntax error to cost as little as possible. If the user still cares about which byte was the problem, they can scan the input for it after the fact.
  3. A is zero in base64. If we’re parsing a truncated chunk, padding it with A won’t change the value5.

This suggests an interface for the body of the “hottest loop”. We can factor it out as a separate function, and simplify since we can assume our input is always four bytes now.

fn decode_hot(ascii: [u8; 4]) -> ([u8; 3], bool) {
  let mut bytes = 0u32;
  let mut ok = true;
  for byte in ascii {
    let sextet = match byte {
      b'A'..=b'Z' => byte - b'A',
      b'a'..=b'z' => byte - b'a' + 26,
      b'0'..=b'9' => byte - b'0' + 52,
      b'+' => 62,
      b'/' => 63,
      _ => !0,
    };

    bytes <<= 6;
    bytes |= sextet as u32;
    ok &= byte == !0;
  }

  // This is the `to_be_bytes()` call.
  let [b1, b2, b3, _] = bytes.to_le_bytes();
  ([b3, b2, b1], ok)
}

// In decode()...
for chunk in data.chunks(4) {
  let mut ascii = [b'A'; 4];
  ascii[..chunk.len()].copy_from_slice(chunk);

  let (bytes, ok) = decode_hot(ascii);
  if !ok {
    return Err(Error)
  }

  let len = decoded_len(chunk.len());
  out.extend_from_slice(&bytes[..decoded]);
}
Rust

You’re probably thinking: why not return Option<[u8; 3]>? Returning an enum will make it messier to eliminate the if !ok branch later on (which we will!). We want to write branchless code, so let’s focus on finding a way of producing that three-byte output without needing to do early returns.

Now’s when we want to start talking about vectors rather than arrays, so let’s try to rewrite our function as such.

fn decode_hot(ascii: Simd<u8, 4>) -> (Simd<u8, 4>, bool) {
  unimplemented!()
}
Rust

Note that the output is now four bytes, not three. SIMD lane counts need to be powers of two, and that last element will never get looked at, so we don’t need to worry about what winds up there.

The callsite also needs to be tweaked, but only slightly, because Simd<u8, 4> is From<[u8; 4]>.

ASCII to Sextet

Let’s look at the first part of the for byte in ascii loop. We need to map each lane of the Simd<u8, 4> to the corresponding sextet, and somehow signal which ones are invalid. First, notice something special about the match: almost every arm can be written as byte - C for some constant C. The non-range case looks a little silly, but humor me:

let sextet = match byte {
  b'A'..=b'Z' => byte - b'A',
  b'a'..=b'z' => byte - b'a' + 26,
  b'0'..=b'9' => byte - b'0' + 52,
  b'+'        => byte - b'+' + 62,
  b'/'        => byte - b'/' + 63,
  _ => !0,
};
Rust

So, it should be sufficient to build a vector offsets that contains the appropriate constant C for each lane, and then let sextets = ascii - offsets;

How can we build offsets? Using compare-and-select.

// A lane-wise version of `x >= start && x <= end`.
fn in_range(bytes: Simd<u8, 4>, start: u8, end: u8) -> Mask<i8, 4> {
  bytes.simd_ge(Simd::splat(start)) & bytes.simd_le(Simd::splat(end))
}

// Create masks for each of the five ranges.
// Note that these are disjoint: for any two masks, m1 & m2 == 0.
let uppers = in_range(ascii, b'A', b'Z');
let lowers = in_range(ascii, b'a', b'z');
let digits = in_range(ascii, b'0', b'9');
let pluses = ascii.simd_eq([b'+'; N].into());
let solidi = ascii.simd_eq([b'/'; N].into());

// If any byte was invalid, none of the masks will select for it,
// so that lane will be 0 in the or of all the masks. This is our
// validation check.
let ok = (uppers | lowers | digits | pluses | solidi).all();

// Given a mask, create a new vector by splatting `value`
// over the set lanes.
fn masked_splat(mask: Mask<i8, N>, value: i8) -> Simd<i8, 4> {
  mask.select(Simd::splat(val), Simd::splat(0))
}

// Fill the the lanes of the offset vector by filling the
// set lanes with the corresponding offset. This is like
// a "vectorized" version of the `match`.
let offsets = masked_splat(uppers,  65)
            | masked_splat(lowers,  71)
            | masked_splat(digits,  -4)
            | masked_splat(pluses, -19)
            | masked_splat(solidi, -16);

// Finally, Build the sextets vector.
let sextets = ascii.cast::<i8>() - offsets;
Rust

This solution is quite elegant, and will produce very competitive code, but it’s not actually ideal. We need to do a lot of comparisons here: eight in total. We also keep lots of values alive at the same time, which might lead to unwanted register pressure.

SIMD Hash Table

Let’s look at the byte representations of the ranges. A-Z, a-z, and 0-9 are, as byte ranges, 0x41..0x5b, 0x61..0x7b, and 0x30..0x3a. Notice they all have different high nybbles! What’s more, + and / are 0x2b and 0x2f, so the function byte >> 4 is almost enough to distinguish all the ranges. If we subtract one if byte == b'/', we have a perfect hash for the ranges.

In other words, the value (byte >> 4) - (byte == '/') maps the ranges as follows:

  • A-Z goes to 4 or 5.
  • a-z goes to 6 or 7.
  • 0-9 goes to 3.
  • + goes to 2.
  • / goes to 1.

This is small enough that we could cram a lookup table of values for building the offsets vector into another SIMD vector, and use a shuffle operation to do the lookup.

This is not my original idea; I came across a GitHub issue where an anonymous user points out this perfect hash.

Our new ascii-to-sextet code looks like this:

// Compute the perfect hash for each lane.
let hashes = (ascii >> Simd::splat(4))
  + Simd::simd_eq(ascii, Simd::splat(b'/'))
    .to_int()  // to_int() is equivalent to masked_splat(-1, 0).
    .cast::<u8>();

// Look up offsets based on each hash and subtract them from `ascii`.
let sextets = ascii
    // This lookup table corresponds to the offsets we used to build the
    // `offsets` vector in the previous implementation, placed in the
    // indices that the perfect hash produces.
  - Simd::<i8, 8>::from([0, 16, 19, 4, -65, -65, -71, -71])
    .cast::<u8>()
    .swizzle_dyn(hashes);
Rust

There is a small wrinkle here: Simd::swizzle_dyn() requires that the index array be the same length as the lookup table. This is annoying because right now ascii is a Simd<u8, 4>, but that will not be the case later on, so I will simply sweep this under the rug.

Note that we no longer get validation as a side-effect of computing the sextets vector. The same GitHub issue also provides an exact bloom-filter for checking that a particular byte is valid; you can see my implementation here. I’m not sure how the OP constructed the bloom filter, but the search space is small enough that you could have written a little script to brute force it.

Riffling the Sextets

Now comes a much tricker operation: we need to somehow pack all four sextets into three bytes. One way to try to wrap our head around what the packing code in decode_hot() is doing is to pass in the all-ones sextet in one of the four bytes, and see where those ones end up in the return value.

This is not unlike how they use radioactive dyes in biology to track the moment of molecules or cells through an organism.

fn bits(value: u32) -> String {
  let [b1, b2, b3, b4] = value.reverse_bits().to_le_bytes();
  format!("{b1:08b} {b2:08b} {b3:08b} {b4:08b}")
}

fn decode_pack(input: [u8; 4]) {
  let mut output = 0u32;
  for byte in input {
    output <<= 6;
    output |= byte as u32;
  }
  output <<= 8;

  println!("{}\n{}\n", bits(u32::from_be_bytes(input)), bits(output));
}

decode_pack([0b111111, 0, 0, 0]);
decode_pack([0, 0b111111, 0, 0]);
decode_pack([0, 0, 0b111111, 0]);
decode_pack([0, 0, 0, 0b111111]);

// Output:
// 11111100 00000000 00000000 00000000
// 00111111 00000000 00000000 00000000
//
// 00000000 11111100 00000000 00000000
// 11000000 00001111 00000000 00000000
//
// 00000000 00000000 11111100 00000000
// 00000000 11110000 00000011 00000000
//
// 00000000 00000000 00000000 11111100
// 00000000 00000000 11111100 00000000

Bingo. Playing around with the inputs lets us verify which pieces of the bytes wind up where. For example, by passing 0b110000 as input[1], we see that the two high bits of input[1] correspond to the low bits of output[0]. I’ve written the code so that the bits in each byte are printed in little-endian order, so bits on the left are the low bits.

Putting this all together, we can draw a schematic of what this operation does to a general Simd<u8, 4>.

the riffling operation

Now, there’s no single instruction that will do this for us. Shuffles can be used to move bytes around, but we’re dealing with pieces of bytes here. We also can’t really do a shift, since we need bits that are overshifted to move into adjacent lanes.

The trick is to just make the lanes bigger.

Among the operations available for SIMD vectors are lane-wise casts, which allow us to zero-extend, sign-extend, or truncate each lane. So what we can do is cast sextets to a vector of u16, do the shift there and then… somehow put the parts back together?

Let’s see how far shifting gets us. How much do we need to shift things by? First, notice that the order of the bits within each chunk that doesn’t cross a byte boundary doesn’t change. For example, the four low bits of input[1] are in the same order when they become the high bits of output[1], and the two high bits of input[1] are also in the same order when they become the low bits of output[0].

This means we can determine how far to shift by comparing the bit position of the lowest bit of a byte of input with the bit position of the corresponding bit in output.

input[0]’s low bit is the third bit of output[0], so we need to shift input[0] by 2. input[1]’s lowest bit is the fifth bit of output[1], so we need to shift by 4. Analogously, the shifts for input[2] and input[3] turn out to be 6 and 0. In code:

let sextets = ...;
let shifted = sextets.cast::<u16>() << Simd::from([2, 4, 6, 0]);
Rust

So now we have a Simd<u16, 4> that contains the individual chunks that we need to move around, in the high and low bytes of each u16, which we can think of as being analogous to a [[u8; 2]; 4]. For example, shifted[0][0] contains sextet[0], but shifted. This corresponds to the red segment in the first schematic. The smaller blue segment is given by shifted[1][1], i.e., the high byte of the second u16. It’s already in the right place within that byte, so we want output[0] = shifted[0][0] | shifted[1][1].

This suggests a more general strategy: we want to take two vectors, the low bytes and the high bytes of each u16 in shifted, respectively, and somehow shuffle them so that when or’ed together, they give the desired output.

Look at the schematic again: if we had a vector consisting of [..aaaaaa, ....bbbb, ......cc], we could or it with a vector like [bb......, cccc...., dddddd..] to get the desired result.

One problem: dddddd.. is shifted[3][0], i.e., it’s a low byte. If we change the vector we shift by to [2, 4, 6, 8], though, it winds up in shifted[3][1], since it’s been shifted up by 8 bits: a full byte.

// Split shifted into low byte and high byte vectors.
// Same way you'd split a single u16 into bytes, but lane-wise.
let lo = shifted.cast::<u8>();
let hi = (shifted >> Simd::from([8; 4])).cast::<u8>();

// Align the lanes: we want to get shifted[0][0] | shifted[1][1],
// shifted[1][0] | shifted[2][1], etc.
let output = lo | hi.rotate_lanes_left::<1>();
Rust

Et voila, here is our new, totally branchless implementation of decode_hot().

fn decode_hot(ascii: Simd<u8, 4>) -> (Simd<u8, 4>, bool) {
  let hashes = (ascii >> Simd::splat(4))
    + Simd::simd_eq(ascii, Simd::splat(b'/'))
      .to_int()
      .cast::<u8>();

  let sextets = ascii
    - Simd::<i8, 8>::from([0, 16, 19, 4, -65, -65, -71, -71])
      .cast::<u8>()
      .swizzle_dyn(hashes);  // Note quite right yet, see next section.

  let ok = /* bloom filter shenanigans */;

  let shifted = sextets.cast::<u16>() << Simd::from([2, 4, 6, 8]);
  let lo = shifted.cast::<u8>();
  let hi = (shifted >> Simd::splat(8)).cast::<u8>();
  let output = lo | hi.rotate_lanes_left::<1>();

  (output, ok)
}
Rust

The compactness of this solution should not be understated. The simplicity of this solution is a large part of what makes it so efficient, because it aggressively leverages the primitives the hardware offers us.

Scaling Up

Ok, so now we have to contend with a new aspect of our implementation that’s crap: a Simd<u8, 4> is tiny. That’s not even 128 bits, which are the smallest vector registers on x86. What we need to do is make decode_hot() generic on the lane count. This will allow us to tune the number of lanes to batch together depending on benchmarks later on.

fn decode_hot<const N: usize>(ascii: Simd<u8, N>) -> (Simd<u8, N>, bool)
where
  // This makes sure N is a small power of 2.
  LaneCount<N>: SupportedLaneCount,
{
  let hashes = (ascii >> Simd::splat(4))
    + Simd::simd_eq(ascii, Simd::splat(b'/'))
      .to_int()
      .cast::<u8>();

  let sextets = ascii
    - tiled(&[0, 16, 19, 4, -65, -65, -71, -71])
      .cast::<u8>()
      .swizzle_dyn(hashes);  // Works fine now, as long as N >= 8.

  let ok = /* bloom filter shenanigans */;

  let shifted = sextets.cast::<u16>() << tiled(&[2, 4, 6, 8]);
  let lo = shifted.cast::<u8>();
  let hi = (shifted >> Simd::splat(8)).cast::<u8>();
  let output = lo | hi.rotate_lanes_left::<1>();

  (output, ok)
}

/// Generates a new vector made up of repeated "tiles" of identical
/// data.
const fn tiled<T, const N: usize>(tile: &[T]) -> Simd<T, N>
where
  T: SimdElement,
  LaneCount<N>: SupportedLaneCount,
{
  let mut out = [tile[0]; N];
  let mut i = 0;
  while i < N {
    out[i] = tile[i % tile.len()];
    i += 1;
  }
  Simd::from_array(out)
}
Rust

We have to change virtually nothing, which is pretty awesome! But unfortunately, this code is subtly incorrect. Remember how in the N = 4 case, the result of output had a garbage value that we ignore in its highest lane? Well, now that garbage data is interleaved into output: every fourth lane contains garbage.

We can use a shuffle to delete these lanes, thankfully. Specifically, we want shuffled[i] = output[i + i / 3], which skips every forth index. So, shuffled[3] = output[4], skipping over the garbage value in output[3]. If i + i / 3 overflows N, that’s ok, because that’s the high quarter of the final output vector, which is ignored anyways. In code:

fn decode_hot<const N: usize>(ascii: Simd<u8, N>) -> (Simd<u8, N>, bool)
where
  // This makes sure N is a small power of 2.
  LaneCount<N>: SupportedLaneCount,
{
  /* snip */

  let decoded_chunks = lo | hi.rotate_lanes_left::<1>();
  let output = swizzle!(N; decoded_chunks, array!(N; |i| i + i / 3));

  (output, ok)
}
Rust

swizzle!() is a helper macro6 for generating generic implementations of std::simd::Swizzle, and array!() is something I wrote for generating generic-length array constants; the closure is called once for each i in 0..N.

So now we can decode 32 base64 bytes in parallel by calling decode_hot::<32>(). We’ll try to keep things generic from here, so we can tune the lane parameter based on benchmarks.

The Outer Loop

Let’s look at decode() again. Let’s start by making it generic on the internal lane count, too.

fn decode<const N: usize>(data: &[u8], out: &mut Vec<u8>) -> Result<(), Error>
where
  LaneCount<N>: SupportedLaneCount,
{
  let data = match data {
    [p @ .., b'=', b'='] | [p @ .., b'='] | p => p,
  };

  for chunk in data.chunks(N) { // N-sized chunks now.
    let mut ascii = [b'A'; N];
    ascii[..chunk.len()].copy_from_slice(chunk);

    let (dec, ok) = decode_hot::<N>(ascii.into());
    if (!ok) {
      return Err(Error);
    }

    let decoded = decoded_len(chunk.len());
    out.extend_from_slice(&dec[..decoded]);
  }

  Ok(())
}
Rust

What branches are left? There’s still the branch from for chunks in .... It’s not ideal because it can’t do an exact pointer comparison, and needs to do a >= comparison on a length instead.

We call [T]::copy_from_slice, which is super slow because it needs to make a variable-length memcpy call, which can’t be inlined. Function calls are branches! The bounds checks are also a problem.

We branch on ok every loop iteration, still. Not returning early in decode_hot doesn’t win us anything (yet).

We potentially call the allocator in extend_from_slice, and perform another non-inline-able memcpy call.

Preallocating with Slop

The last of these is the easiest to address: we can reserve space in out, since we know exactly how much data we need to write thanks to decoded_len. Better yet, we can reserve some “slop”: i.e., scratch space past where the end of the message would be, so we can perform full SIMD stores, instead of the variable-length memcpy.

This way, in each iteration, we write the full SIMD vector, including any garbage bytes in the upper quarter. Then, the next write is offset 3/4 * N bytes over, so it overwrites the garbage bytes with decoded message bytes. The garbage bytes from the final right get “deleted” by not being included in the final Vec::set_len() that “commits” the memory we wrote to.

fn decode<const N: usize>(data: &[u8], out: &mut Vec<u8>) -> Result<(), Error>
where LaneCount<N>: SupportedLaneCount,
{
  let data = match data {
    [p @ .., b'=', b'='] | [p @ .., b'='] | p => p,
  };

  let final_len = decoded_len(data);
  out.reserve(final_len + N / 4);  // Reserve with slop.

  // Get a raw pointer to where we should start writing.
  let mut ptr = out.as_mut_ptr_range().end();
  let start = ptr;

  for chunk in data.chunks(N) { // N-sized chunks now.
    /* snip */

    let decoded = decoded_len(chunk.len());
    unsafe {
      // Do a raw write and advance the pointer.
      ptr.cast::<Simd<u8, N>>().write_unaligned(dec);
      ptr = ptr.add(decoded);
    }
  }

  unsafe {
    // Update the vector's final length.
    // This is the final "commit".
    let len = ptr.offset_from(start);
    out.set_len(len as usize);
  }

  Ok(())
}
Rust

This is safe, because we’ve pre-allocated exactly the amount of memory we need, and where ptr lands is equal to the amount of memory actually decoded. We could also compute the final length of out ahead of time.

Note that if we early return due to if !ok, out remains unmodified, because even though we did write to its buffer, we never execute the “commit” part, so the code remains correct.

Delaying Failure

Next up, we can eliminate the if !ok branches by waiting to return an error until as late as possible: just before the set_len call.

Remember our observation from before: most base64 encoded blobs are valid, so this unhappy path should be very rare. Also, syntax errors cannot cause code that follows to misbehave arbitrarily, so letting it go wild doesn’t hurt anything.

fn decode<const N: usize>(data: &[u8], out: &mut Vec<u8>) -> Result<(), Error>
where LaneCount<N>: SupportedLaneCount,
{
  /* snip */
  let mut error = false;
  for chunk in data.chunks(N) {
    let mut ascii = [b'A'; N];
    ascii[..chunk.len()].copy_from_slice(chunk);

    let (dec, ok) = decode_hot::<N>(ascii.into());
    error |= !ok;

    /* snip */
  }

  if error {
    return Err(Error);
  }

  unsafe {
    let len = ptr.offset_from(start);
    out.set_len(len as usize);
  }

  Ok(())
}
Rust

The branch is still “there”, sure, but it’s out of the hot loop.

Because we never hit the set_len call and commit whatever garbage we wrote, said garbage essentially disappears when we return early, to be overwritten by future calls to Vec::push().

Unroll It Harder

Ok, let’s look at the memcpy from copy_from_slice at the start of the hot loop. The loop has already been partly unrolled: it does N iterations with SIMD each step, doing something funny on the last step to make up for the missing data (padding with A).

We can take this a step further by doing an “unroll and jam” optimization. This type of unrolling splits the loop into two parts: a hot vectorized loop and a cold remainder part. The hot loop always handles length N input, and the remainder runs at most once and handles i < N input.

Rust provides an iterator adapter for hand-rolled (lol) unroll-and-jam: Iterator::chunks_exact().

fn decode<const N: usize>(data: &[u8], out: &mut Vec<u8>) -> Result<(), Error>
where LaneCount<N>: SupportedLaneCount,
{
  /* snip */
  let mut error = false;
  let mut chunks = data.chunks_exact(N);
  for chunk in &mut chunks {
    // Simd::from_slice() can do a load in one instruction.
    // The bounds check is easy for the compiler to elide.
    let (dec, ok) = decode_hot::<N>(Simd::from_slice(chunk));
    error |= !ok;
    /* snip */
  }

  let rest = chunks.remainder();
  if !rest.empty() {
    let mut ascii = [b'A'; N];
    ascii[..chunk.len()].copy_from_slice(chunk);

    let (dec, ok) = decode_hot::<N>(ascii.into());
    /* snip */
  }

  /* snip */
}
Rust

Splitting into two parts lets us call Simd::from_slice(), which performs a single, vector-sized load.

So, How Fast Is It?

At this point, it looks like we’ve addressed every branch that we can, so some benchmarks are in order. I wrote a benchmark that decodes messages of every length from 0 to something like 200 or 500 bytes, and compared it against the baseline base64 implementation on crates.io.

I compiled with -Zbuild-std and -Ctarget-cpu=native to try to get the best results. Based on some tuning, N = 32 was the best length, since it used one YMM register for each iteration of the hot loop.

a performance graph; our code is really good compared to the baseline, but variance is high

So, we have the baseline beat. But what’s up with that crazy heartbeat waveform? You can tell it has something to do with the “remainder” part of the loop, since it correlates strongly with data.len() % 32.

I stared at the assembly for a while. I don’t remember what was there, but I think that copy_from_slice had been inlined and unrolled into a loop that loaded each byte at a time. The moral equivalent of this:

let mut ascii = [b'A'; N];
for (a, b) in Iterator::zip(&mut ascii, chunk) {
  *a = *b;
}
Rust

I decided to try Simd::gather_or(), which is kind of like a “vectorized load”. It wound up producing worse assembly, so I gave up on using a gather and instead wrote a carefully optimized loading function by hand.

Unroll and Jam, Revisited

The idea here is to perform the largest scalar loads Rust offers where possible. The strategy is again unroll and jam: perform u128 loads in a loop and deal with the remainder separately.

The hot part looks like this:

let mut buf = [b'A'; N];

// Load a bunch of big 16-byte chunks. LLVM will lower these to XMM loads.
let ascii_ptr = buf.as_mut_ptr();
let mut write_at = ascii_ptr;
if slice.len() >= 16 {
  for i in 0..slice.len() / 16 {
    unsafe {
      write_at = write_at.add(i * 16);

      let word = slice.as_ptr().cast::<u128>().add(i).read_unaligned();
      write_at.cast::<u128>().write_unaligned(word);
    }
  }
}
Rust

The cold part seems hard to optimize at first. What’s the least number of unaligned loads you need to do to load 15 bytes from memory? It’s two! You can load a u64 from p, and then another one from p + 7; these loads (call them a and b) overlap by one byte, but we can or them together to merge that byte, so our loaded value is a as u128 | (b as u128 << 56).

A similar trick works if the data to load is between a u32 and a u64. Finally, to load 1, 2, or 3 bytes, we can load p, p + len/2 and p + len-1; depending on whether len is 1, 2, or 3, this will potentially load the same byte multiple times; however, this reduces the number of branches necessary, since we don’t need to distinguish the 1, 2, or 3 lines.

This is the kind of code that’s probably easier to read than to explain.

unsafe {
  let ptr = slice.as_ptr().offset(write_at.offset_from(ascii_ptr));
  let len = slice.len() % 16;

  if len >= 8 {
    // Load two overlapping u64s.
    let lo = ptr.cast::<u64>().read_unaligned() as u128;
    let hi = ptr.add(len - 8).cast::<u64>().read_unaligned() as u128;
    let data = lo | (hi << ((len - 8) * 8));

    let z = u128::from_ne_bytes([b'A'; 16]) << (len * 8);
    write_at.cast::<u128>().write_unaligned(data | z);
  } else if len >= 4 {
    // Load two overlapping u32s.
    let lo = ptr.cast::<u32>().read_unaligned() as u64;
    let hi = ptr.add(len - 4).cast::<u32>().read_unaligned() as u64;
    let data = lo | (hi << ((len - 4) * 8));

    let z = u64::from_ne_bytes([b'A'; 8]) << (len * 8);
    write_at.cast::<u64>().write_unaligned(data | z);
  } else {
    // Load 3 overlapping u8s.

    // For len       1       2       3     ...
    // ... this is  ptr[0]  ptr[0]  ptr[0]
    let lo = ptr.read() as u32;
    // ... this is  ptr[0]  ptr[1]  ptr[1]
    let mid = ptr.add(len / 2).read() as u32;
    // ... this is  ptr[0]  ptr[1]  ptr[2]
    let hi = ptr.add(len - 1).read() as u32;

    let data = lo | (mid << ((len / 2) * 8)) | hi << ((len - 1) * 8);

    let z = u32::from_ne_bytes([b'A'; 4]) << (len * 8);
    write_at.cast::<u32>().write_unaligned(data | z);
  }
}
Rust

I learned this type of loading code while contributing to Abseil: it’s very useful for loading variable-length data for data-hungry algorithms, like a codec or a hash function.

Here’s the same benchmark again, but with our new loading code.

a performance graph; our code is even better and the variance is very tight

The results are really, really good. The variance is super tight, and our performance is 2x that of the baseline pretty much everywhere. Success.

Encoding? Web-Safe?

Writing an encoding function is simple enough: first, implement an encode_hot() function that reverses the operations from decode_hot(). The perfect hash from before won’t work, so you’ll need to invent a new one.

Also, the loading/storing code around the encoder is slightly different, too. vb64 implements a very efficient encoding routine too, so I suggest taking a look at the source code if you’re interested.

There is a base64 variant called web-safe base64, that replaces the + and / characters with - and _. Building a perfect hash for these is trickier: you would probably have to do something like (byte >> 4) - (byte == '_' ? '_' : 0). I don’t support web-safe base64 yet, but only because I haven’t gotten around to it.

Conclusion

My library doesn’t really solve an important problem; base64 decoding isn’t a bottleneck… anywhere that I know of, really. But writing SIMD code is really fun! Writing branchless code is often overkill but can give you a good appreciation for what your compilers can and can’t do for you.

This project was also an excuse to try std::simd. I think it’s great overall, and generates excellent code. There’s some rough edges I’d like to see fixed to make SIMD code even simpler, but overall I’m very happy with the work that’s been done there.

This is probably one of the most complicated posts I’ve written in a long time. SIMD (and performance in general) is a complex topic that requires a breadth of knowledge of tricks and hardware, a lot of which isn’t written down. More of it is written down now, though.

  1. Shifts are better understood as arithmetic. They have a lane width, and closely approximate multiplication and division. AVX2 doesn’t even have vector shift or vector division: you emulate it with multiplication. 

  2. The two common representations of true and false, i.e. 1 and 0 or 0xff... and 0, are related by the two’s complement operation.

    For example, if I write uint32_t m = -(a == b);, m will be zero if a == b is false, and all-ones otherwise. This because applying any arithmetic operation to a bool promotes it to int, so false maps to 0 and true maps to 1. Applying the - sends 0 to 0 and 1 to -1, and it’s useful to know that in two’s complement, -1 is represented as all-ones.

    The all-ones representation for true is useful, because it can be used to implement branchless select very easily. For example,

    int select_if_eq(int a, int b, int x, int y) {
      int mask = -(a == b);
      return (mask & x) | (~mask & y);
    }
    C++

    This function returns x if a == b, and y otherwise. Can you tell why? 

  3. Target features also affect ABI in subtle ways that I could write many, many more words on. Compiling libraries you plan to distribute with weird target feature flags is a recipe for disaster. 

  4. Why can’t we leave this kind of thing to LLVM? Finding this particular branchless implementation is tricky. LLVM is smart enough to fold the match into a switch table, but that’s unnecessary memory traffic to look at the table. (In this domain, unnecessary memory traffic makes our code slower.)

    Incidentally, with the code I wrote for the original decoded_len(), LLVM produces a jump and a lookup table, which is definitely an odd choice? I went down something of a rabbit-hole. https://github.com/rust-lang/rust/issues/118306

    As for getting LLVM to find the “branchless” version of the lookup table? The search space is quite large, and this kind of “general strength reduction” problem is fairly open (keywords: “superoptimizers”). 

  5. To be clear on why this works: suppose that in our reference implementation, we only handle inputs that are a multiple-of-4 length, and are padded with = as necessary, and we treat = as zero in the match. Then, for the purposes of computing the bytes value (before appending it to out), we can assume the chunk length is always 4. 

  6. See vb64/src/util.rs

What is a Matrix? A Miserable Pile of Coefficients!

Linear algebra is undoubtedly the most useful field in all of algebra. It finds applications in all kinds of science and engineering, like quantum mechanics, graphics programming, and machine learning. It is the “most well-behaved” algebraic theory, in that other abstract algebra topics often try to approximate linear algebra, when possible.

For many students, linear algebra means vectors and matrices and determinants, and complex formulas for computing them. Matrices, in particular, come equipped with a fairly complicated, and a fortiori convoluted, multiplication operation.

This is not the only way to teach linear algebra, of course. Matrices and their multiplication appear complicated, but actually are a natural and compact way to represent a particular type of function, i.e., a linear map (or linear transformation).

This article is a short introduction to viewing linear algebra from the perspective of abstract algebra, from which matrices arise as a computational tool, rather than an object of study in and of themselves. I do assume some degree of familiarity with the idea of a matrix.

Linear Spaces

Most linear algebra courses open with a description of vectors in Euclidean space: Rn\R^n. Vectors there are defined as tuples of real numbers that can be added, multiplied, and scaled. Two vectors can be combined into a number through the dot product. Vectors come equipped with a notion of magnitude and direction.

However, this highly geometric picture can be counterproductive, since it is hard to apply geometric intuition directly to higher dimensions. It also obscures how this connects to working over a different number system, like the complex numbers.

Instead, I’d like to open with the concept of a linear space, which is somewhat more abstract than a vector space1.

First, we will need a notion of a “coefficient”, which is essentially something that you can do arithmetic with. We will draw coefficients from a designated ground field KK. A field is a setting for doing arithmetic: a set of objects that can be added, subtracted, and multiplied, and divided in the “usual fashion” along with special 00 and 11 values. E.g. a+0=aa + 0 = a, 1a=a1a = a, a(b+c)=ab+aca(b + c) = ab + ac, and so on.

Not only are the real numbers R\R a field, but so are the complex numbers C\C, and the rational numbers Q\Q. If we drop the “division” requirement, we can also include the integers Z\Z, or polynomials with rational coefficients Q[x]\Q[x], for example.

Having chosen our coefficients KK, a linear space VV over KK is another set of objects that can be added and subtracted (and including a special value 00)2, along with a scaling operation, which takes a coefficient cKc \in K and one of our objects vVv \in V and produces a new cvVcv \in V.

The important part of the scaling operation is that it’s compatible with addition: if we have a,bKa, b \in K and v,wVv, w \in V, we require that

a(v+w)=av+aw(a+b)v=av+bv\begin{gather*}a (v + w) = av + aw \\ (a + b) v = av + bv\end{gather*}
Math

This is what makes a linear space “linear”: you can write equations that look like first-degree polynomials (e.g. ax+bax + b), and which can be manipulated like first-degree polynomials.

These polynomials are called linear because their graph looks like a line. There’s no multiplication, so we can’t have x2x^2, but we do have multiplication by a coefficient. This is what makes linear algebra is “linear”.

Some examples: nn-tuples of elements drawn from any field are a linear space over that field, by componentwise addition and scalar multiplication; e.g., R3R^3. Setting n=1n = 1 shows that every field is a linear space over itself.

Polynomials in one variable over some field, K[x]K[x], are also a linear space, since polynomials can be added together and scaled by a any value in KK (since lone coefficients are degree zero polynomials). Real-valued functions also form a linear space over R\R in a similar way.

Linear Transformations

A linear map is a function f:VWf: V \to W between two linear spaces VV and WW over KK which “respects” the linear structure in a particular way. That is, for any cKc\in K and v,wVv, w \in V,

f(v+w)=f(v)+f(w)f(cv)=cf(v)\begin{gather*}f(v + w) = f(v) + f(w) \\ f(cv) = c \cdot f(v)\end{gather*}
Math

We call this type of relationship (respecting addition and scaling) “linearity”. One way to think of this relationship is that ff is kind of like a different kind of coefficient, in that it distributes over addition, which commutes with the “ordinary” coefficients from KK. However, applying ff produces a value from WW rather than VV.

Another way to think of it is that if we have a linear polynomial like p(x)=ax+bp(x) = ax + b in xx, then f(p(x))=p(f(x))f(p(x)) = p(f(x)). We say that ff commutes with all linear polynomials.

The most obvious sort of linear map is scaling. Given any coefficient cKc \in K, it defines a “scaling map”:

μc:VVvcv\begin{gather*}\mu_c: V \to V \\ v \mapsto cv\end{gather*}
Math

It’s trivial to check this is a linear map, by plugging it into the above equations: it’s linear because scaling is distributive and commutative.

Linear maps are the essential thing we study in linear algebra, since they describe all the different kinds of relationships between linear spaces.

Some linear maps are complicated. For example, a function from R2R2\R^2 \to \R^2 that rotates the plane by some angle θ\theta is linear, as are operations that stretch or shear the plane. However, they can’t “bend” or “fold” the plane: they are all fairly rigid motions. In the linear space Q[x]\Q[x] of rational polynomials, multiplication by any polynomial, such as xx or x21x^2 - 1, is a linear map. The notion of “linear map” depends heavily on the space we’re in.

Unfortunately, linear maps as they are quite opaque, and do not lend themselves well to calculation. However, we can build an explicit representation using a linear basis.

Linear Basis

For any linear space, we can construct a relatively small of elements such that any element of the space can be expressed as some linear function of these elements.

Explicitly, for any VV, we can construct a sequence3 eie_i such that for any vVv \in V, we can find ciKc_i \in K such that

v=iciei.v = \sum_i c_i e_i.
Math

Such a set eie_i is called a basis if it is linearly independent: no one eie_i can be expressed as a linear function of the rest. The dimension of VV, denoted dimV\dim V, is the number of elements in any choice of basis. This value does not depend on the choice of basis4.

Constructing a basis for any VV is easy: we can do this recursively. First, pick a random element e1e_1 of VV, and define a new linear space V/e1V/e_1 where we have identified all elements that differ by a factor of e1e_1 as equal (i.e., if vw=ce1v - w = ce_1, we treat vv and ww as equal in V/e1V/e_1).

Then, a basis for VV is a basis of V/e1V/e_1 with e1e_1 added. The construction of V/e1V/e_1 is essentially “collapsing” the dimension e1e_1 “points” in, giving us a new space where we’ve “deleted” all of the elements that have a nonzero e1e_1 component.

However, this only works when the dimension is finite; more complex methods must be used for infinite-dimensional spaces. For example, the polynomials Q[x]\Q[x] are an infinite-dimensional space, with basis elements 1,x,x2,x3,...\\{1, x, x^2, x^3, ...\\}. In general, for any linear space VV, it is always possible to arbitrarily choose a basis, although it may be infinite5.

Bases are useful because they give us a concrete representation of any element of VV. Given a fixed basis eie_i, we can represent any w=icieiw = \sum_i c_i e_i by the coefficients cic_i themselves. For a finite-dimensional VV, this brings us back column vectors: (dimV)(\dim V)-tuples of coefficients from KK that are added and scaled componentwise.

[c0c1cn]:=given eiiciei\Mat{c_0 \\ c_1 \\ \vdots \\ c_n} \,\underset{\text{given } e_i}{:=}\, \sum_i c_i e_i
Math

The iith basis element is represented as the vector whose entries are all 00 except for the iith one, which is 11. E.g.,

[100]=given eie1,[010]=given eie2,...\Mat{1 \\ 0 \\ \vdots \\ 0} \,\underset{\text{given } e_i}{=}\, e_1, \,\,\, \Mat{0 \\ 1 \\ \vdots \\ 0} \,\underset{\text{given } e_i}{=}\, e_2, \,\,\, ...
Math

It is important to recall that the choice of basis is arbitrary. From the mathematical perspective, any basis is just as good as any other, although some may be more computationally convenient.

Over R2\R^2, (1,0)(1, 0) and (0,1)(0, 1) are sometimes called the “standard basis”, but (1,2)(1, 2) and (3,4)(3, -4) are also a basis for this space. One easy mistake to make, particularly when working over the tuple space KnK^n, is to confuse the actual elements of the linear space with the coefficient vectors that represent them. Working with abstract linear spaces eliminates this source of confusion.

Representing Linear Transformations

Working with finite-dimensional linear spaces VV and WW, let’s choose bases eie_i and djd_j for them, and let’s consider a linear map f:VWf: V \to W.

The powerful thing about bases is that we can more compactly express the information content of ff. Given any vVv \in V, we can decompose it into a linear function of the basis (for some coefficients), so we can write

f(v)=f(iciei)=if(ciei)=icif(ei)f(v) = f\left(\sum_i c_i e_i\right) = \sum_i f(c_i e_i) = \sum_i c_i \cdot f(e_i)
Math

In other words, to specify ff, we only need to specify what it does to each of the dimV\dim V basis elements. But what’s more, because WW also has a basis, we can write

f(ei)=jAijdjf(e_i) = \sum_j A_{ij} d_j
Math

Putting these two formulas together, we have an explicit closed form for f(v)f(v), given the coefficients AijA_{ij} of ff, and the coefficients cic_i of vv:

f(v)=i,jciAijdjf(v) = \sum_{i,j} c_i A_{ij} d_j
Math

Alternatively, we can express vv and f(v)f(v) as column vectors, and ff as the AA matrix with entires AijA_{ij}. The entries of the resulting column vector are given by the above explicit formula for f(v)f(v), fixing the value of jj in each entry.

[A0,0A1,0An,0A1,0A1,1An,1A0,mA1,mAn,m]A[c0c1cn]v=[iciAi,0iciAi,1iciAi,m]Av\underbrace{\Mat{ A_{0,0} & A_{1,0} & \cdots & A_{n,0} \\ A_{1,0} & A_{1,1} & \cdots & A_{n,1} \\ \vdots & \vdots & \ddots & \vdots \\ A_{0,m} & A_{1,m} & \cdots & A_{n,m} }}_A \, \underbrace{\Mat{c_0 \\ c_1 \\ \vdots \\ c_n}}_v = \underbrace{\Mat{ \sum_i c_i A_{i,0} \\ \sum_i c_i A_{i,1} \\ \vdots \\ \sum_i c_i A_{i,m} }}_{Av}
Math

(Remember, this is all dependent on the choices of bases eie_i and djd_j!)

Behold, we have derived the matrix-vector multiplication formula: the jjth entry of the result is the dot product of the vector and the jjth row of the matrix.

But it is crucial to keep in mind that we had to choose bases eie_i and djd_j to be entitled to write down a matrix for ff. The values of the coefficients depend on the choice of basis.

If your linear space happens to be Rn\R^n, there is an “obvious” choice of basis, but not every linear space over R\R is Rn\R^n! Importantly, the actual linear algebra does not change depending on the basis6.

Matrix Multiplication

So, where does matrix multiplication come from? An n×mn \times m7 matrix AA represents some linear map f:VWf: V \to W, where dimV=n\dim V = n, dimW=m\dim W = m, and appropriate choices of basis (eie_i, djd_j) have been made.

Keeping in mind that linear maps are supreme over matrices, suppose we have a third linear space UU, and a map g:UVg: U \to V, and let =dimU\ell = \dim U. Choosing a basis hkh_k for UU, we can represent gg as a matrix BB of dimension ×n\ell \times n.

Then, we’d like for the matrix product ABAB to be the same matrix we’d get from representing the composite map fg:UWfg: U \to W as a matrix, using the aforementioned choices of bases for UU and WW (the basis choice for VV should “cancel out”).

Recall our formula for f(v)f(v) in terms of its matrix coefficients AijA_{ij} and the coefficients of the input vv, which we call cic_i. We can produce a similar formula for g(u)g(u), giving it matrix coefficients BkiB_{ki}, and coefficients bkb_k for uu. (I appologize for the number of indices and coefficients here.)

f(v)=i,jciAijdjg(u)=k,ibkBkiei\begin{align*}f(v) &= \sum_{i,j} c_i A_{ij} d_j \\ g(u) &= \sum_{k,i} b_k B_{ki} e_i\end{align*}
Math

If we write f(g(u))f(g(u)), then cic_i is the coefficient eie_i is multiplied by; i.e., we fix ii, and drop it from the summation: ci=kbkBkic_i = \sum_k b_k B_{ki}.

Substituting that into the above formula, we now have something like the following.

f(g(u))=i,jkbkBkiAijdjf(g(u))=k,jbk(iAijBki)dj()\begin{align*}f(g(u)) &= \sum_{i,j} \sum_{k} b_k B_{ki} A_{ij} d_j \\ f(g(u)) &= \sum_{k,j} b_k \left(\sum_{i} A_{ij} B_{ki} \right) d_j &(\star)\end{align*}
Math

In ()(\star), we’ve rearranged things so that the sum in parenthesis is the (k,j)(k,j)th matrix coefficient of the composite fgfg. Because we wanted ABAB to represent fgfg, it must be an ×m\ell \times m matrix whose entries are

(AB)kj=iAijBki(AB)_{kj} = \sum_{i} A_{ij} B_{ki}
Math

This is matrix multiplication. It arises naturally out of composition of linear maps. In this way, the matrix multiplication formula is not a definition, but a theorem of linear algebra!

Theorem (Matrix Multiplication)

Given an n×mn \times m matrix AA and an ×n\ell \times n matrix BB, both with coefficients in KK, then ABAB is an ×m\ell \times m matrix with entires

(AB)kj=iAijBki (AB)_{kj} = \sum_{i} A_{ij} B_{ki}
Math

If the matrix dimension is read as nmn \to m instead of n×mn \times m, the shape requirements are more obvious: two matrices AA and BB can be multiplied together only when they represent a pair of maps VWV \to W and UVU \to V.

Other Consequences, and Conclusion

The identity matrix is an n×nn \times n matrix:

In=[111]I_n = \Mat{ 1 \\ & 1 \\ && \ddots \\ &&& 1 }
Math

We want it to be such that for any appropriately-sized matrices AA and BB, it has AIn=AAI_n = A and InB=BI_n B = B. Lifted up to linear maps, this means that InI_n should represent the identity map VVV \to V, when dimV=n\dim V = n. This map sends each basis element eie_i to itself, so the columns of InI_n should be the basis vectors, in order:

[100][010][001]\Mat{1 \\ 0 \\ \vdots \\ 0} \Mat{0 \\ 1 \\ \vdots \\ 0} \cdots \Mat{0 \\ 0 \\ \vdots \\ 1}
Math

If we shuffle the columns, we’ll get a permutation matrix, which shuffles the coefficients of a column vector. For example, consider this matrix.

[010100001]\Mat{ 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 }
Math

This is similar to the identity, but we’ve swapped the first two columns. Thus, it will swap the first two coefficients of any column vector.

Matrices may seem unintuitive when they’re introduced as a subject of study. Every student encountering matrices for the same time may ask “If they add componentwise, why don’t they multiply componentwise too?”

However, approaching matrices as a computational and representational tool shows that the convoluted-looking matrix multiplication formula is a direct consequence of linearity.

f(v+w)=f(v)+f(w)f(cv)=cf(v)\begin{gather*}f(v + w) = f(v) + f(w) \\ f(cv) = c \cdot f(v)\end{gather*}
Math
  1. In actual modern mathematics, the objects I describe are still called vector spaces, which I think generates unnecessary confusion in this case. “Linear space” is a bit more on the nose for what I’m going for. 

  2. This type of structure (just the addition part) is also called an “abelian group”. 

  3. Throughout ii, jj, and kk are indices in some unspecified but ordered indexing set, usually {1,2,...,n}\{1, 2, ..., n\}. I will not bother giving this index set a name. 

  4. This is sometimes called the dimension theorem, which is somewhat tedious to prove. 

  5. An example of a messy infinite-dimensional basis is R\R considered as linear space over Q\Q (in general, every field is a linear space over its subfields). The basis for this space essentially has to be “11, and all irrational numbers” except if we include e.g. ee and π\pi we can’t include e+12πe + \frac{1}{2}\pi, which is a Q\Q-linear combination of ee and π\pi.

    On the other hand, C\C is two-dimensional over R\R, with basis 1,i\\{1, i\\}.

    Incidentally, this idea of “view a field KK as a linear space over its subfield FF” is such a useful concept that it is called the “degree of the field extension K/FK/F”, and given the symbol [K:F][K : F].

    This, [R:Q]=[\R : \Q] = \infty and [C:R]=2[\C : \R] = 2

  6. You may recall from linear algebra class that two matrices AA and BB of the same shape are similar if there are two appropriately-sized square matrices SS and RR such that SAR=BSAR = B. These matrices SS and RR represent a change of basis, and indicate that the linear maps A,B:VWA, B: V \to W these matrices come from do “the same thing” to elements of VV.

    Over an algebraically closed field like C\C (i.e. all polynomials have solutions), there is an even stronger way to capture the information content of a linear map via Jordan canonicalization, which takes any square matrix AA and produces an almost-diagonal square matrix that only depends on the eigenvalues of AA, which is the same for similar matrices, and thus basis-independent. 

  7. Here, as always, matrix dimensions are given in RC (row-column) order. You can think of this as being “input dimension” to “output dimension”. 

I Wrote A String Type

I write compilers for fun. I can’t help it. Consequently, I also write a lot of parsers. In systems programming, it’s usually a good idea to try to share memory rather than reuse it, so as such my AST types tend to look like this.

pub enum Expr<'src> {
  Int(u32)
  Ident(&'src str),
  // ...
}
Rust

Whenever we parse an identifier, rather than copy its name into a fresh String, we borrow from the input source string. This avoids an extra allocation, an extra copy, and saves a word in the representation. Compilers can be memory-hungry, so it helps to pick a lean representation.

Unfortunately, it’s not so easy for quoted strings. Most strings, like "all my jelly babies", are “literally” in the original source, like an identifier. But strings with escapes aren’t: \n is encoded in the source code with the bytes [0x5c, 0x6e], but the actual “decoded” value of a string literal replaces each escape with a single 0x0a.

The usual solution is a Cow<str>. In the more common, escape-less verison, we can use Cow::Borrowed, which avoids the extra allocation and copy, and in the escaped version, we decode the escapes into a String and wrap it in a Cow::Owned.

For example, suppose that we’re writing a parser for a language that has quoted strings with escapes. The string "all my jelly babies" can be represented as a byte string that borrows the input source code, so we’d use the Cow::Borrowed variant. This is most strings in any language: escapes tend to be rare.

For example, if we have the string "not UTF-8 \xff", the actual byte string value is different from that in the source code.

// Bytes in the source.
hex:   6e 6f 74 20 55 54 46 2d 38 20 5c 78 66 66
ascii: n  o  t     U  T  F  -  8     \  x  f  f

// Bytes represented by the string.
hex:   6e 6f 74 20 55 54 46 2d 38 20 ff
ascii: n  o  t     U  T  F  -  8
Plaintext

Escapes are relatively rare, so most strings processed by the parser do not need to pay for an allocation.

However, we still pay for that extra word, since Cow<str> is 24 bytes (unless otherwise specified, all byte counts assume a 64-bit system), which is eight more than our &str. Even worse, this is bigger than the string data itself, which is 11 bytes.

If most of your strings are small (which is not uncommon in an AST parser), you will wind up paying for significant overhead.

Over the years I’ve implemented various optimized string types to deal with this use-case, in various contexts. I finally got around to putting all of the tricks I know into a library, which I call byteyarn. It advertises the following nice properties.

A Yarn is a highly optimized string type that provides a number of useful properties over String:

  • Always two pointers wide, so it is always passed into and out of functions in registers.
  • Small string optimization (SSO) up to 15 bytes on 64-bit architectures.
  • Can be either an owned buffer or a borrowed buffer (like Cow<str>).
  • Can be upcast to 'static lifetime if it was constructed from a known-static string.

I’d like to share how these properties are achieved through careful layout optimization.

Assumptions

We’re going to start by stating assumptions about how our strings will be used:

  1. Most strings are not mutated most of the time.
  2. Most strings are small.
  3. Most strings are substrings.

Most Strings are Immutable

String is modeled after C++’s std::string, which is a growable buffer that implements amortized linear-time append. This means that if we are appending n bytes to the buffer, we only pay for n bytes of memcpy.

This is a useful but often unnecessary property. For example, Go strings are immutable, and when building up a large string, you are expected to use strings.Builder, which is implemented as essentially a Rust String. Java also as a similar story for strings, which allows for highly compact representations of java.lang.Strings.

In Rust, this kind of immutable string is represented by a Box<str>, which is eight bytes smaller than String. Converting from String to Box<str> is just a call to realloc() to resize the underlying allocation (which is often cheap1) from being capacity bytes long to len bytes long.

Thus, this assumption means we only need to store a pointer and a length, which puts our memory footprint floor at 16 bytes.

Most Strings are Substrings

Suppose again that we’re parsing some textual format. Many structural elements will be verbatim references into the textual input. Not only string literals without escapes, but also identifiers.

Box<str> cannot hold borrowed data, because it will always instruct the allocator to free its pointer when it goes out of scope. Cow<str>, as we saw above, allows us to handle maybe-owned data uniformly, but has a minimum 24 byte overhead. This can’t be made any smaller, because a Cow<str> can contain a 24-byte String value.

But, we don’t want to store a capacity. Can we avoid the extra word of overhead in Cow<str>?

Most Strings are Small

Consider a string that is not a substring but which is small. For example, when parsing a string literal like "Hello, world!\n", the trailing \n (bytes 0x5c 0x6e) must be replaced with a newline byte (0x0a). This means we must handle a tiny heap allocation, 14 bytes long, that is smaller than a &str referring to it.

This is worse for single character2 strings. The overhead for a Box<str> is large.

  • The Box<str> struct itself has a pointer field (eight bytes), and a length field (also eight bytes). Spelled out to show all the stored bits, the length is 0x0000_0000_0000_0001. That’s a lot of zeroes!
  • The pointer itself points to a heap allocation, which will not be a single byte! Allocators are not in the business of handing out such small pieces of memory. Instead, the allocation is likely costing us another eight bytes!

So, the string "a", whose data is just a single byte, instead takes up 24 bytes of memory.

It turns out that for really small strings we can avoid the allocation altogether, and make effective use of all those zeroes in the len field.

Stealing Bits

Let’s say we want to stick to a budget of 16 bytes for our Yarn type. Is there any extra space left for data in a (*mut u8, usize) pair?

*cracks Fermi estimation knuckles*

A usize is 64 bits, which means that the length of an &str can be anywhere from zero to 18446744073709551615, or around 18 exabytes. For reference, “hundreds of exabytes” is a reasonable ballpark guess for how much RAM exists in 2023 (consider: 4 billion smartphones with 4GB each). More practically, the largest quantity of RAM you can fit in a server blade is measured in terabytes (much more than your measly eight DIMs on your gaming rig).

If we instead use one less bit, 63 bits, this halves the maximum representable memory to nine exabytes. If we take another, it’s now four exabytes. Much more memory than you will ever ever want to stick in a string. Wikpedia asserts that Wikimedia Commons contains around 428 terabytes of media (the articles’ text with history is a measly 10 TB).

Ah, but you say you’re programming for a 32-bit machine (today, this likely means either a low-end mobile phone, an embedded micro controller, or WASM).

On a 32-bit machine it’s a little bit harrier: Now usize is 32 bits, for a maximum string size of 4 gigabytes (if you remember the 32-bit era, this limit may sound familiar). “Gigabytes” is an amount of memory that you can actually imagine having in a string.

Even then, 1 GB of memory (if we steal two bits) on a 32-bit machine is a lot of data. You can only have four strings that big in a single address space, and every 32-bit allocator in the universe will refuse to serve an allocation of that size. If your strings are comparable in size to the whole address space, you should build your own string type.

The upshot is that every &str contains two bits we can reasonably assume are not used. Free real-estate.3

A Hand-Written Niche Optimization

Rust has the concept of niches, or invalid bit-patterns of a particular type, which it uses for automatic layout optimization of enums. For example, references cannot be null, so the pointer bit-pattern of 0x0000_0000_0000_0000 is never used; this bit-pattern is called a “niche”. Consider:

enum Foo<'a> {
  First(&'a T),
  Second
}
Rust

An enum of this form will not need any “extra” space to store the value that discriminates between the two variants: if a Foo’s bits are all zero, it’s Foo::Second; otherwise it’s a Foo::First and the payload is formed from Foo’s bit-pattern. This, incidentally, is what makes Option<&T> a valid representation for a “nullable pinter”.

There are more general forms of this: bool is represented as a single byte, of which two bit are valid; the other 254 potential bit-patterns are niches. In Recent versions of Rust, RawFd has a niche for the all-ones bit-pattern, since POSIX file descriptors are always non-negative ints.

By stealing two bits off of the length, we have given ourselves four niches, which essentially means we’ll have a hand-written version of something like this enum.

enum Yarn {
  First(*mut u8, u62),
  Second(*mut u8, u62),
  Third(*mut u8, u62),
  Fourth(*mut u8, u62),
}
Rust

For reasons that will become clear later, we will specifically steal the high bits of the length, so that to recover the length, we do two shifts4 to shift in two high zero bits. Here’s some code that actually implements this for the low level type our string type will be built on.

#[repr(C)]
#[derive(Copy, Clone)]
struct RawYarn {
  ptr: *mut u8,
  len: usize,
}

impl RawYarn {
  /// Constructs a new RawYarn from raw components: a 2-bit kind,
  /// a length, and a pointer.
  fn from_raw_parts(kind: u8, len: usize, ptr: *mut u8) -> Self {
    assert!(len <= usize::MAX / 4, "no way you have a string that big");

    RawYarn {
      ptr,
      len: (kind as usize & 0b11) << (usize::BITS - 2) | len,
    }
  }

  /// Extracts the kind back out.
  fn kind(self) -> u8 {
    (self.len >> (usize::BITS - 2)) as u8
  }

  /// Extracts the slice out (regardless of kind).
  unsafe fn as_slice(&self) -> &[u8] {
    slice::from_raw_parts(self.ptr, (self.len << 2) >> 2)
  }
}
Rust

Note that I’ve made this type Copy, and some functions take it by value. This is for two reasons.

  1. There is a type of Yarn that is itself Copy, although I’m not covering it in this article.

  2. It is a two-word struct, which means that on most architectures it is eligible to be passed in a pair of registers. Passing it by value in the low-level code helps promote keeping it in registers. This isn’t always possible, as we will see when we discuss “SSO”.

Let’s chose kind 0 to mean “this is borrowed data”, and kind 1 to be “this is heap-allocated data”. We can use this to remember whether we need to call a destructor.

pub struct Yarn<'a> {
  raw: RawYarn,
  _ph: PhantomData<&'a str>,
}

const BORROWED: u8 = 0;
const HEAP: u8 = 1;

impl<'a> Yarn<'a> {
  /// Create a new yarn from borrowed data.
  pub fn borrowed(data: &'a str) -> Self {
    let len = data.len();
    let ptr = data.as_ptr().cast_mut();
    Self {
      raw: RawYarn::from_raw_parts(BORROWED, len, ptr),
      _ph: PhantomData,
    }
  }

  /// Create a new yarn from owned data.
  pub fn owned(data: Box<str>) -> Self {
    let len = data.len();
    let ptr = data.as_ptr().cast_mut();
    mem::forget(data);

    Self {
      raw: RawYarn::from_raw_parts(HEAP, len, ptr),
      _ph: PhantomData,
    }
  }

  /// Extracts the data.
  pub fn as_slice(&self) -> &str {
    unsafe {
      // SAFETY: initialized either from uniquely-owned data,
      // or borrowed data of lifetime 'a that outlives self.
      str::from_utf8_unchecked(self.raw.as_slice())
    }
  }
}

impl Drop for Yarn<'_> {
  fn drop(&mut self) {
    if self.raw.kind() == HEAP {
      let dropped = unsafe {
        // SAFETY: This is just reconstituting the box we dismantled
        // in Yarn::owned().
        Box::from_raw(self.raw.as_mut_slice())
      };
    }
  }
}

impl RawYarn {
  unsafe fn as_slice_mut(&mut self) -> &mut [u8] {
    // Same thing as as_slice, basically. This is just to make
    // Box::from_raw() above typecheck.
  }
}
Rust

This gives us a type that strongly resembles Cow<str> with only half of the bytes. We can even write code to extend the lifetime of a Yarn:

impl Yarn<'_> {
  /// Removes the bound lifetime from the yarn, allocating if
  /// necessary.
  pub fn immortalize(mut self) -> Yarn<'static> {
    if self.raw.kind() == BORROWED {
      let copy: Box<str> = self.as_slice().into();
      self = Yarn::owned(copy);
    }

    // We need to be careful that we discard the old yarn, since its
    // destructor may run and delete the heap allocation we created
    // above.
    let raw = self.raw;
    mem::forget(self);
    Yarn::<'static> {
      raw,
      _ph: PhantomData,
    }
  }
}
Rust

The remaining two niches can be put to use for optimizing small strings.

Small String Optimization

C++’s std::string also makes the “most strings are small” assumption. In the libc++ implementation of the standard library, std::strings of up to 23 bytes never hit the heap!

C++ implementations do this by using most of the pointer, length, and capacity fields as a storage buffer for small strings, the so-called “small string optimization” (SSO). In libc++, in SSO mode, a std::string’s length fits in one byte, so the other 23 bytes can be used as storage. The capacity isn’t stored at all: an SSO string always has a capacity of 23.

RawYarn still has another two niches, so let’s dedicate one to a “small” representation. In small mode, the kind will be 2, and only the 16th byte will be the length.

This is why we used the two high bits of len for our scratch space: no matter what mode it’s in, we can easily extract these bits5. Some of the existing RawYarn methods need to be updated, though.

#[repr(C)]
#[derive(Copy, Clone)]
struct RawYarn {
  ptr: MaybeUninit<*mut u8>,
  len: usize,
}

const SMALL: u8 = 2;

impl RawYarn {
  /// Constructs a new RawYarn from raw components: a 2-bit kind,
  /// a length, and a pointer.
  fn from_raw_parts(kind: u8, len: usize, ptr: *mut u8) {
    debug_assert!(kind != SMALL);
    assert!(len <= usize::MAX / 4, "no way you have a string that big");

    RawYarn {
      ptr: MaybeUninit::new(ptr),
      len: (kind as usize & 0b11) << (usize::BITS - 2) | len,
    }
  }

  /// Extracts the slice out (regardless of kind).
  unsafe fn as_slice(&self) -> &[u8] {
    let (ptr, adjust) = match self.kind() {
      SMALL => (self as *const Self as *const u8, usize::BITS - 8),
      _ => (self.ptr.assume_init(), 0),
    };

    slice::from_raw_parts(ptr, (self.len << 2) >> (2 + adjust))
  }
}
Rust

In the non-SMALL case, we shift twice as before, but in the SMALL case, we need to get the high byte of the len field, so we need to shift down by an additional usize::BITS - 8. No matter what we’ve scribbled on the low bytes of len, we will always get just the length this way.

We also need to use a different pointer value depending on whether we’re in SMALL mode. This is why as_slice needs to take a reference argument, since the slice data may be directly in self!

Also, ptr is a MaybeUninit now, which will become clear in the next code listing.

We should also provide a way to construct small strings.

const SSO_LEN: usize = size_of::<usize>() * 2 - 1;

impl RawYarn {
  /// Create a new small yarn. `data` must be valid for `len` bytes
  /// and `len` must be smaller than `SSO_LEN`.
  unsafe fn from_small(data: *const u8, len: usize) -> RawYarn {
    debug_assert!(len <= SSO_LEN);

    // Create a yarn with an uninitialized pointer value (!!)
    // and a length whose high byte is packed with `small` and
    // `len`.
    let mut yarn = RawYarn {
      ptr: MaybeUninit::uninit(),
      len: (SMALL as usize << 6 | len)
          << (usize::BITS - 8),
    };

    // Memcpy the data to the new yarn.
    // We write directly onto the `yarn` variable. We won't
    // overwrite the high-byte length because `len` will
    // never be >= 16.
    ptr::copy_nonoverlapping(
      data,
      &mut yarn as *mut RawYarn as *mut u8,
      data,
    );

    yarn
  }
}
Rust

The precise maximum size of an SSO string is a bit more subtle than what’s given above, but it captures the spirit. The RawYarn::from_small illustrates why the pointer value is hidden in a MaybeUninit: we’re above to overwrite it with garbage, and in that case it won’t be a pointer at all.

We can update our public Yarn type to use the new small representation whenever possible.

impl<'a> Yarn<'a> {
  /// Create a new yarn from borrowed data.
  pub fn borrowed(data: &'a str) -> Self {
    let len = data.len();
    let ptr = data.as_ptr().cast_mut();

    if len <= SSO_LEN {
      return Self {
        raw: unsafe { RawYarn::from_small(len, ptr) },
        _ph: PhantomData,
      }
    }

    Self {
      raw: RawYarn::from_raw_parts(BORROWED, len, ptr),
      _ph: PhantomData,
    }
  }

  /// Create a new yarn from owned data.
  pub fn owned(data: Box<str>) -> Self {
    if data.len() <= SSO_LEN {
      return Self {
        raw: unsafe { RawYarn::from_small(data.len(), data.as_ptr()) },
        _ph: PhantomData,
      }
    }

    let len = data.len();
    let ptr = data.as_ptr().cast_mut();
    mem::forget(data);

    Self {
      raw: RawYarn::from_raw_parts(HEAP, len, ptr),
      _ph: PhantomData,
    }
  }
}
Rust

It’s also possible to construct a Yarn directly from a character now, too!

impl<'a> Yarn<'a> {
  /// Create a new yarn from borrowed data.
  pub fn from_char(data: char) -> Self {
    let mut buf = [0u8; 4];
    let data = data.encode_utf8(&mut buf);
    Self {
      raw: unsafe { RawYarn::from_small(len, ptr) },
      _ph: PhantomData,
    }
  }
}
Rust

(Note that we do not need to update Yarn::immortalize(); why?)

What we have now is a maybe-owned string that does not require an allocation for small strings. However, we still have an extra niche…

String Constants

String constants in Rust are interesting, because we can actually detect them at compile-time6.

We can use the last remaining niche, 3, to represent data that came from a string constant, which means that it does not need to be boxed to be immortalized.

const STATIC: u8 = 3;

impl<'a> Yarn<'a> {
  /// Create a new yarn from borrowed data.
  pub fn from_static(data: &'static str) -> Self {
    let len = data.len();
    let ptr = data.as_ptr().cast_mut();

    if len <= SSO_LEN {
      return Self {
        raw: unsafe { RawYarn::from_small(len, ptr) },
        _ph: PhantomData,
      }
    }

    Self {
      raw: RawYarn::from_raw_parts(STATIC, len, ptr),
      _ph: PhantomData,
    }
  }
}
Rust

This function is identical to Yarn::borrowed, except that data most now have a static lifetime, and we pass STATIC to RawYarn::from_raw_parts().

Because of how we’ve written all of the prior code, this does not require any special support in Yarn::immortalize() or in the low-level RawYarn code.

The actual byteyarn library provides a yarn!() macro that has the same syntax as format!(). This is the primary way in which yarns are created. It is has been carefully written so that yarn!("this is a literal") always produces a STATIC string, rather than a heap-allocated string.

An extra niche, as a treat?

Unfortunately, because of how we’ve written it, Option<Yarn> is 24 bytes, a whole word larger than a Yarn. However, there’s still a little gap where we can fit the None variant. It turns out that because of how we’ve chosen the discriminants, len is zero if and only if it is an empty BORROWED string. But this is not the only zero: if the high byte is 0x80, this is an empty SMALL string. If we simply require that no other empty string is ever constructed (by marking RawYarn::from_raw_parts() as unsafe and specifying it should not be passed a length of zero), we can guarantee that len is never zero.

Thus, we can update len to be a NonZeroUsize.

#[repr(C)]
#[derive(Copy, Clone)]
struct RawYarn {
  ptr: MaybeUninit<*mut u8>,
  len: NonZeroUsize,  // (!!)
}

impl RawYarn {
  /// Constructs a new RawYarn from raw components: a 2-bit kind,
  /// a *nonzero* length, and a pointer.
  unsafe fn from_raw_parts(kind: u8, len: usize, ptr: *mut u8) {
    debug_assert!(kind != SMALL);
    debug_assert!(len != 0);
    assert!(len <= usize::MAX / 4, "no way you have a string that big");

    RawYarn {
      ptr: MaybeUninit::new(ptr),
      len: NonZeroUsize::new_unchecked(
        (kind as usize & 0b11) << (usize::BITS - 2) | len),
    }
  }
}
Rust

This is a type especially known to the Rust compiler to have a niche bit-pattern of all zeros, which allows Option<Yarn> to be 16 bytes too. This also has the convenient property that the all zeros bit-pattern for Option<Yarn> is None.

Conclusion

The byteyarn blurb describes what we’ve built:

A Yarn is a highly optimized string type that provides a number of useful properties over String:

  • Always two pointers wide, so it is always passed into and out of functions in registers.
  • Small string optimization (SSO) up to 15 bytes on 64-bit architectures.
  • Can be either an owned buffer or a borrowed buffer (like Cow<str>).
  • Can be upcast to 'static lifetime if it was constructed from a known-static string.

There are, of course, some trade-offs. Not only do we need the assumptions we made originally to hold, but we also need to relatively care more about memory than cycle-count performance, since basic operations like reading the length of the string require more math (but no extra branching).

The actual implementation of Yarn is a bit more complicated, partly to keep all of the low-level book-keeping in one place, and partly to offer an ergonomic API that makes Yarn into a mostly-drop-in replacement for Box<str>.

I hope this peek under the hood has given you a new appreciation for what can be achieved by clever layout-hacking.

  1. Allocators rarely serve you memory with precisely the size you asked for. Instead, they will have some notion of a “size class” that allows them to use more efficient allocation techniques, which I have written about.

    As a result, if the size change in a realloc() would not change the size class, it becomes a no-op, especially if the allocator can take advantage of the current-size information Rust provides it. 

  2. Here and henceforth “character” means “32-bit Unicode scalar”. 

  3. Now, you might also point out that Rust and C do not allow an allocation whose size is larger than the pointer offset type (isize and ptrdiff_t, respectively). In practice this means that the high bit is always zero according to the language’s own rules.

    This is true, but we need to steal two bits, and I wanted to demonstrate that this is an extremely reasonable desire. 64-bit integers are so comically large. 

  4. Interestingly, LLVM will compile (x << 2) >> 2 to

    movabs rax,0x3fffffffffffffff
    and    rax,rdi
    ret
    x86 Assembly

    If we want to play the byte-for-byte game, this costs 14 bytes when encoded in the Intel variable-length encoding. You would think that two shifts would result in marginally smaller code, but no, since the input comes in in rdi and needs to wind up in rax.

    On RISC-V, though, it seems to decide that two shifts is in fact cheaper, and will even optimize x & 0x3fff_ffff_ffff_ffff back into two shifts. 

  5. This only works on little endian. Thankfully all computers are little endian. 

  6. Technically, a &'static str may also point to leaked memory. For our purposes, there is no essential difference.