What is a Matrix? A Miserable Pile of Coefficients!

Linear algebra is undoubtedly the most useful field in all of algebra. It finds applications in all kinds of science and engineering, like quantum mechanics, graphics programming, and machine learning. It is the “most well-behaved” algebraic theory, in that other abstract algebra topics often try to approximate linear algebra, when possible.

For many students, linear algebra means vectors and matrices and determinants, and complex formulas for computing them. Matrices, in particular, come equipped with a fairly complicated, and a fortiori convoluted, multiplication operation.

This is not the only way to teach linear algebra, of course. Matrices and their multiplication appear complicated, but actually are a natural and compact way to represent a particular type of function, i.e., a linear map (or linear transformation).

This article is a short introduction to viewing linear algebra from the perspective of abstract algebra, from which matrices arise as a computational tool, rather than an object of study in and of themselves. I do assume some degree of familiarity with the idea of a matrix.

Linear Spaces

Most linear algebra courses open with a description of vectors in Euclidean space: Rn\R^n. Vectors there are defined as tuples of real numbers that can be added, multiplied, and scaled. Two vectors can be combined into a number through the dot product. Vectors come equipped with a notion of magnitude and direction.

However, this highly geometric picture can be counterproductive, since it is hard to apply geometric intuition directly to higher dimensions. It also obscures how this connects to working over a different number system, like the complex numbers.

Instead, I’d like to open with the concept of a linear space, which is somewhat more abstract than a vector space1.

First, we will need a notion of a “coefficient”, which is essentially something that you can do arithmetic with. We will draw coefficients from a designated ground field KK. A field is a setting for doing arithmetic: a set of objects that can be added, subtracted, and multiplied, and divided in the “usual fashion” along with special 00 and 11 values. E.g. a+0=aa + 0 = a, 1a=a1a = a, a(b+c)=ab+aca(b + c) = ab + ac, and so on.

Not only are the real numbers R\R a field, but so are the complex numbers C\C, and the rational numbers Q\Q. If we drop the “division” requirement, we can also include the integers Z\Z, or polynomials with rational coefficients Q[x]\Q[x], for example.

Having chosen our coefficients KK, a linear space VV over KK is another set of objects that can be added and subtracted (and including a special value 00)2, along with a scaling operation, which takes a coefficient cKc \in K and one of our objects vVv \in V and produces a new cvVcv \in V.

The important part of the scaling operation is that it’s compatible with addition: if we have a,bKa, b \in K and v,wVv, w \in V, we require that

a(v+w)=av+aw(a+b)v=av+bv\begin{gather*}a (v + w) = av + aw \\ (a + b) v = av + bv\end{gather*}
Math

This is what makes a linear space “linear”: you can write equations that look like first-degree polynomials (e.g. ax+bax + b), and which can be manipulated like first-degree polynomials.

These polynomials are called linear because their graph looks like a line. There’s no multiplication, so we can’t have x2x^2, but we do have multiplication by a coefficient. This is what makes linear algebra is “linear”.

Some examples: nn-tuples of elements drawn from any field are a linear space over that field, by componentwise addition and scalar multiplication; e.g., R3R^3. Setting n=1n = 1 shows that every field is a linear space over itself.

Polynomials in one variable over some field, K[x]K[x], are also a linear space, since polynomials can be added together and scaled by a any value in KK (since lone coefficients are degree zero polynomials). Real-valued functions also form a linear space over R\R in a similar way.

Linear Transformations

A linear map is a function f:VWf: V \to W between two linear spaces VV and WW over KK which “respects” the linear structure in a particular way. That is, for any cKc\in K and v,wVv, w \in V,

f(v+w)=f(v)+f(w)f(cv)=cf(v)\begin{gather*}f(v + w) = f(v) + f(w) \\ f(cv) = c \cdot f(v)\end{gather*}
Math

We call this type of relationship (respecting addition and scaling) “linearity”. One way to think of this relationship is that ff is kind of like a different kind of coefficient, in that it distributes over addition, which commutes with the “ordinary” coefficients from KK. However, applying ff produces a value from WW rather than VV.

Another way to think of it is that if we have a linear polynomial like p(x)=ax+bp(x) = ax + b in xx, then f(p(x))=p(f(x))f(p(x)) = p(f(x)). We say that ff commutes with all linear polynomials.

The most obvious sort of linear map is scaling. Given any coefficient cKc \in K, it defines a “scaling map”:

μc:VVvcv\begin{gather*}\mu_c: V \to V \\ v \mapsto cv\end{gather*}
Math

It’s trivial to check this is a linear map, by plugging it into the above equations: it’s linear because scaling is distributive and commutative.

Linear maps are the essential thing we study in linear algebra, since they describe all the different kinds of relationships between linear spaces.

Some linear maps are complicated. For example, a function from R2R2\R^2 \to \R^2 that rotates the plane by some angle θ\theta is linear, as are operations that stretch or shear the plane. However, they can’t “bend” or “fold” the plane: they are all fairly rigid motions. In the linear space Q[x]\Q[x] of rational polynomials, multiplication by any polynomial, such as xx or x21x^2 - 1, is a linear map. The notion of “linear map” depends heavily on the space we’re in.

Unfortunately, linear maps as they are quite opaque, and do not lend themselves well to calculation. However, we can build an explicit representation using a linear basis.

Linear Basis

For any linear space, we can construct a relatively small of elements such that any element of the space can be expressed as some linear function of these elements.

Explicitly, for any VV, we can construct a sequence3 eie_i such that for any vVv \in V, we can find ciKc_i \in K such that

v=iciei.v = \sum_i c_i e_i.
Math

Such a set eie_i is called a basis if it is linearly independent: no one eie_i can be expressed as a linear function of the rest. The dimension of VV, denoted dimV\dim V, is the number of elements in any choice of basis. This value does not depend on the choice of basis4.

Constructing a basis for any VV is easy: we can do this recursively. First, pick a random element e1e_1 of VV, and define a new linear space V/e1V/e_1 where we have identified all elements that differ by a factor of e1e_1 as equal (i.e., if vw=ce1v - w = ce_1, we treat vv and ww as equal in V/e1V/e_1).

Then, a basis for VV is a basis of V/e1V/e_1 with e1e_1 added. The construction of V/e1V/e_1 is essentially “collapsing” the dimension e1e_1 “points” in, giving us a new space where we’ve “deleted” all of the elements that have a nonzero e1e_1 component.

However, this only works when the dimension is finite; more complex methods must be used for infinite-dimensional spaces. For example, the polynomials Q[x]\Q[x] are an infinite-dimensional space, with basis elements 1,x,x2,x3,...\\{1, x, x^2, x^3, ...\\}. In general, for any linear space VV, it is always possible to arbitrarily choose a basis, although it may be infinite5.

Bases are useful because they give us a concrete representation of any element of VV. Given a fixed basis eie_i, we can represent any w=icieiw = \sum_i c_i e_i by the coefficients cic_i themselves. For a finite-dimensional VV, this brings us back column vectors: (dimV)(\dim V)-tuples of coefficients from KK that are added and scaled componentwise.

[c0c1cn]:=given eiiciei\Mat{c_0 \\ c_1 \\ \vdots \\ c_n} \,\underset{\text{given } e_i}{:=}\, \sum_i c_i e_i
Math

The iith basis element is represented as the vector whose entries are all 00 except for the iith one, which is 11. E.g.,

[100]=given eie1,[010]=given eie2,...\Mat{1 \\ 0 \\ \vdots \\ 0} \,\underset{\text{given } e_i}{=}\, e_1, \,\,\, \Mat{0 \\ 1 \\ \vdots \\ 0} \,\underset{\text{given } e_i}{=}\, e_2, \,\,\, ...
Math

It is important to recall that the choice of basis is arbitrary. From the mathematical perspective, any basis is just as good as any other, although some may be more computationally convenient.

Over R2\R^2, (1,0)(1, 0) and (0,1)(0, 1) are sometimes called the “standard basis”, but (1,2)(1, 2) and (3,4)(3, -4) are also a basis for this space. One easy mistake to make, particularly when working over the tuple space KnK^n, is to confuse the actual elements of the linear space with the coefficient vectors that represent them. Working with abstract linear spaces eliminates this source of confusion.

Representing Linear Transformations

Working with finite-dimensional linear spaces VV and WW, let’s choose bases eie_i and djd_j for them, and let’s consider a linear map f:VWf: V \to W.

The powerful thing about bases is that we can more compactly express the information content of ff. Given any vVv \in V, we can decompose it into a linear function of the basis (for some coefficients), so we can write

f(v)=f(iciei)=if(ciei)=icif(ei)f(v) = f\left(\sum_i c_i e_i\right) = \sum_i f(c_i e_i) = \sum_i c_i \cdot f(e_i)
Math

In other words, to specify ff, we only need to specify what it does to each of the dimV\dim V basis elements. But what’s more, because WW also has a basis, we can write

f(ei)=jAijdjf(e_i) = \sum_j A_{ij} d_j
Math

Putting these two formulas together, we have an explicit closed form for f(v)f(v), given the coefficients AijA_{ij} of ff, and the coefficients cic_i of vv:

f(v)=i,jciAijdjf(v) = \sum_{i,j} c_i A_{ij} d_j
Math

Alternatively, we can express vv and f(v)f(v) as column vectors, and ff as the AA matrix with entires AijA_{ij}. The entries of the resulting column vector are given by the above explicit formula for f(v)f(v), fixing the value of jj in each entry.

[A0,0A1,0An,0A1,0A1,1An,1A0,mA1,mAn,m]A[c0c1cn]v=[iciAi,0iciAi,1iciAi,m]Av\underbrace{\Mat{ A_{0,0} & A_{1,0} & \cdots & A_{n,0} \\ A_{1,0} & A_{1,1} & \cdots & A_{n,1} \\ \vdots & \vdots & \ddots & \vdots \\ A_{0,m} & A_{1,m} & \cdots & A_{n,m} }}_A \, \underbrace{\Mat{c_0 \\ c_1 \\ \vdots \\ c_n}}_v = \underbrace{\Mat{ \sum_i c_i A_{i,0} \\ \sum_i c_i A_{i,1} \\ \vdots \\ \sum_i c_i A_{i,m} }}_{Av}
Math

(Remember, this is all dependent on the choices of bases eie_i and djd_j!)

Behold, we have derived the matrix-vector multiplication formula: the jjth entry of the result is the dot product of the vector and the jjth row of the matrix.

But it is crucial to keep in mind that we had to choose bases eie_i and djd_j to be entitled to write down a matrix for ff. The values of the coefficients depend on the choice of basis.

If your linear space happens to be Rn\R^n, there is an “obvious” choice of basis, but not every linear space over R\R is Rn\R^n! Importantly, the actual linear algebra does not change depending on the basis6.

Matrix Multiplication

So, where does matrix multiplication come from? An n×mn \times m7 matrix AA represents some linear map f:VWf: V \to W, where dimV=n\dim V = n, dimW=m\dim W = m, and appropriate choices of basis (eie_i, djd_j) have been made.

Keeping in mind that linear maps are supreme over matrices, suppose we have a third linear space UU, and a map g:UVg: U \to V, and let =dimU\ell = \dim U. Choosing a basis hkh_k for UU, we can represent gg as a matrix BB of dimension ×n\ell \times n.

Then, we’d like for the matrix product ABAB to be the same matrix we’d get from representing the composite map fg:UWfg: U \to W as a matrix, using the aforementioned choices of bases for UU and WW (the basis choice for VV should “cancel out”).

Recall our formula for f(v)f(v) in terms of its matrix coefficients AijA_{ij} and the coefficients of the input vv, which we call cic_i. We can produce a similar formula for g(u)g(u), giving it matrix coefficients BkiB_{ki}, and coefficients bkb_k for uu. (I appologize for the number of indices and coefficients here.)

f(v)=i,jciAijdjg(u)=k,ibkBkiei\begin{align*}f(v) &= \sum_{i,j} c_i A_{ij} d_j \\ g(u) &= \sum_{k,i} b_k B_{ki} e_i\end{align*}
Math

If we write f(g(u))f(g(u)), then cic_i is the coefficient eie_i is multiplied by; i.e., we fix ii, and drop it from the summation: ci=kbkBkic_i = \sum_k b_k B_{ki}.

Substituting that into the above formula, we now have something like the following.

f(g(u))=i,jkbkBkiAijdjf(g(u))=k,jbk(iAijBki)dj()\begin{align*}f(g(u)) &= \sum_{i,j} \sum_{k} b_k B_{ki} A_{ij} d_j \\ f(g(u)) &= \sum_{k,j} b_k \left(\sum_{i} A_{ij} B_{ki} \right) d_j &(\star)\end{align*}
Math

In ()(\star), we’ve rearranged things so that the sum in parenthesis is the (k,j)(k,j)th matrix coefficient of the composite fgfg. Because we wanted ABAB to represent fgfg, it must be an ×m\ell \times m matrix whose entries are

(AB)kj=iAijBki(AB)_{kj} = \sum_{i} A_{ij} B_{ki}
Math

This is matrix multiplication. It arises naturally out of composition of linear maps. In this way, the matrix multiplication formula is not a definition, but a theorem of linear algebra!

Theorem (Matrix Multiplication)

Given an n×mn \times m matrix AA and an ×n\ell \times n matrix BB, both with coefficients in KK, then ABAB is an ×m\ell \times m matrix with entires

(AB)kj=iAijBki (AB)_{kj} = \sum_{i} A_{ij} B_{ki}
Math

If the matrix dimension is read as nmn \to m instead of n×mn \times m, the shape requirements are more obvious: two matrices AA and BB can be multiplied together only when they represent a pair of maps VWV \to W and UVU \to V.

Other Consequences, and Conclusion

The identity matrix is an n×nn \times n matrix:

In=[111]I_n = \Mat{ 1 \\ & 1 \\ && \ddots \\ &&& 1 }
Math

We want it to be such that for any appropriately-sized matrices AA and BB, it has AIn=AAI_n = A and InB=BI_n B = B. Lifted up to linear maps, this means that InI_n should represent the identity map VVV \to V, when dimV=n\dim V = n. This map sends each basis element eie_i to itself, so the columns of InI_n should be the basis vectors, in order:

[100][010][001]\Mat{1 \\ 0 \\ \vdots \\ 0} \Mat{0 \\ 1 \\ \vdots \\ 0} \cdots \Mat{0 \\ 0 \\ \vdots \\ 1}
Math

If we shuffle the columns, we’ll get a permutation matrix, which shuffles the coefficients of a column vector. For example, consider this matrix.

[010100001]\Mat{ 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 }
Math

This is similar to the identity, but we’ve swapped the first two columns. Thus, it will swap the first two coefficients of any column vector.

Matrices may seem unintuitive when they’re introduced as a subject of study. Every student encountering matrices for the same time may ask “If they add componentwise, why don’t they multiply componentwise too?”

However, approaching matrices as a computational and representational tool shows that the convoluted-looking matrix multiplication formula is a direct consequence of linearity.

f(v+w)=f(v)+f(w)f(cv)=cf(v)\begin{gather*}f(v + w) = f(v) + f(w) \\ f(cv) = c \cdot f(v)\end{gather*}
Math
  1. In actual modern mathematics, the objects I describe are still called vector spaces, which I think generates unnecessary confusion in this case. “Linear space” is a bit more on the nose for what I’m going for. 

  2. This type of structure (just the addition part) is also called an “abelian group”. 

  3. Throughout ii, jj, and kk are indices in some unspecified but ordered indexing set, usually {1,2,...,n}\{1, 2, ..., n\}. I will not bother giving this index set a name. 

  4. This is sometimes called the dimension theorem, which is somewhat tedious to prove. 

  5. An example of a messy infinite-dimensional basis is R\R considered as linear space over Q\Q (in general, every field is a linear space over its subfields). The basis for this space essentially has to be “11, and all irrational numbers” except if we include e.g. ee and π\pi we can’t include e+12πe + \frac{1}{2}\pi, which is a Q\Q-linear combination of ee and π\pi.

    On the other hand, C\C is two-dimensional over R\R, with basis 1,i\\{1, i\\}.

    Incidentally, this idea of “view a field KK as a linear space over its subfield FF” is such a useful concept that it is called the “degree of the field extension K/FK/F”, and given the symbol [K:F][K : F].

    This, [R:Q]=[\R : \Q] = \infty and [C:R]=2[\C : \R] = 2

  6. You may recall from linear algebra class that two matrices AA and BB of the same shape are similar if there are two appropriately-sized square matrices SS and RR such that SAR=BSAR = B. These matrices SS and RR represent a change of basis, and indicate that the linear maps A,B:VWA, B: V \to W these matrices come from do “the same thing” to elements of VV.

    Over an algebraically closed field like C\C (i.e. all polynomials have solutions), there is an even stronger way to capture the information content of a linear map via Jordan canonicalization, which takes any square matrix AA and produces an almost-diagonal square matrix that only depends on the eigenvalues of AA, which is the same for similar matrices, and thus basis-independent. 

  7. Here, as always, matrix dimensions are given in RC (row-column) order. You can think of this as being “input dimension” to “output dimension”. 

I Wrote A String Type

I write compilers for fun. I can’t help it. Consequently, I also write a lot of parsers. In systems programming, it’s usually a good idea to try to share memory rather than reuse it, so as such my AST types tend to look like this.

pub enum Expr<'src> {
  Int(u32)
  Ident(&'src str),
  // ...
}
Rust

Whenever we parse an identifier, rather than copy its name into a fresh String, we borrow from the input source string. This avoids an extra allocation, an extra copy, and saves a word in the representation. Compilers can be memory-hungry, so it helps to pick a lean representation.

Unfortunately, it’s not so easy for quoted strings. Most strings, like "all my jelly babies", are “literally” in the original source, like an identifier. But strings with escapes aren’t: \n is encoded in the source code with the bytes [0x5c, 0x6e], but the actual “decoded” value of a string literal replaces each escape with a single 0x0a.

The usual solution is a Cow<str>. In the more common, escape-less verison, we can use Cow::Borrowed, which avoids the extra allocation and copy, and in the escaped version, we decode the escapes into a String and wrap it in a Cow::Owned.

For example, suppose that we’re writing a parser for a language that has quoted strings with escapes. The string "all my jelly babies" can be represented as a byte string that borrows the input source code, so we’d use the Cow::Borrowed variant. This is most strings in any language: escapes tend to be rare.

For example, if we have the string "not UTF-8 \xff", the actual byte string value is different from that in the source code.

// Bytes in the source.
hex:   6e 6f 74 20 55 54 46 2d 38 20 5c 78 66 66
ascii: n  o  t     U  T  F  -  8     \  x  f  f

// Bytes represented by the string.
hex:   6e 6f 74 20 55 54 46 2d 38 20 ff
ascii: n  o  t     U  T  F  -  8
Plaintext

Escapes are relatively rare, so most strings processed by the parser do not need to pay for an allocation.

However, we still pay for that extra word, since Cow<str> is 24 bytes (unless otherwise specified, all byte counts assume a 64-bit system), which is eight more than our &str. Even worse, this is bigger than the string data itself, which is 11 bytes.

If most of your strings are small (which is not uncommon in an AST parser), you will wind up paying for significant overhead.

Over the years I’ve implemented various optimized string types to deal with this use-case, in various contexts. I finally got around to putting all of the tricks I know into a library, which I call byteyarn. It advertises the following nice properties.

A Yarn is a highly optimized string type that provides a number of useful properties over String:

  • Always two pointers wide, so it is always passed into and out of functions in registers.
  • Small string optimization (SSO) up to 15 bytes on 64-bit architectures.
  • Can be either an owned buffer or a borrowed buffer (like Cow<str>).
  • Can be upcast to 'static lifetime if it was constructed from a known-static string.

I’d like to share how these properties are achieved through careful layout optimization.

Assumptions

We’re going to start by stating assumptions about how our strings will be used:

  1. Most strings are not mutated most of the time.
  2. Most strings are small.
  3. Most strings are substrings.

Most Strings are Immutable

String is modeled after C++’s std::string, which is a growable buffer that implements amortized linear-time append. This means that if we are appending n bytes to the buffer, we only pay for n bytes of memcpy.

This is a useful but often unnecessary property. For example, Go strings are immutable, and when building up a large string, you are expected to use strings.Builder, which is implemented as essentially a Rust String. Java also as a similar story for strings, which allows for highly compact representations of java.lang.Strings.

In Rust, this kind of immutable string is represented by a Box<str>, which is eight bytes smaller than String. Converting from String to Box<str> is just a call to realloc() to resize the underlying allocation (which is often cheap1) from being capacity bytes long to len bytes long.

Thus, this assumption means we only need to store a pointer and a length, which puts our memory footprint floor at 16 bytes.

Most Strings are Substrings

Suppose again that we’re parsing some textual format. Many structural elements will be verbatim references into the textual input. Not only string literals without escapes, but also identifiers.

Box<str> cannot hold borrowed data, because it will always instruct the allocator to free its pointer when it goes out of scope. Cow<str>, as we saw above, allows us to handle maybe-owned data uniformly, but has a minimum 24 byte overhead. This can’t be made any smaller, because a Cow<str> can contain a 24-byte String value.

But, we don’t want to store a capacity. Can we avoid the extra word of overhead in Cow<str>?

Most Strings are Small

Consider a string that is not a substring but which is small. For example, when parsing a string literal like "Hello, world!\n", the trailing \n (bytes 0x5c 0x6e) must be replaced with a newline byte (0x0a). This means we must handle a tiny heap allocation, 14 bytes long, that is smaller than a &str referring to it.

This is worse for single character2 strings. The overhead for a Box<str> is large.

  • The Box<str> struct itself has a pointer field (eight bytes), and a length field (also eight bytes). Spelled out to show all the stored bits, the length is 0x0000_0000_0000_0001. That’s a lot of zeroes!
  • The pointer itself points to a heap allocation, which will not be a single byte! Allocators are not in the business of handing out such small pieces of memory. Instead, the allocation is likely costing us another eight bytes!

So, the string "a", whose data is just a single byte, instead takes up 24 bytes of memory.

It turns out that for really small strings we can avoid the allocation altogether, and make effective use of all those zeroes in the len field.

Stealing Bits

Let’s say we want to stick to a budget of 16 bytes for our Yarn type. Is there any extra space left for data in a (*mut u8, usize) pair?

*cracks Fermi estimation knuckles*

A usize is 64 bits, which means that the length of an &str can be anywhere from zero to 18446744073709551615, or around 18 exabytes. For reference, “hundreds of exabytes” is a reasonable ballpark guess for how much RAM exists in 2023 (consider: 4 billion smartphones with 4GB each). More practically, the largest quantity of RAM you can fit in a server blade is measured in terabytes (much more than your measly eight DIMs on your gaming rig).

If we instead use one less bit, 63 bits, this halves the maximum representable memory to nine exabytes. If we take another, it’s now four exabytes. Much more memory than you will ever ever want to stick in a string. Wikpedia asserts that Wikimedia Commons contains around 428 terabytes of media (the articles’ text with history is a measly 10 TB).

Ah, but you say you’re programming for a 32-bit machine (today, this likely means either a low-end mobile phone, an embedded micro controller, or WASM).

On a 32-bit machine it’s a little bit harrier: Now usize is 32 bits, for a maximum string size of 4 gigabytes (if you remember the 32-bit era, this limit may sound familiar). “Gigabytes” is an amount of memory that you can actually imagine having in a string.

Even then, 1 GB of memory (if we steal two bits) on a 32-bit machine is a lot of data. You can only have four strings that big in a single address space, and every 32-bit allocator in the universe will refuse to serve an allocation of that size. If your strings are comparable in size to the whole address space, you should build your own string type.

The upshot is that every &str contains two bits we can reasonably assume are not used. Free real-estate.3

A Hand-Written Niche Optimization

Rust has the concept of niches, or invalid bit-patterns of a particular type, which it uses for automatic layout optimization of enums. For example, references cannot be null, so the pointer bit-pattern of 0x0000_0000_0000_0000 is never used; this bit-pattern is called a “niche”. Consider:

enum Foo<'a> {
  First(&'a T),
  Second
}
Rust

An enum of this form will not need any “extra” space to store the value that discriminates between the two variants: if a Foo’s bits are all zero, it’s Foo::Second; otherwise it’s a Foo::First and the payload is formed from Foo’s bit-pattern. This, incidentally, is what makes Option<&T> a valid representation for a “nullable pinter”.

There are more general forms of this: bool is represented as a single byte, of which two bit are valid; the other 254 potential bit-patterns are niches. In Recent versions of Rust, RawFd has a niche for the all-ones bit-pattern, since POSIX file descriptors are always non-negative ints.

By stealing two bits off of the length, we have given ourselves four niches, which essentially means we’ll have a hand-written version of something like this enum.

enum Yarn {
  First(*mut u8, u62),
  Second(*mut u8, u62),
  Third(*mut u8, u62),
  Fourth(*mut u8, u62),
}
Rust

For reasons that will become clear later, we will specifically steal the high bits of the length, so that to recover the length, we do two shifts4 to shift in two high zero bits. Here’s some code that actually implements this for the low level type our string type will be built on.

#[repr(C)]
#[derive(Copy, Clone)]
struct RawYarn {
  ptr: *mut u8,
  len: usize,
}

impl RawYarn {
  /// Constructs a new RawYarn from raw components: a 2-bit kind,
  /// a length, and a pointer.
  fn from_raw_parts(kind: u8, len: usize, ptr: *mut u8) -> Self {
    assert!(len <= usize::MAX / 4, "no way you have a string that big");

    RawYarn {
      ptr,
      len: (kind as usize & 0b11) << (usize::BITS - 2) | len,
    }
  }

  /// Extracts the kind back out.
  fn kind(self) -> u8 {
    (self.len >> (usize::BITS - 2)) as u8
  }

  /// Extracts the slice out (regardless of kind).
  unsafe fn as_slice(&self) -> &[u8] {
    slice::from_raw_parts(self.ptr, (self.len << 2) >> 2)
  }
}
Rust

Note that I’ve made this type Copy, and some functions take it by value. This is for two reasons.

  1. There is a type of Yarn that is itself Copy, although I’m not covering it in this article.

  2. It is a two-word struct, which means that on most architectures it is eligible to be passed in a pair of registers. Passing it by value in the low-level code helps promote keeping it in registers. This isn’t always possible, as we will see when we discuss “SSO”.

Let’s chose kind 0 to mean “this is borrowed data”, and kind 1 to be “this is heap-allocated data”. We can use this to remember whether we need to call a destructor.

pub struct Yarn<'a> {
  raw: RawYarn,
  _ph: PhantomData<&'a str>,
}

const BORROWED: u8 = 0;
const HEAP: u8 = 1;

impl<'a> Yarn<'a> {
  /// Create a new yarn from borrowed data.
  pub fn borrowed(data: &'a str) -> Self {
    let len = data.len();
    let ptr = data.as_ptr().cast_mut();
    Self {
      raw: RawYarn::from_raw_parts(BORROWED, len, ptr),
      _ph: PhantomData,
    }
  }

  /// Create a new yarn from owned data.
  pub fn owned(data: Box<str>) -> Self {
    let len = data.len();
    let ptr = data.as_ptr().cast_mut();
    mem::forget(data);

    Self {
      raw: RawYarn::from_raw_parts(HEAP, len, ptr),
      _ph: PhantomData,
    }
  }

  /// Extracts the data.
  pub fn as_slice(&self) -> &str {
    unsafe {
      // SAFETY: initialized either from uniquely-owned data,
      // or borrowed data of lifetime 'a that outlives self.
      str::from_utf8_unchecked(self.raw.as_slice())
    }
  }
}

impl Drop for Yarn<'_> {
  fn drop(&mut self) {
    if self.raw.kind() == HEAP {
      let dropped = unsafe {
        // SAFETY: This is just reconstituting the box we dismantled
        // in Yarn::owned().
        Box::from_raw(self.raw.as_mut_slice())
      };
    }
  }
}

impl RawYarn {
  unsafe fn as_slice_mut(&mut self) -> &mut [u8] {
    // Same thing as as_slice, basically. This is just to make
    // Box::from_raw() above typecheck.
  }
}
Rust

This gives us a type that strongly resembles Cow<str> with only half of the bytes. We can even write code to extend the lifetime of a Yarn:

impl Yarn<'_> {
  /// Removes the bound lifetime from the yarn, allocating if
  /// necessary.
  pub fn immortalize(mut self) -> Yarn<'static> {
    if self.raw.kind() == BORROWED {
      let copy: Box<str> = self.as_slice().into();
      self = Yarn::owned(copy);
    }

    // We need to be careful that we discard the old yarn, since its
    // destructor may run and delete the heap allocation we created
    // above.
    let raw = self.raw;
    mem::forget(self);
    Yarn::<'static> {
      raw,
      _ph: PhantomData,
    }
  }
}
Rust

The remaining two niches can be put to use for optimizing small strings.

Small String Optimization

C++’s std::string also makes the “most strings are small” assumption. In the libc++ implementation of the standard library, std::strings of up to 23 bytes never hit the heap!

C++ implementations do this by using most of the pointer, length, and capacity fields as a storage buffer for small strings, the so-called “small string optimization” (SSO). In libc++, in SSO mode, a std::string’s length fits in one byte, so the other 23 bytes can be used as storage. The capacity isn’t stored at all: an SSO string always has a capacity of 23.

RawYarn still has another two niches, so let’s dedicate one to a “small” representation. In small mode, the kind will be 2, and only the 16th byte will be the length.

This is why we used the two high bits of len for our scratch space: no matter what mode it’s in, we can easily extract these bits5. Some of the existing RawYarn methods need to be updated, though.

#[repr(C)]
#[derive(Copy, Clone)]
struct RawYarn {
  ptr: MaybeUninit<*mut u8>,
  len: usize,
}

const SMALL: u8 = 2;

impl RawYarn {
  /// Constructs a new RawYarn from raw components: a 2-bit kind,
  /// a length, and a pointer.
  fn from_raw_parts(kind: u8, len: usize, ptr: *mut u8) {
    debug_assert!(kind != SMALL);
    assert!(len <= usize::MAX / 4, "no way you have a string that big");

    RawYarn {
      ptr: MaybeUninit::new(ptr),
      len: (kind as usize & 0b11) << (usize::BITS - 2) | len,
    }
  }

  /// Extracts the slice out (regardless of kind).
  unsafe fn as_slice(&self) -> &[u8] {
    let (ptr, adjust) = match self.kind() {
      SMALL => (self as *const Self as *const u8, usize::BITS - 8),
      _ => (self.ptr.assume_init(), 0),
    };

    slice::from_raw_parts(ptr, (self.len << 2) >> (2 + adjust))
  }
}
Rust

In the non-SMALL case, we shift twice as before, but in the SMALL case, we need to get the high byte of the len field, so we need to shift down by an additional usize::BITS - 8. No matter what we’ve scribbled on the low bytes of len, we will always get just the length this way.

We also need to use a different pointer value depending on whether we’re in SMALL mode. This is why as_slice needs to take a reference argument, since the slice data may be directly in self!

Also, ptr is a MaybeUninit now, which will become clear in the next code listing.

We should also provide a way to construct small strings.

const SSO_LEN: usize = size_of::<usize>() * 2 - 1;

impl RawYarn {
  /// Create a new small yarn. `data` must be valid for `len` bytes
  /// and `len` must be smaller than `SSO_LEN`.
  unsafe fn from_small(data: *const u8, len: usize) -> RawYarn {
    debug_assert!(len <= SSO_LEN);

    // Create a yarn with an uninitialized pointer value (!!)
    // and a length whose high byte is packed with `small` and
    // `len`.
    let mut yarn = RawYarn {
      ptr: MaybeUninit::uninit(),
      len: (SMALL as usize << 6 | len)
          << (usize::BITS - 8),
    };

    // Memcpy the data to the new yarn.
    // We write directly onto the `yarn` variable. We won't
    // overwrite the high-byte length because `len` will
    // never be >= 16.
    ptr::copy_nonoverlapping(
      data,
      &mut yarn as *mut RawYarn as *mut u8,
      data,
    );

    yarn
  }
}
Rust

The precise maximum size of an SSO string is a bit more subtle than what’s given above, but it captures the spirit. The RawYarn::from_small illustrates why the pointer value is hidden in a MaybeUninit: we’re above to overwrite it with garbage, and in that case it won’t be a pointer at all.

We can update our public Yarn type to use the new small representation whenever possible.

impl<'a> Yarn<'a> {
  /// Create a new yarn from borrowed data.
  pub fn borrowed(data: &'a str) -> Self {
    let len = data.len();
    let ptr = data.as_ptr().cast_mut();

    if len <= SSO_LEN {
      return Self {
        raw: unsafe { RawYarn::from_small(len, ptr) },
        _ph: PhantomData,
      }
    }

    Self {
      raw: RawYarn::from_raw_parts(BORROWED, len, ptr),
      _ph: PhantomData,
    }
  }

  /// Create a new yarn from owned data.
  pub fn owned(data: Box<str>) -> Self {
    if data.len() <= SSO_LEN {
      return Self {
        raw: unsafe { RawYarn::from_small(data.len(), data.as_ptr()) },
        _ph: PhantomData,
      }
    }

    let len = data.len();
    let ptr = data.as_ptr().cast_mut();
    mem::forget(data);

    Self {
      raw: RawYarn::from_raw_parts(HEAP, len, ptr),
      _ph: PhantomData,
    }
  }
}
Rust

It’s also possible to construct a Yarn directly from a character now, too!

impl<'a> Yarn<'a> {
  /// Create a new yarn from borrowed data.
  pub fn from_char(data: char) -> Self {
    let mut buf = [0u8; 4];
    let data = data.encode_utf8(&mut buf);
    Self {
      raw: unsafe { RawYarn::from_small(len, ptr) },
      _ph: PhantomData,
    }
  }
}
Rust

(Note that we do not need to update Yarn::immortalize(); why?)

What we have now is a maybe-owned string that does not require an allocation for small strings. However, we still have an extra niche…

String Constants

String constants in Rust are interesting, because we can actually detect them at compile-time6.

We can use the last remaining niche, 3, to represent data that came from a string constant, which means that it does not need to be boxed to be immortalized.

const STATIC: u8 = 3;

impl<'a> Yarn<'a> {
  /// Create a new yarn from borrowed data.
  pub fn from_static(data: &'static str) -> Self {
    let len = data.len();
    let ptr = data.as_ptr().cast_mut();

    if len <= SSO_LEN {
      return Self {
        raw: unsafe { RawYarn::from_small(len, ptr) },
        _ph: PhantomData,
      }
    }

    Self {
      raw: RawYarn::from_raw_parts(STATIC, len, ptr),
      _ph: PhantomData,
    }
  }
}
Rust

This function is identical to Yarn::borrowed, except that data most now have a static lifetime, and we pass STATIC to RawYarn::from_raw_parts().

Because of how we’ve written all of the prior code, this does not require any special support in Yarn::immortalize() or in the low-level RawYarn code.

The actual byteyarn library provides a yarn!() macro that has the same syntax as format!(). This is the primary way in which yarns are created. It is has been carefully written so that yarn!("this is a literal") always produces a STATIC string, rather than a heap-allocated string.

An extra niche, as a treat?

Unfortunately, because of how we’ve written it, Option<Yarn> is 24 bytes, a whole word larger than a Yarn. However, there’s still a little gap where we can fit the None variant. It turns out that because of how we’ve chosen the discriminants, len is zero if and only if it is an empty BORROWED string. But this is not the only zero: if the high byte is 0x80, this is an empty SMALL string. If we simply require that no other empty string is ever constructed (by marking RawYarn::from_raw_parts() as unsafe and specifying it should not be passed a length of zero), we can guarantee that len is never zero.

Thus, we can update len to be a NonZeroUsize.

#[repr(C)]
#[derive(Copy, Clone)]
struct RawYarn {
  ptr: MaybeUninit<*mut u8>,
  len: NonZeroUsize,  // (!!)
}

impl RawYarn {
  /// Constructs a new RawYarn from raw components: a 2-bit kind,
  /// a *nonzero* length, and a pointer.
  unsafe fn from_raw_parts(kind: u8, len: usize, ptr: *mut u8) {
    debug_assert!(kind != SMALL);
    debug_assert!(len != 0);
    assert!(len <= usize::MAX / 4, "no way you have a string that big");

    RawYarn {
      ptr: MaybeUninit::new(ptr),
      len: NonZeroUsize::new_unchecked(
        (kind as usize & 0b11) << (usize::BITS - 2) | len),
    }
  }
}
Rust

This is a type especially known to the Rust compiler to have a niche bit-pattern of all zeros, which allows Option<Yarn> to be 16 bytes too. This also has the convenient property that the all zeros bit-pattern for Option<Yarn> is None.

Conclusion

The byteyarn blurb describes what we’ve built:

A Yarn is a highly optimized string type that provides a number of useful properties over String:

  • Always two pointers wide, so it is always passed into and out of functions in registers.
  • Small string optimization (SSO) up to 15 bytes on 64-bit architectures.
  • Can be either an owned buffer or a borrowed buffer (like Cow<str>).
  • Can be upcast to 'static lifetime if it was constructed from a known-static string.

There are, of course, some trade-offs. Not only do we need the assumptions we made originally to hold, but we also need to relatively care more about memory than cycle-count performance, since basic operations like reading the length of the string require more math (but no extra branching).

The actual implementation of Yarn is a bit more complicated, partly to keep all of the low-level book-keeping in one place, and partly to offer an ergonomic API that makes Yarn into a mostly-drop-in replacement for Box<str>.

I hope this peek under the hood has given you a new appreciation for what can be achieved by clever layout-hacking.

  1. Allocators rarely serve you memory with precisely the size you asked for. Instead, they will have some notion of a “size class” that allows them to use more efficient allocation techniques, which I have written about.

    As a result, if the size change in a realloc() would not change the size class, it becomes a no-op, especially if the allocator can take advantage of the current-size information Rust provides it. 

  2. Here and henceforth “character” means “32-bit Unicode scalar”. 

  3. Now, you might also point out that Rust and C do not allow an allocation whose size is larger than the pointer offset type (isize and ptrdiff_t, respectively). In practice this means that the high bit is always zero according to the language’s own rules.

    This is true, but we need to steal two bits, and I wanted to demonstrate that this is an extremely reasonable desire. 64-bit integers are so comically large. 

  4. Interestingly, LLVM will compile (x << 2) >> 2 to

    movabs rax,0x3fffffffffffffff
    and    rax,rdi
    ret
    x86 Assembly

    If we want to play the byte-for-byte game, this costs 14 bytes when encoded in the Intel variable-length encoding. You would think that two shifts would result in marginally smaller code, but no, since the input comes in in rdi and needs to wind up in rax.

    On RISC-V, though, it seems to decide that two shifts is in fact cheaper, and will even optimize x & 0x3fff_ffff_ffff_ffff back into two shifts. 

  5. This only works on little endian. Thankfully all computers are little endian. 

  6. Technically, a &'static str may also point to leaked memory. For our purposes, there is no essential difference. 

A Gentle Introduction to LLVM IR

The other day, I saw this tweet. In it, Andrew Gallant argues that reaching for LLVM IR, instead of assembly, is a useful tool for someone working on performance. Unfortunately, learning material on LLVM is usually aimed at compiler engineers, not generalist working programmers.

Now, I’m a compiler engineer, so my answer is of course you should know your optimizer’s IR. But I do think there’s a legitimate reason to be able to read it, in the same way that being able to read assembly to understand what your processor is doing is a powerful tool. I wrote an introduction to assembly over a year ago (still have to finish the followups… 💀), which I recommend reading first.

Learning LLVM IR is similar, but it helps you understand what your compiler is doing to create highly optimized code. LLVM IR is very popular, and as such well-documented and reasonably well-specified, to the point that we can just treat it as a slightly weird programming language.

In this article, I want to dig into what LLVM IR is and how to read it.

What’s LLVM IR?

“LLVM” is an umbrella name for a number of software components that can be used to build compilers. If you write performance-critical code, you’ve probably heard of it.

Its flagship product is Clang, a high-end C/C++/Objective-C compiler. Clang follows the orthodox compiler architecture: a frontend that parses source code into an AST and lowers it into an intermediate representation, an “IR”; an optimizer (or “middle-end”) that transforms IR into better IR, and a backend that converts IR into machine code for a particular platform.

                               optimizer (opt)
                                    ___
                                   |   v
          .c file  -->  AST  -->  LLVM IR  -->  assembly
                    ^         ^             ^
                 parser    lowering    backend (llc)

         \____________________/  \_____________________/
             Clang Frontend                LLVM

LLVM often also refers to just the optimizer and backend parts of Clang; this is can be thought of as a compiler for the “LLVM language” or “LLVM assembly”. Clang, and other language frontends like Rust, essentially compile to LLVM IR, which LLVM then compiles to machine code.

LLVM IR is well documented and… somewhat stable, which makes it a very good compilation target, since language implementers can re-use the thousands of engineer hours poured into LLVM already. The source of truth for “what is LLVM IR?” is the LangRef.

LLVM IR is also binary format (sometimes called “bitcode”), although we will be working exclusively with its text format (which uses the .ll extension).

LLVM-targeting compilers will have debugging flags to make them emit IR instead of their final output. For Clang, this is e.g. clang++ -S -emit-llvm foo.cc, while for Rust this is rustc --emit=llvm-ir foo.rs. Godbolt will also respect these options and correctly display LLVM IR output.

Back to Basic Blocks

LLVM IR can be quite intimidating to read, since it contains much more ancillary information than an assembly dump. Consider this function:

pub fn square(x: i32) -> i32 {
  x * x
}

If you click on the “Godbolt” widget, it will take you to a Godbolt that lowers it to LLVM IR. Most of that code is just metadata, but it’s really intimidating!

Starting from compiler output will have a steep difficulty curve, because we have to face the full complexity of LLVM IR. For Rust, this will likely mean encountering exception-handling, which is how panics are implemented, and function attributes that forward Rust’s guarantees (e.g. non-null pointers) to LLVM.

Instead, we’ll start by introducing the basic syntax of LLVM IR, and then we’ll tackle reading compiler output.

A Trivial Function

The meat of LLVM IR is function definitions, introduced with a define. There is also declare, which has exactly the same purpose as a function without a body in C: it brings an external symbol into scope.

For example, the following function takes no arguments and returns immediately:

define void @do_nothing() {
  ret void
}
LLVM IR

The return type of the function (void) immediately follows the define keyword; the name of the function starts with an @, which introduces us to the concept of sigils: every user-defined symbol starts with a sigil, indicating what kind of symbol it is. @ is used for global and functions: things you can take the address of (when used as a value, they are always ptr-typed).

The body of a function resembles assembly: a list of labels and instructions. Unlike ordinary assembly, however, there are significant restrictions on the structure of these instructions.

In this case, there is only one instruction: a void-typed return. Unlike most assembly languages, LLVM IR is strongly typed, and requires explicit type annotations almost everywhere.

Here is another trivial function.

define void @do_not_call() {
  unreachable
}
LLVM IR

This function will trigger undefined behavior upon being called: the unreachable instruction represents a codepath that the compiler can assume is never executed; this is unlike e.g. the unimplemented ud2 instruction in x86, which is guaranteed to issue a fault.

This is an important distinction between LLVM IR and an assembly language: some operations are explicitly left undefined to leave room for potential optimizations. For example, LLVM can reason that, because @do_not_call immediately triggers undefined behavior, all calls to @do_not_call are also unreachable (and propagate unreachability from there).

Purely Scalar Code

Let’s start with basic functions that only operate on integers. Consider the following function, that squares a 32-bit integer:

define i32 @square(i32 %x) {
  %1 = mul i32 %x, %x
  ret i32 %1
}
LLVM IR

Now our function takes arguments and has multiple instructions.

The argument is specified as i32 %x. Names a % sigil are sort of like local variables, but with some restrictions that make them more optimization-friendly; as we’ll see later, they’re not really “variable” at all. LLVM sometimes calls them registers; in a sense, LLVM IR is assembly for an abstract machine with an infinite number of registers. I’ll be calling %-prefixed names “registers” throughout this article.

i32 is a primitive integer types. All integer types in LLVM are of the form iN, for any N (even non-multiples of eight). There are no signed or unsigned types; instead, instructions that care about signedness will specify which semantic they use.

The first instruction is a mul i32, which multiples the two i32 operands together, and returns a value; we assign this to the new register %11. The next instruction returns this value.

The other arithmetic operations have the names you expect: add, sub, and, or, xor, shl (shift left). There are two division and remainder instructions, signed (sdiv, srem) and unsigned (udiv, urem). There two shift right instructions, again signed (ashr) and unsigned (lshr).

Exercise for the reader: why are /, %, and >> the only operations with signed and unsigned versions?

We can also convert from one integer type to another using trunc, zext, and sext, which truncate, zero-extend, and sign-extend, respectively (sext and zext are another signed/unsigned pair). For example, if we wanted the square function to never overflow, we could write

define i64 @square(i32 %x) {
  %1 = sext i32 %x to i64
  %2 = mul i64 %1, %1
  ret i64 %2
}
LLVM IR

Here, we cast %x to i64 by sign-extension (since we’ve decided we’re squaring signed integers) and then square the result. trunc and zext both have the same syntax as sext.

“I’ll Be Back”

Of course, interesting functions have control flow. Suppose we want a safe division function: division by zero is UB, so we need to handle it explicitly. Perhaps something like this:

fn safe_div(n: u64, d: u64) -> u64 {
  if d == 0 { return u64::MAX; }
  n / d
}
Rust

We could try doing this using select, LLVM’s “ternary” operation.

define i64 @safe_div(i64 %n, i64 %d) {
  %1 = icmp eq i64 %d, 0
  %2 = udiv i64 %n, %d
  %3 = select i1 %1, i64 -1, i64 %2
  ret i64 %3
}
LLVM IR

However, this has a problem: division by zero is UB2, and select is not short-circuiting: its semantics are closer to that of cmov in x86.

To compile this correctly, need to use the br instruction, which represents a general branch operation3. In C terms, a br i1 %cond, label %a, label %b is equivalent to if (cond) goto a; else goto b;.

This is how we might write that:

define i64 @safe_div(i64 %n, i64 %d) {
  %1 = icmp eq i64 %d, 0
  br i1 %1, label %iszero, label %nonzero

iszero:
  ret i64 -1

nonzero:
  %2 = udiv i64 %n, %d
  ret i64 %2
}
LLVM IR

Now our function has labels, which are used by the br instruction as jump targets.

In the first block, we do the d == 0 check, implemented by an icmp eq instruction. This returns an i1 (the type LLVM uses for booleans). We then pass the result into a br instruction, which jumps to the first label if it’s zero, otherwise to the second if it isn’t.

The second block is the early-return; it returns the “sentinel” value; the third block is self-explanatory.

Each of these blocks is a “basic block”: a sequence of non-control flow operations, plus an instruction that moves control flow away from the block. These blocks form the control flow graph (CFG) of the function.

There are a few other “block terminator” instructions. The one-argument form of br takes a single label, and is a simple unconditional goto. There’s also switch, which is similar to a C switch:

switch i32 %value, label %default [
  i32 0, label %if_zero
  i32 1, label %if_one,
  ; etc
]
LLVM IR

The type of the switch must be an integer type. Although you could represent this operation with a chain of brs, a separate switch instruction makes it easier for LLVM to generate jump tables.

unreachable, which we saw before, is a special terminator that does not trigger control flow per se, but which can terminate a block because reaching it is undefined behavior; it is equivalent to e.g. std::unreachable() in C++.

LLVM Deleted My Code!

The unreachable instruction provides a good example of why LLVM uses a basic block CFG: a naive dead code elimination (DCE) optimization pass can be implemented as follows:

  1. Fill a set with every block that ends in unreachable.
  2. For every block, if its terminator references a block in the unreachable set, delete that label from the terminator. For example, if we have br i1 %c, label %a, label %b, and the unreachable set contains %a, we can replace this with a br label %b.
  3. If every outgoing edge from a block is deleted in (2), replace the terminator with unreachable.
  4. Delete all blocks in the unreachable set.
  5. Repeat from (1) as many times as desired.

Intuitively, unreachables bubble upwards in the CFG, dissolving parts of the CFG among them. Other passes can generate unreachables to represent UB: interplay between this and DCE results in the “the compiler will delete your code” outcome from UB.

The actual DCE pass is much more complicated, since function calls make it harder to decide if a block is “pure” and thus transparently deletable.

But, what if we want to implement something more complicated, like a / b + 1? This expression needs the intermediate result, so we can’t use two return statements as before.

Working around this is not so straightforward: if we try to assign the same register in different blocks, the IR verifier will complain. This brings us to the concept of static single assignment.

Phony! Phony!

LLVM IR is a static single assignment form (SSA) IR. LLVM was actually started at the turn of the century to create a modern SSA optimizer as an academic project. These days, SSA is extremely fashionable for optimizing imperative code.

SSA form means that every register is assigned by at most one instruction per function. Different executions of the same block in the same function may produce different values for particular registers, but we cannot mutate already-assigned registers.

In other words:

  1. Every register is guaranteed to be initialized by a single expression.
  2. Every register depends only on the values of registers assigned before its definition.

This has many useful properties for writing optimizations: for example, within a basic block, every use of a particular register %x always refers to the same value, which makes optimizations like global value numbering and constant-folding much simpler to write, since the state of a register throughout a block doesn’t need to be tracked separately.

In SSA, we reinterpret mutation as many versions of a single variable. Thus, we might lower x += y as

%x.1 = add i32 %x.0, %y.0
LLVM IR

Here, we’ve used a var.n convention to indicate which version of a variable a specific register represents (LLVM does not enforce any naming conventions).

However, when loops enter the mix, it’s not clear how to manage versions. The number of registers in a function is static, but the number of loop iterations is dynamic.

Concretely, how do we implement this function?

fn pow(x: u32, y: u32) -> u32 {
  let mut r = 1;
  for i in 0..y {
    r *= x;
  }
  r
}
Rust

We could try something like this:

define i32 @pow(i32 %x, i32 %y) {
  br label %loop

loop:
  %r = add i32 %r, %x  ; ERROR: Recursive definition.
  %i = add i32 %i, 1   ; ERROR: Recursive definition.
  %done = icmp eq i32 %i, %y
  br i1 %done, label %exit, label %loop

exit:
  ret i32 %r
}
LLVM IR

But there’s a problem! What are the original definitions of %r and %i? The IR verifier will complain that these registers depend directly on themselves, which violates SSA form. What’s the “right” way to implement this function?

One option is to ask LLVM! We’ll implement the function poorly, and let the optimizer clean it up for us.

First, let’s write the function using memory operations, like loads and stores, to implement mutation. We can use the alloca instruction to create statically-sized stack slots; these instructions return a ptr[^clang-codegen].

Clang Makes a Mess, LLVM Cleans It Up

Incidentally, this is how Clang and Rust both generate LLVM IR: stack variables are turned into allocas and manipulated through loads and stores; temporaries are mostly turned into %regss, but the compiler will sometimes emit extra allocas to avoid thinking too hard about needing to create phi instructions.

This is pretty convenient, because it avoids needing to think very hard about SSA form outside of LLVM, and LLVM can trivially eliminate unnecessary allocas. The code I wrote for the codegen of @pow is very similar to what Rust would send to LLVM (although because we used an iterator, there’s a lot of extra junk Rust emits that LLVM has to work to eliminate).

define i32 @pow(i32 %x, i32 %y) {
  ; Create slots for r and the index, and initialize them.
  ; This is equivalent to something like
  ;   int i = 0, r = 1;
  ; in C.
  %r = alloca i32
  %i = alloca i32
  store i32 1, ptr %r
  store i32 0, ptr %i
  br label %loop_start

loop_start:
  ; Load the index and check if it equals y.
  %i.check = load i32, ptr %i
  %done = icmp eq i32 %i.check, %y
  br i1 %done, label %exit, label %loop

loop:
  ; r *= x
  %r.old = load i32, ptr %r
  %r.new = mul i32 %r.old, %x
  store i32 %r.new, ptr %r

  ; i += 1
  %i.old = load i32, ptr %i
  %i.new = add i32 %i.old, 1
  store i32 %i.new, ptr %i

  br label %loop_start

exit:
  %r.ret = load i32, ptr %r
  ret i32 %r.ret
}
LLVM IR

Next, we can pass this into the LLVM optimizer. The command opt, which is part of the LLVM distribution, runs specific optimizer passes on the IR. In our case, we want opt -p mem2reg, which runs a single “memory to register” pass. We can also just run opt --O2 or similar to get similar4 optimizations to the ones clang -O2 runs.

This is the result.

; After running through `opt -p mem2reg`
define i32 @pow(i32 %x, i32 %y) {
start:
  br label %loop_start

loop_start:
  %i.0 = phi i32 [0, %start], [%i.new, %loop]
  %r.0 = phi i32 [1, %start], [%r.new, %loop]
  %done = icmp eq i32 %i.0, %y
  br i1 %done, label %exit, label %loop

loop:
  %r.new = mul i32 %r.0, %x
  %i.new = add i32 %i.0, 1
  br label %loop_start

exit:
  ret i32 %r.0
}
LLVM IR

The allocas are gone, but now we’re faced with a new instruction: phi. “φ node” is jargon from the original SSA paper; the greek letter φ means “phoney”. These instructions select a value from a list based on which basic block we jumped to the block from.

For example, phi i32 [0, %start], [%i.new, %loop] says “this value should be 0 if we came from the start block; otherwise %i.new if it came from %loop”.

Unlike all other instructions, phi can refer to values that are not defined in all blocks that dominate the current block. This lets us have a dynamic number of versions of a variable! Here’s what that looks like in a dynamic execution context.

A block %a is said to dominate a block %b if each of its predecessors is either %a or a block dominated by %a. In other words, every path from the first block to %b passes through %a. In general instructions can only refer to values defined in previous instructions in the current block or values from blocks that dominate it.

  1. %start directly jumps into %loop_start. The first block cannot be a jump target, since it cannot have phi nodes because its predecessors include function’s callsite.

  2. In %loop_start, since we’ve entered from %start, %i.0 and %r.0 are selected to be the first versions of the (platonic) i and r variables, i.e., their initial values; we jump to %loop.

  3. Then, %loop is dominated by %loop_start so we can use %i.0 and %r.0 there directly; these are the *= and += operations. Then we jump back to %loop_start.

  4. Back in %loop_start, the phis now select %i.new and %r.new, so now %i.0 and %r.0 are the second versions of i and r. By induction, the nth execution of %loop_start has the nth versions of i and r.

  5. When we finally get sent to %exit, we can use %r.0 (since %loop_start dominates %r.0), which will be the %yth version of r; this is our return value.

This is a good place to stop and think about what we’ve done so far. SSA, domination, and phis can be hard to wrap your head around, and are not absolutely necessary for reading most IR. However, it is absolutely worth trying to understand, because it captures essential facts about how compilers like to reason about code5.

With phi and br, we can build arbitrarily complicated control flow within a function6.

Types and Aggregates

Now that we have basic scalar functions, let’s review LLVM’s type system.

We’ve seen i32 and its friends; these are arbitrary-bit-with integers. i1 is special because it is used as the boolean type. LLVM optimizations have been known to generate integer types with non-power-of-two sizes.

LLVM also has float and double, and some exotic float types like bfloat; these use their own arithmetic instructions with different options. I’ll pass on them in this explainer; see fadd and friends in the LangRef for more.

We’ve also seen void, which is only used as a return value, and ptr, which is an untyped7 pointer.

We’ve also seen the label pseudo-type, which represents a block label. It does not appear directly at runtime and has limited uses; the token and metadata types are similar.

Arrays are spelled [n x T]; the number must be an integer and the type must have a definite size. E.g., [1024 x i8]. Zero-sized arrays are supported.

Structs are spelled {T1, T2, ...}. E.g., {i64, ptr} is a Rust slice. Struct fields do not have names and are indexed, instead. The form <{...}> is a packed struct, which removes inter-field padding. E.g. #[repr(packed)] compiles down to this.

Vectors are like arrays but spelled <n x T>. These are used to represent types used in SIMD operations. For example, adding two <4 x i32> would lower to an AVX2 vector add on x86. I will not touch on SIMD stuff beyond this, although at higher optimization levels LLVM will merge scalar operations into vector operations, so you may come across them.

Type aliases can be created at file scope with the syntax

%Slice = type {i64, ptr}
LLVM IR

This means that %T can be either a type or a register/label inside of a function, depending on syntactic position.

Operations on Aggregates

The insertvalue and extractvalue can be used with struct or array types to statically access a field. For example,

%MyStruct = type {i32, {[5 x i8], i64}}

; In Rust-like syntax, this is `let v = s.1.0[4];`
%v = extractvalue %MyStruct %s, 1, 0, 4
LLVM IR

insertvalue is the reverse: it produces a copy of the aggregate with a specific field changed. It does not mutate in-place, because SSA forbids that.

; In Rust-like syntax, this is
;   let s2 = { let mut s = s; s2.1.1 = 42; s };
%s2 = insertvalue %MyStruct %s, i64 42, 1, 1
LLVM IR

There are similar operations called insertelement and extractelement work on vectors, but have slightly different syntax and semantics.

Finally, there’s getelementptr, the “pointer arithmetic instruction”, often abbreviated to GEP. A GEP can be used to calculate an offset pointer into a struct. For example,

define ptr @get_inner_in_array(ptr %p, i64 %idx) {
  %q = getelementptr %MyStruct, ptr %p, i64 %idx, i32 1, i32 1
  ret ptr %q
}
LLVM IR

This function takes in a pointer, ostensibly pointing to an array of %MyStructs, and an index. This returns a pointer to the i64 field of the %idxth element of %p.

A few important differences between GEP and extractvalue:

  1. It takes an untyped pointer instead of a value of a particular struct/array type.
  2. There is an extra parameter that specifies an index; from the perspective of GEP, every pointer is a pointer to an array of unspecified bound. When operating on a pointer that does not (at runtime) point to an array, an index operand of 0 is still required. (Alternatively, you can view a pointer to T as being a pointer to a one-element array.)
  3. The index parameters need explicit types.

LLVM provides a helpful8 FAQ on the GEP instruction: https://llvm.org/docs/GetElementPtr.html.

Other Operations

Some other operations are very relevant for reading IR, but don’t fit into any specific category. As always, the LangRef provides a full description of what all of these instructions do.

Function Calls

call, which calls any ptr as a function. For example:

; Arguments are passed in parentheses.
%r = call i32 @my_func(i32 %x)
LLVM IR

Note that this could have been a %reg instead of a @global, which indicates a function pointer call.

Sometimes you will see invoke, which is used to implement “call a function inside of a C++ try {} block”. This is rare in Rust, but can occur in some C++ code.

Function calls are often noisy areas of IR, because they will be very heavily annotated.

Synchronization

The load and store instructions we’ve already seen can be annotated as atomic, which is used to implement e.g. AtomicU32::load in Rust; this requires that an atomic ordering be specified, too. E.g.,

%v = load atomic i32, ptr %p acquire
LLVM IR

The fence operation is a general memory fence operation corresponding to e.g. Rust’s std::sync::atomic::fence function.

cmpxchg provides the CAS (compare-and-swap) primitive. It returns a {T, i1} containing the old value and whether the CAS succeeded. cmpxchg weak implements the spuriously-failing “weak CAS” primitive.

Finally, atomicrmw performs a read-modify-write (e.g., *p = op(*p, val)) atomically. This is used to implement things like AtomicU32::fetch_add and friends.

All of these operations, except for fence, can also be marked as volatile. In LLVM IR, much like in Rust but unlike in C/C++, individual loads and stores are volatile (i.e., have compiler-invisible side-effects). volatile can be combined with atomic operations (e.g. load atomic volatile), although most languages don’t provide access to these (except older C++ versions).

Reinterpret Shenanigans

bitcast is what mem::transmute and reinterpret_cast in Rust and C++, respectively, ultimately compile into. It can convert any non-aggregate type (integers, vectors) to any other type of the same bit width. For example, it can be used to get at the bits of a floating-point value:

%bits = bitcast double %fp to i64
LLVM IR

It also used to be what was used to cast pointer types (e.g. i32* to i8*). Pointers are now all untyped (ptr) so this use is no longer present.

However, bitcast cannot cast between pointer and integer data. For this we must use the inttoptr and ptrtoint9 instructions. These have the same syntax, but interact with the sketchy semantics of pointer-to-integer conversion and pointer provenance. This part of LLVM’s semantics is a bit of an ongoing trashfire; see Ralf Jung’s post for an introduction to this problem.

Intrinsics

There is also a vast collection of LLVM intrinsics, which are specified in the LangRef. For example, if we need a particular built-in memcpy, we can bring it into scope with a declare:

; ptr %dst, ptr %src, i64 %len, i1 %volatile
declare void @llvm.memcpy.p0.p0.i64(ptr, ptr, i64, i1)
LLVM IR

All of the LLVM intrinsics are functions that start with llvm.; diving into all of them is far beyond what we can do here.

I’m also leaving out discussion of floating point, SIMD, and exception handling, each of which would require their own articles!

Undefined Behavior

LLVM exists to generate optimized code, and optimizations require that we declare certain machine states “impossible”, so that we can detect when we can simplify what the programmer has said. This is “undefined behavior”.

For example, we’ve already encountered unreachable, which LLVM assumes cannot be executed. Division by zero and accessing memory out of bounds is also undefined.

Most LLVM UB factors through the concept of “poisoned values”. A poison value can be thought of as “taking on every value at once”, whichever is convenient for the current optimization pass with no respect to any other passes. This also means that if optimizations don’t detect a use of poison, it is ok from LLVM’s perspective to give you a garbage value. This is most visible at -O0, which performs minimal optimization.

Using a poison value as a pointer in a load, store, or call must be UB, because LLVM can choose it to be a null pointer. It also can’t be the denominator of a udiv or similar, because LLVM can choose it to be zero, which is UB. Passing poison into a br or a switch is also defined to be UB.

LLVM can perform dataflow analysis to try to determine what operations a poisonous value that was used in a UB way came from, and thus assume those operations cannot produce poison. Because all operations (other than select and phi) with a poison input produce poison, backwards reasoning allows LLVM to propagate UB forward. This is where so-called “time traveling UB” comes from.

Many operations generate poison. For example, in C, signed overflow is UB, so addition lowers to an add nsw (nsw stands for no signed wrap). Instead of wrapping on overflow, the instruction produces poison. There is also an unsigned version of the annotation, nuw.

Many other operations have “less defined” versions, which are either generated by optimizations, or inserted directly by the compiler that invokes LLVM when the language rules allow it (see C above). More examples include:

  • udiv and friends have an exact annotation, which requires that the division have a zero remainder, else poison.
  • getelementptr has an inbounds annotation, which produces poison if the access is actually out of bounds. This changes it from a pure arithmetic operation to one more closely matching C’s pointer arithmetic restrictions. GEP without inbounds corresponds to Rust’s <*mut T>::wrapping_offset() function.
  • Floating point operations marked with nnan and ninf will produce poison instead of a NaN or an infinite value, respectively (or when a NaN or infinity is an argument).

Creating poison is not UB; only using it is. This is weaker than the way UB works in most languages; in C, overflow is instantly UB, but in LLVM overflow that is never “witnessed” is simply ignored. This is a simpler operational semantics for reasoning about the validity of optimizations: UB must often be viewed as a side-effect, because the compiler will generate code that puts the program into a broken state. For example, division by zero will cause a fault in many architectures. This means UB-causing operations cannot always be reordered soundly. Replacing “causes UB” with “produces poison” ensures the vast majority of operations are pure and freely reorderable.

Reading Some Codegen

Let’s go back to our original Rust example!

pub fn square(x: i32) -> i32 {
  x * x
}

This is the output, with metadata redacted and some things moved around for readability.

source_filename = "example.b6eb2c7a6b40b4d2-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; example::square
define i32 @_ZN7example6square17hb32bcde4463f37c3E(i32 %x) unnamed_addr #0 {
start:
  %0 = call { i32, i1 } @llvm.smul.with.overflow.i32(i32 %x, i32 %x)
  %_2.0 = extractvalue { i32, i1 } %0, 0
  %_2.1 = extractvalue { i32, i1 } %0, 1
  %1 = call i1 @llvm.expect.i1(i1 %_2.1, i1 false)
  br i1 %1, label %panic, label %bb1

bb1:
  ret i32 %_2.0

panic:
  ; core::panicking::panic
  call void @_ZN4core9panicking5panic17ha338a74a5d65bf6fE(
    ptr align 1 @str.0,
    i64 33,
    ptr align 8 @alloc_1368addac7d22933d93af2809439e507
  )
  unreachable
}

declare { i32, i1 } @llvm.smul.with.overflow.i32(i32, i32) #1
declare i1 @llvm.expect.i1(i1, i1) #2

; core::panicking::panic
declare void @_ZN4core9panicking5panic17ha338a74a5d65bf6fE(ptr align 1, i64, ptr align 8) unnamed_addr #3

@alloc_9be5c135c0f7c91e35e471f025924b11 = private unnamed_addr constant
  <{ [15 x i8] }>
  <{ [15 x i8] c"/app/example.rs" }>, align 1

@alloc_1368addac7d22933d93af2809439e507 = private unnamed_addr constant
  <{ ptr, [16 x i8] }> <{
    ptr @alloc_9be5c135c0f7c91e35e471f025924b11,
    [16 x i8] c"\0F\00\00\00\00\00\00\00\02\00\00\00\03\00\00\00"
  }>, align 8

@str.0 = internal constant [33 x i8] c"attempt to multiply with overflow"

attributes #0 = { nonlazybind uwtable }
attributes #1 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #2 = { nocallback nofree nosync nounwind willreturn memory(none) }
attributes #3 = { cold noinline noreturn nonlazybind uwtable }
LLVM IR

The main function is @_ZN7example6square17hb32bcde4463f37c3E, which is the mangled name of example::square. Because this code was compiled in debug mode, overflow panics, so we need to generate code for that. The first operation is a call to the LLVM intrinsic for “multiply and tell us if it overflowed”. This returns the equivalent of a (i32, bool); we extract both value out of it with extractvalue. We then pass the bool through @llvm.expect, which is used to tell the optimizer to treat the panicking branch as “cold”. The success branch goes to a return, which returns the product; otherwise, we go to a function that calls core::panicking::panic() to panic the current thread. This function never returns, so we can terminate the block with an unreachable.

The rest of the file consists of:

  • declares for the llvm intrinsics we used.
  • A declare for core::panicking::panic. Any external function we call needs to be declared. This also gives us a place to hang attributes for the function off of.
  • Global constants for a core::panic::Location and a panic message.
  • Attributes for the functions above.

This is a good place to mention attributes: LLVM has all kinds of attributes that can be placed on functions (and function calls) to record optimization-relevant information. For example, @llvm.expect.i1 is annotated as willreturn, which means this function will eventually return; this means that, for example, any UB that comes after the function is guaranteed to occur after finite time, so LLVM can conclude that the code is unreachable despite the call to @llvm.expect.i1. The full set of attributes is vast, but the LangRef documents all of them!

Conclusion

LLVM IR is huge, bigger than any individual ISA, because it is intended to capture every interesting operation. It also has a rich annotation language, so passes can record information for future passes to make use of. Its operational semantics attempt to leave enough space for optimizations to occur, while ensuring that multiple sound optimizations in sequence are not unsound (this last part is a work in progress).

Being able to read assembly reveals what will happen, exactly, when code is executed, but reading IR, before and after optimization, shows how the compiler is thinking about your code. Using opt to run individual optimization passes can also help further this understanding (in fact, “bisecting on passes” is a powerful debugging technique in compiler development).

I got into compilers by reading LLVM IR. Hopefully this article inspires you to learn more, too!

  1. Registers within a function may have a numeric name. They must be defined in order: you must define %0 (either as a register or a label), then %1, then %2, etc. These are often used to represent “temporary results”.

    If a function does not specify names for its parameters, they will be given the names %0, %1, etc implicitly, which affect what the first explicit numeric register name you can use is. Similarly, if the function does not start with a label, it will be implicitly be given the next numeric name.

    This can result in significant confusion, because if we have define void @foo(i32, i32) { ... }, the arguments will be %0 and %1, but if we tried to write %2 = add i32 %0, %1, we would get an extremely confusing parser error, because %2 is already taken as the name of the first block. 

  2. For some reason, the optimizer can’t figure out that the select is redundant? Alive2 (an SMT-solver correctness checker for optimizations) seems to agree this is a valid optimization.

    So I’ve filed a bug. :D 

  3. If you read my assembly article, you’ll recall that there are many branch instructions. On RV, we have beq, bne, bgt, and bge. Later on in the compilation process, after the optimizer runs, LLVM will perform instruction selection (isel) to choose the best machine instruction(s) to implement a particular LLVM instruction (or sequence), which is highly context-dependent: for example, we want to fuse an icmp eq followed by a br on the result into a beq.

    Isel is far outside my wheelhouse, and doing it efficiently and profitably is an active area of academic research. 

  4. Not exactly the same: language frontends like Clang and Rust will perform their own optimizations. For example, I have an open bug for LLVM being unable to convert && into & in some cases; this was never noticed, because Clang performs this optimization while lowering from C/C++ to LLVM, but Rust does not do the equivalent optimization. 

  5. A more intuitive model is used in more modern IRs, like MLIR. In MLIR, you cannot use variables defined in other blocks; instead, each block takes a set of arguments, just like a function call. This is equivalent to phi instructions, except that now instead of selecting which value we want in the target, each predecessor specifies what it wants to send to the target.

    If we instead treat each block as having “arguments”, we can rewrite it in the following fantasy syntax where register names are scoped to their block.

    ;; Not actual LLVM IR! ;;
    
    define i32 @pow(i32 %x, i32 %y) {
      br %loop_start(i32 0, i32 1)
    
    loop_start(i32 %i, i32 %r)
      %done = icmp eq i32 %i.0, %y
      br i1 %done, %exit(i32 %r), %loop(i32 %i, i32 %r)
    
    loop(i32 %i, i32 %r)
      %r.new = mul i32 %r, %x
      %i.new = add i32 %i, 1
      br %loop_start(i32 %i, i32 %r)
    
    exit(i32 %r)
      ret i32 %r
    }
    LLVM IR

  6. What does the CFG look like? LLVM contains “optimization” passes that print the CFG as a file as a .dot file, which can be rendered with the dot command. For @safe_div, we get something like the following.

    This is useful for understanding complex functions. Consider this Rust hex-parsing function.

    // hex.rs
    #[no_mangle]
    fn parse_hex(data: &str) -> Option<u64> {
      let mut val = 0u64;
      for c in data.bytes() {
        let digit = match c {
          b'0'..=b'9' => c,
          b'a'..=b'f' => c - b'a' + 10,
          b'A'..=b'F' => c - b'A' + 10,
          _ => return None,
        };
    
        val = val.checked_mul(16)?;
        val = val.checked_add(digit as u64)?;
      }
      Some(val)
    }
    Rust

    Then, we can generate our CFG with some shell commands.

    $ rustc -O --crate-type lib --emit llvm-ir hex.rs
    $ opt -p dot-cfg -o /dev/null hex.ll
    Writing '.parse_hex.dot'...
    $ dot -Tsvg .parse_hex.dot -o parse_hex.svg
    Shell

    The result is this mess.

    Without optimizations, we get a bigger mess (most optimization passes are various CFG cleanups).

    Exercise: try to trace through what each basic block is doing. You will want to open the SVGs in a separate tab to do that. I recommend following the optimized version, since it is much less noisy.

    Comparing optimized vs. unoptimized is a good way to see how much the compiler does to simplify the stuff the language frontend gives it. At -O0? All allocas. At -O2? No allocas! 

  7. Once upon a time we had typed pointers, like i32*. These turned out to generate more problems than they solved, requiring frequent casts in IR in exchange for mediocre type safety. See https://llvm.org/docs/OpaquePointers.html for a more complete history. 

  8. Sarcasm. 

  9. I quietly judge LLVM for having instructions named inttoptr when int2ptr just reads so much nicer.