Things You Never Wanted To Know About Go Interfaces

Lately I’ve been finding myself writing a bit of Go, and I’ve picked up various fun “layout secrets” that help inform how I write code to minimize hidden allocations, and generally be kind to the optimizer. This article is a series of notes on the topic.

This post is about Go implementation details, so they can probably break you at any time if you rely on it. On the other hand, Hyrum’s law is a bitch, so taking your chances may not be that bad. After all, they’re probably never going to be able to properly clean up the mess people made with //go:linkname with runtime symbols…

As with many of my other posts, I’ll assume a basic familiarity with being able to read assembly. I’m using x86 for this post, but it’s worth looking at my RISC-V post for a refresher.

GC Shapes

The most basic Go-specific concept when it comes to type layouts is the shape of a type. This is an implementation detail of Go’s garbage collector that leaks through the unsafe package.

Like in most native programming languages, every Go type has a size (the number of bytes that type takes up in memory) and an alignment (a power of two that every pointer to that type must be divisible by). Go, like most other languages, requires that size be divisible by the alignment: that is, the size is equal to the stride of an array of that type.

The size an alignment of a type can be queried by the intrinsics unsafe.Sizeof and unsafe.Alignof. These are very unwieldy in generic code, so I like to define a couple of helpers1:

func Size[T any]() int {
  var v T
  return int(unsafe.Sizeof(v))
}

func Align[T any]() int {
  var v T
  return int(unsafe.Alignof(v))
}
Go

Together, these two quantities are called the layout of a type (a term common to many native languages). However, the shape of a type also records what pieces thereof contain pointers. This is because memory visible to the GC (such as globals, heap memory, or stack roots) is typed, and the GC needs to know which parts of those types are pointers that it needs to trace through.

Because all pointers have the same size and alignment (4 or 8 bytes depending on the system) the pointer words of a type can be represented as a bitset, one bit for every 4 or 8 bytes in the type. This, in fact, is the representation used by the GC2.

In particular, this means that whether a field is to be interpreted as an unsafe.Pointer or as a uintptr is a static property of the type. As we will see when we discuss interfaces, this restriction prevents a few layout optimizations.

Slices and Strings

Go is very public about the layout of slices and strings. A slice is

type slice[T] struct {
  data     *T
  len, cap int
}
Go

len and cap are extracted by their eponymous builtins, and data can be obtained using unsafe.SliceData (or &s[0] if the slice is nonempty, but that costs a bounds-check).

A string has the same layout as a []byte, except for a capacity:

type string struct {
  data *byte
  len  int
}
Go

Despite essentially being slices, Go treats strings subtly differently. Strings are comparable, so they can be used as map keys. They are also immutable, which enables a handful of optimizations. Immutability is also why they are comparable: Go made the mistake of not keeping const from C, but they really want map keys to be const.

There is nothing stopping us from aliasing strings to data pointed to by a slice: after all, strings.Builder does it to avoid a copy in String(). We can implement this easily enough with some unsafe:

func StringAlias(data []byte) string {
  return unsafe.String(unsafe.SliceData(data), len(data))
}
Go

Doing this is perfectly safe, so long as data is not mutated while the returned string is accessible. This allows virtually any slice type to be used as a key in a map, with some caveats.

  1. Types which contain alignment padding cannot be used, because Go does not promise that it zeros memory returned by new.

  2. Types which contain pointers will cause those pointers to become unreachable if the only reference is the aliased string; this is because the pointed to data’s shape contains no pointer words.

  3. Incomparable types and interfaces will be compared by address (that is, maps, channels and funcs).

Dynamic Arrays with Reflection

Now, this isn’t the only to accomplish this: you can create dynamically-sized array types using reflection, like so:

func Slice2Array[T any](s []T) any {
  if s == nil { return nil }

  var v T
  elem := reflect.TypeOf(v)
  array := reflect.ArrayOf(len(s), elem)

  // NOTE: NewAt will return a reflect.Value containing a
  // pointer, not an array!
  refl := reflect.NewAt(array, unsafe.SliceData(s))
  refl = refl.Elem() // Dereference to get a pointer-to-array.
  return refl.Interface()
}
Go

This will return an any whose type is [len(s)]T. You can even type assert it for static array sizes. This any is suitable for placing into a map[any]T, just as if we had built it with e.g. any([...]byte("foo"))

However, and this is not at all obvious from the code here, calling refl.Interface() will perform a copy of the whole array. Interface() delegates through a few functions until it calls reflect.packEface().

The code this function (found here) is reproduced below:

package reflect

// packEface converts v to the empty interface.
func packEface(v Value) any {
	t := v.typ()
	var i any
	e := (*abi.EmptyInterface)(unsafe.Pointer(&i))
	// First, fill in the data portion of the interface.
	switch {
	case t.IfaceIndir():
		if v.flag&flagIndir == 0 {
			panic("bad indir")
		}
		// Value is indirect, and so is the interface we're making.
		ptr := v.ptr
		if v.flag&flagAddr != 0 {
			c := unsafe_New(t)
			typedmemmove(t, c, ptr)
			ptr = c
		}
		e.Data = ptr
	case v.flag&flagIndir != 0:
		// Value is indirect, but interface is direct. We need
		// to load the data at v.ptr into the interface data word.
		e.Data = *(*unsafe.Pointer)(v.ptr)
	default:
		// Value is direct, and so is the interface.
		e.Data = v.ptr
	}
	// Now, fill in the type portion. We're very careful here not
	// to have any operation between the e.word and e.typ assignments
	// that would let the garbage collector observe the partially-built
	// interface value.
	e.Type = t
	return i
}
Go

The switch determines precisely how the interface data pointer is computed. It turns out that (almost all) array types return true for t.IfaceIndr(), so the first case is selected, which triggers a copy (that being the call to unsafe_New() followed by a typedmemmove). This copy is to ensure that the value of the resulting interface cannot be mutated.

Now, if only we knew the layout of Go’s interfaces, we might be able to get somewhere here…

The Layout of Go’s Interfaces

Oh, yes, that’s what this article is about. So, if we look at the runtime2.go file in the runtime (yes, that’s what it’s called), nestled among the giant scheduler types for Gs, Ps, and Ms, we’ll find a couple of structs that really elucidate what’s going on:

package runtime

type funcval struct {
	fn uintptr
	// variable-size, fn-specific data here
}

type iface struct {
	tab  *itab
	data unsafe.Pointer
}

type eface struct {
	_type *_type
	data  unsafe.Pointer
}
Go

funcval is the layout of a func(), more on that later. iface is the layout of your “usual” interface, consisting of an itab (an interface table, or what Go calls a vtable) and a pointer to some data. eface is the layout of any (the artist formerly known as interface{}, hence the name: empty interface).

eface having its own layout is an optimization. Because any exists to be downcast from dynamically, storing the type directly cuts out a pointer load when doing a type switch on an any specifically. If we look at what an itab is (which is “just” an abi.ITab):

package abi

// The first word of every non-empty interface type contains an *ITab.
// It records the underlying concrete type (Type), the interface type
// it is implementing (Inter), and some ancillary information.
//
// allocated in non-garbage-collected memory
type ITab struct {
	Inter *InterfaceType
	Type  *Type
	Hash  uint32     // copy of Type.Hash. Used for type switches.
	Fun   [1]uintptr // fun[0]==0 means Type does not implement Inter.
}
Go

Codegen for Interface Operations

An ITab contains the same type it would have as an any, which makes the generated code for a function that upcasts an interface to any very simple3:

package foo

func Upcast(i MyIface) any {
  return i
}
Go
foo.F:
    test    rax, rax
    jeq     nil
    mov     rax, [rax + 8]
nil:
    ret
x86 Assembly

In the register ABI, the x86 argument (and return) registers are rax, rbx, rcx, rdi, rsi, r8, r9, r10 and r11 (with rdx reserved for passing a closure capture, more on that later; r14 holds a pointer to the currently running G).

The *ITab comes in on rax and the data pointer on rbx. First, we need to check if this is the nil interface, identified by having a nil itab (or type, in the case of any). If it is nil, we just return: rax:rbx already contain the data of a nil any. Otherwise, we load ITab.Type, at offset 8, into rax, and return.

How do interface function calls work?

package foo

type MyIface interface {
  Method(int) int
}

func Call(m MyIface) int {
  return m.Method(42)
}
Go
foo.Call:
    cmp     rsp, [r14 + 16]
    jls     grow
    push    rbp
    mov     rsp, rbp
    add     rsp, -16
    mov     [rsp], rax
    mov     [rsp + 8], rbx
    mov     rcx, [rax + 24]
    mov     rax, rbx
    mov     rbx, 42
    call    rcx
    add     rsp, 16
    pop     rbp
    ret
grow:
    nop
    mov     [rsp + 8], rax
    mov     [rsp + 16], rbx
    call    runtime.morestack_noctxt
    mov     rax, [rsp + 8]
    mov     rbx, [rsp + 16]
    jmp     foo.Call
x86 Assembly

This function seems to be doing a lot more than it actually is. Part of it is that its prologue has to do a call to runtime.morestack_noctxt(), which is simply a call to runtime.morestack that clobbers rdx, the closure capture parameter. The meat of it comes when it loads [rax + 24], the first element of ITab.Fun. It then moves the data pointer in rbx to rax, the argument into rbx, and issues the call.

What about upcasts? An upcast to a concrete type is quite simple: simply compare the type in the interface (either directly or in the *ITab) to a particular statically-known one. Downcasting to an interface (sometimes called a sidecast) is much more complicated, because it essentially requires a little bit of reflection.

package foo

type MyIface interface {
  Method(int) int
}

func Downcast(m any) MyIface {
  return m.(MyIface)
}
Go
foo.Downcast:
    cmp     rsp, [r14 + 16]
    jls     grow
    push    rpb
    mov     rbp, rsp
    add     rsp, -24
    mov     [rsp], rax
    mov     [rsp + 8], rbx
    test    rax, rax
    jeq     nil
    mov     rcx, [foo..typeAssert0]
    mov     rdx, [rcx]
    mov     rsi, [rax + 16]
hashProbe:
    mov     rdi, rsi
    and     rsi, rdx
    shl     rsi, 4
    mov     r8, [rcx + rsi + 8]
    cmp     rax, r8
    jeq     found
    lea     rsi, [rdi + 1]
    test    r8, r8
    jnz     hashProbe
    mov     [rsp + 8], rbx
    mov     rbx, rax
    leq     rax, [foo..typeAssert0]
    call    runtime.typeAssert
    mov     rbx, [rsp + 8]
    jmp     done
found:
    mov     rax, [rcx + rsi + 16]
done:
    add     rsp, 24
    pop     rpb
    ret
nil:
    lea     rax, [type:foo.MyIface]
    call    runtime.panicnildottype
grow:
    // Same as it was in foo.Call above.
    jmp     foo.Downcast
x86 Assembly

When we request an interface downcast, the Go compiler synthesizes a symbol of type abi.TypeAssert. Its definition is reproduced below.

package abi

type TypeAssert struct {
	Cache   *TypeAssertCache
	Inter   *InterfaceType
	CanFail bool
}
type TypeAssertCache struct {
	Mask    uintptr
	Entries [1]TypeAssertCacheEntry
}
type TypeAssertCacheEntry struct {
	// type of source value (a *runtime._type)
	Typ uintptr
	// itab to use for result (a *runtime.itab)
	// nil if CanFail is set and conversion would fail.
	Itab uintptr
}
Go

The first thing this function does is check if rax contains 0, i.e., if this is a nil any, and panics if that’s the case (that’s a call to runtime.panicnildottype). It then loads foo..typeAssert0, a synthetic global variable containing an abi.TypeAssert value. It loads the Cache field, as well as the Hash field of the abi.Type attached to the any. It masks off the low bits using typeAssert0.Cache.Mask, and uses that to start probing the very simple open-addressed hash table located in typeAssert0.Cache.Entries.

If it finds a TypeAssertCacheEntry with the type we’re looking for (compared by address), we’ve found it. We load that entry’s Itab value into rax to change the value from being an any to being a MyIface, and we’re done.

If it finds a TypeAssertCacheEntry with a nil Typ pointer, we’re forced to hit the slow path, implemented at runtime.typeAssert(). This dynamically builds an itab by searching the method set of the type inside the any.

This then calls the reflection code in runtime.getitab(), which is what actually performs the messy search through the method set, comparing the names and signatures of methods with those in the interface, to produce an itab at runtime.

Then, it shoves this the resulting itab into the global itab cache, which is protected by a global lock! There are lots of scary atomics in this code. There are many places where this can potentially panic, bubbling up a type assertion failure to the user.

When runtime.getitab() returns, runtime.typeAssert() will maybe4 update the type assertion cache, and return the new itab. This allows the code in our function to return directly, without needing to take another trip into the hashProbe loop.

In theory, PGO could be used to pre-fill the cache, but I couldn’t find any code in the compiler that indicates that this is something they do. In the meantime, you can optimize a hot type assert ahead of time by asserting to a known common type:

func DoSomething(r io.Reader) {
  var rs io.ReadSeeker
  if f, ok := r.(*os.File); ok {
    // Check for a known implementation first. This only costs
    // a pointer comparison with the *abi.Type in the itab.
    rs = f
  } else if f, ok := r.(io.ReadSeeker); ok {
    // Do an interface type assertion. This would eventually
    // learn os.File, but the branch above skips that "warmup"
    // time. It also lets the hardware branch predictor allocate
    // a prediction slot just for os.File.
    rs = f
  } else {
    // ...
  }
} 
Go

Type switches, incidentally, use a very similar caching mechanism for switches that include interface types among the cases.

What Was That About Indirect Interfaces?

Back when we were hacking arrays into existence with reflection, there was some trouble in reflect.Value.Interface(), where it would do a seemingly unnecessary copy.

This is because an interface’s data pointer must be a pointer. If you cram, say, an int into an any, Go will spill it to the heap. This is often called boxing, but the Go runtime refers to it as an “indirect interface”.

package foo

func Int2Any(x int) any {
  return x
}
Go
foo.Int2Any:
  push     rbp
  mov      rbp, rsp
  add      rsp, -8
  call     runtime.convT64
  move     rbx, rax
  lea      rax, [type:int]
  add      rsp, 8
  pop      rbp
  ret
x86 Assembly

Like many other managed languages, Go will skip boxing very small values by instead returning pointers into some global array.

Now, this boxing could be avoided: after all, an int is no larger than a pointer, so we could cram it into the data pointer field directly. However, the GC really doesn’t like that: the GC assumes it can trace through any pointer. Now, the GC could treat interfaces differently, and look at the type/itab pointer to determine if the data value pointer or a scalar. However, this would add significant complexity to both the representation of shapes, and to the tracing code in the GC, resulting in more branches and slower tracing.

However, if the type being wrapped in an interface happens to be a pointer, it can just use that pointer value directly.

package foo

func Int2Any(x int) any {
  return x
}
Go
foo.Int2Any:
  move     rbx, rax
  lea      rax, [type:int]
  ret
x86 Assembly

Any type that has the same shape as a pointer will be indirect. This includes maps, channels, and funcs. It also includes one element arrays of such types, such as [1]*int and [1]chan error, and single-field structs of such types. Curiously, this does not include structs which contain a zero-sized field before the pointer-sized field, even though those have the same shape as a pointer.

This means it’s generally not safe to play games with forging an interface out of a pointer to some type: whether that type is indirect in an interface is a subtle implementation detail of the compiler.

And of course, it’s important to remember that if you want to return a value by interface, you had best hope it can get inlined, so the compiler can promote the heap allocation to the stack.

Function Pointers

The last thing to look at are Go’s function pointers. For the longest time, I assumed they had the same layout as an interface: a pointer to closure data, and a hardware function pointer.

It turns out the layout is weirder: let’s revisit the runtime.funcval we found in runtime2.go earlier.

package runtime

type funcval struct {
	fn uintptr
	// variable-size, fn-specific data here
}
Go

This unusual layout is best understood by looking at the generated assembly.

package foo

func Call(
  f func(int) int,
  x int,
) int {
  return f(x)
}
Go
foo.Call:
    cmp     rsp, [r14 + 16]
    jls     grow
    push    rpb
    mov     rpb, rsp
    add     rsp, -8
    mov     rcx, [rax]
    mov     rdx, rax
    mov     rax, rbx
    call    rcx
    add     rsp, 8
    pop     rbp
    ret
grow:
    // Same as before.
    jmp     foo.Call
x86 Assembly

To call f, first we interpret it as a *funcval and load f.fn into a temporary. That is, the first word pointed to by rax (which holds f on function entry). Then, we place f in rdx, the closure context register. The reason for using this extra magic register will become clear shorter. Then, we arrange the rest of the arguments in their usual registers, and we jump to the address stored in f.fn.

Inside of f, captures are accessed by offsetting from rdx. What does one of those closures look like?

package foo

func Capture(x int) func(int) int {
  return func(y int) int {
    return x * y
  }
}
Go
foo.Capture:
    cmp     rsp, [r14 + 16]
    jls     grow
    push    rpb
    mov     rpb, rsp
    add     rsp, -16
    mov     [rsp], rax
    lea     rax, ["type:noalg.struct { F uintptr; X0 int }"]
    call    runtime.newobject
    lea     rcx, foo.Capture.func1
    mov     [rax], rcx
    mov     rcx, [rsp]
    mov     [rax + 8], rcx
    add     rsp, 16
    pop     rbp
    ret
grow:
    // Same as before.
    jmp     foo.Capture

foo.Capture.func1:
    mov     rcx, [rdx + 8]
    imul    rax, rcx
    ret
x86 Assembly

All Capture is doing is allocating a funcval with a single int capture; that’s the { F uintptr; X0 int } in the code above. It then places the address of Capture.func1, which implements the callback, into F, and the argument of Capture into X0.

What about when returning a reference to a function? In that case, all that happens is it returns a reference to a global containing the address of the function.

package foo

func Capture(x int) func(int) int {
  return Id
}

func Id(x int) int {
  return x
}
Go
foo.Capture:
    lea     rax, [foo.Id·f]
    ret

foo.Id:
    ret

foo.Id·f:
    .quad foo.Id
x86 Assembly

Because we pass the closure arguments in an extra register not used by regular functions, we don’t need to create a thunk for this case.

Unfortunately, we do need to create a thunk for methods, even methods with a pointer receiver. This is because of the following incompatible constraints:

  1. The receiver pointer for a method must point exactly to the value the method is called on. It can’t be a fixed offset before, because that would create an out-of-bounds pointer, which the GC does not tolerate.

  2. The closure pointer must point to the start of the funcval, not its captures, because adjusting the pointer to point to the captures would cause it to point one-past-the-end of a value, which the GC also does not tolerate!

Thus, even if methods accepted a pointer receiver via rdx, closures and methods disagree about where that pointer should be passed.

Of course, there are adjustments we can make to fix this problem. For example, we could require that all funcval values have at least one capture. No-capture funcvals would have a synthetic _ byte field. This is not unlike how a non-empty struct whose final field is empty will be padded with an extra _ byte field: this is specifically to avoid a pointer to that field being a past-the-end pointer. The cost is that every non-capturing closure costs twice as much binary size.

Another fix is to make the GC blind to the pointer in rdx. This will never be the only pointer by which a value is reachable, so it would be safe to replace mov rdx, rax with a lea rdx, [rax + 8]. The GC would never know!

Until then, beware that writing return foo.Method secretly allocates 16 bytes or so. (Aside: I used to sit next to the Go team at Google, and I remember having a conversation with Austin Clements about this. Apparently I misremembered, because until recently I thought Go already implemented this optimization!)

Conclusion

If you made it this far this is probably you right now:

Miguel as a Whimsicott, dizzy with register names.

This isn’t intended to be as polished as most of my articles, but there’s been enough things I’ve come across that I wanted to write this all up for my own reference.

  1. Sizeof and Alignof are intrinsics, so the compiler will turn them into constants. However, they are only constants if the type being measured is not generic, so wrapping them in a function like this doesn’t actually hurt in generic code. 

  2. Except for very large types that would have more words than can be recorded by an array of size abi.MaxPtrmaskBytes. For larger types, we use GC programs! A GC program is an LZ-compressed bitset serving the same purpose as the pointer bitset most smaller types use. See gcprog.go.

    In fact, reflection knows how to create programs on the fly for most types! See reflect/type.go

  3. I will be writing assembly examples in Intel-syntax x86. Go’s assembly syntax is horrible and an impediment to the point I’m making. 

  4. Maybe? Well, the cache will only get updated about 0.1% of the time. This is to amortize the costs of growing the cache. I assume they benchmarked this, and found that the cost of growing the cache makes it only worthwhile when that assertion is getting hammered. 

Nobody Gets Fired for Picking JSON, but Maybe They Should?

JSON is extremely popular but deeply flawed. This article discusses the details of JSON’s design, how it’s used (and misused), and how seemingly helpful “human readability” features cause headaches instead. Crucially, you rarely find JSON-based tools (except dedicated tools like jq) that can safely handle arbitrary JSON documents without a schema—common corner cases can lead to data corruption!

What is JSON?

JSON is famously simple. In fact, you can fit the entire grammar on the back of a business card. It’s so omnipresent in REST APIs that you might assume you already know JSON quite well. It has decimal numbers, quoted strings, arrays with square brackets, and key-value maps (called “objects”) with curly braces. A JSON document consists of any of these constructs: null, 42, and {"foo":"bar"} are all valid JSON documents.

However, the formal definition of JSON is quite complicated. JSON is defined by the IETF document RFC8259 (if you don’t know what the IETF is, it’s the standards body for Internet protocols). However, it’s also normatively defined by ECMA-404, which is from ECMA, the standards body that defines JavaScript[^json.org].

[^json.org]: Of course, some wise guy will probably want to cite . I should underscore: is __NOT__ a standard. It is __NOT__ normative. the documents produced by the IETF and by ECMA, which are international standards organizations that represent the industry __ARE__ normative. When a browser implementer wants to implement JSON to the letter, they go to ECMA, not to some dude's 90's ass website.

JavaScript? Yes, JSON (JavaScript Object Notation) is closely linked with JavaScript and is, in fact, (almost) a subset of it. While JSON’s JavaScript ancestry is the main source of its quirks, several other poor design decisions add additional unforced errors.

However, the biggest problem with JSON isn’t any specific design decision but rather the incredible diversity of parser behavior and non-conformance across and within language ecosystems. RFC8259 goes out of its way to call this out:

Note, however, that ECMA-404 allows several practices that this specification recommends avoiding in the interests of maximal interoperability.

The RFC makes many observations regarding interoperability elsewhere in the document. Probably the most glaring—and terrifying—is how numbers work.

Everything is Implementation-Defined

JSON numbers are encoded in decimal, with an optional minus sign, a fractional part after a decimal point, and a scientific notation exponent. This is similar to how many programming languages define their own numeric literals.

Presumably, JSON numbers are meant to be floats, right?

Wrong.

RFC8259 reveals that the answer is, unfortunately, “whatever you want.”

This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision.

binary64 is the “standards-ese” name for the type usually known as double or float64. Floats have great dynamic range but often can’t represent exact values. For example, 1.1 isn’t representable as a float because all floats are fractions of the form n / 2^m for integers n and m, but 1.1 = 11/10, which has a factor of 5 in its denominator. The closest float64 value is

2476979795053773 / 2^51 = 1.100000000000000088817841970012523233890533447265625
Plaintext

Of course, you might think to declare “all JSON values map to their closest float64 value”. Unfortunately, this value might not be unique. For example, the value 900000000000.00006103515625 isn’t representable as a float64, and it’s precisely between two exact float64 values. Depending on the rounding mode, this rounds to either or 900000000000 or 900000000000.0001220703125 .

IEEE 754 recommends “round ties to even” as the default rounding mode, so for almost all software, the result is 900000000000. But remember, floating-point state is a global variable implemented in hardware, and might just happen to be clobbered by some dependency that calls fesetround() or a similar system function.

Data Loss! Data Loss!

You’re probably thinking, “I don’t care about such fussy precision stuff. None of my numbers have any fractional parts—and there is where you would be wrong. The n part of n / 2^m only has 53 bits available, but int64 values fall outside of that range. This means that for very large 64-bit integers, such as randomly generated IDs, a JSON parser that converts integers into floats results in data loss. Go’s encoding/json package does this, for example.

How often does this actually happen for randomly-generated numbers? We can do a little Monte Carlo simulation to find out.

package main

import (
	"fmt"
	"math"
	"math/big"
	"math/rand"
)

const trials = 5_000_000
func main() {
	var misses int
	var err big.Float
	for range trials {
		x := int64(rand.Uint64())
		y := int64(float64(x)) // Round-trip through binary64.
		if x != y {
			misses++
			err.Add(&err, big.NewFloat(math.Abs(float64(x - y))))
		}
	}

	err.Quo(&err, big.NewFloat(trials))
	fmt.Printf("misses: %d/%d, avg: %f", misses, trials, &err)
}

// Output:
// misses: 4970572/5000000, avg: 170.638499
Go

It turns out that almost all randomly distributed int64 values are affected by round-trip data loss. Roughly, the only numbers that are safe are those with at most 16 digits (although not exactly: 9,999,999,999,999,999, for example, gets rounded up to a nice round 10 quadrillion).

How does this affect you? Suppose you have a JSON document somewhere that includes a user ID and a transcript of their private messages with another user. Data loss due to rounding would result in the wrong user ID being associated with the private messages, which could result in leaking PII or incorrect management of privacy consent (such as GDPR requirements).

This isn’t just about your user IDs, mind you. Plenty of other vendors’ IDs are nice big integers, which the JSON grammar can technically accommodate and which random tools will mangle. Some examples:

  • License keys: for example, Adobe uses 24 digits for their serial numbers, which may be tempting to store as an integer.

  • Barcode IDs like the unique serial numbers of medical devices, which are tightly regulated.

  • Visa and Mastercard credit card numbers happen to fit in the “safe” range for binary64 , which may lull you into a false sense of security, since they’re so common. But not all credit cards have 16 digit numbers: some now support 19.

These are pretty bad compliance consequences purely due to a data serialization format.

This problem is avoidable with care. After all, Go can parse JSON into any arbitrary type using reflection. For example, if we replace the inner loop of the Monte Carlo simulation with something like the following:

for range trials {
	x := int64(rand.Uint64())
	var v struct{ N int64 }
	json.Unmarshal([]byte(fmt.Sprintf(`{"N":%d}`, x)), &v)
	y := v.N
	if x != y {
		// ...
	}
}
Go

We suddenly see that x == y in every trial. This is because with type information, Go’s JSON library knows exactly what the target precision is. If we were parsing to an any instead of to a struct { N int64 }, we’d be in deep trouble: the outer object would be parsed into a map[string]any, and the N field would become a float64.

This means that your system probably can’t safely handle JSON documents with unknown fields. Tools like jq must be extremely careful about number handling to avoid data loss. This is an easy mistake for third-party tools to make.

But again, float64 isn’t the standard—there is no standard. Some implementations might only have 32-bit floats available, making the problem worse. Some implementations might try to be clever, using a float64 for fractional values and an int64 for integer values; however, this still imposes arbitrary limits on the parsed values, potentially resulting in data loss.

Some implementations such as Python use bignums, so they appear not to have this problem. However, this can lead to a false sense of security where issues are not caught until it’s too late: some database now contains ostensibly valid but non-interoperable JSON.

Protobuf is forced to deal with this in a pretty non-portable way. To avoid data loss, large 64-bit integers are serialized as quoted strings when serializing to JSON. So, instead of writing {"foo":6574404881820635023}, it emits {"foo":"6574404881820635023"}. This solves the data loss issue but does not work with other JSON libraries such as Go’s, producing errors like this one:

json: cannot unmarshal string into Go struct field .N of type int64
Plaintext

Non-Finite Values

The special floating point values Infinity, -Infinity, and NaN are not representable: it’s the wild west as to what happens when you try to serialize the equivalent of {x:1.0/0.0}.

  • Go refuses to serialize, citing json: unsupported value: +Inf.
  • Protobuf serializes it as {"x":"inf"} (or should—it’s unclear which implementations get it right).
  • JavaScript won’t even bother trying: JSON.stringify({x:Infinity}) prints {"x":null}.
  • Python is arguably the worst offender: json.dumps({"x":float("inf")}) prints {"x":Infinity}, which isn’t even valid JSON per RFC8259.

NaN is arguably an even worse offender, because the NaN payload (yes, NaNs have a special payload) is discarded when converting to "nan" or however your library represents it.

Does this affect you? Well, if you’re doing anything with floats, you’re one division-by-zero or overflow away from triggering serialization errors. At best, it’s “benign” data corruption (JavaScript). At worst, when the data is partially user-controlled, it might result in crashes or unparseable output, which is the making of a DoS vector.

In comparison, Protobuf serialization can’t fail except due to non-UTF-8 string fields or cyclic message references, both of which are comparatively unlikely to a NaN popping up in a calculation.

The upshot is that all the parsers end up parsing a bunch of crazy things for the special floating-point values over time because of Postel’s law. RFC8259 makes no effort to provide suggestions for dealing with such real-world situations beyond “tough luck, not interoperable.”

Text Encodings and Invalid Unicode

JSON strings are relatively tame, with some marked (but good) divergence from JavaScript. Specifically, JavaScript, being a language of a certain age (along with Java), uses UTF-16 as its Unicode text encoding. Most of the world has realized this is a bad idea (it doubles the size of ASCII text, which makes up almost all of Internet traffic), so JSON uses UTF-8 instead. RFC8259 actually specifies that the whole document MUST be encoded in UTF-8.

But when we go to read about Unicode characters in §8.2, we are disappointed: it merely says that it’s really great when all quoted strings consist entirely of Unicode characters, which means that unpaired surrogates are allowed. In effect, the spec merely requires that JSON strings be WTF-8: UTF-8 that permits unpaired surrogates.

What’s an unpaired surrogate? It’s any encoded Unicode 32-bit value in the range U+D800 to U+DFFF , which form a gap in the Unicode codepoint range. UTF-8’s variable-length integer encoding can encode them, but their presence in a bytestream makes it invalid UTF-8. WTF-8 is UTF-8 but permitting the appearance of these values.

So, who actually supports parsing (or serializing) these? Consider the document {"x":"\udead"}, which contains an unpaired surrogate, U+DEAD.

  • Go gladly deserializes AND serializes it (Go’s strings are arbitrary byte strings, not UTF-8). However, Go serializes a non-UTF-8 string such as "\xff" as "\ufffd", having replaced the invalid byte with a U+FFFD replacement character (this thing: �).

  • Most Java parsers seem to follow the same behavior as Go, but there are many different parsers available, and we’ve already learned that different JSON parsers may behave differently.

  • JavaScript and Python similarly gladly parse unpaired surrogates, but they also serialize them back without converting them into U+FFFD.

  • Different Protobuf runtimes may not handle this identically, but the reference C++ implementation (whose JSON codec I wrote!) refuses to parse unpaired surrogates.

There are other surprising pitfalls around strings: are "x" and “\x78" the same string? RFC8259 feels the need to call out that they are, for the purposes of checking that object keys are equal. The fact that they feel the need to call it out indicates that this is also a source of potential problems.

Byte Strings

What if I don’t want to send text? A common type of byte blob to send is a cryptographic hash that identifies a document in a content-addressed blobstore, or perhaps a digital signature (an encrypted hash). JSON has no native way of representing byte strings.

You could send a quoted string full of ASCII and \xNN escapes (for bytes which are not in the ASCII range), but this is wasteful in terms of bandwidth, and has serious interoperability problems (as noted above, Go actively destroys data in this case). You could also encode it as an array of JSON numbers, which is much worse for bandwidth and serialization speed.

What everyone winds up doing, one way or another, is to rely on base64 encoding. Protobuf, for example, encodes bytes fields into base64 strings in JSON. This has the unfortunate side-effect of defeating JSON’s human-readable property: if the blob contains mostly ASCII, a human reader can’t tell.

Because this isn’t part of JSON, virtually no JSON codec does this decoding for you, particularly because in a schema-less context, there’s nothing to distinguish a byte blob encoded with base64 from an actual textual string that happens to contain valid base64, such as an alphanumeric username.

Compared to other problems, this is more like a paper cut, but it’s unnecessary and adds complexity and interop problems. By the way, did you know there are multiple incompatible Base64 alphabets?

Streaming Doesn’t Work

A less obvious problem with JSON is that it can’t be streamed. Almost all JSON documents are objects or arrays and are therefore incomplete until they reach the closing } or ], respectively. This means you can’t send a stream of JSON documents that form a part of a larger document without some additional protocol for combining them in post-processing.

JSONL is the world’s silliest spec that “solves” this problem in the simplest way possible: a JSONL document is a sequence of JSON documents separated by newlines. JSONL is streamable, but because it’s done in the simplest way possible, it only supports streaming a giant array. You can’t, for example, stream an object field-by-field or stream an array within that object.

Protobuf doesn’t have this problem: in a nutshell, the Protobuf wire format is as if we removed the braces and brackets from the top-level array or object of a document, and made it so that values with the same key get merged. In the wire format, the equivalent of the JSONL document

{"foo": {"x": 1}, "bar": [5, 6]}
{"foo": {"y": 2}, "bar": [7, 8]}
JSON

is automatically “merged” into the single document

{ "foo": { "x": 1, "y": 2 }, "bar": [5, 6] }
JSON

This forms the basis of the “message merge” operation, which is intimately connected to how the wire format was designed. We’ll dive into this fundamental operation in a future article.

Canonicalization Leads to Data Loss

Thanks to RFC7519 and RFC7515, which define JSON Web Tokens (JWT) and JSON Web Signatures (JWS), digitally signing JSON documents is a very common operation. However, digital signatures can only sign specific byte blobs and are sensitive to things that JSON isn’t, such as whitespace and key ordering.

This results in specifications like RFC8785 for canonicalization of JSON documents. This introduces a new avenue by which existing JSON documents, which accidentally happen to contain non-interoperable (or, thanks to non-conforming implementations such as Python’s) invalid JSON that must be manipulated and reformatted by third-party tools. RFC8785 itself references ECMA-262 (the JavaScript standard) for how to serialize numbers, meaning that it’s required to induce data loss for 64-bit numerical values!

Is JSON Fixable?

Plainly? No. JSON can’t be fixed because of how extremely popular it is. Common mistakes are baked into the format. Are comments allowed? Trailing commas? Number formats? Nobody knows!

What tools are touching your JSON? Are they aware of all of the rakes they can step on? Do they emit invalid JSON (like Python does)? How do you even begin to audit that?

Thankfully, you don’t have to use JSON. There are alternatives—BSON, UBJSON, MessagePack, and CBOR are just a few binary formats that try to replicate JSON’s data model. Unfortunately, many of them have their own problems.

Protobuf, however, has none of these problems, because it was designed to fulfill needs JSON couldn’t meet. Using a strongly-typed schema system, like Protobuf, makes all of these problems go away.

The Rust Calling Convention We Deserve

I will often say that the so-called “C ABI” is a very bad one, and a relatively unimaginative one when it comes to passing complicated types effectively. A lot of people ask me “ok, what would you use instead”, and I just point them to the Go register ABI, but it seems most people have trouble filling in the gaps of what I mean. This article explains what I mean in detail.

I have discussed calling conventions in the past, but as a reminder: the calling convention is the part of the ABI that concerns itself with how to pass arguments to and from a function, and how to actually call a function. This includes which registers arguments go in, which registers values are returned out of, what function prologues/epilogues look like, how unwinding works, etc.

This particular post is primarily about x86, but I intend to be reasonably generic (so that what I’ve written applies just as well to ARM, RISC-V, etc). I will assume a general familiarity with x86 assembly, LLVM IR, and Rust (but not rustc’s internals).

The Problem

Today, like many other natively compiled languages, Rust defines an unspecified0- calling convention that lets it call functions however it likes. In practice, Rust lowers to LLVM’s built-in C calling convention, which LLVM’s prologue/epilogue codegen generates calls for.

Rust is fairly conservative: it tries to generate LLVM function signatures that Clang could have plausibly generated. This has two significant benefits:

  1. Good probability debuggers won’t choke on it. This is not a concern on Linux, though, because DWARF is very general and does not bake-in the Linux C ABI. We will concern ourselves only with ELF-based systems and assume that debuggability is a nonissue.

  2. It is less likely to tickle LLVM bugs due to using ABI codegen that Clang does not exercise. I think that if Rust tickles LLVM bugs, we should actually fix them (a very small number of rustc contributors do in fact do this).

However, we are too conservative. We get terrible codegen for simple functions:

fn extract(arr: [i32; 3]) -> i32 {
  arr[1]
}
Rust
extract:
  mov   eax, dword ptr [rdi + 4]
  ret
x86 Assembly

arr is 12 bytes wide, so you’d think it would be passed in registers, but no! It is passed by pointer! Rust is actually more conservative than what the Linux C ABI mandates, because it actually passes the [i32; 3] in registers when extern "C" is requested.

extern "C" fn extract(arr: [i32; 3]) -> i32 {
  arr[1]
}
Rust
extract:
  mov   rax, rdi
  shr   rax, 32
  ret
x86 Assembly

The array is passed in rdi and rsi, with the i32s packed into registers. The function moves rdi into rax, the output register, and shifts the upper half down.

Not only does clang produce patently bad code for passing things by value, but it also knows how to do it better, if you request a standard calling convention! We could be generating way better code than Clang, but we don’t!

Hereforth, I will describe how to do it.

-Zcallconv

Let’s suppose that we keep the current calling convention for extern "Rust"1, but we add a flag -Zcallconv that sets the calling convention for extern "Rust" when compiling a crate. The supported values will be -Zcallconv=legacy for the current one, and -Zcallconv=fast for the one we’re going to design. We could even let -O set -Zcallconv=fast automatically.

Why keep the old calling convention? Although I did sweep debugability under the rug, one nice property -Zcallconv=fast will not have is that it does not place arguments in the C ABI order, which means that a reader replying on the “Diana’s silk dress cost $89” mnemonic on x86 will get fairly confused.

I am also assuming we may not even support -Zcallconv=fast for some targets, like WASM, where there is no concept of “registers” and “spilling”. It may not even make sense to enable it for for debug builds, because it will produce much worse code with optimizations turned off.

There is also a mild wrinkle with function pointers, and extern "Rust" {} blocks. Because this flag is per-crate, even though functions can advertise which version of extern "Rust" they use, function pointers have no such luxury. However, calling through a function pointer is slow and rare, so we can simply force them to use -Zcallconv=legacy. We can generate a shim to translate calling conventions as needed.

Similarly, we can, in principle, call any Rust function like this:

fn secret_call() -> i32 {
  extern "Rust" {
    fn my_func() -> i32;
  }
  unsafe { my_func() }
}
Rust

However, this mechanism can only be used to call unmangled symbols. Thus, we can simply force #[no_mangle] symbols to use the legacy calling convention.

Bending LLVM to Our Will

In an ideal world, LLVM would provide a way for us to specify the calling convention directly. E.g., this argument goes in that register, this return goes in that one, etc. Unfortunately, adding a calling convention to LLVM requires writing a bunch of C++.

However, we can get away with specifying our own calling convention by following the following procedure.

  1. First, determine, for a given target triple, the maximum number of values that can be passed “by register”. I will explain how to do this below.

  2. Decide how to pass the return value. It will either fit in the output registers, or it will need to be returned “by reference”, in which case we pass an extra ptr argument to the function (tagged with the sret attribute) and the actual return value of the function is that pointer.

  3. Decide which arguments that have been passed by value need to be demoted to being passed by reference. This will be a heuristic, but generally will be approximately “arguments larger than the by-register space”. For example, on x86, this comes out to 176 bytes.

  4. Decide which arguments get passed by register, so as to maximize register space usage. This problem is NP-hard (it’s the knapsack problem) so it will require a heuristic. All other arguments are passed on the stack.

  5. Generate the function signature in LLVM IR. This will be all of the arguments that are passed by register encoded as various non-aggregates, such as i64, ptr, double, and <2 x i64>. What valid choices are for said non-aggregates depends on the target, but the above are what you will generally get on a 64-bit architecture. Arguments passed on the stack will follow the “register inputs”.

  6. Generate a function prologue. This is code to decode each Rust-level argument from the register inputs, so that there are %ssa values corresponding to those that would be present when using -Zcallconv=legacy. This allows us to generate the same code for the body of the function regardless of calling convention. Redundant decoding code will be eliminated by DCE passes.

  7. Generate a function exit block. This is a block that contains a single phi instruction for the return type as it would be for -Zcallconv=legacy. This block will encode it into the requisite output format and then ret as appropriate. All exit paths through the function should br to this block instead of ret-ing.

  8. If a non-polymorphic, non-inline function may have its address taken (as a function pointer), either because it is exported out of the crate or the crate takes a function pointer to it, generate a shim that uses -Zcallconv=legacy and immediately tail-calls the real implementation. This is necessary to preserve function pointer equality.

The main upshot here is that we need to cook up heuristics for figuring out what goes in registers (since we allow reordering arguments to get better throughput). This is equivalent to the knapsack problem; knapsack heuristics are beyond the scope of this article. This should happen early enough that this information can be stuffed into rmeta to avoid needing to recompute it. We may want to use different, faster heuristics depending on -Copt-level. Note that correctness requires that we forbid linking code generated by multiple different Rust compilers, which is already the case, since Rust breaks ABI from release to release.

What Is LLVM Willing to Do?

Assuming we do that, how do we actually get LLVM to pass things in the way we want it to? We need to determine what the largest “by register” passing LLVM will permit is. The following LLVM program is useful for determining this on a particular version of LLVM:

%InputI = type [6 x i64]
%InputF = type [0 x double]
%InputV = type [8 x <2 x i64>]

%OutputI = type [3 x i64]
%OutputF = type [0 x double]
%OutputV = type [4 x <2 x i64>]

define void @inputs({ %InputI, %InputF, %InputV }) {
  %p = alloca [4096 x i8]
  store volatile { %InputI, %InputF, %InputV } %0, ptr %p
  ret void
}

%Output = { %OutputI, %OutputF, %OutputV }
@gOutput = constant %Output zeroinitializer
define %Output @outputs() {
  %1 = load %Output, ptr @gOutput
  ret %Output %1
}
LLVM IR

When you pass an aggregate by-value to an LLVM function, LLVM will attempt to “explode” that aggregate into as many registers as possible. There are distinct register classes on different systems. For example, on both x86 and ARM, floats and vectors share the same register class (kind of2).

The above values are for x863. LLVM will pass six integers and eight SSE vectors by register, and return half as many (3 and 4) by register. Increasing any of the values generates extra loads and stores that indicate LLVM gave up and passed arguments on the stack.

The values for aarch64-unknown-linux are 8 integers and 8 vectors for both inputs and outputs, respectively.

This is the maximum number of registers we get to play with for each class. Anything extra gets passed on the stack.

I recommend that every function have the same number of by-register arguments. So on x86, EVERY -Zcallconv=fast function’s signature should look like this:

declare {[3 x i64], [4 x <2 x i64>]} @my_func(
  i64 %rdi, i64 %rsi, i64 %rdx, i64 %rcx, i64 %r8, i64 %r9,
  <2 x i64> %xmm0, <2 x i64> %xmm1, <2 x i64> %xmm2, <2 x i64> %xmm3,
  <2 x i64> %xmm4, <2 x i64> %xmm5, <2 x i64> %xmm6, <2 x i64> %xmm7,
  ; other args...
)
LLVM IR

When passing pointers, the appropriate i64s should be replaced by ptr, and when passing doubles, they replace <2 x i64>s.

But you’re probably saying, “Miguel, that’s crazy! Most functions don’t pass 176 bytes!” And you’d be right, if not for the magic of LLVM’s very well-specified poison semantics.

We can get away with not doing extra work if every argument we do not use is passed poison. Because poison is equal to “the most convenient possible value at the present moment”, when LLVM sees poison passed into a function via register, it decides that the most convenient value is “whatever happens to be in the register already”, and so it doesn’t have to touch that register!

For example, if we wanted to pass a pointer via rcx, we would generate the following code.

; This is a -Zcallconv=fast-style function.
%Out = type {[3 x i64], [4 x <2 x i64>]}
define %Out @load_rcx(
  i64 %rdi, i64 %rsi, i64 %rdx,
  ptr %rcx, i64 %r8, i64 %r9,
  <2 x i64> %xmm0, <2 x i64> %xmm1,
  <2 x i64> %xmm2, <2 x i64> %xmm3,
  <2 x i64> %xmm4, <2 x i64> %xmm5,
  <2 x i64> %xmm6, <2 x i64> %xmm7
) {
  %load = load i64, ptr %rcx
  %out = insertvalue %Out poison,
                      i64 %load, 0, 0
  ret %Out %out
}

declare ptr @malloc(i64)
define i64 @make_the_call() {
  %1 = call ptr @malloc(i64 8)
  store i64 42, ptr %1
  %2 = call %Out @by_rcx(
    i64 poison, i64 poison, i64 poison,
    ptr %1,     i64 poison, i64 poison,
    <2 x i64> poison, <2 x i64> poison,
    <2 x i64> poison, <2 x i64> poison,
    <2 x i64> poison, <2 x i64> poison,
    <2 x i64> poison, <2 x i64> poison)
  %3 = extractvalue %Out %2, 0, 0
  %4 = add i64 %3, 42
  ret i64 %4
}
LLVM IR
by_rcx:
  mov   rax, qword ptr [rcx]
  ret

make_the_call:
  push  rax
  mov   edi, 8
  call  malloc
  mov   qword ptr [rax], 42
  mov   rcx, rax
  call  load_rcx
  add   rax, 42
  pop   rcx
  ret
x86 Assembly

It is perfectly legal to pass poison to a function, if it does not interact with the poisoned argument in any proscribed way. And as we see, load_rcx() receives its pointer argument in rcx, whereas make_the_call() takes no penalty in setting up the call: loading poison into the other thirteen registers compiles down to nothing4, so it only needs to load the pointer returned by malloc into rcx.

This gives us almost total control over argument passing; unfortunately, it is not total. In an ideal world, the same registers are used for input and output, to allow easier pipelining of calls without introducing extra register traffic. This is true on ARM and RISC-V, but not x86. However, because register ordering is merely a suggestion for us, we can choose to allocate the return registers in whatever order we want. For example, we can pretend the order registers should be allocated in is rdx, rcx, rdi, rsi, r8, r9 for inputs, and rdx, rcx, rax for outputs.

%Out = type {[3 x i64], [4 x <2 x i64>]}
define %Out @square(
  i64 %rdi, i64 %rsi, i64 %rdx,
  ptr %rcx, i64 %r8, i64 %r9,
  <2 x i64> %xmm0, <2 x i64> %xmm1,
  <2 x i64> %xmm2, <2 x i64> %xmm3,
  <2 x i64> %xmm4, <2 x i64> %xmm5,
  <2 x i64> %xmm6, <2 x i64> %xmm7
) {
  %sq = mul i64 %rdx, %rdx
  %out = insertvalue %Out poison,
                      i64 %sq, 0, 1
  ret %Out %out
}

define i64 @make_the_call(i64) {
  %2 = call %Out @square(
    i64 poison, i64 poison, i64 %0,
    i64 poison, i64 poison, i64 poison,
    <2 x i64> poison, <2 x i64> poison,
    <2 x i64> poison, <2 x i64> poison,
    <2 x i64> poison, <2 x i64> poison,
    <2 x i64> poison, <2 x i64> poison)
  %3 = extractvalue %Out %2, 0, 1

  %4 = call %Out @square(
    i64 poison, i64 poison, i64 %3,
    i64 poison, i64 poison, i64 poison,
    <2 x i64> poison, <2 x i64> poison,
    <2 x i64> poison, <2 x i64> poison,
    <2 x i64> poison, <2 x i64> poison,
    <2 x i64> poison, <2 x i64> poison)
  %5 = extractvalue %Out %4, 0, 1

  ret i64 %5
}
LLVM IR
square:
  imul rdx, rdx
  ret

make_the_call:
  push rax
  mov rdx, rdi
  call square
  call square
  mov rax, rdx
  pop rcx
  ret
x86 Assembly

square generates extremely simple code: the input and output register is rdi, so no extra register traffic needs to be generated. Similarly, when we effectively do @square(@square(%0)), there is no setup between the functions. This is similar to code seen on aarch64, which uses the same register sequence for input and output. We can see that the “naive” version of this IR produces the exact same code on aarch64 for this reason.

define i64 @square(i64) {
  %2 = mul i64 %0, %0
  ret i64 %2
}

define i64 @make_the_call(i64) {
  %2 = call i64 @square(i64 %0)
  %3 = call i64 @square(i64 %2)
  ret i64 %3
}
LLVM IR
square:
  mul x0, x0, x0
  ret

make_the_call:
  str x30, [sp, #-16]!
  bl square
  ldr x30, [sp], #16
  b square  // Tail call.
ARM Assembly

Rust Structs and Unions

Now that we’ve established total control on how registers are assigned, we can turn towards maximizing use of these registers in Rust.

For simplicity, we can assume that rustc has already processed the users’s types into basic aggregates and unions; no enums here! We then have to make some decisions about which portions of the arguments to allocate to registers.

First, return values. This is relatively straightforward, since there is only one value to pass. The amount of data we need to return is not the size of the struct. For example, [(u64, u32); 2] measures 32 bytes wide. However, eight of those bytes are padding! We do not need to preserve padding when returning by value, so we can flatten the struct into (u64, u32, u64, u32) and sort by size into (u64, u64, u32, u32). This has no padding and is 24 bytes wide, which fits into the three return registers LLVM gives us on x86. We define the effective size of a type to be the number of non-undef bits it occupies. For [(u64, u32); 2], this is 192 bits, since it excludes the padding. For bool, this is one. For char this is technically 21, but it’s simpler to treat char as an alias for u32.

The reason for counting bits this way is that it permits significant compaction. For example, returning a struct full of bools can simply bit-pack the bools into a single register.

So, a return value is converted to a by-ref return if its effective size is smaller than the output register space (on x86, this is three integer registers and four SSE registers, so we get 88 bytes total, or 704 bits).

Argument registers are much harder, because we hit the knapsack problem, which is NP-hard. The following relatively naive heuristic is where I would start, but it can be made infinitely smarter over time.

First, demote to by-ref any argument whose effective size is larget than the total by-register input space (on x86, 176 bytes or 1408 bits). This means we get a pointer argument instead. This is beneficial to do first, since a single pointer might pack better than the huge struct.

Enums should be replaced by the appropriate discriminant-union pair. For example, Option<i32> is, internally, (union { i32, () }, i1), while Option<Option<i32>> is (union { i32, (), () }, i2). Using a small non-power-of-two integer improves our ability to pack things, since enum discriminants are often quite tiny.

Next, we need to handle unions. Because mucking about with unions’ uninitialized bits behind our backs is allowed, we need to either pass it as an array of u8, unless it only has a single non-empty variant, in which case it is replaced with that variant5.

Now, we can proceed to flatten everything. All of the converted arguments are flattened into their most primitive components: pointers, integers, floats, and bools. Every field should be no larger than the smallest argument register; this may require splitting large types such as u128 or f64.

This big list of primitives is next sorted by effective size, from smallest to largest. We take the largest prefix of this that will fit in the available register space; everything else goes on the stack.

If part of a Rust-level input is sent to the stack in this way, and that part is larger than a small multiple of the pointer size (e.g., 2x), it is demoted to being passed by pointer-on-the-stack, to minimize memory traffic. Everything else is passed directly on the stack in the order those inputs were before the sort. This helps keep regions that need to be copied relatively contiguous, to minimize calls to memcpy.

The things we choose to pass in registers are allocated to registers in reverse size order, so e.g. first 64-bit things, then 32-bit things, etc. This is the same layout algorithm that repr(Rust) structs use to move all the padding into the tail. Once we get to the bools, those are bit-packed, 64 to a register.

Here’s a relatively complicated example. My Rust function is as follows:

struct Options {
  colorize: bool,
  verbose_debug: bool,
  allow_spurious_failure: bool,
  retries: u32,
}

trait Context {
  fn check(&self, n: usize, colorize: bool);
}

fn do_thing<'a>(op_count: Option<usize>, context: &dyn Context,
                name: &'a str, code: [char; 6],
                options: Options,
) -> &'a str {
  if let Some(op_count) = op_count {
    context.check(op_count, options.colorize);
  }

  for c in code {
    if let Some((_, suf)) = name.split_once(c) {
      return suf;
    }
  }

  "idk"
}
Rust

The codegen for this function is quite complex, so I’ll only cover the prologue and epilogue. After sorting and flattening, our raw argument LLVM types are something like this:

gprs: i64, ptr, ptr, ptr, i64, i32, i32
xmm0: i32, i32, i32, i32
xmm1: i32, i1, i1, i1, i1
LLVM IR

Everything fits in registers! So, what does the LLVM function look like on x86?

%Out = type {[3 x i64], [4 x <2 x i64>]}
define %Out @do_thing(
  i64 %rdi, ptr %rsi, ptr %rdx,
  ptr %rcx, i64 %r8, i64 %r9,
  <4 x i32> %xmm0, <4 x i32> %xmm1,
  ; Unused.
  <2 x i64> %xmm2, <2 x i64> %xmm3,
  <2 x i64> %xmm4, <2 x i64> %xmm5,
  <2 x i64> %xmm6, <2 x i64> %xmm7
) {
  ; First, unpack all the primitives.
  %r9.0 = trunc i64 %r9 to i32
  %r9.1.i64 = lshr i64 %r9, 32
  %r9.1 = trunc i64 %r9.1.i64 to i32
  %xmm0.0 = extractelement <4 x i32> %xmm0, i32 0
  %xmm0.1 = extractelement <4 x i32> %xmm0, i32 1
  %xmm0.2 = extractelement <4 x i32> %xmm0, i32 2
  %xmm0.3 = extractelement <4 x i32> %xmm0, i32 3
  %xmm1.0 = extractelement <4 x i32> %xmm1, i32 0
  %xmm1.1 = extractelement <4 x i32> %xmm1, i32 1
  %xmm1.1.0 = trunc i32 %xmm1.1 to i1
  %xmm1.1.1.i32 = lshr i32 %xmm1.1, 1
  %xmm1.1.1 = trunc i32 %xmm1.1.1.i32 to i1
  %xmm1.1.2.i32 = lshr i32 %xmm1.1, 2
  %xmm1.1.2 = trunc i32 %xmm1.1.2.i32 to i1
  %xmm1.1.3.i32 = lshr i32 %xmm1.1, 3
  %xmm1.1.3 = trunc i32 %xmm1.1.3.i32 to i1

  ; Next, reassemble them into concrete values as needed.
  %op_count.0 = insertvalue { i64, i1 } poison, i64 %rdi, 0
  %op_count = insertvalue { i64, i1 } %op_count.0, i1 %xmm1.1.0, 1
  %context.0 = insertvalue { ptr, ptr } poison, ptr %rsi, 0
  %context = insertvalue { ptr, ptr } %context.0, ptr %rdx, 1
  %name.0 = insertvalue { ptr, i64 } poison, ptr %rcx, 0
  %name = insertvalue { ptr, i64 } %name.0, i64 %r8, 1
  %code.0 = insertvalue [6 x i32] poison, i32 %r9.0, 0
  %code.1 = insertvalue [6 x i32] %code.0, i32 %r9.1, 1
  %code.2 = insertvalue [6 x i32] %code.1, i32 %xmm0.0, 2
  %code.3 = insertvalue [6 x i32] %code.2, i32 %xmm0.1, 3
  %code.4 = insertvalue [6 x i32] %code.3, i32 %xmm0.2, 4
  %code = insertvalue [6 x i32] %code.4, i32 %xmm0.3, 5
  %options.0 = insertvalue { i32, i1, i1, i1 } poison, i32 %xmm1.0, 0
  %options.1 = insertvalue { i32, i1, i1, i1 } %options.0, i1 %xmm1.1.1, 1
  %options.2 = insertvalue { i32, i1, i1, i1 } %options.1, i1 %xmm1.1.2, 2
  %options = insertvalue { i32, i1, i1, i1 } %options.2, i1 %xmm1.1.3, 3

  ; Codegen as usual.
  ; ...
}
LLVM IR

Above, !dbg metadata for the argument values should be attached to the instruction that actually materializes it. This ensures that gdb does something halfway intelligent when you ask it to print argument values.

On the other hand, in current rustc, it gives LLVM eight pointer-sized parameters, so it winds up spending all six integer registers, plus two values passed on the stack. Not great!

This is not a complete description of what a completely over-engineered calling convention could entail: in some cases we might know that we have additional registers available (such as AVX registers on x86). There are cases where we might want to split a struct across registers and the stack.

This also isn’t even getting into what returns could look like. Results are often passed through several layers of functions via ?, which can result in a lot of redundant register moves. Often, a Result is large enough that it doesn’t fit in registers, so each call in the ? stack has to inspect an ok bit by loading it from memory. Instead, a Result return might be implemented as an out-parameter pointer for the error, with the ok variant’s payload, and the is ok bit, returned as an Option<T>. There are some fussy details with Into calls via ?, but the idea is implementable.

Optimization-Dependent ABI

Now, because we’re Rust, we’ve also got a trick up our sleeve that C doesn’t (but Go does)! When we’re generating the ABI that all callers will see (for -Zcallconv=fast), we can look at the function body. This means that a crate can advertise the precise ABI (in terms of register-passing) of its functions.

This opens the door to a more extreme optimization-based ABIs. We can start by simply throwing out unused arguments: if the function never does anything with a parameter, don’t bother spending registers on it.

Another example: suppose that we know that an &T argument is not retained (a question the borrow checker can answer at this point in the compiler) and is never converted to a raw pointer (or written to memory a raw pointer is taken of, etc). We also know that T is fairly small, and T: Freeze. Then, we can replace the reference with the pointee directly, passed by value.

The most obvious candidates for this is APIs like HashMap::get(). If the key is something like an i32, we need to spill that integer to the stack and pass a pointer to it! This results in unnecessary, avoidable memory traffic.

Profile-guided ABI is a step further. We might know that some arguments are hotter than others, which might cause them to be prioritized in the register allocation order.

You could even imagine a case where a function takes a very large struct by reference, but three i64 fields are very hot, so the caller can preload those fields, passing them both by register and via the pointer to the large struct. The callee does not see additional cost: it had to issue those loads anyway. However, the caller probably has those values in registers already, which avoids some memory traffic.

Instrumentation profiles may even indicate that it makes sense to duplicate whole functions, which are identical except for their ABIs. Maybe they take different arguments by register to avoid costly spills.

Conclusion

This is a bit more advanced (and ranty) than my usual writing, but this is an aspect of Rust that I find really frustrating. We could be doing so much better than C++ ever can (because of their ABI constraints). None of this is new ideas; this is literally how Go does it!

So why don’t we? Part of the reason is that ABI codegen is complex, and as I described above, LLVM gives us very few useful knobs. It’s not a friendly part of rustc, and doing things wrong can have nasty consequences for usability. The other part is a lack of expertise. As of writing, only a handful of people contributing to rustc have the necessary grasp of LLVM’s semantics (and mood swings) to emit the Right Code such that we get good codegen and don’t crash LLVM.

Another reason is compilation time. The more complicated the function signatures, the more prologue/epilogue code we have to generate that LLVM has to chew on. But -Zcallconv is intended to only be used with optimizations turned on, so I don’t think this is a meaningful complaint. Nor do I think the project’s Goodhartization of compilation time as a metric is healthy… but I do not think this is ultimately a relevant drawback.

I, unfortunately, do not have the spare time to dive into fixing rustc’s ABI code, but I do know LLVM really well, and I know that this is a place where Rust has a low bus factor. For that reason, I am happy to provide the Rust compiler team expert knowledge on getting LLVM to do the right thing in service of making optimized code faster.

  1. Or just switch it to the codepath for extern "C" or extern "fastcall" since those are clearly better. We will always need to know how to generate code for the non-extern "Rust" calling conventions. 

  2. It’s Complicated. Passing a double burns a whole <2 x i64> slot. This seems bad, but it can be beneficial since keeping a double in vector registers reduces register traffic, since usually, fp instructions use the vector registers (or the fp registers shadow the vector registers, like on ARM). 

  3. On the one hand, you might say this “extended calling convention” isn’t an explicitly supported part of LLVM’s ccc calling convention. On the other hand, Hyrum’s Law cuts both ways: Rust is big enough of an LLVM user that LLVM cannot simply miscompile all Rust programs at this point, and the IR I propose Rust emits is extremely reasonable.

    If Rust causes LLVM to misbehave, that’s an LLVM bug, and we should fix LLVM bugs, not work around them. 

  4. Only on -O1 or higher, bizarrely. At -O0, LLVM decides that all of the poisons must have the same value, so it copies a bunch of registers around needlessly. This seems like a bug? 

  5. There are other cases where we might want to replace a union with one of its variants: for example, there’s a lot of cases where Result<&T, Error> is secretly a union { ptr, u32 }, in which case it should be replaced with a single ptr