Protobuf Tip #3: Enum Names Need Prefixes

Smart people learn from their mistakes. But the real sharp ones learn from the mistakes of others. –Brandon Mull

TL;DR: enums inherit some unfortunate behaviors from C++. Use the Buf lint rules ENUM_VALUE_PREFIX and ENUM_ZERO_VALUE_SUFFIX  to avoid this problem (they’re part of the DEFAULT category).

I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog.

C++-Style Enums

Protobuf’s enums define data types that represent a small set of valid values. For example, google.rpc.Code lists status codes used by various RPC frameworks, such as GRPC. Under the hood, every enum is just an int32  on the wire, although codegen backends will generate custom types and constants for the enum to make it easier to use.

Unfortunately, enums were originally designed to match C++ enums exactly, and they inadvertently replicate many of those behaviors.

If you look at the source for google.rpc.Code, and compare it to, say, google.protobuf.FieldDescriptorProto.Type, you will notice a subtle difference:

package google.rpc;
enum Code {
  OK = 0;
  CANCELLED = 1;
  UNKNOWN = 2;
  // ...
}

package google.protobuf;
message FieldDescriptorProto {
  enum Type {
    // 0 is reserved for errors.
    TYPE_DOUBLE = 1;
    TYPE_FLOAT = 2;
    TYPE_INT64 = 3;
    // ...
  }
}
Protobuf

FieldDescriptorProto.Type has values starting with TYPE_, but Code ‘s values don’t have a CODE_ prefix.  This is because the fully-qualified names (FQN) of an enum value don’t include the name of the enum. That is, TYPE_DOUBLE actually refers to google.protobuf.FieldDescriptorProto.TYPE_DOUBLE. Thus, OK is not google.rpc.Code.OK, but google.rpc.OK.

This is because it matches the behavior of unscoped C++ enums. C++ is the “reference” implementation, so the language often bends for the sake of the C++ backend.

When generating code, protoc’s C++ backend emits the above as follows:

namespace google::rpc {
enum Code {
  OK = 0,
  CANCELLED = 1,
  UNSPECIFIED = 2,
  // ...
};
}

namespace google::protobuf {
class FieldDescriptorProto final {
 public:
  enum Type {
   TYPE_DOUBLE = 1;
   TYPE_FLOAT = 2;
   // ...
  };
};
}
C++

And in C++, enums don’t scope their enumerators: you write google::rpc::OK, NOT google::rpc::Code::OK.

If you know C++, you might be thinking, “why didn’t they use enum class?!”? Enums were added in proto2, which was developed around 2007-2008, but Google didn’t start using C++11, which introduced enum class , until much, much later.

Now, if you’re a Go or Java programmer, you’re probably wondering why you even care about C++. Both Go and Java do scope enum values to the enum type (although Go does it in a somewhat grody way: rpcpb.Code_OK).

Unfortunately, this affects name collision detection in Protobuf. You can’t write the following code:

package myapi.v1;

enum Stoplight {
  UNSPECIFIED = 0;
  RED = 1;
  YELLOW = 2;
  GREEN = 3;
}

enum Speed {
  UNSPECIFIED = 0;
  SLOW = 1;
  FAST = 2;
}
Protobuf

Because the enum name is not part of the FQN for an enum value, both UNSPECIFIEDs here have the FQN myapi.v1.UNSPECIFIED, so Protobuf complains about duplicate symbols.

Thus, the convention we see in FieldDescriptorProto.Type:

package myapi.v1;

enum Stoplight {
  STOPLIGHT_UNSPECIFIED = 0;
  STOPLIGHT_RED = 1;
  STOPLIGHT_YELLOW = 2;
  STOPLIGHT_GREEN = 3;
}

enum Speed {
  SPEED_UNSPECIFIED = 0;
  SPEED_SLOW = 1;
  SPEED_FAST = 2;
}
Protobuf

Buf provides a lint rule to enforce this convention: ENUM_VALUE_PREFIX. Even though you might think that an enum name will be unique, because top-level enums bleed their names into the containing package, the problem spreads across packages!

Zero Values

proto3 relies heavily on the concept of “zero values” – all non-message fields that are neither repeated nor optional are implicitly zero if they are not present. Thus, proto3 requires that enums specify a value equal to zero.

By convention, this value shouldn’t be a specific value of the enum, but rather a value representing that no value is specified. ENUM_ZERO_VALUE_SUFFIX enforces this, with a default of _UNSPECIFIED. Of course, there are situations where this might not make sense for you, and a suffix like _ZERO or _UNKNOWN might make more sense.

It may be tempting to have a specific “good default” value for the zero value. Beware though, because that choice is forever. Picking a generic “unknown” as the default reduces the chance you’ll burn yourself.

Why Don’t All of Google’s Protobuf Files Do This?

Name prefixes and zero values also teach us an important lesson: because Protobuf names are forever, it’s really hard to fix style mistakes, especially as we collectively get better at using Protobuf.

google.rpc.Code is intended to be source-compatible with very old existing C++ code, so it throws caution to the wind. FieldDescriptorProto.Type doesn’t have a zero value because in proto2 , which doesn’t have zero value footguns in its wire format, you don’t need to worry about that. The lesson isn’t just to use Buf’s linter to try to avoid some of the known pitfalls, but also to remember that even APIs designed by the authors of the language make unfixable mistakes, so unlike other programming languages, imitating “existing practice” isn’t always the best strategy.

Cheating the Reaper in Go

Even though I am a C++ programmer at heart, Go fascinates me for none of the reasons you think. Go has made several interesting design decisions:

  1. It has virtually no Undefined Behavior1.

  2. It has very simple GC semantics that they’re mostly stuck with due to design decisions in the surface language.

These things mean that despite Go having a GC, it’s possible to do manual memory management in pure Go and in cooperation with the GC (although without any help from the runtime package). To demonstrate this, we will be building an untyped, garbage-collected arena abstraction in Go which relies on several GC implementation details.

I would never play this kind of game in Rust or C++, because LLVM is extremely intelligent and able to find all kinds of ways to break you over the course of frequent compiler upgrades. On the other hand, although Go does not promise any compatibility across versions for code that imports unsafe, in practice, two forces work against Go doing this:

  1. Go does not attempt to define what is and isn’t allowed: unsafe lacks any operational semantics.

  2. Go prioritizes not breaking the ecosystem; this allows to assume that Hyrum’s Law will protect certain observable behaviors of the runtime, from which we may infer what can or cannot break easily.

This is in contrast to a high-performance native compiler like LLVM, which has a carefully defined boundary around all UB, allowing them to arbitrarily break programs that cross it (mostly) without fear of breaking the ecosystem.

So, let’s dive in and cheat death.

What Are We Building?

Our goal is to build an arena, which is a data structure for efficient allocation of memory that has the same lifetime. This reduces pressure on the general-purpose allocator by only requesting memory in large chunks and then freeing it all at once.

For a comparison in Go, consider the following program:

package main

import "fmt"

func main() {
  var s []int
  for i := range 1000 {
    prev := cap(s)
    s = append(s, i)
    if cap(s) != prev {
      fmt.Println(cap(s))
    }
  }
}
Go

This program will print successive powers of 2: this is because append is implemented approximately like so:

func append[S ~[]T, T any](a, b S) S {
  // If needed, grow the allocation.
  if cap(a) - len(a) < len(b) {
    // Either double the size, or allocate just enough if doubling is
    // too little.
    newCap := max(2*cap(a), len(a)+len(b))

    // Grow a.
    a2 := make([]T, len(a), newCap)
    copy(a2, a)
    a = a2
  }

  // Increase the length of a to fit b, then write b into the freshly
  // grown region.
  a = a[:len(a)+len(b)]
  copy(a[len(a)-len(b):], b)
  return a
}
Go

For appending small pieces, make is only called O(logn)O(\log n) times, a big improvement over calling it for every call to append. Virtually every programming language’s dynamic array abstraction makes this optimization.

An arena generalizes this concept, but instead of resizing exponentially, it allocates new blocks and vends pointers into them. The interface we want to conform to is as follows:

type Allocator interface {
  Alloc(size, align uintptr) unsafe.Pointer
}
Go

In go a size and and an alignment, out comes a pointer fresh memory with that layout. Go does not have user-visible uninitialized memory, so we additionally require that the returned region be zeroed. We also require that align be a power of two.

We can give this a type-safe interface by writing a generic New function:

// New allocates a fresh zero value of type T on the given allocator, and
// returns a pointer to it.
func New[T any](a Allocator) *T {
  var t T
  p := a.Alloc(unsafe.Sizeof(t), unsafe.Alignof(t))
  return (*T)(p)
}
Go

This all feels very fine and dandy to anyone used to hurting themselves with malloc or operator new in C++, but there is a small problem. What happens when we allocate pointer-typed memory into this allocator?

// Allocate a pointer in our custom allocator, and then
// initialize it to a pointer on the Go heap.
p := New[*int](myAlloc)
*p = new(int)

runtime.GC()
**p = 42  // Use after free!
Go

Allocator.Alloc takes a size and an alignment, which is sufficient to describe the layout of any type. For example, on 64-bit systems, int and *int have the same layout: 8 bytes of size, and 8 bytes of alignment.

However, the Go GC (and all garbage collectors, generally) require one additional piece of information, which is somewhere between the layout of a value (how it is placed in memory) and the type of a value (rich information on its structure). To understand this, we need a brief overview on what a GC does.

Mark and Sweep

For a complete overview on how to build a simple GC, take a look at a toy GC I designed some time ago: The Alkyne GC.

A garbage collector’s responsibility is to maintain a memory allocator and an accounting of:

  1. What memory has been allocated.
  2. Whether that memory is still in use.

Memory that is not in use can be reclaimed and marked as unallocated, for re-use.

The most popular way to accomplish this is via a “mark and sweep” architecture. The GC will periodically walk the entire object graph of the program from certain pre-determined roots; anything it finds is “marked” as alive. After a mark is complete, all other memory is “swept”, which means to mark it is unallocated for future re-use, or to return it to the OS, in the case of significant surplus.

The roots are typically entities that are actively being manipulated by the program. In the case of Go, this is anything currently on the stack of some G2, or anything in a global (of which there is a compile-time-known set).

The marking phase begins with stack scanning, which looks at the stack of each G and locates any pointers contained therein. The Go compiler generates metadata for each function that specifies which stack slots in a function’s frame contain pointers. All of these pointers are live by definition.

These pointers are placed into a queue, and each pointer is traced to its allocation on the heap. If the GC does not know anything about a particular address, it is discarded as foreign memory that does not need to be marked. If it does, each pointer in that allocation is pushed onto the queue if it has not already been marked as alive. The process continues until the queue is empty.

The critical step here is to take the address of some allocation, and convert it into all of the pointer values within. Go has precise garbage collection, which means that it only treats things declared as pointers in the surface language as pointers: an integer that happens to look like an address will not result in sweeping. This results in more efficient memory usage, but trades off some more complexity in the GC.

For example, the types *int, map[int]byte, string, struct {A int; B *int} all contain at least one pointer, while int, [1000]byte, struct {X bool; F uintptr} do not. The latter are called pointer-free types.

Go enhances the layout of a type into a shape by adding a bitset that specifies which pointer-aligned, pointer-sized words of the type’s memory region contain a pointer. These are called the pointer bits. For example, here are the shapes of a few Go types on a 64-bit system.

Type Size/Align Pointer Bits3
byte 1/1 0
int 8/8 0
rune 4/4 0
*int 8/8 1
unsafe.Pointer 8/8 1
string 16/8 10
[]int 24/8 100
[3]string 48/8 101010
map[int]byte 8/8 1
map[int]string 8/8 1
any 16/8 014
error 16/8 01
func(int) int 8/8 1
runtime.hchan5 104/8 0010110011110

In the Go GC, each allocation is tagged with its shape (this is done in a variety of ways in the GC, either through an explicit header on the allocation, itself (a “malloc header”), a runtime type stored in the allocation’s runtime.mspan, or another mechanism). When scanning a value, it uses this information to determine where the pointers to scan through are.

The most obvious problem with our Allocator.Alloc type is that it does not discriminate shapes, so it cannot allocate memory that contains pointers: the GC will not be able to find the pointers, and will free them prematurely!

In our example where we allocated an *int in our custom allocator, we wind up with a **int on the stack. You would think that Go would simply trace through the first * to find an *int and mark it as being alive, but that is not what happens! Go instead finds a pointer into some chunk that the custom allocator grabbed from the heap, which is missing the pointer bits of its shape!

Why does go not look at the type of the pointer it steps through? Two reasons.

  1. All pointers in Go are untyped from the runtime’s perspective; every *T gets erased into an unsafe.Pointer. This allows much of the Go runtime to be “generic” without using actual generics.

  2. Pointee metadata can be aggregated, so that each pointer to an object does not have to remember its type at runtime.

The end result for us is that we can’t put pointers on the arena. This makes our New API unsafe, especially since Go does not provide a standard constraint for marking generic parameters as pointer-free: unsurprisingly, the don’t expect most users to care about such a detail.

It is possible to deduce the pointer bits of a type using reflection, but that’s very slow, and the whole point of using arenas is to go fast. As we design our arena, though, it will become clear that there is a safe way to have pointers on it.

Designing The Arena

Now that we have a pretty good understanding about what the Go GC is doing, we can go about designing a fast arena structure.

The ideal case is that a call to Alloc is very fast: just offsetting a pointer in the common case. One assumption we can make off the bat is that all memory can be forced to have maximum alignment: most objects are a pointer or larger, and Go does have a maximum alignment for ordinary user types, so we can just ignore the align parameter and always align to say, 8 bytes. This means that the pointer to the next unallocated chunk will always be well-aligned. Thus, we might come up with a structure like this one:

type Arena struct {
  next      unsafe.Pointer
  left, cap uintptr
}

const (
  // Power of two size of the minimum allocation granularity.
  wordBytes = 8  // Depends on target, this is for 64-bit.
  minWords  = 8
)

func (a *Arena) Alloc(size, align uintptr) unsafe.Pointer {
  // First, round the size up to the alignment of every object in
  // the arena.
  mask := wordBytes - 1
  size = (size + mask) &^ mask
  // Then, replace the size with the size in pointer-sized words.
  // This does not result in any loss of size, since size is now
  // a multiple of the uintptr size.
  words := size / wordBytes

  // Next, check if we have enough space left for this chunk. If
  // there isn't, we need to grow.
  if a.left < words {
    // Pick whichever is largest: the minimum allocation size,
    // twice the last allocation, or the next power of two
    // after words.
    a.cap = max(minWords, a.cap*2, nextPow2(words))
    a.next = unsafe.Pointer(unsafe.SliceData(make([]uintptr, a.cap)))
    a.left = a.cap
  }

  // Allocate the chunk by incrementing the pointer.
  p := a.next
  a.left -= words
  if a.left > 0 {
    a.next = unsafe.Add(a.next, size)
  } else {
    // Beware, offsetting to one-past-the-end is one of the few
    // things explicitly not allowed by Go.
    a.next = nil
  }

  return p
}

// nextPow2 returns the smallest power of two greater than n.
func nextPow2(n uintptr) uintptr {
  return uintptr(1) << bits.Len(uint(n))
}
Go

How fast is this really? Here’s a simple benchmark for it.

func BenchmarkArena(b *testing.B) {
  bench[int](b)
  bench[[2]int](b)
  bench[[64]int](b)
  bench[[1024]int](b)
}

const runs = 100000

var sink any

func bench[T any](b *testing.B) {
  var z T
  n := int64(runs * unsafe.Sizeof(z))
  name := fmt.Sprintf("%v", reflect.TypeFor[T]())

  b.Run(name, func(b *testing.B) {
    b.Run("arena", func(b *testing.B) {
      b.SetBytes(n)
      for b.Loop() {
        a := new(arena.Arena)
        for range runs {
          sink = arena.New[T](a)
        }
      }
    })

    b.Run("new", func(b *testing.B) {
      b.SetBytes(n)
      for b.Loop() {
        for range runs {
          sink = new(T)
        }
      }
    })
  })
}
Go

The focus of this benchmark is to measure the cost of allocating many objects of the same size. The number of times the for b.Loop() loop will execute is unknown, and determined by the benchmarking framework to try to reduce statistical anomaly. This means that if we instead just benchmark a single allocation, the result will be very sensitive to the number of runs.

We also use b.SetBytes to get a throughput measurement on the benchmark. This is a bit easier to interpret than the gross ns/op, the benchmark would otherwise produce. It tells us how much memory each allocator can allocate per unit time.

We want to compare against new, but just writing _ = new(T) will get optimized out, since the resulting pointer does not escape. Writing it to a global is sufficient to convince Go that it escapes.

Here’s the results, abbreviated to show only the bytes per second. All benchmarks were performed on my AMD Ryzen Threadripper 3960X. Larger is better.

BenchmarkArena/int/arena-48         794.84 MB/s
BenchmarkArena/int/new-48           390.59 MB/s
BenchmarkArena/[2]int/arena-48      1263.58 MB/s
BenchmarkArena/[2]int/new-48        528.06 MB/s
BenchmarkArena/[64]int/arena-48     7370.08 MB/s
BenchmarkArena/[64]int/new-48       2865.24 MB/s
BenchmarkArena/[1024]int/arena-48   9889.20 MB/s
BenchmarkArena/[1024]int/new-48     2875.75 MB/s
Console

This is quite nice, and certainly worth pursuing! The performance increase seems to scale up with the amount of memory allocated, for a 2x-4x improvement across different cases.

Now we need to contend with the fact that our implementation is completely broken if we want to have pointers in it.

Not Dropping Memory on the Ground

In (*Arena).Alloc, when we assign a freshly-allocated chunk, we overwrite a.next, which means the GC can reclaim it. But this is fine: as long as pointers into that arena chunk are alive, the GC will not free it, independent of the arena. So it seems like we don’t need to worry about it?

However, the whole point of an arena is to allocate lots of memory that has the same lifetime. This is common for graph data structures, such as an AST or a compiler IR, which performs a lot of work that allocates a lot and then throws the result away.

We are not allowed to put pointers in the arena, because they would disappear from the view of the GC and become freed too soon. But, if a pointer wants to go on an arena, it necessarily outlive the whole arena, since it outlives part of the arena, and the arena is meant to have the same lifetime.

In particular, if we could make it so that holding any pointer returned by Alloc prevents the entire arena from being swept by the GC, the arena can safely contain pointers into itself! Consider this:

  1. We have a pointer p **int. It is allocated on some arena a.

  2. The GC sees our pointer (as a type-erased unsafe.Pointer) and marks its allocation as live.

  3. Somehow, the GC also marks a as alive as a consequence.

  4. Somehow, the GC then marks every chunk a has allocated as alive.

  5. Therefore he chunk that *p points to is also alive, so *p does not need to be marked directly, and will not be freed early.

The step (3) is crucial. By forcing the whole arena to be marked, any pointers stored in the arena into itself will be kept alive automatically, without the GC needing to know how to scan for them.

So, even though *New[*int](a) = new(int) is still going to result in a use-after-free, *New[*int](a) = New[int](a) would not! This small improvement does not make arenas themselves safe, but a data structure with an internal arena can be completely safe, so long as the only pointers that go into the arena are from the arena itself.

How can we make this work? The easy part is (4), which we can implement by adding a []unsafe.Pointer to the arena, and sticking every pointer we allocate into it.

type Arena struct {
  next      unsafe.Pointer
  left, cap uintptr

  chunks []unsafe.Pointer  // New field.
}

func (a *Arena) Alloc(size, align uintptr) unsafe.Pointer {
  // ... snip ...
  if a.left < words {
    // Pick whichever is largest: the minimum allocation size,
    // twice the last allocation, or the next power of two
    // after words.
    a.cap = max(minWords, a.cap*2, nextPow2(words))
    a.next = unsafe.Pointer(unsafe.SliceData(make([]uintptr, a.cap)))
    a.left = a.cap
    a.chunks = append(a.chunks, a.next)
  }
  // ... snip ...
}
Go

The cost of the append is amortized: to allocate n bytes, we wind up allocating an additional O(loglogn)O(\log \log n) times. But what does this do to our benchmarks?

BenchmarkArena/int/arena-48         800.08 MB/s
BenchmarkArena/int/new-48           386.81 MB/s
BenchmarkArena/[2]int/arena-48      1236.00 MB/s
BenchmarkArena/[2]int/new-48        520.84 MB/s
BenchmarkArena/[64]int/arena-48     7999.71 MB/s
BenchmarkArena/[64]int/new-48       2706.68 MB/s
BenchmarkArena/[1024]int/arena-48   9998.00 MB/s
BenchmarkArena/[1024]int/new-48     2816.28 MB/s
Console

Seems pretty much the same, which is a good sign.

Back Pointers

Now that the arena does not discard any allocated memory, we can focus on condition (3): making it so that if any pointer returned by Alloc is alive, then so is the whole arena.

Here we can make use of an important property of how Go’s GC works: any pointer into an allocation will keep it alive, as well as anything reachable from that pointer. But the chunks we’re allocating are []uintptrs, which will not be scanned. If there could somehow be a single pointer in this slice that was scanned, we would be able to stick the pointer a *Arena there, and so when anything that Alloc returns is scanned, it would cause a to be marked as alive.

So far, we have been allocating [N]uintptr using make([]T), but we would actually like to allocate struct { A [N]uintptr; P unsafe.Pointer }, where N is some dynamic value.

In its infintie wisdom, the Go standard library actually gives us a dedicated mechanism to do this: reflect.StructOf. This can be used to construct arbitrary anonymous struct types at runtime, which we can then allocate on the heap.

So, instead of calling make, we might call this function:

func (a *Arena) allocChunk(words uintptr) unsafe.Pointer {
  chunk := reflect.New(reflect.StructOf([]reflect.StructField{
    {
      Name: "X0",
      Type: reflect.ArrayOf(int(words), reflect.TypeFor[uintptr]()),
    },
    {Name: "X1", Type: reflect.TypeFor[unsafe.Pointer]()},
  })).UnsafePointer()

  // Offset to the end of the chunk, and write a to it.
  end := unsafe.Add(chunk, words * unsafe.Sizeof(uintptr(0)))
  *(**Arena)(end) = a

  return chunk
}
Go

This appears to have a minor but noticeable effect on performance6.

BenchmarkArena/int/arena-48         763.91 MB/s
BenchmarkArena/int/new-48           385.49 MB/s
BenchmarkArena/[2]int/arena-48      1174.00 MB/s
BenchmarkArena/[2]int/new-48        524.32 MB/s
BenchmarkArena/[64]int/arena-48     7563.54 MB/s
BenchmarkArena/[64]int/new-48       2649.63 MB/s
BenchmarkArena/[1024]int/arena-48   8668.02 MB/s
BenchmarkArena/[1024]int/new-48     2648.10 MB/s
Console

More Optimizations

Looking back at Arena.Alloc, the end of this function has a branch:

func (a *Arena) Alloc(size, align uintptr) unsafe.Pointer {
  // ... snip...

  // Allocate the chunk by incrementing the pointer.
  p := a.next
  a.left -= words
  if a.left > 0 {
    a.next = unsafe.Add(a.next, size)
  } else {
    // Beware, offsetting to one-past-the-end is one of the few
    // things explicitly not allowed by Go.
    a.next = nil
  }

  return p
}
Go

This is the absolute hottest part of allocation, since it is executed every time we call this function. The branch is a bit unfortunate, but it’s necessary, as noted by the comment.

In C++, if we have an array of int with n elements in it, and int* p is a pointer to the start of the array, p + n is a valid pointer, even though it can’t be dereferenced; it points “one past the end” of the array. This is a useful construction, since, for example, you can use it to eliminate a loop induction variable:

// Naive for loop, has an induction variable i.
for (int i = 0; i < n; i++) {
  do_something(p[i]);
}

// Faster: avoids the extra variable increment in the loop
// body for doing p[i].
for (auto end = p + n; p < end; p++) {
  do_something(*p);
}
C++

Go, however, gets very upset if you do this, because it confuses the garbage collector. The GC can’t tell the difference between a one-past-the-end pointer for allocation A, and for the start of allocation B immediately after it. At best this causes memory to stay alive for longer, and at worst it triggers safety interlocks in the GC. The GC will panic if it happens to scan a pointer for an address that it knows has been freed.

But in our code above, every chunk now has an extra element at the very end that is not used for allocation, so we can have a pointer that is one-past-the-end of the [N]uintptr that we are vending memory from.

The updated allocation function would look like this:

func (a *Arena) Alloc(size, align uintptr) unsafe.Pointer {
  // ... snip ...

  // Allocate the chunk by incrementing the pointer.
  p := a.next
  a.next = unsafe.Add(a.next, size)
  a.left -= words

  return p
}
Go

Notably, we do not replace a.left with an end pointer, because of the if a.left < words comparison. We can’t actually avoid the subtraction a.left -= words because we would have to do it to make this comparison work if we got rid of a.left.

So how much better is this?

BenchmarkArena/int/arena-48         780.07 MB/s
BenchmarkArena/int/new-48           383.16 MB/s
BenchmarkArena/[2]int/arena-48      1245.73 MB/s
BenchmarkArena/[2]int/new-48        530.39 MB/s
BenchmarkArena/[64]int/arena-48     7684.39 MB/s
BenchmarkArena/[64]int/new-48       2679.94 MB/s
BenchmarkArena/[1024]int/arena-48   8859.99 MB/s
BenchmarkArena/[1024]int/new-48     2611.33 MB/s
Console

Remarkably, not very! This is an improvement on the order of magnitude of one or two percentage points. This is because the branch we deleted is extremely predictable. Because Go’s codegen is relatively mediocre, the effect of highly predictable branches (assuming Go actually schedules the branches correctly 🙄) is quite minor.

Turns out there’s a bigger improvement we can make.

Write Barriers

Here’s the assembly Go generated for this function, heavily abridged, and annotated with the corresponding Go source code.

TEXT (*Arena).Alloc(SB)
  CMPQ    SP, 0x10(R14)
  JBE     moreStack  ; Stack growth prologue.
  PUSHQ   BP
  MOVQ    SP, BP
  SUBQ    $0x58, SP

  ; size = (size + mask) &^ mask
  LEAQ    0x7(BX), DX
  ANDQ    $-0x8, DX
  ; words := size / wordBytes
  MOVQ    DX, SI
  SHRQ    $0x3, DX

  ; if a.left < words
  CMPQ    0x8(AX), DX
  JAE     alloc

  MOVQ    AX, 0x68(SP)
  MOVQ    SI, 0x48(SP)
  MOVQ    DX, 0x40(SP)

  ; nextPow2(words)
  MOVZX   runtime.x86HasPOPCNT(SB), DI
  TESTL   DI, DI
  JE      1f
  XORL    DI, DI
  POPCNTQ DX, DI
  JMP     2f
1:
  MOVQ    DX, AX
  CALL    math/bits.OnesCount(SB)
  MOVQ    0x40(SP), DX
  MOVQ    0x48(SP), SI
  MOVQ    AX, DI
  MOVQ    0x68(SP), AX
2:
  CMPQ    DI, $0x1
  JE      1f
  BSRQ    DX, CX
  MOVQ    $-0x1, DI
  CMOVE   DI, CX
  INCQ    CX
  MOVL    $0x1, DI
  SHLQ    CL, DI
  CMPQ    CX, $0x40
  SBBQ    R8, R8
  ANDQ    R8, DI
  MOVQ    DI, DX
1:
  MOVQ    0x10(AX), CX
  SHLQ    $0x1, CX

  ; a.cap = max(minWords, a.cap*2, nextPow2(words))
  CMPQ    CX, $0x8
  MOVL    $0x8, BX
  CMOVA   CX, BX
  CMPQ    DX, BX
  CMOVA   DX, BX
  MOVQ    BX, 0x10(AX)

  ; a.next = a.allocChunk(a.cap)
  CALL    github.com/mcy/go-arena.(*Arena).allocChunk(SB)
  CMPL    runtime.writeBarrier(SB), $0x0
  JNE     1f
  MOVQ    0x68(SP), DX
  JMP     2f
1:
  CALL    runtime.gcWriteBarrier2(SB)
  MOVQ    AX, 0(R11)
  MOVQ    0x68(SP), DX
  MOVQ    0(DX), R8
  MOVQ    R8, 0x8(R11)
2:
  MOVQ    AX, 0(DX)

  ; a.left = a.cap
  MOVQ    0x10(DX), R8
  MOVQ    R8, 0x8(DX)
  MOVQ    0x28(DX), CX
  MOVQ    0x20(DX), BX
  INCQ    BX
  MOVQ    0x18(DX), R8
  CMPQ    CX, BX
  JAE     2f

  ; a.chunks = append(a.chunks, a.next)
  MOVQ    AX, 0x50(SP)
  MOVQ    R8, AX
  MOVL    $0x1, DI
  LEAQ    0x28f70(IP), SI
  CALL    runtime.growslice(SB)
  MOVQ    0x68(SP), DX
  MOVQ    CX, 0x28(DX)
  CMPL    runtime.writeBarrier(SB), $0x0
  JE      1f
  CALL    runtime.gcWriteBarrier2(SB)
  MOVQ    AX, 0(R11)
  MOVQ    0x18(DX), CX
  MOVQ    CX, 0x8(R11)
1:
  MOVQ    AX, 0x18(DX)
  MOVQ    AX, R8
  MOVQ    0x50(SP), AX
2:
  MOVQ    BX, 0x20(DX)
  CMPL    runtime.writeBarrier(SB), $0x0
  JE      1f
  CALL    runtime.gcWriteBarrier2(SB)
  MOVQ    AX, 0(R11)
  MOVQ    -0x8(R8)(BX*8), CX
  MOVQ    CX, 0x8(R11)
1:
  MOVQ    AX, -0x8(R8)(BX*8)
  MOVQ    DX, AX
  MOVQ    0x40(SP), DX
  MOVQ    0x48(SP), SI

alloc:
  ; p := a.next
  MOVQ    0(AX), CX

  ; a.next = unsafe.Add(a.next, size)
  LEAQ    0(CX)(SI*1), BX
  CMPL    runtime.writeBarrier(SB), $0x0
  JE      1f
  CALL    runtime.gcWriteBarrier2(SB)
  MOVQ    BX, 0(R11)
  MOVQ    0(AX), SI
  MOVQ    SI, 0x8(R11)
1:
  MOVQ    BX, 0(AX)

  ; a.left -= words
  LEAQ    0(CX)(SI*1), BX
  SUBQ    DX, 0x8(AX)

  ; return p
  MOVQ    CX, AX
  ADDQ    $0x58, SP
  POPQ    BP
  RET
x86 Assembly (Go Syntax)

There’s a lot going on in this function, but most of it is a mix of Go not being great at register allocation, and lots of write barriers.

A write barrier is a mechanism for synchronizing ordinary user code with the GC. Go generates code for one any time a non-pointer-free type is stored. For example, writing to a **int, *string, or *[]int requires a write barrier.

Write barriers are implemented as follows:

  1. runtime.writeBarrier is checked, which determines whether the write barrier is necessary, which is only when the GC is in the mark phase. Otherwise the branch is taken to skip the write barrier.

  2. A call to one of the runtime.gcWriteBarrierN functions happens. N is the number of pointers that the GC needs to be informed of.

  3. This function calls runtime.gcWriteBarrier, which returns a buffer onto which pointers the GC needs to now trace through should be written to.

  4. The actual store happens.

A write barrier is required for a case like the following. Consider the following code.

func alloc(n **int) {
  *n = new(int)
}
Go

This function will call runtime.newobject to allocate eight bytes of memory. The resulting pointer will be returned in rax. This function then stores rax into n and returns. If we Godbolt this function, we’ll find that it does, in fact, generate a write barrier:

TEXT x.alloc
  CMPQ    SP, 16(R14)
  JLS     growStack
  PUSHQ   BP
  MOVQ    SP, BP
  SUBQ    $16, SP

  MOVQ    AX, main.n+32(SP)

  ; new(int)
  LEAQ    type:int(SB), AX
  CALL    runtime.newobject(SB)

  MOVQ    main.n+32(SP), CX
  TESTB   AL, (CX)

  ; This is the write barrier.
  CMPL    runtime.writeBarrier(SB), $0
  JEQ     skip
  MOVQ    (CX), DX
  CALL    runtime.gcWriteBarrier2(SB)
  MOVQ    AX, (R11)
  MOVQ    DX, 8(R11)
skip:
  MOVQ    AX, (CX)  ; The actual store.

  ADDQ    $16, SP
  POPQ    BP
  RET

growStack:
  NOP
  MOVQ    AX, 8(SP)
  CALL    runtime.morestack_noctxt(SB)
  MOVQ    8(SP), AX
  JMP     x.alloc
x86 Assembly (Go Syntax)

Note that two pointers get written: the pointer returned by new(int), and the old value of *n. This ensures that regardless of where in this function the GC happens to be scanning through *n, it sees both values during the mark phase.

Now, this isn’t necessary if the relevant pointers are already reachable in some other way… which is exactly the case in our arena (thanks to the chunks slice). So the write barrier in the fast path is redundant.

But, how do we get rid of it? There is //go:nowritebarrier, but that’s not allowed outside of a list of packages allowlisted in the compiler. It also doens’t disable write barriers; it simply generates a diagnostic if any are emitted.

But remember, write barriers only occur when storing pointer-typed memory… so we can just replace next unsafe.Pointer with next uintptr.

type Arena struct {
  next      uintptr // A real pointer!
  left, cap uintptr

  chunks []unsafe.Pointer
}

func (a *Arena) Alloc(size, align uintptr) unsafe.Pointer {
  mask := wordBytes - 1
  size = (size + mask) &^ mask
  words := size / wordBytes

  if a.left < words {
    a.cap = max(minWords, a.cap*2, nextPow2(words))

    p := a.allocChunk(a.cap)
    a.next = uintptr(p)
    a.left = a.cap
    a.chunks = append(a.chunks, p)
  }

  p := a.next
  a.next += size
  a.left -= words

  return unsafe.Pointer(p)
}
Go

go vet hates this, because it doesn’t know that we’re smarter than it is. Does This make the code faster? To make it a little bit more realistic, I’ve written a separate variant of the benchmarks that hammers the GC really hard in a separate G:

go func() {
  for { runtime.GC() }
}()
Go

The result indicates that this is a worthwhile optimization for churn-heavy contexts. Performance is much worse overall, but that’s because the GC is pre-empting everyone. The improvement seems to be on the order of 20% for very small allocations.

# Before
BenchmarkArena/int/arena-48         169.09 MB/s
BenchmarkArena/int/new-48           84.73 MB/s
BenchmarkArena/[2]int/arena-48      309.40 MB/s
BenchmarkArena/[2]int/new-48        120.23 MB/s
BenchmarkArena/[64]int/arena-48     1954.16 MB/s
BenchmarkArena/[64]int/new-48       950.48 MB/s
BenchmarkArena/[1024]int/arena-48   3341.13 MB/s
BenchmarkArena/[1024]int/new-48     1413.26 MB/s

# After
BenchmarkArena/int/arena-48         195.58 MB/s
BenchmarkArena/int/new-48           83.67 MB/s
BenchmarkArena/[2]int/arena-48      352.49 MB/s
BenchmarkArena/[2]int/new-48        120.13 MB/s
BenchmarkArena/[64]int/arena-48     1987.22 MB/s
BenchmarkArena/[64]int/new-48       903.78 MB/s
BenchmarkArena/[1024]int/arena-48   3342.67 MB/s
BenchmarkArena/[1024]int/new-48     1439.99 MB/s
Console

Cutting Out The Heap Entirely

Another source of slowdown is the fact that any time we allocate from the heap, it’s forced to eagerly clear the huge allocated chunk every time, because it contains pointers. If you profile this code, a ton of time is spent in runtime.memclrNoHeapPointers. Because the chunks of memory we allocate are always of a specific size, we can use an array of sync.Pools to amortize the cost of allocating and clearing chunks.

First, we need an entry in this array of pools, one for each size of memory we allocate. Then, we need to set a finalizer on the arena to reclaim its memory once we’re done. Finally, we can change the contract of Alloc to require the caller to clear the value for us, and change New take a value as its argument:

func New[T any](a Allocator, v T) *T {
  p := (*T)(a.Alloc(unsafe.Sizeof(v), unsafe.Alignof(v)))
  *p = v
  return p
}
Go

What’s nice about this is that it avoids having to clear the value if a non-zero value would be allocated to it instead.

Putting this all together, it would look like this:

var pools [64]sync.Pool

func init() {
  for i := range pools {
    pools[i].New = func() any {
      return reflect.New(reflect.StructOf([]reflect.StructField{
        {
          Name: "A",
          Type: reflect.ArrayOf(1<<i, reflect.TypeFor[uintptr]()),
        },
        {Name: "P", Type: reflect.TypeFor[unsafe.Pointer]()},
      })).UnsafePointer()
    }
  }
}

func (a *Arena) allocChunk(words uintptr) unsafe.Pointer {
  log := bits.TrailingZeros(uint(words))
  chunk := pools[log].Get().(unsafe.Pointer)

  // Offset to the end of the chunk, and write a to it.
  end := unsafe.Add(chunk, words*unsafe.Sizeof(uintptr(0)))
  *(**Arena)(end) = a

  // If this is the first chunk allocated, set a finalizer.
  if a.chunks == nil {
    runtime.SetFinalizer(a, (*Arena).finalize)
  }

  // Place the returned chunk at the offset in a.chunks that
  // corresponds to its log, so we can identify its size easily
  // in the loop above.
  a.chunks = append(a.chunks, make([]unsafe.Pointer, log+1-len(a.chunks))...)
  a.chunks[log] = chunk

  return chunk
}

func (a *Arena) finalize() {
  for log, chunk := range a.chunks {
    if chunk == nil {
      continue
    }

    words := uintptr(1) << log
    end := unsafe.Add(chunk, words*unsafe.Sizeof(uintptr(0)))
    *(**Arena)(end) = nil // Make sure that we don't leak the arena.

    pools[log].Put(chunk)
  }
}
Go

How does this perform?

BenchmarkArena/int/arena-48        1260.73 MB/s
BenchmarkArena/int/new-48          712.94 MB/s
BenchmarkArena/[2]int/arena-48     2457.27 MB/s
BenchmarkArena/[2]int/new-48       1167.57 MB/s
BenchmarkArena/[64]int/arena-48    4491.49 MB/s
BenchmarkArena/[64]int/new-48      6800.76 MB/s
BenchmarkArena/[1024]int/arena-48  3992.32 MB/s
BenchmarkArena/[1024]int/new-48    4320.65 MB/s
Console

Well. That’s a surprise. It does much better for small allocations, but it made really big allocations worse! It’s not immediately clear to me why this is, but note that new also got much faster, which tells me that because the allocations from the arena are longer-lived, the GC behaves somewhat differently, causing some of the cost from allocating really large things with new to be amortized.

Whether this optimization makes sense would require some profiling. An alternative is to manually manage arena re-use, by adding a very unsafe Reset() function that causes the arena to behave as if it was just constructed, but keeping all of its allocated chunks. This is analogous to reslicing to zero: x = x[:0].

This is very unsafe because it can lead to the same memory being allocated twice: this is only ok if the memory is not re-used.

Implementing this is very simple.

func (a *Arena) Reset() {
  a.next, a.left, a.cap = 0, 0, 0
}

func (a *Arena) allocChunk(words uintptr) unsafe.Pointer {
  log := bits.TrailingZeros(uint(words))
  if len(a.chunks) > log {
    // If we've already allocated a chunk of this size in a previous arena
    // generation, return it.
    //
    // This relies on the fact that an arena never tries to allocate the same
    // size of chunk twice between calls to Reset().
    return a.chunks[log]
  }

  // ... snip ...
}
Go

Then, if we modify our arena benchmark to take advantage of this…

b.Run("arena", func(b *testing.B) {
  b.SetBytes(n)
  a := new(arena.Arena)
  for b.Loop() {
    a.Reset()  // Important!
    for range runs {
      sink = arena.New[T](a)
    }
  }
})
Go

What does the performance look like now?

BenchmarkArena/int/arena-48         2376.01 MB/s
BenchmarkArena/int/new-48           377.64 MB/s
BenchmarkArena/[2]int/arena-48      4314.98 MB/s
BenchmarkArena/[2]int/new-48        530.62 MB/s
BenchmarkArena/[64]int/arena-48     10496.49 MB/s
BenchmarkArena/[64]int/new-48       3959.85 MB/s
BenchmarkArena/[1024]int/arena-48   9735.19 MB/s
BenchmarkArena/[1024]int/new-48     6160.73 MB/s
Console

That’s a massive improvement! There’s a couple of reasons this is faster. First, it doesn’t require waiting for the GC to collect old arenas to make their memory get reused. Second, the fast path is very fast with no synchronization.

On the flipside, this is very dangerous: arena re-use needs to be carefully managed, because you can wind up with unique pointers that aren’t.

Realloc

Go does not offer an easy mechanism to “reallocate” an allocation, as with realloc() in C. This is because it has no mechanism for freeing pointers explicitly, which is necessary for a reallocation abstraction.

But we already don’t care about safety, so we can offer reallocation on our arena. Now, the reallocation we can offer is quite primitive: if a chunk happens to be the most recent one allocated, we can grow it. Otherwise we just allocate a new chunk and don’t free the old one.

This makes it possible to implement “arena slices” that can be constructed by appending, which will not trigger reallocation on slice growth as long as nothing else gets put on the arena.

Realloc would look something like this:

func (a *Arena) Realloc(
  ptr unsafe.Pointer,
  oldSize, newSize, align uintptr,
) unsafe.Pointer {
  mask := wordBytes - 1
  oldSize = (oldSize + mask) &^ mask
  newSize = (newSize + mask) &^ mask

  if newSize <= oldSize {
    return ptr
  }

  // Check if this is the most recent allocation. If it is,
  // we can grow in-place.
  if a.next - oldSize == uintptr(ptr) {
    // Check if we have enough space available for the
    // requisite extra space.
    need := (newSize - oldSize) / wordBytes
    if a.left >= need {
      // Grow in-place.
      a.left -= need
      return ptr
    }
  }

  // Can't grow in place, allocate new memory and copy to it.
  new := a.Alloc(newSize, align)
  copy(
    unsafe.Slice((*byte)(new), newSize),
    unsafe.Slice((*byte)(ptr), oldSize),
  )

  return new
}
Go

Then, whenever we append to our arena slice, we can call a.Realloc() to grow it. However, this does not work if the slice’s base pointer is not the original address returned by Alloc or Realloc. It is an exercise for the reader to:

  1. Implement a Slice[T] type that uses an arena for allocation.

  2. Make this work for any value of ptr within the most recent allocation, not just the base offset. This requires extra book-keeping.

All Together

Here is the entirety of the code that we have developed, not including the reallocation function above.

package arena

import (
  "math/bits"
  "reflect"
  "unsafe"
)

func New[T any](a *Arena, v T) *T {
  p := (*T)(a.Alloc(unsafe.Sizeof(v), unsafe.Alignof(v)))
  *p = v
  return p
}

type Arena struct {
  next      unsafe.Pointer
  left, cap uintptr
  chunks    []unsafe.Pointer
}

const (
  maxAlign uintptr = 8 // Depends on target, this is for 64-bit.
  minWords uintptr = 8
)

func (a *Arena) Alloc(size, align uintptr) unsafe.Pointer {
  // First, round the size up to the alignment of every object in the arena.
  mask := maxAlign - 1
  size = (size + mask) &^ mask
  // Then, replace the size with the size in pointer-sized words. This does not
  // result in any loss of size, since size is now a multiple of the uintptr
  // size.
  words := size / maxAlign

  // Next, check if we have enough space left for this chunk. If there isn't,
  // we need to grow.
  if a.left < words {
    // Pick whichever is largest: the minimum allocation size, twice the last
    // allocation, or the next power of two after words.
    a.cap = max(minWords, a.cap*2, nextPow2(words))
    a.next = a.allocChunk(a.cap)
    a.left = a.cap

    a.chunks = append(a.chunks, a.next)
  }

  // Allocate the chunk by incrementing the pointer.
  p := a.next
  a.next = unsafe.Add(a.next, size)
  a.left -= words

  return p
}

func (a *Arena) Reset() {
  a.next, a.left, a.cap = 0, 0, 0
}

var pools [64]sync.Pool

func init() {
  for i := range pools {
    pools[i].New = func() any {
      return reflect.New(reflect.StructOf([]reflect.StructField{
        {
          Name: "X0",
          Type: reflect.ArrayOf(1<<i, reflect.TypeFor[uintptr]()),
        },
        { Name: "X1", Type: reflect.TypeFor[unsafe.Pointer]() },
      })).UnsafePointer()
    }
  }
}

func (a *Arena) allocChunk(words uintptr) unsafe.Pointer {
  log := bits.TrailingZeros(uint(words))
  if len(a.chunks) > log {
    return a.chunks[log]
  }

  chunk := pools[log].Get().(unsafe.Pointer)

  // Offset to the end of the chunk, and write a to it.
  end := unsafe.Add(chunk, words*unsafe.Sizeof(uintptr(0)))
  *(**Arena)(end) = a

  // If this is the first chunk allocated, set a finalizer.
  if a.chunks == nil {
    runtime.SetFinalizer(a, (*Arena).finalize)
  }

  // Place the returned chunk at the offset in a.chunks that
  // corresponds to its log, so we can identify its size easily
  // in the loop above.
  a.chunks = append(a.chunks, make([]unsafe.Pointer, log+1-len(a.chunks))...)
  a.chunks[log] = chunk

  return chunk
}

func (a *Arena) finalize() {
  for log, chunk := range a.chunks {
    if chunk == nil {
      continue
    }

    words := uintptr(1) << log
    end := unsafe.Add(chunk, words*unsafe.Sizeof(uintptr(0)))
    *(**Arena)(end) = nil // Make sure that we don't leak the arena.

    pools[log].Put(chunk)
  }
}

func nextPow2(n uintptr) uintptr {
	return uintptr(1) << bits.Len(uint(n))
}
Go

There are other optimizations that we could make here that I haven’t discussed. For example, arenas could be re-used; once an arena is done, it could be “reset” and placed into a sync.Pool. This arena would not need to go into the GC to request new chunks, re-using the ones previously allocated (and potentially saving on the cost of zeroing memory over and over again).

I did say that this relies very heavily on Go’s internal implementation details. Whats the odds that they get broken in the future? Well, the requirement that allocations know their shape is forced by the existence of unsafe.Pointer, and the requirement that a pointer into any part of an allocation keeps the whole thing alive essentially comes from slices being both sliceable and mutable; once a slice escapes to the heap (and thus multiple goroutines) coordinating copies for shrinking a slice would require much more complexity than the current write barrier implementation.

And in my opinion, it’s pretty safe to say that Hyrum’s Law has us covered here. ;)

  1. Go does have some UB. For example, Go assumes that a G’s stack is never read or written to by any other G, except by the GC across a write barrier.

    That said, what UB does exist is very, very difficult to trip on purpose. 

  2. I almost exclusively refer to goroutines as Gs here, since it makes it easy to refer to Ps and Ms as needed. See https://go.dev/src/runtime/HACKING#scheduler-structures

  3. The pointer bits are in big endian order, so the first bit in left-to-right order corresponds to the first word. 

  4. The “itab”, or interface table part of an interface value is not managed by the GC; it is allocated in persistent memory, so even though it is a pointer, it is not a pointer the GC needs to care about. 

  5. Internal implementation of the chan T type, which is implemented as a *runtime.hchan. See https://cs.opensource.google/go/go/+/master:src/runtime/chan.go;l=34;drc=a204ed53d907c3b325e3c2bdd6f847a8f97e90d9

  6. This can be made better by caching the reflect.Types, but that is only a very slight improvement on the order of 1% speedup. Most of the slowdown is because Go is a bit more eager about zeroing allocations of values that contain pointers.

    var types []reflect.Type
    
    func init() {
      // Pre-allocate the whole array. There aren't that many powers
      // of two. Don't need to go beyond 1<<61, since that's about as
      // large of an allocation as Go will service (trying to create
      // a larger array will panic).
      types = make([]reflect.Type, 61)
      for i := range types {
        types[i] = reflect.StructOf([]reflect.StructField{
          {
            Name: "X0",
            Type: reflect.ArrayOf(int(1)<<i, reflect.TypeFor[uintptr]()),
          },
          { Name: "X1", Type: reflect.TypeFor[unsafe.Pointer]() },
        })
      }
    }
    
    func (a *Arena) allocChunk(words uintptr) unsafe.Pointer {
      log := bits.TrailingZeros(uint(words))
      chunk := reflect.New(types[log]).UnsafePointer()
    
      // Offset to the end of the chunk, and write a to it.
      end := unsafe.Add(chunk, words * unsafe.Sizeof(uintptr(0)))
      *(**Arena)(end) = a
    
      return chunk
    }
    Go

    However, with this in place, we can be assured that property (3) now holds, so it’s perfectly safe to place arena pointers into arena-allocated memory, so long as it’s across the same arena. 

Protobuf Tip #2: Compress Your Protos!

As a matter of fact, when compression technology came along, we thought the future in 1996 was about voice. We got it wrong. It is about voice, video, and data, and that is what we have today on these cell phones. –Steve Buyer

TL;DR: Compression is everywhere: CDNs, HTTP servers, even in RPC frameworks like Connect. This pervasiveness means that wire size tradeoffs matter less than they used to twenty years ago, when Protobuf was designed.

I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog.

Varints from 1998

Protobuf’s wire format is intended to be relatively small. It makes use of variable-width integers so that smaller values take up less space on the wire. Fixed width integers might be larger on the wire, but often have faster decoding times.

But what if I told you that doesn’t matter?

See, most internet traffic is compressed. Bandwidth is precious, and CDN operators don’t want to waste time sending big blobs full of zeros. There are many compression algorithms available, but the state of the art for HTTP requests (which dominates much of global internet traffic) is Brotli, an algorithm developed at Google in 2013 and standardized in IETF RFC7932 in 2016. There is a very good chance that this article was delivered to your web browser as a Brotli-compressed blob.

Using Compression

How compression is applied in your case will vary, but both Connect RPC and gRPC support native compression. For example, Connect has an API for injecting compression providers: https://pkg.go.dev/connectrpc.com/connect#WithCompression.

Connect uses gzip by default, which uses the DEFLATE compression algorithm. Providing your own compression algorithm (such as Brotli) is pretty simple, as shown by this third-party package.

Other services may compress for you transparently. Any competent CDN will likely use Brotli (or gzip or zlib, but probably Brotli) to compress any files it serves for you. (In fact, JavaScript and HTML minimization can often be rendered irrelevant by HTTP compression, too.)

It’s important to remember that Protobuf predates pervasive compression: if it didn’t, it would almost certainly not use variable-width integers for anything. It only uses them because they offer a primitive form of compression in exchange for being slower to decode. If that tradeoff was eliminated, Protobuf would almost certainly only use fixed-width integers on the wire.

How Good Is It Really?

Let’s do some apples-to-apples comparisons. Consider the following Protobuf type.

message Foo {
	repeated int32 varints = 1;
	repeated sfixed32 fixeds = 2;
}
Protobuf

There are two fields that contain essentially the same data, which can be encoded in four different ways: as old-style repeated fields, as packed fields, and the integers can be encoded as varints or fixed32 values.

Using Protoscope, we can create some data that exercises these four cases:

# a.pb, repeated varint
1: 1
1: 2
1: 3
# ...

# b.pb, packed varint
1: {
  1 2 3
  # ...
}

# c.pb, repeated fixed32
2: 1i32
2: 2i32
2: 3i32

# d.pb, packed fixed32
2: {
  1i32 2i32 3i32
  # ...
}
Protoscope

Each blob contains the integers from 1 to 1000 encoded in different ways. I’ll compress each one using gzip, zlib, and Brotli, using their default compression levels, and arrange their sizes, in bytes, in the table below.

File Uncompressed gzip (DEFLATE) zlib Brotli
a.pb 2875 1899 1878 1094
b.pb 1877 1534 1524 885
c.pb 5005 1577 1567 1140
d.pb 4007 1440 1916 1140

Compression achieves incredible results: Brotli manages to get all of the files down to around 1.1 kB, except for the packed varints, which it gets about 250 bytes smaller! Of course, that’s only because most of the values in that repeated field are small. If the values range from 100000 to 101000, b.pb and d.pb are 3006 and 4007 bytes respectively (see that d.pb’s size is unchanged!), but when compressed with brotli, the lead for b.pb starts to disappear: 1039 bytes vs. 1163 bytes. Now it’s only 120 bytes smaller.

Are Varints Still Better?

Applying compression can often have similar results to replacing everything with varints, but not exactly: using a varint will likely always be slightly smaller, at least when using state-of-the-art compression like Brotli. But you can pretty much always assume you will be using compression, such as to compress HTTP headers and other ancillary content in your request. Compression is generic and highly optimized—it applies to all data, regardless of schema, and is often far more optimized than application-level codecs like those in a Protobuf library.

Not to mention, you should definitely be compressing any large data blobs you’re storing on disk, too!

As a result, you can usually disregard many encoded size concerns when making tradeoffs in designing a Protobuf type. Fixed integer types will decode faster, so if decoding speed is important to you, and you’re worried about the size on the wire, don’t. It’s almost certainly already taken care of at a different layer of the stack.