Smart people learn from their mistakes. But the real sharp ones learn from the mistakes of others. –Brandon Mull
TL;DR: enums inherit some unfortunate behaviors from C++. Use the Buf lint rules ENUM_VALUE_PREFIX and ENUM_ZERO_VALUE_SUFFIX to avoid this problem (they’re part of the DEFAULT category).
I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog.
Protobuf’s enums define data types that represent a small set of valid values. For example, google.rpc.Code lists status codes used by various RPC frameworks, such as GRPC. Under the hood, every enum is just an int32 on the wire, although codegen backends will generate custom types and constants for the enum to make it easier to use.
Unfortunately, enums were originally designed to match C++ enums exactly, and they inadvertently replicate many of those behaviors.
If you look at the source for google.rpc.Code, and compare it to, say, google.protobuf.FieldDescriptorProto.Type, you will notice a subtle difference:
packagegoogle.rpc;enumCode{OK=0;CANCELLED=1;UNKNOWN=2;// ...}packagegoogle.protobuf;messageFieldDescriptorProto{enumType{// 0 is reserved for errors.TYPE_DOUBLE=1;TYPE_FLOAT=2;TYPE_INT64=3;// ...}}
Protobuf
FieldDescriptorProto.Type has values starting with TYPE_, but Code ‘s values don’t have a CODE_ prefix. This is because the fully-qualified names (FQN) of an enum value don’t include the name of the enum. That is, TYPE_DOUBLE actually refers to google.protobuf.FieldDescriptorProto.TYPE_DOUBLE. Thus, OK is not google.rpc.Code.OK, but google.rpc.OK.
This is because it matches the behavior of unscoped C++ enums. C++ is the “reference” implementation, so the language often bends for the sake of the C++ backend.
When generating code, protoc’s C++ backend emits the above as follows:
And in C++, enums don’t scope their enumerators: you write google::rpc::OK, NOT google::rpc::Code::OK.
If you know C++, you might be thinking, “why didn’t they use enum class?!”? Enums were added in proto2, which was developed around 2007-2008, but Google didn’t start using C++11, which introduced enum class , until much, much later.
Now, if you’re a Go or Java programmer, you’re probably wondering why you even care about C++. Both Go and Java do scope enum values to the enum type (although Go does it in a somewhat grody way: rpcpb.Code_OK).
Unfortunately, this affects name collision detection in Protobuf. You can’t write the following code:
Because the enum name is not part of the FQN for an enum value, both UNSPECIFIEDs here have the FQN myapi.v1.UNSPECIFIED, so Protobuf complains about duplicate symbols.
Thus, the convention we see in FieldDescriptorProto.Type:
Buf provides a lint rule to enforce this convention: ENUM_VALUE_PREFIX. Even though you might think that an enum name will be unique, because top-level enums bleed their names into the containing package, the problem spreads across packages!
proto3 relies heavily on the concept of “zero values” – all non-message fields that are neither repeated nor optional are implicitly zero if they are not present. Thus, proto3 requires that enums specify a value equal to zero.
By convention, this value shouldn’t be a specific value of the enum, but rather a value representing that no value is specified. ENUM_ZERO_VALUE_SUFFIX enforces this, with a default of _UNSPECIFIED. Of course, there are situations where this might not make sense for you, and a suffix like _ZERO or _UNKNOWN might make more sense.
It may be tempting to have a specific “good default” value for the zero value. Beware though, because that choice is forever. Picking a generic “unknown” as the default reduces the chance you’ll burn yourself.
Name prefixes and zero values also teach us an important lesson: because Protobuf names are forever, it’s really hard to fix style mistakes, especially as we collectively get better at using Protobuf.
google.rpc.Code is intended to be source-compatible with very old existing C++ code, so it throws caution to the wind. FieldDescriptorProto.Type doesn’t have a zero value because in proto2 , which doesn’t have zero value footguns in its wire format, you don’t need to worry about that. The lesson isn’t just to use Buf’s linter to try to avoid some of the known pitfalls, but also to remember that even APIs designed by the authors of the language make unfixable mistakes, so unlike other programming languages, imitating “existing practice” isn’t always the best strategy.
It has very simple GC semantics that they’re mostly stuck with due to design decisions in the surface language.
These things mean that despite Go having a GC, it’s possible to do manual memory management in pure Go and in cooperation with the GC (although without any help from the runtime package). To demonstrate this, we will be building an untyped, garbage-collected arena abstraction in Go which relies on several GC implementation details.
I would never play this kind of game in Rust or C++, because LLVM is extremely intelligent and able to find all kinds of ways to break you over the course of frequent compiler upgrades. On the other hand, although Go does not promise any compatibility across versions for code that imports unsafe, in practice, two forces work against Go doing this:
Go does not attempt to define what is and isn’t allowed: unsafe lacks any operational semantics.
Go prioritizes not breaking the ecosystem; this allows to assume that Hyrum’s Law will protect certain observable behaviors of the runtime, from which we may infer what can or cannot break easily.
This is in contrast to a high-performance native compiler like LLVM, which has a carefully defined boundary around all UB, allowing them to arbitrarily break programs that cross it (mostly) without fear of breaking the ecosystem.
Our goal is to build an arena, which is a data structure for efficient allocation of memory that has the same lifetime. This reduces pressure on the general-purpose allocator by only requesting memory in large chunks and then freeing it all at once.
For a comparison in Go, consider the following program:
This program will print successive powers of 2: this is because append is implemented approximately like so:
funcappend[S~[]T,Tany](a,bS)S{// If needed, grow the allocation.ifcap(a)-len(a)<len(b){// Either double the size, or allocate just enough if doubling is// too little.newCap:=max(2*cap(a),len(a)+len(b))// Grow a.a2:=make([]T,len(a),newCap)copy(a2,a)a=a2}// Increase the length of a to fit b, then write b into the freshly// grown region.a=a[:len(a)+len(b)]copy(a[len(a)-len(b):],b)returna}
Go
For appending small pieces, make is only called O(logn) times, a big improvement over calling it for every call to append. Virtually every programming language’s dynamic array abstraction makes this optimization.
An arena generalizes this concept, but instead of resizing exponentially, it allocates new blocks and vends pointers into them. The interface we want to conform to is as follows:
In go a size and and an alignment, out comes a pointer fresh memory with that layout. Go does not have user-visible uninitialized memory, so we additionally require that the returned region be zeroed. We also require that align be a power of two.
We can give this a type-safe interface by writing a generic New function:
// New allocates a fresh zero value of type T on the given allocator, and// returns a pointer to it.funcNew[Tany](aAllocator)*T{vartTp:=a.Alloc(unsafe.Sizeof(t),unsafe.Alignof(t))return(*T)(p)}
Go
This all feels very fine and dandy to anyone used to hurting themselves with malloc or operator new in C++, but there is a small problem. What happens when we allocate pointer-typed memory into this allocator?
// Allocate a pointer in our custom allocator, and then// initialize it to a pointer on the Go heap.p:=New[*int](myAlloc)*p=new(int)runtime.GC()**p=42// Use after free!
Go
Allocator.Alloc takes a size and an alignment, which is sufficient to describe the layout of any type. For example, on 64-bit systems, int and *int have the same layout: 8 bytes of size, and 8 bytes of alignment.
However, the Go GC (and all garbage collectors, generally) require one additional piece of information, which is somewhere between the layout of a value (how it is placed in memory) and the type of a value (rich information on its structure). To understand this, we need a brief overview on what a GC does.
For a complete overview on how to build a simple GC, take a look at a toy GC I designed some time ago: The Alkyne GC.
A garbage collector’s responsibility is to maintain a memory allocator and an accounting of:
What memory has been allocated.
Whether that memory is still in use.
Memory that is not in use can be reclaimed and marked as unallocated, for re-use.
The most popular way to accomplish this is via a “mark and sweep” architecture. The GC will periodically walk the entire object graph of the program from certain pre-determined roots; anything it finds is “marked” as alive. After a mark is complete, all other memory is “swept”, which means to mark it is unallocated for future re-use, or to return it to the OS, in the case of significant surplus.
The roots are typically entities that are actively being manipulated by the program. In the case of Go, this is anything currently on the stack of some G2, or anything in a global (of which there is a compile-time-known set).
The marking phase begins with stack scanning, which looks at the stack of each G and locates any pointers contained therein. The Go compiler generates metadata for each function that specifies which stack slots in a function’s frame contain pointers. All of these pointers are live by definition.
These pointers are placed into a queue, and each pointer is traced to its allocation on the heap. If the GC does not know anything about a particular address, it is discarded as foreign memory that does not need to be marked. If it does, each pointer in that allocation is pushed onto the queue if it has not already been marked as alive. The process continues until the queue is empty.
The critical step here is to take the address of some allocation, and convert it into all of the pointer values within. Go has precise garbage collection, which means that it only treats things declared as pointers in the surface language as pointers: an integer that happens to look like an address will not result in sweeping. This results in more efficient memory usage, but trades off some more complexity in the GC.
For example, the types *int, map[int]byte, string, struct {A int; B *int} all contain at least one pointer, while int, [1000]byte, struct {X bool; F uintptr} do not. The latter are called pointer-free types.
Go enhances the layout of a type into a shape by adding a bitset that specifies which pointer-aligned, pointer-sized words of the type’s memory region contain a pointer. These are called the pointer bits. For example, here are the shapes of a few Go types on a 64-bit system.
In the Go GC, each allocation is tagged with its shape (this is done in a variety of ways in the GC, either through an explicit header on the allocation, itself (a “malloc header”), a runtime type stored in the allocation’s runtime.mspan, or another mechanism). When scanning a value, it uses this information to determine where the pointers to scan through are.
The most obvious problem with our Allocator.Alloc type is that it does not discriminate shapes, so it cannot allocate memory that contains pointers: the GC will not be able to find the pointers, and will free them prematurely!
In our example where we allocated an *int in our custom allocator, we wind up with a **int on the stack. You would think that Go would simply trace through the first * to find an *int and mark it as being alive, but that is not what happens! Go instead finds a pointer into some chunk that the custom allocator grabbed from the heap, which is missing the pointer bits of its shape!
Why does go not look at the type of the pointer it steps through? Two reasons.
All pointers in Go are untyped from the runtime’s perspective; every *T gets erased into an unsafe.Pointer. This allows much of the Go runtime to be “generic” without using actual generics.
Pointee metadata can be aggregated, so that each pointer to an object does not have to remember its type at runtime.
The end result for us is that we can’t put pointers on the arena. This makes our New API unsafe, especially since Go does not provide a standard constraint for marking generic parameters as pointer-free: unsurprisingly, the don’t expect most users to care about such a detail.
It is possible to deduce the pointer bits of a type using reflection, but that’s very slow, and the whole point of using arenas is to go fast. As we design our arena, though, it will become clear that there is a safe way to have pointers on it.
Now that we have a pretty good understanding about what the Go GC is doing, we can go about designing a fast arena structure.
The ideal case is that a call to Alloc is very fast: just offsetting a pointer in the common case. One assumption we can make off the bat is that all memory can be forced to have maximum alignment: most objects are a pointer or larger, and Go does have a maximum alignment for ordinary user types, so we can just ignore the align parameter and always align to say, 8 bytes. This means that the pointer to the next unallocated chunk will always be well-aligned. Thus, we might come up with a structure like this one:
typeArenastruct{nextunsafe.Pointerleft,capuintptr}const(// Power of two size of the minimum allocation granularity.wordBytes=8// Depends on target, this is for 64-bit.minWords=8)func(a*Arena)Alloc(size,alignuintptr)unsafe.Pointer{// First, round the size up to the alignment of every object in// the arena.mask:=wordBytes-1size=(size+mask)&^mask// Then, replace the size with the size in pointer-sized words.// This does not result in any loss of size, since size is now// a multiple of the uintptr size.words:=size/wordBytes// Next, check if we have enough space left for this chunk. If// there isn't, we need to grow.ifa.left<words{// Pick whichever is largest: the minimum allocation size,// twice the last allocation, or the next power of two// after words.a.cap=max(minWords,a.cap*2,nextPow2(words))a.next=unsafe.Pointer(unsafe.SliceData(make([]uintptr,a.cap)))a.left=a.cap}// Allocate the chunk by incrementing the pointer.p:=a.nexta.left-=wordsifa.left>0{a.next=unsafe.Add(a.next,size)}else{// Beware, offsetting to one-past-the-end is one of the few// things explicitly not allowed by Go.a.next=nil}returnp}// nextPow2 returns the smallest power of two greater than n.funcnextPow2(nuintptr)uintptr{returnuintptr(1)<<bits.Len(uint(n))}
Go
How fast is this really? Here’s a simple benchmark for it.
The focus of this benchmark is to measure the cost of allocating many objects of the same size. The number of times the for b.Loop() loop will execute is unknown, and determined by the benchmarking framework to try to reduce statistical anomaly. This means that if we instead just benchmark a single allocation, the result will be very sensitive to the number of runs.
We also use b.SetBytes to get a throughput measurement on the benchmark. This is a bit easier to interpret than the gross ns/op, the benchmark would otherwise produce. It tells us how much memory each allocator can allocate per unit time.
We want to compare against new, but just writing _ = new(T) will get optimized out, since the resulting pointer does not escape. Writing it to a global is sufficient to convince Go that it escapes.
Here’s the results, abbreviated to show only the bytes per second. All benchmarks were performed on my AMD Ryzen Threadripper 3960X. Larger is better.
This is quite nice, and certainly worth pursuing! The performance increase seems to scale up with the amount of memory allocated, for a 2x-4x improvement across different cases.
Now we need to contend with the fact that our implementation is completely broken if we want to have pointers in it.
In (*Arena).Alloc, when we assign a freshly-allocated chunk, we overwrite a.next, which means the GC can reclaim it. But this is fine: as long as pointers into that arena chunk are alive, the GC will not free it, independent of the arena. So it seems like we don’t need to worry about it?
However, the whole point of an arena is to allocate lots of memory that has the same lifetime. This is common for graph data structures, such as an AST or a compiler IR, which performs a lot of work that allocates a lot and then throws the result away.
We are not allowed to put pointers in the arena, because they would disappear from the view of the GC and become freed too soon. But, if a pointer wants to go on an arena, it necessarily outlive the whole arena, since it outlives part of the arena, and the arena is meant to have the same lifetime.
In particular, if we could make it so that holding any pointer returned by Alloc prevents the entire arena from being swept by the GC, the arena can safely contain pointers into itself! Consider this:
We have a pointer p **int. It is allocated on some arena a.
The GC sees our pointer (as a type-erased unsafe.Pointer) and marks its allocation as live.
Somehow, the GC also marks a as alive as a consequence.
Somehow, the GC then marks every chunk a has allocated as alive.
Therefore he chunk that *p points to is also alive, so *p does not need to be marked directly, and will not be freed early.
The step (3) is crucial. By forcing the whole arena to be marked, any pointers stored in the arena into itself will be kept alive automatically, without the GC needing to know how to scan for them.
So, even though *New[*int](a) = new(int) is still going to result in a use-after-free, *New[*int](a) = New[int](a) would not! This small improvement does not make arenas themselves safe, but a data structure with an internal arena can be completely safe, so long as the only pointers that go into the arena are from the arena itself.
How can we make this work? The easy part is (4), which we can implement by adding a []unsafe.Pointer to the arena, and sticking every pointer we allocate into it.
typeArenastruct{nextunsafe.Pointerleft,capuintptrchunks[]unsafe.Pointer// New field.}func(a*Arena)Alloc(size,alignuintptr)unsafe.Pointer{// ... snip ...ifa.left<words{// Pick whichever is largest: the minimum allocation size,// twice the last allocation, or the next power of two// after words.a.cap=max(minWords,a.cap*2,nextPow2(words))a.next=unsafe.Pointer(unsafe.SliceData(make([]uintptr,a.cap)))a.left=a.capa.chunks=append(a.chunks,a.next)}// ... snip ...}
Go
The cost of the append is amortized: to allocate n bytes, we wind up allocating an additional O(loglogn) times. But what does this do to our benchmarks?
Now that the arena does not discard any allocated memory, we can focus on condition (3): making it so that if any pointer returned by Alloc is alive, then so is the whole arena.
Here we can make use of an important property of how Go’s GC works: any pointer into an allocation will keep it alive, as well as anything reachable from that pointer. But the chunks we’re allocating are []uintptrs, which will not be scanned. If there could somehow be a single pointer in this slice that was scanned, we would be able to stick the pointer a *Arena there, and so when anything that Alloc returns is scanned, it would cause a to be marked as alive.
So far, we have been allocating [N]uintptr using make([]T), but we would actually like to allocate struct { A [N]uintptr; P unsafe.Pointer }, where N is some dynamic value.
In its infintie wisdom, the Go standard library actually gives us a dedicated mechanism to do this: reflect.StructOf. This can be used to construct arbitrary anonymous struct types at runtime, which we can then allocate on the heap.
So, instead of calling make, we might call this function:
func(a*Arena)allocChunk(wordsuintptr)unsafe.Pointer{chunk:=reflect.New(reflect.StructOf([]reflect.StructField{{Name:"X0",Type:reflect.ArrayOf(int(words),reflect.TypeFor[uintptr]()),},{Name:"X1",Type:reflect.TypeFor[unsafe.Pointer]()},})).UnsafePointer()// Offset to the end of the chunk, and write a to it.end:=unsafe.Add(chunk,words*unsafe.Sizeof(uintptr(0)))*(**Arena)(end)=areturnchunk}
Go
This appears to have a minor but noticeable effect on performance6.
Looking back at Arena.Alloc, the end of this function has a branch:
func(a*Arena)Alloc(size,alignuintptr)unsafe.Pointer{// ... snip...// Allocate the chunk by incrementing the pointer.p:=a.nexta.left-=wordsifa.left>0{a.next=unsafe.Add(a.next,size)}else{// Beware, offsetting to one-past-the-end is one of the few// things explicitly not allowed by Go.a.next=nil}returnp}
Go
This is the absolute hottest part of allocation, since it is executed every time we call this function. The branch is a bit unfortunate, but it’s necessary, as noted by the comment.
In C++, if we have an array of int with n elements in it, and int* p is a pointer to the start of the array, p + n is a valid pointer, even though it can’t be dereferenced; it points “one past the end” of the array. This is a useful construction, since, for example, you can use it to eliminate a loop induction variable:
// Naive for loop, has an induction variable i.for(inti=0;i<n;i++){do_something(p[i]);}// Faster: avoids the extra variable increment in the loop// body for doing p[i].for(autoend=p+n;p<end;p++){do_something(*p);}
C++
Go, however, gets very upset if you do this, because it confuses the garbage collector. The GC can’t tell the difference between a one-past-the-end pointer for allocation A, and for the start of allocation B immediately after it. At best this causes memory to stay alive for longer, and at worst it triggers safety interlocks in the GC. The GC will panic if it happens to scan a pointer for an address that it knows has been freed.
But in our code above, every chunk now has an extra element at the very end that is not used for allocation, so we can have a pointer that is one-past-the-end of the [N]uintptr that we are vending memory from.
The updated allocation function would look like this:
func(a*Arena)Alloc(size,alignuintptr)unsafe.Pointer{// ... snip ...// Allocate the chunk by incrementing the pointer.p:=a.nexta.next=unsafe.Add(a.next,size)a.left-=wordsreturnp}
Go
Notably, we do not replace a.left with an end pointer, because of the if a.left < words comparison. We can’t actually avoid the subtraction a.left -= words because we would have to do it to make this comparison work if we got rid of a.left.
Remarkably, not very! This is an improvement on the order of magnitude of one or two percentage points. This is because the branch we deleted is extremely predictable. Because Go’s codegen is relatively mediocre, the effect of highly predictable branches (assuming Go actually schedules the branches correctly 🙄) is quite minor.
Turns out there’s a bigger improvement we can make.
There’s a lot going on in this function, but most of it is a mix of Go not being great at register allocation, and lots of write barriers.
A write barrier is a mechanism for synchronizing ordinary user code with the GC. Go generates code for one any time a non-pointer-free type is stored. For example, writing to a **int, *string, or *[]int requires a write barrier.
Write barriers are implemented as follows:
runtime.writeBarrier is checked, which determines whether the write barrier is necessary, which is only when the GC is in the mark phase. Otherwise the branch is taken to skip the write barrier.
A call to one of the runtime.gcWriteBarrierN functions happens. N is the number of pointers that the GC needs to be informed of.
This function calls runtime.gcWriteBarrier, which returns a buffer onto which pointers the GC needs to now trace through should be written to.
The actual store happens.
A write barrier is required for a case like the following. Consider the following code.
funcalloc(n**int){*n=new(int)}
Go
This function will call runtime.newobject to allocate eight bytes of memory. The resulting pointer will be returned in rax. This function then stores rax into n and returns. If we Godbolt this function, we’ll find that it does, in fact, generate a write barrier:
TEXTx.allocCMPQSP,16(R14)JLSgrowStackPUSHQBPMOVQSP,BPSUBQ$16,SPMOVQAX,main.n+32(SP); new(int)LEAQtype:int(SB),AXCALLruntime.newobject(SB)MOVQmain.n+32(SP),CXTESTBAL,(CX); This is the write barrier.CMPLruntime.writeBarrier(SB),$0JEQskipMOVQ(CX),DXCALLruntime.gcWriteBarrier2(SB)MOVQAX,(R11)MOVQDX,8(R11)skip:MOVQAX,(CX); The actual store.ADDQ$16,SPPOPQBPRETgrowStack:NOPMOVQAX,8(SP)CALLruntime.morestack_noctxt(SB)MOVQ8(SP),AXJMPx.alloc
x86 Assembly (Go Syntax)
Note that two pointers get written: the pointer returned by new(int), and the old value of *n. This ensures that regardless of where in this function the GC happens to be scanning through *n, it sees both values during the mark phase.
Now, this isn’t necessary if the relevant pointers are already reachable in some other way… which is exactly the case in our arena (thanks to the chunks slice). So the write barrier in the fast path is redundant.
But, how do we get rid of it? There is //go:nowritebarrier, but that’s not allowed outside of a list of packages allowlisted in the compiler. It also doens’t disable write barriers; it simply generates a diagnostic if any are emitted.
But remember, write barriers only occur when storing pointer-typed memory… so we can just replace next unsafe.Pointer with next uintptr.
typeArenastruct{nextuintptr// A real pointer!left,capuintptrchunks[]unsafe.Pointer}func(a*Arena)Alloc(size,alignuintptr)unsafe.Pointer{mask:=wordBytes-1size=(size+mask)&^maskwords:=size/wordBytesifa.left<words{a.cap=max(minWords,a.cap*2,nextPow2(words))p:=a.allocChunk(a.cap)a.next=uintptr(p)a.left=a.capa.chunks=append(a.chunks,p)}p:=a.nexta.next+=sizea.left-=wordsreturnunsafe.Pointer(p)}
Go
go vet hates this, because it doesn’t know that we’re smarter than it is. Does This make the code faster? To make it a little bit more realistic, I’ve written a separate variant of the benchmarks that hammers the GC really hard in a separate G:
gofunc(){for{runtime.GC()}}()
Go
The result indicates that this is a worthwhile optimization for churn-heavy contexts. Performance is much worse overall, but that’s because the GC is pre-empting everyone. The improvement seems to be on the order of 20% for very small allocations.
Another source of slowdown is the fact that any time we allocate from the heap, it’s forced to eagerly clear the huge allocated chunk every time, because it contains pointers. If you profile this code, a ton of time is spent in runtime.memclrNoHeapPointers. Because the chunks of memory we allocate are always of a specific size, we can use an array of sync.Pools to amortize the cost of allocating and clearing chunks.
First, we need an entry in this array of pools, one for each size of memory we allocate. Then, we need to set a finalizer on the arena to reclaim its memory once we’re done. Finally, we can change the contract of Alloc to require the caller to clear the value for us, and change New take a value as its argument:
What’s nice about this is that it avoids having to clear the value if a non-zero value would be allocated to it instead.
Putting this all together, it would look like this:
varpools[64]sync.Poolfuncinit(){fori:=rangepools{pools[i].New=func()any{returnreflect.New(reflect.StructOf([]reflect.StructField{{Name:"A",Type:reflect.ArrayOf(1<<i,reflect.TypeFor[uintptr]()),},{Name:"P",Type:reflect.TypeFor[unsafe.Pointer]()},})).UnsafePointer()}}}func(a*Arena)allocChunk(wordsuintptr)unsafe.Pointer{log:=bits.TrailingZeros(uint(words))chunk:=pools[log].Get().(unsafe.Pointer)// Offset to the end of the chunk, and write a to it.end:=unsafe.Add(chunk,words*unsafe.Sizeof(uintptr(0)))*(**Arena)(end)=a// If this is the first chunk allocated, set a finalizer.ifa.chunks==nil{runtime.SetFinalizer(a,(*Arena).finalize)}// Place the returned chunk at the offset in a.chunks that// corresponds to its log, so we can identify its size easily// in the loop above.a.chunks=append(a.chunks,make([]unsafe.Pointer,log+1-len(a.chunks))...)a.chunks[log]=chunkreturnchunk}func(a*Arena)finalize(){forlog,chunk:=rangea.chunks{ifchunk==nil{continue}words:=uintptr(1)<<logend:=unsafe.Add(chunk,words*unsafe.Sizeof(uintptr(0)))*(**Arena)(end)=nil// Make sure that we don't leak the arena.pools[log].Put(chunk)}}
Well. That’s a surprise. It does much better for small allocations, but it made really big allocations worse! It’s not immediately clear to me why this is, but note that new also got much faster, which tells me that because the allocations from the arena are longer-lived, the GC behaves somewhat differently, causing some of the cost from allocating really large things with new to be amortized.
Whether this optimization makes sense would require some profiling. An alternative is to manually manage arena re-use, by adding a very unsafe Reset() function that causes the arena to behave as if it was just constructed, but keeping all of its allocated chunks. This is analogous to reslicing to zero: x = x[:0].
This is very unsafe because it can lead to the same memory being allocated twice: this is only ok if the memory is not re-used.
Implementing this is very simple.
func(a*Arena)Reset(){a.next,a.left,a.cap=0,0,0}func(a*Arena)allocChunk(wordsuintptr)unsafe.Pointer{log:=bits.TrailingZeros(uint(words))iflen(a.chunks)>log{// If we've already allocated a chunk of this size in a previous arena// generation, return it.//// This relies on the fact that an arena never tries to allocate the same// size of chunk twice between calls to Reset().returna.chunks[log]}// ... snip ...}
Go
Then, if we modify our arena benchmark to take advantage of this…
That’s a massive improvement! There’s a couple of reasons this is faster. First, it doesn’t require waiting for the GC to collect old arenas to make their memory get reused. Second, the fast path is very fast with no synchronization.
On the flipside, this is very dangerous: arena re-use needs to be carefully managed, because you can wind up with unique pointers that aren’t.
Go does not offer an easy mechanism to “reallocate” an allocation, as with realloc() in C. This is because it has no mechanism for freeing pointers explicitly, which is necessary for a reallocation abstraction.
But we already don’t care about safety, so we can offer reallocation on our arena. Now, the reallocation we can offer is quite primitive: if a chunk happens to be the most recent one allocated, we can grow it. Otherwise we just allocate a new chunk and don’t free the old one.
This makes it possible to implement “arena slices” that can be constructed by appending, which will not trigger reallocation on slice growth as long as nothing else gets put on the arena.
Realloc would look something like this:
func(a*Arena)Realloc(ptrunsafe.Pointer,oldSize,newSize,alignuintptr,)unsafe.Pointer{mask:=wordBytes-1oldSize=(oldSize+mask)&^masknewSize=(newSize+mask)&^maskifnewSize<=oldSize{returnptr}// Check if this is the most recent allocation. If it is,// we can grow in-place.ifa.next-oldSize==uintptr(ptr){// Check if we have enough space available for the// requisite extra space.need:=(newSize-oldSize)/wordBytesifa.left>=need{// Grow in-place.a.left-=needreturnptr}}// Can't grow in place, allocate new memory and copy to it.new:=a.Alloc(newSize,align)copy(unsafe.Slice((*byte)(new),newSize),unsafe.Slice((*byte)(ptr),oldSize),)returnnew}
Go
Then, whenever we append to our arena slice, we can call a.Realloc() to grow it. However, this does not work if the slice’s base pointer is not the original address returned by Alloc or Realloc. It is an exercise for the reader to:
Implement a Slice[T] type that uses an arena for allocation.
Make this work for any value of ptr within the most recent allocation, not just the base offset. This requires extra book-keeping.
Here is the entirety of the code that we have developed, not including the reallocation function above.
packagearenaimport("math/bits""reflect""unsafe")funcNew[Tany](a*Arena,vT)*T{p:=(*T)(a.Alloc(unsafe.Sizeof(v),unsafe.Alignof(v)))*p=vreturnp}typeArenastruct{nextunsafe.Pointerleft,capuintptrchunks[]unsafe.Pointer}const(maxAlignuintptr=8// Depends on target, this is for 64-bit.minWordsuintptr=8)func(a*Arena)Alloc(size,alignuintptr)unsafe.Pointer{// First, round the size up to the alignment of every object in the arena.mask:=maxAlign-1size=(size+mask)&^mask// Then, replace the size with the size in pointer-sized words. This does not// result in any loss of size, since size is now a multiple of the uintptr// size.words:=size/maxAlign// Next, check if we have enough space left for this chunk. If there isn't,// we need to grow.ifa.left<words{// Pick whichever is largest: the minimum allocation size, twice the last// allocation, or the next power of two after words.a.cap=max(minWords,a.cap*2,nextPow2(words))a.next=a.allocChunk(a.cap)a.left=a.capa.chunks=append(a.chunks,a.next)}// Allocate the chunk by incrementing the pointer.p:=a.nexta.next=unsafe.Add(a.next,size)a.left-=wordsreturnp}func(a*Arena)Reset(){a.next,a.left,a.cap=0,0,0}varpools[64]sync.Poolfuncinit(){fori:=rangepools{pools[i].New=func()any{returnreflect.New(reflect.StructOf([]reflect.StructField{{Name:"X0",Type:reflect.ArrayOf(1<<i,reflect.TypeFor[uintptr]()),},{Name:"X1",Type:reflect.TypeFor[unsafe.Pointer]()},})).UnsafePointer()}}}func(a*Arena)allocChunk(wordsuintptr)unsafe.Pointer{log:=bits.TrailingZeros(uint(words))iflen(a.chunks)>log{returna.chunks[log]}chunk:=pools[log].Get().(unsafe.Pointer)// Offset to the end of the chunk, and write a to it.end:=unsafe.Add(chunk,words*unsafe.Sizeof(uintptr(0)))*(**Arena)(end)=a// If this is the first chunk allocated, set a finalizer.ifa.chunks==nil{runtime.SetFinalizer(a,(*Arena).finalize)}// Place the returned chunk at the offset in a.chunks that// corresponds to its log, so we can identify its size easily// in the loop above.a.chunks=append(a.chunks,make([]unsafe.Pointer,log+1-len(a.chunks))...)a.chunks[log]=chunkreturnchunk}func(a*Arena)finalize(){forlog,chunk:=rangea.chunks{ifchunk==nil{continue}words:=uintptr(1)<<logend:=unsafe.Add(chunk,words*unsafe.Sizeof(uintptr(0)))*(**Arena)(end)=nil// Make sure that we don't leak the arena.pools[log].Put(chunk)}}funcnextPow2(nuintptr)uintptr{returnuintptr(1)<<bits.Len(uint(n))}
Go
There are other optimizations that we could make here that I haven’t discussed. For example, arenas could be re-used; once an arena is done, it could be “reset” and placed into a sync.Pool. This arena would not need to go into the GC to request new chunks, re-using the ones previously allocated (and potentially saving on the cost of zeroing memory over and over again).
I did say that this relies very heavily on Go’s internal implementation details. Whats the odds that they get broken in the future? Well, the requirement that allocations know their shape is forced by the existence of unsafe.Pointer, and the requirement that a pointer into any part of an allocation keeps the whole thing alive essentially comes from slices being both sliceable and mutable; once a slice escapes to the heap (and thus multiple goroutines) coordinating copies for shrinking a slice would require much more complexity than the current write barrier implementation.
And in my opinion, it’s pretty safe to say that Hyrum’s Law has us covered here. ;)
Go does have some UB. For example, Go assumes that a G’s stack is never read or written to by any other G, except by the GC across a write barrier.
That said, what UB does exist is very, very difficult to trip on purpose. ↩
The pointer bits are in big endian order, so the first bit in left-to-right order corresponds to the first word. ↩
The “itab”, or interface table part of an interface value is not managed by the GC; it is allocated in persistent memory, so even though it is a pointer, it is not a pointer the GC needs to care about. ↩
This can be made better by caching the reflect.Types, but that is only a very slight improvement on the order of 1% speedup. Most of the slowdown is because Go is a bit more eager about zeroing allocations of values that contain pointers.
vartypes[]reflect.Typefuncinit(){// Pre-allocate the whole array. There aren't that many powers// of two. Don't need to go beyond 1<<61, since that's about as// large of an allocation as Go will service (trying to create// a larger array will panic).types=make([]reflect.Type,61)fori:=rangetypes{types[i]=reflect.StructOf([]reflect.StructField{{Name:"X0",Type:reflect.ArrayOf(int(1)<<i,reflect.TypeFor[uintptr]()),},{Name:"X1",Type:reflect.TypeFor[unsafe.Pointer]()},})}}func(a*Arena)allocChunk(wordsuintptr)unsafe.Pointer{log:=bits.TrailingZeros(uint(words))chunk:=reflect.New(types[log]).UnsafePointer()// Offset to the end of the chunk, and write a to it.end:=unsafe.Add(chunk,words*unsafe.Sizeof(uintptr(0)))*(**Arena)(end)=areturnchunk}
Go
However, with this in place, we can be assured that property (3) now holds, so it’s perfectly safe to place arena pointers into arena-allocated memory, so long as it’s across the same arena. ↩
As a matter of fact, when compression technology came along, we thought the future in 1996 was about voice. We got it wrong. It is about voice, video, and data, and that is what we have today on these cell phones. –Steve Buyer
TL;DR: Compression is everywhere: CDNs, HTTP servers, even in RPC frameworks like Connect. This pervasiveness means that wire size tradeoffs matter less than they used to twenty years ago, when Protobuf was designed.
I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog.
Protobuf’s wire format is intended to be relatively small. It makes use of variable-width integers so that smaller values take up less space on the wire. Fixed width integers might be larger on the wire, but often have faster decoding times.
But what if I told you that doesn’t matter?
See, most internet traffic is compressed. Bandwidth is precious, and CDN operators don’t want to waste time sending big blobs full of zeros. There are many compression algorithms available, but the state of the art for HTTP requests (which dominates much of global internet traffic) is Brotli, an algorithm developed at Google in 2013 and standardized in IETF RFC7932 in 2016. There is a very good chance that this article was delivered to your web browser as a Brotli-compressed blob.
How compression is applied in your case will vary, but both Connect RPC and gRPC support native compression. For example, Connect has an API for injecting compression providers: https://pkg.go.dev/connectrpc.com/connect#WithCompression.
Connect uses gzip by default, which uses the DEFLATE compression algorithm. Providing your own compression algorithm (such as Brotli) is pretty simple, as shown by this third-party package.
Other services may compress for you transparently. Any competent CDN will likely use Brotli (or gzip or zlib, but probably Brotli) to compress any files it serves for you. (In fact, JavaScript and HTML minimization can often be rendered irrelevant by HTTP compression, too.)
It’s important to remember that Protobuf predates pervasive compression: if it didn’t, it would almost certainly not use variable-width integers for anything. It only uses them because they offer a primitive form of compression in exchange for being slower to decode. If that tradeoff was eliminated, Protobuf would almost certainly only use fixed-width integers on the wire.
There are two fields that contain essentially the same data, which can be encoded in four different ways: as old-style repeated fields, as packed fields, and the integers can be encoded as varints or fixed32 values.
Using Protoscope, we can create some data that exercises these four cases:
Each blob contains the integers from 1 to 1000 encoded in different ways. I’ll compress each one using gzip, zlib, and Brotli, using their default compression levels, and arrange their sizes, in bytes, in the table below.
File
Uncompressed
gzip (DEFLATE)
zlib
Brotli
a.pb
2875
1899
1878
1094
b.pb
1877
1534
1524
885
c.pb
5005
1577
1567
1140
d.pb
4007
1440
1916
1140
Compression achieves incredible results: Brotli manages to get all of the files down to around 1.1 kB, except for the packed varints, which it gets about 250 bytes smaller! Of course, that’s only because most of the values in that repeated field are small. If the values range from 100000 to 101000, b.pb and d.pb are 3006 and 4007 bytes respectively (see that d.pb’s size is unchanged!), but when compressed with brotli, the lead for b.pb starts to disappear: 1039 bytes vs. 1163 bytes. Now it’s only 120 bytes smaller.
Applying compression can often have similar results to replacing everything with varints, but not exactly: using a varint will likely always be slightly smaller, at least when using state-of-the-art compression like Brotli. But you can pretty much always assume you will be using compression, such as to compress HTTP headers and other ancillary content in your request. Compression is generic and highly optimized—it applies to all data, regardless of schema, and is often far more optimized than application-level codecs like those in a Protobuf library.
Not to mention, you should definitely be compressing any large data blobs you’re storing on disk, too!
As a result, you can usually disregard many encoded size concerns when making tradeoffs in designing a Protobuf type. Fixed integer types will decode faster, so if decoding speed is important to you, and you’re worried about the size on the wire, don’t. It’s almost certainly already taken care of at a different layer of the stack.