Default Methods in Go

Go’s interfaces are very funny. Rather than being explicitly implemented, like in Java or Rust, they are simply a collection of methods (a “method set”) that the concrete type must happen to have. This is called structural typing, which is the opposite of nominal typing.

Go interfaces are very cute, but this conceptual simplicity leads to a lot of implementation problems (a theme with Go, honestly). It removes a lot of intentionality from implementing interfaces, and there is no canonical way to document that A satisfies1 B, nor can you avoid conforming to interfaces, especially if one forces a particular method on you.

It also has very quirky results for the language runtime. To cast an interface value to another interface type (via the type assertion syntax a.(B)), the runtime essentially has to use reflection to go through the method set of the concrete type of a. I go into detail on how this is implemented here.

Because of their structural nature, this also means that you can’t add new methods to an interface without breaking existing code, because there is no way to attach default implementations to interface methods. This results in very silly APIs because someone screwed up an interface.

flag.Value is a Mess

For example, in the standard library’s package flag, the interface flag.Value represents a value which can be parsed as a CLI flag. It looks like this:

type Value interface {
  // Get a string representation of the value.
  String() string

  // Parse a value from a string, possibly returning an error.
  Set(string) error
}
Go

flag.Value also has an optional method, which is only specified in the documentation. If the concrete type happens to provide IsBoolFlag() bool, it will be queries for determining if the flag should have bool-like behavior. Essentially, this means that something like this exists in the flag library:

var isBool bool
if b, ok := value.(interface{ IsBoolFlag() bool }); ok {
  isBool = b.IsBoolFlag()
}
Go

The flag package already uses reflection, but you can see how it might be a problem if this interface-to-interface cast happens regularly, even taking into account Go’s caching of cast results.

There is also flag.Getter, which exists because they messed up and didn’t provide a way for a flag.Value to unwrap into the value it contains. For example, if a flag is defined with flag.Int, and then that flag is looked up with flag.Lookup, there’s no straightforward way to get the int out of the returned flag.Value.

Instead, you have to side-cast to flag.Getter:

type Getter interface {
  Value

  // Returns the value of the flag.
  Get() any
}
Go

As a result, flag.Lookup("...").(flag.Getter) needs to do a lot more work than if flag.Value had just added Get() any, with a default return value of nil.

It turns out that there is a rather elegant workaround for this.

Struct Embeddings

Go has this quirky feature called embedding, where a a field in a struct is declared without a name:

type (
  A int
  B struct{
    A
    Foo int
  }
)
Go

The A-typed embedded field behaves as if we had declared the field A A, but selectors on var b B will search in A if they do not match something on the B level. For example, if A has a method Bar, and B does not, b.Bar() will resolve to b.A.Bar(). However, if A has a method Foo, b.Foo resolves to b.Foo, not b.A.Foo, because b has a field Foo.

Importantly, any methods from A which B does not already have will be added to B’s method set. So this works:

type (
  A int
  B struct{ A }
  C interface {
    Foo()
    Bar()
  }
)

func (A) Foo() {}
func (B) Bar() {}

var _ C = B{}  // B satisfies C.
Go

Now, suppose that we were trying to add Get() any to flag.Value. Let’s suppose that we had also defined flag.ValueDefaults, a type that all satisfiers of flag.Value must embed. Then, we can write the following:

type Value interface{
  String() string
  Set(string) error

  Get() any  // New method.
}

type ValueDefaults struct{}

func (ValueDefaults) Get() { return nil }
Go

Then, no code change is required for all clients to pick up the new implementation of Get().

Required Embeds

Now, this only works if we had required in the first place that anyone satisfying flag.Value embeds flag.ValueDefaults. How can we force that?

A little-known Go feature is that interfaces can have unexported methods. The way these work, for the purposes of interface conformance, is that exported methods are matched just by their name, but unexported methods must match both name and package.

So, if we have an interface like interface { foo() }, then foo will only match methods defined in the same package that this interface expression appears. This is useful for preventing satisfaction of interfaces.

However, there is a loophole: embedding inherits the entire method set, including unexported methods. Therefore, we can enhance Value to account for this:

type Value interface{
  String() string
  Set(string) error

  Get() any  // New method.

  value() // Unexported!
}

type ValueDefaults struct{}

func (ValueDefaults) Get() { return nil }
func (ValueDefaults) value() {}
Go

Now, it’s impossible for any type defined outside of this package to satisfy flag.Value, without embedding flag.ValueDefaults (either directly or through another embedded flag.Value).

Exported Struct Fields

Now, another problem is that you can’t control the name of embedded fields. If the embedded type is Foo, the field’s name is Foo. Except, it’s not based on the name of the type itself; it will pick up the name of a type alias. So, if you want to unexport the defaults struct, you can simply write:

type MyValue struct {
  valueDefaults

  // ...
}

type valueDefaults = flag.ValueDefaults
Go

This also has the side-effect of hiding all of ValueDefaults’ methods from MyValue’s documentation, despite the fact that exported and fields methods are still selectable and callable by other packages (including via interfaces). As far as I can tell, this is simply a bug in godoc, since this behavior is not documented.

What About Same-Name Methods?

There is still a failure mode: if a user type satisfying flag.Value happened to define a Get method with a different interface. In this case, that Get takes precedence, and changes to flag.Value will break users.

There are two workarounds:

  1. Tell people not to define methods on their satisfying type, and if they do, they’re screwed. Because satisfying flag.Value is now explicit, this is not too difficult to ask for.

  2. Pick a name for new methods that is unlikely to collide with anything.

Unfortunately, this runs into a big issue with structural typing, which is that it is very difficult to avoid making mistakes when making changes, due to the lack of intent involved. A similar problem occurs with C++ templates, where the interfaces defined by concepts are implicit, and can result in violating contract expectations.

Go has historically be relatively cavalier about this kind of issue, so I think that breaking people based on this is fine.

And of course, you cannot retrofit a default struct into a interface; you have to define it from day one.

Using Defaults

Now that we have defaults, we can also enhance flag.Value with bool flag detection:

type Value interface{
  String() string
  Set(string) error

  Get() any
  IsBoolFlag() bool

  value() // Unexported!
}

type ValueDefaults struct{}

func (ValueDefaults) Get() { return nil }
func (ValueDefaults) IsBoolFlag() { return false }
func (ValueDefaults) value() {}
Go

Now IsBoolFlag is more than just a random throw-away comment on a type.

We can also use defaults to speed up side casts. Many functions around the io package will cast an io.Reader into an io.Seeker or io.ReadAt to perform more efficient I/O.

In a hypothetical world where we had defaults structs for all the io interfaces, we can enhance io.Reader with a ReadAt default method that by default returns an error.

type Reader interface {
  Read([]byte) (int, error)
  ReadAt([]byte, int64) (int, error)

  reader()
}

type ReaderDefaults struct{}

func (ReaderDefaults) ReadAt([]byte, int64) (int, error) {
  return 0, errors.ErrUnsupported
}
func (ReaderDefaults) reader() {}
Go

We can do something similar for io.Seeker, but because it’s a rather general interface, it’s better to keep io.Seeker as-is. So, we can add a conversion method:

type Reader interface {
  Seeker() io.Seeker

  // ...
}

type ReaderDefaults struct{}

func (ReaderDefaults) Seeker([]byte, int64) io.Seeker {
  return nil
}
Go

Here, Reader.Seeker() converts to an io.Seeker, returning nil if that’s not possible. How is this faster than r.(io.Seeker)? Well, consider what this would look like in user code:

type MyIO struct{
  io.ReaderDefaults
  // ...
}

func (m *MyIO) Read(b []byte) (int, error) { /* ... */ }
func (m *MyIO) Seek(offset int64, whence int) (int64, error) { /* ... */ }

func (m *MyIO) Seeker() io.Seeker {
  return m
}
Go

Calling r.Seeker(), if r is an io.Reader containing a MyIO, lowers to the following machine code:

  1. Load the function pointer for Seeker out of r’s itab.
  2. Perform an indirect jump on that function pointer.
  3. Inside of (*MyIO).Seeker, load a pointer to the itab symbol go:itab.*MyIO,io.Seeker and m into the return slots.
  4. Return.

The main cost of this conversion is the indirect jump, compared to, at minimum, hitting a hashmap lookup loop for the cache for r.(io.Seeker).

Does this performance matter? Not for I/O interfaces, probably, but it can matter for some uses!

Shouldn’t This Be a Language Feature?

Yes, it should, but here we are. Although making it a language feature has a few rather unfortunate quirks that we need to keep in mind.

Suppose we can define defaults on interface methods somehow, like this:

type Foo interface{
  Bar()
  Baz()
}

func (f Foo) Baz() {
  // ...
}
Go

Then, any type which provides Bar() automatically satisfies Foo. Suppose MyFoo satisfies Foo, but does not provide Baz. Then we have a problem:

var x MyFoo
x.Baz()       // Error!
Foo(x).Baz()  // Ok!
Go

Now, we might consider looking past that, but it becomes a big problem with reflection. If we passed Foo(x) into reflect.ValueOf, the resulting any conversion would discard the defaulted method, meaning that it would not be findable by reflect.Value.MethodByName(). Oops.

So we need to somehow add Baz to MyFoo’s method set. Maybe we say that if MyFoo is ever converted into Foo, it gets the method. But this doesn’t work, because the compiler might not be able to see through something like any(MyFoo{...}).(Foo). This means that Baz must be applied unconditionally. But, now we have the problem that if we have another interface interface { Bar(); Baz(int) }, MyFoo would need to receive incompatible signatures for Baz.

Again, we’re screwed by the non-intentionality of structural typing.

Missing Methods

Ok, let’s forget about default method implementations, that doesn’t seem to be workable. What if we make some methods optional, like IsBoolFlag() earlier? Let’s invent some syntax for it.

type Foo interface {
  Bar()
  ?Baz() // Optional method.
}
Go

Then, suppose that MyFoo provides Bar but not Baz (or Baz with the wrong signature). Then, the entry in the itab for Baz would contain a nil function pointer, such that x.Baz() panics! To determine if Baz is safe to call, we would use the following idiom:

if x.Baz != nil {
  x.Baz()
}
Go

The compiler is already smart enough to elide construction of funcvals for cases like this, although it does mean that x.Func in general, for an interface value x, requires an extra cmov or similar to make sure that x.Func is nil when it’s a missing method.

All of the use cases described above would work Just Fine using this construction, though! However, we run into the same issue that Foo(x) appears to have a larger method set than x. It is not clear if Foo(x) should conform to interface { Bar(); Baz() }, where Baz is required. My intuition would be no: Foo is a strictly weaker interface. Perhaps it might be necessary to avoid the method access syntax for optional methods, but that’s a question of aesthetics.

This idea of having nulls in place of function pointers in a vtable is not new, but to my knowledge is not used especially widely. It would be very useful in C++, for example, to be able to determine if no implementation was provided for a non-pure virtual function. However, the nominal nature of C++’s virtual functions does not make this as big of a need.

Another alternative is to store a related interfaces’ itabs on in an itab. For example, suppose that we invent the syntax A<- within an interface{} to indicate that that interface will likely get cast to A. For example:

type (
  A interface{
    Foo()
  }

  B interface{
    Bar()

    A<-
  }
)
Go

Satisfying B does not require satisfying A. However, the A<- must be part of public API, because a interface{ Bar() } cannot be used in place of an interface{ A<- }

Within B’s itab, after all of the methods, there is a pointer to an itab for A, if the concrete type for this itab also happens to satisfy A. Then, a cast from B to A is just loading a pointer from the itab. If the cast would fail, the loaded pointer will be nil.

I had always assumed that Go did an optimization like this for embedding interfaces, but no! Any inter-interface conversion, including upcasts, goes through the whole type assertion machinery! Of course, Go cannot hope to generate an itab for every possible subset of the method set of an interface (exponential blow-up), but it’s surprising that they don’t do this for embedded interfaces, which are Go’s equivalent of superinterfaces (present in basically every language with interfaces).

Using this feature, we can update flag.Value to look like this:

type Value interface {
  String() string
  Set(string) error

  BoolValue<-
  Getter<-
}

type BoolValue interface {
  Value
  IsBoolFlag() bool
}

type Getter interface {
  Value
  Get() any
}
Go

Unfortunately, because A<- changes the ABI of an interface, it does not seem possible to actually add this to existing interfaces, because the following code is valid:

type A interface {
  Foo()
}

var (
  a A
  b interface{ Foo() }
)

a, b = b, a
Go

Even though this fix seems really clean, it doesn’t work! The only way it could work is if PGO determines that a particular interface conversion A to B happens a lot, and updates the ABI of all interfaces with the method set of A, program-globally, to contain a pointer to a B itab if available.

Conclusion

Go’s interfaces are pretty bad; in my opinion, a feature that looks good on a slide, but which results in a lot of mess due to its granular and intention-less nature. We can sort of patch over it with embeds, but there’s still problems.

Due to how method sets work in Go, it’s very hard to “add” methods through an interface, and honestly at this point, any interface mechanism that makes it impossible (or expensive) to add new functions is going to be a huge problem.

Missing methods seems like the best way out of this problem, but for now, we can stick to the janky embedded structs.

  1. Go uses the term “implements” to say that a type satisfies an interface. I am instead intentionally using the term “satisfies”, because it makes the structural, passive nature of implementing an interface clearer. This is also more in-line with interfaces’ use as generic constraints.

    Swift uses the term “conform” instead, which I am avoiding for this reason. 

Parsing Protobuf Like Never Before

Historically I have worked on many projects related to high-performance Protobuf, be that on the C++ runtime, on the Rust runtime, or on integrating UPB, the fastest Protobuf runtime, written by my colleague Josh Haberman. I generally don’t post directly about my current job, but my most recent toy-turned-product is something I’m very excited to write about: hyperpb.

Here’s how we measure up against other Go Protobuf parsers. This is a subset of my benchmarks, since the benchmark suite contains many dozens of specimens. This was recorded on an AMD Zen 4 machine.

Throughput for various configurations of hyperpb (colored bars) vs. competing parsers (grey bars). Each successive hyperpb includes all previous optimizations, corresponding to zerocopy mode, arena reuse, and profile-guided optimization. Bigger is better.

Traditionally, Protobuf backends would generate parsers by generating source code specialized to each type. Naively, this would give the best performance, because everything would be “right-sized” to a particular message type. Unfortunately, now that we know better, there are a bunch of drawbacks:

  1. Every type you care about must be compiled ahead-of-time. Tricky for when you want to build something generic over schemas your users provide you.
  2. Every type contributes to a cost on the instruction cache, meaning that if your program parses a lot of different types, it will essentially flush your instruction cache any time you enter a parser. Worse still, if a parse involves enough types, the parser itself will hit instruction decoding throughput issues.

These effects are not directly visible in normal workloads, but other side-effects are visible: for example, giant switches on field numbers can turn into chains of branch instructions, meaning that higher-numbered fields will be quite slow. Even binary-searching on field numbers isn’t exactly ideal. However, we know that every Protobuf codec ever emits fields in index order (i.e., declaration order in the .proto file), which is a data conditioning fact we don’t take advantage of with a switch.

UPB solves this problem. It is a small C kernel for parsing Protobuf messages, which is completely dynamic: a UPB “parser” is actually a collection of data tables that are evaluated by a table-driven parser. In other words, a UPB parser is actually configuration for an interpreter VM, which executes Protobuf messages as its bytecode. UPB also contains many arena optimizations to improve allocation throughput when parsing complex messages.

hyperpb is a brand new library, written in the most cursed Go imaginable, which brings many of the optimizations of UPB to Go, and many new ones, while being tuned to Go’s own weird needs. The result leaves the competition in the dust in virtually every benchmark, while being completely runtime-dynamic. This means it’s faster than Protobuf Go’s own generated code, and vtprotobuf (a popular but non-conforming1 parser generator for Go).

This post is about some of the internals of hyperpb. I have also prepared a more sales-ey writeup, which you can read on the Buf blog.

Why Reinvent UPB?

UPB is awesome. It can slot easily into any language that has C FFI, which is basically every language ever.

Unfortunately, Go’s C FFI is really, really bad. It’s hard to overstate how bad cgo is. There isn’t a good way to cooperate with C on memory allocation (C can’t really handle Go memory without a lot of problems, due to the GC). Having C memory get cleaned up by the GC requires finalizers, which are very slow. Calling into C is very slow, because Go pessimistically assumes that C requires a large stack, and also calling into C does nasty things to the scheduler.

All of these things can be worked around, of course. For a while I considered compiling UPB to assembly, and rewriting that assembly into Go’s awful assembly syntax2, and then having Go assemble UPB out of that. This presents a few issues though, particularly because Go’s assembly calling convention is still in the stone age3 (arguments are passed on the stack), and because we would still need to do a lot of work to get UPB to match the protoreflect API.

Go also has a few… unique qualities that make writing a Protobuf interpreter an interesting challenge with exciting optimization opportunities.

First, of course, is the register ABI, which on x86_64 gives us a whopping nine argument and return registers, meaning that we can simply pass the entire parser state in registers all the time.

Second is that Go does not have much UB to speak of, so we can get away with a lot of very evil pointer crimes that we could not in C++ or Rust.

Third is that Protobuf Go has a robust reflection system that we can target if we design specifically for it.

Also, the Go ecosystem seems much more tolerant of less-than-ideal startup times (because the language loves life-before-main due to init() functions), so unlike UPB, we can require that the interpreter’s program be generated at runtime, meaning that we can design for online PGO. In other words, we have the perfect storm to create the first-ever Protobuf JIT compiler (which we also refer to as “online PGO” or “real-time PGO”).

hyperpb’s API

Right now, hyperpb’s API is very simple. There are hyperpb.Compile* functions that accept some representation of a message descriptor, and return a *hyperpb.MessageType, which implements the protoreflect type APIs. This can be used to allocate a new *hyperpb.Message , which you can shove into proto.Unmarshal and do reflection on the result. However, you can’t mutate *hyperpb.Messages currently, because the main use-cases I am optimizing for are read-only. All mutations panic instead.

The hero use-case, using Buf’s protovalidate library, uses reflection to execute validation predicates. It looks like this:

// Compile a new message type, deserializing an encoded FileDescriptorSet.
msgType := hyperpb.CompileForBytes(schema, "my.api.v1.Request")

// Allocate a new message of that type.
msg := hyperpb.NewMessage(msgType)

// Unmarshal like you would any other message, using proto.Unmarshal.
if err := proto.Unmarshal(data, msg); err != nil {
    // Handle parse failure.
}

// Validate the message. Protovalidate uses reflection, so this Just Works.
if err := protovalidate.Validate(msg); err != nil {
    // Handle validation failure.
}
Go

We tell users to make sure to cache the compilation step because compilation can be arbitrarily slow: it’s an optimizing compiler! This is not unlike the same warning on regexp.Compile, which makes it easy to teach users how to use this API correctly.

In addition to the main API, there’s a bunch of performance tuning knobs for the compiler, for unmarshaling, and for recording profiles. Types can be recompiled using a recorded profile to be more optimized for the kinds of messages that actually come on the wire. hyperpb PGO4 affects a number of things that we’ll get into as I dive into the implementation details.

The Guts

Most of the core implementation lives under internal/tdp. The main components are as follows:

  1. tdp, which defines the “object code format” for the interpreter. This includes definitions for describing types and fields to the parser.
  2. tdp/compiler, which contains all of the code for converting a protoreflect.MessageDescriptor into a tdp.Library, which contains all of the types relevant to a particular parsing operation.
  3. tdp/dynamic defines what dynamic message types look like. The compiler does a bunch of layout work that gets stored in tdp.Type values, which a dynamic.Message interprets to find the offsets of fields within itself.
  4. tdp/vm contains the core interpreter implementation, including the VM state that is passed in registers everywhere. It also includes hand-optimized routines for parsing varints and validating UTF-8.
  5. tdp/thunks defines archetypes, which are classes of fields that all use the same layout and parsers. This corresponds roughly to a (presence, kind) pair, but not exactly. There are around 200 different archetypes.

This article won’t be a deep-dive into everything in the parser, and even this excludes large portions of hyperpb. For example, the internal/arena package is already described in a different blogpost of mine. I recommend taking a look at that to learn about how we implement a GC-friendly arena for hyperpb .

Instead, I will give a brief overview of how the object code is organized and how the parser interprets it. I will also go over a few of the more interesting optimizations we have.

tdp.Type

Every MessageDescriptor that is reachable from the root message (either as a field or as an extension) becomes a tdp.Type . This contains the dynamic size of the corresponding message type, a pointer to the type’s default parser (there can be more than one parser for a type) and a variable number of tdp.Field values. These specify the offset of each field and provide accessor thunks, for actually extracting the value of the field.

A tdp.TypeParser is what the parser VM interprets alongside encoded Protobuf data. It contains all of the information needed for decoding a message in compact form, including tdp.FieldParsers for each of its fields (and extensions), as well as a hashtable for looking up a field by tag, which is used by the VM as a fallback.

The tdp.FieldParsers each contain:

  1. The same offset information as a tdp.Field.
  2. The field’s tag, in a special format.
  3. A function pointer that gets called to parse the field.
  4. The next field(s) to try parsing after this one is parsed.

Each tdp.FieldParser actually corresponds to a possible tag on a record for this message. Some fields have multiple different tags: for example, a repeated int32 can have a VARINT-type tag for the repeated representation, and a LEN-type tag for the packed representation.

Each field specifies which fields to try next. This allows the compiler to perform field scheduling, by carefully deciding which order to try fields in based both on their declaration order and a rough estimation of their “hotness”, much like branch scheduling happens in a program compiler. This avoids almost all of the work of looking up the next field in the common case, because we have already pre-loaded the correct guess.

I haven’t managed to nail down a good algorithm for this yet, but I am working on a system for implementing a type of “branch prediction” for PGO, that tries to provide better predictions for the next fields to try based on what has been seen before.

The offset information for a field is more than just a memory offset. A tdp.Offset includes a bit offset, for fields which request allocation of individual bits in the message’s bitfields. These are used to implement the hasbits of optional fields (and the values of bool fields). It also includes a byte offset for larger storage. However, this byte offset can be negative, in which case it’s actually an offset into the cold region.

In many messages, most fields won’t be set, particularly extensions. But we would like to avoid having to allocate memory for the very rare (i.e., “cold”) fields. For this, a special “cold region” exists in a separate allocation from the main message, which is referenced via a compressed pointer. If a message happens to need a cold field set, it takes a slow path to allocate a cold region only if needed. Whether a field is cold is a dynamic property that can be affected by PGO.

The Parser VM

The parser is designed to make maximal use of Go’s generous ABI without spilling anything to the stack that isn’t absolutely necessary. The parser state consists of eight 64-bit integers, split across two types: vm.P1 and vm.P2. Unfortunately, these can’t be merged due to a compiler bug, as documented in vm/vm.go.

Every parser function takes these two structs as its first two arguments, and returns them as its first two results. This ensures that register allocation tries its darnedest to keep those eight integers in the first eight argument registers, even across calls. This leads to the common idiom of

var n int
p1, p2, n = DoSomething(p1, p2)
Go

Overwriting the parser state like this ensures that future uses of p1 and p2 use the values that DoSomething places in registers for us.

I spent a lot of time and a lot of profiling catching all of the places where Go would incorrectly spill parser state to the stack, which would result in stalls. I found quite a few codegen bugs in the process. Particularly notable (and shocking!) is #73589. Go has somehow made it a decade without a very basic pointer-to-SSA lifting pass (for comparison, this is a heavy-lifting cleanup pass (mem2reg) in LLVM).

The core loop of the VM goes something like this:

  1. Are we out of bytes to parse? If so, pop a parser stack frame5. If we popped the last stack frame, parsing is done; return success.
  2. Parse a tag. This does not fully decode the tag, because tdp.FieldParsers contain a carefully-formatted, partially-decoded tag to reduce decoding work.
  3. Check if the next field we would parse matches the tag.
    1. If yes, call the function pointer tdp.Field.Parser; update the current field to tdp.Field.NextOk; goto 1.
    2. If no, update the current field to tdp.Field.NextErr; goto 3.
    3. If no “enough times”, fall through.
  4. Slow path: hit tdp.Field.Tags to find the matching field for that tag.
    1. If matched, go to 3a.
    2. If not, this is an unknown field; put it into the unknown field set; parse a tag and goto 4.

Naturally, this is implemented as a single function whose control flow consists exclusively of ifs and gotos, because getting Go to generate good control flow otherwise proved too hard.

Now, you might be wondering why the hot loop for the parser includes calling a virtual function. Conventional wisdom holds that virtual calls are slow. After all, the actual virtual call instruction is quite slow, because it’s an indirect branch, meaning that it can easily stall the CPU. However, it’s actually much faster than the alternatives in this case, due to a few quirks of our workload and how modern CPUs are designed:

  1. Modern CPUs are not great at traversing complex “branch mazes”. This means that selecting one of ~100 alternatives using branches, even if they are well-predicted and you use unrolled binary search, is still likely to result in frequent mispredictions, and is an obstacle to other JIT optimizations in the processor’s backend.
  2. Predicting a single indirect branch with dozens of popular targets is something modern CPUs are pretty good at. Chips and Cheese have a great writeup on the indirect prediction characteristics of Zen 4 chips.

In fact, the “optimized” form of a large switch is a jump table, which is essentially an array of function pointers. Rather than doing a large number of comparisons and direct branches, a jump table turns a switch into a load and an indirect branch.

This is great news for us, because it means we can make use of a powerful assumption about most messages: most messages only feature a handful of field archetypes. How often is it that you see a message which has more in it than int32, int64, string , and submessages? In effect, this allows us to have a very large “instruction set”, consisting of all of the different field archetypes, but a particular message only pays for what it uses. The fewer archetypes it uses at runtime, the better the CPU can predict this indirect jump.

On the other hand, we can just keep adding archetypes over time to specialize for common parse workloads, which PGO can select for. Adding new archetypes that are not used by most messages does not incur a performance penalty.

Other Optimizations

We’ve already discussed the hot/cold split, and briefly touched on the message bitfields used for bools and hasbits. I’d like to mention a few other cool optimizations that help cover all our bases, as far as high-performance parsing does.

Zerocopy Strings

The fastest memcpy implementation is the one you don’t call. For this reason, we try to, whenever possible, avoid copying anything out of the input buffer. strings and bytes are represented as zc.Ranges, which are a packed pair of offset+length in a uint64. Protobuf is not able to handle lengths greater than 2GB properly, so we can assume that this covers all the data we could ever care about. This means that a bytes field is 8 bytes, rather than 24, in our representation.

Zerocopy is also used for packed fields. For example, a repeated double will typically be encoded as a LEN record. The number of float64s in this record is equal to its length divided by 8, and the float64s are already encoded in IEEE754 format for us. So we can just retain the whole repeated fields as a zc.Range . Of course, we need to be able to handle cases where there are multiple disjoint records, so the backing repeated.Scalars can also function as a 24-byte arena slice. Being able to switch between these modes gracefully is a delicate and carefully-tested part of the repeated field thunks.

Surprisingly, we also use zerocopy for varint fields, such as repeated int32. Varints are variable-length, so we can’t just index directly into the packed buffer to get the n th element… unless all of the elements happen to be the same size. In the case that every varint is one byte (so, between 0 and 127), we can zerocopy the packed field. This is a relatively common scenario, too, so it results in big savings6. We already count the number of varints in the packed field in order to preallocate space for it, so this doesn’t add extra cost. This counting is very efficient because I have manually vectorized the loop.

Repeated Preloads

PGO records the median size of each repeated/map field, and that is used to calculate a “preload” for each repeated field. Whenever the field is first allocated, it is pre-allocated using the preload to try to right-size the field with minimal waste.

Using the median ensures that large outliers don’t result in huge memory waste; instead, this guarantees that at least 50% of repeated fields will only need to allocate from the arena once. Packed fields don’t use the preload, since in the common case only one record appears for packed fields. This mostly benefits string- and message-typed repeated fields, which can’t be packed.

Map Optimizations

We don’t use Go’s built-in map, because it has significant overhead in some cases: in particular, it has to support Go’s mutation-during-iteration semantics, as well as deletion. Although both are Swisstables7 under the hood, my implementation can afford to take a few shortcuts. It also allows our implementation to use arena-managed memory. swiss.Tables are used both for the backing store of map fields, and for maps inside of tdp.Types.

Currently, the hash used is the variant of fxhash used by the Rust compiler. This greatly out-performs Go’s maphash for integers, but maphash is better for larger strings. I hope to maybe switch to maphash at some point for large strings, but it hasn’t been a priority.

Arena Reuse

Hitting the Go allocator is always going to be a little slow, because it’s a general-case allocator. Ideally, we should learn the estimated memory requirements for a particular workload, and then allocate a single block of that size for the arena to portion out.

The best way to do this is via arena reuse In the context of a service, each request has a bounded lifetime on the message that it parses. Once that lifetime is over (the request is complete), the message is discarded. This gives the programmer an opportunity to reset the backing arena, so that it keeps its largest memory block for re-allocation.

You can show that over time, this will cause the arena to never hit the Go allocator. If the largest block is too small for a message, a block twice as large will wind up getting allocated. Messages that use the same amount of memory will keep doubling the largest block, until the largest block is large enough to fit the whole message. Memory usage will be at worst 2x the size of this message. Note that, thanks to extensive use of zero-copy optimizations, we can often avoid allocating memory for large portions of the message.

Of course, arena re-use poses a memory safety danger, if the previously allocated message is kept around after the arena is reset. For this reason, it’s not the default behavior. Using arena resets is a double-digit percentage improvement, however.

Oneof Unions

Go does not properly support unions, because the GC does not keep the necessary book-keeping to distinguish a memory location that may be an integer or a pointer at runtime. Instead, this gets worked around using interfaces, which is always a pointer to some memory. Go’s GC can handle untyped pointers just fine, so this just works.

The generated API for Protobuf Go uses interface values for oneofs. This API is… pretty messy to use, unfortunately, and triggers unnecessary allocations, (much like optional fields do in the open API).

However, my arena design (read about it here) makes it possible to store arena pointers on the arena as if they are integers, since the GC does not need to scan through arena memory. Thus, our oneofs are true unions, like in C++.

Conclusion

hyperpb is really exciting because its growing JIT capabilities offer an improvement in the state of the art over UPB. It’s also been a really fun challenge working around Go compiler bugs to get the best assembly possible. The code is already so well-optimized that re-building the benchmarks with the Go compiler’s own PGO mode (based on a profile collected from the benchmarks) didn’t really seem to move the needle!

I’m always working on making hyperpb better (I get paid for it!) and I’m always excited to try new optimizations. If you think of something, file an issue! I have meticulously commented most things within hyperpb , so it should be pretty easy to get an idea of where things are if you want to contribute.

I would like to write more posts diving into some of the weeds of the implementation. I can’t promise anything, but there’s lots to talk about. For now… have fun source-diving!

There’s a lot of other things we could be doing: for example, we could be using SIMD to parse varints, we could have smarter parser scheduling, we could be allocating small submessages inline to improve locality… there’s still so much we can do!

And most importantly, I hope you’ve learned something new about performance optimization!

  1. vtprotobuf gets a lot of things wrong that make it beat us in like two benchmarks, because it’s so sloppy. For example, vtprotobuf believes that it’s ok to not validate UTF-8 strings. This is non-conforming behavior. It also believes that map entries’ fields are always in order and always populated, meaning that valid Protobuf messages containing maps can be parsed incorrectly. This sloppiness is unacceptable, which is why hyperpb goes to great lengths to implement all of Protobuf correctly. 

  2. Never gonna let Rob live that one down. Of all of Rob’s careless design decisions, the assembler is definitely one of the least forgivable ones. 

  3. There are only really two pieces of code in hyperpb that could benefit from hand-written assembly: varint decoding and UTF-8 validation. Both of these vectorize well, however, ABI0 is so inefficient that no hand-written implementation will be faster.

    If I do wind up doing this, it will require a build tag like hyperasm, along with something like -gcflags=buf.build/go/hyperpb/internal/asm/...=-+ to treat the assembly implementations as part of the Go runtime, allowing the use of ABIInternal. But even then, this only speeds up parsing of large (>2 byte) varints. 

  4. This is PGO performed by hyperpb itself; this is unrelated to gc’s own PGO mode, which seems to not actually make hyperpb faster. 

  5. Yes, the parser manages its own stack separate from the goroutine stack. This ensures that nothing in the parser has to be reentrant. The only time the stack is pushed to is when we “recurse” into a submessage. 

  6. Large packed repeated fields are where the biggest wins are for us. Being able to zero-copy large packed int32 fields full of small values allows us to eliminate all of the overhead that the other runtimes are paying for; we also choose different parsing strategies depending on the byte-to-varint ratio of the record.

    Throughput for various repeated field benchmarks. This excludes the repeated fixed32 benchmarks, since those achieve such high throughputs (~20 Gbps) that they make the chart unreadable.

    These optimizations account for the performance difference between descriptor/#00 and descriptor/#01 in the first benchmark chart. The latter is a FileDescriptorSet containing SourceCodeInfo, Protobuf’s janky debuginfo format. It is dominated by repeated int32 fields.

    NB: This chart is currently missing the Y-axis, I need to have it re-made. 

  7. Map parsing performance has been a bit of a puzzle. vtprotobuf cheats by rejecting some valid map entry encodings, such as (in Protoscope) {1: {"key"}} (value is implied to be ""), while mis-parsing others, such as {2: {"value"} 1: {"key"}} (fields can go in any order), since they don’t actually validate the field numbers like hyperpb does.

    Here’s where the benchmarks currently stand for maps:

    Throughput for various map parsing benchmarks.

    Maps, I’m told, are not very popular in Protobuf, so they’re not something I have tried to optimize as hard as packed repeated fields. 

The Best C++ Library

It’s no secret that my taste in programming languages is very weird for a programming language enthusiast professional. Several of my last few posts are about Go, broadly regarded as the programming language equivalent of eating plain oatmeal for breakfast.

To make up for that, I’m going to write about the programming language equivalent of diluting your morning coffee with Everclear. I am, of course, talking about C++.

If you’ve ever had the misfortune of doing C++ professionally, you’ll know that the C++ standard library is really bad. Where to begin?

Well, the associative containers are terrible. Due to bone-headed API decisions, std::unordered_map MUST be a closed-addressing, array-of-linked-lists map, not a Swisstable, despite closed-addressing being an outdated technology. std::map, which is not what you usually want, must be a red-black tree. It can’t be a b-tree, like every sensible language provides for the ordered map.

std::optional is a massive pain in the ass to use, and is full of footguns, like operator*. std::variant is also really annoying to use. std::filesystem is full of sharp edges. And where are the APIs for signals?

Everything is extremely wordy. std::hardware_destructive_interference_size could have been called std::cache_line. std::span::subspan could have used opeartor[]. The standard algorithms are super wordy, because they deal with iterator pairs. Oh my god, iterator pairs. They added std::ranges, which do not measure up to Rust’s Iterator at all!

I’m so mad about all this! The people in charge of C++ clearly, actively hate their users!1 They want C++ to be as hard and unpleasant as possible to use. Many brilliant people that I am lucky to consider friends and colleagues, including Titus Winters, JeanHeyd Meneide, Matt Fowles-Kulukundis, and Andy Soffer, have tried and mostly failed2 to improve the language.

This is much to say that I believe C++ in its current form is unfixable. But that’s only due to the small-mindedness of a small cabal based out of Redmond. What if we could do whatever we wanted? What if we used C++’s incredible library-building language features to build a brand-new language?

For the last year-or-so I’ve been playing with a wild idea: what would C++ look like if we did it over again? Starting from an empty C++20 file with no access to the standard library, what can we build in its place?

Starting Over

Titus started Abseil while at Google, whose namespace, absl, is sometimes said to stand for “a better standard library”3. To me, Abseil is important because it was an attempt to work with the existing standard library and make it better, while retaining a high level of implementation quality that a C++ shop’s home-grown utility library won’t have, and a uniformity of vision that Boost is too all-over-the-place to achieve.

Rather than trying to coexist with the standard library, I want to surpass it. As a form of performance art, I want to discover what the standard library would look like if we designed it today, in 2025.

In this sense, I want to build something that isn’t just better. It should be the C++ standard library from the best possible world. It is the best possible library. This is why my library’s namespace is best.

In general, I am trying not to directly copy either what C++, or Abseil, or Rust, or Go did. However, each of them has really interesting ideas, and the best library probably lies in some middle-ground somewhere.

The rest of this post will be about what I have achieved with best so far, and where I want to take it. You can look at the code here.

Building a Foundation

We’re throwing out everything, and that includes <type_traits>. This is a header which shows its age: alias templates were’t added until C++14, and variable templates were added in C++17. As a result, many things that really aught to be concepts have names like best::is_same_v. All of these now have concept equivalents in <concepts>.

I have opted to try to classify type traits into separate headers to make them easier to find. They all live under //best/meta/traits, and they form the leaves of the dependency graph.

For example, arrays.h contains all of the array traits, such as best::is_array, best::un_array (to remove an array extent), and best::as_array, which applies an extent to a type T, such that best::as_array<T, 0> is not an error.

types.h contains very low-level metaprogramming helpers, such as:

  • best::id and best::val, the identity traits for type- and value-kinded traits.
  • best::same<...>, which returns whether an entire pack of types is all equal.
  • best::lie, our version of std::declval.
  • best::select, our std::conditional_t.
  • best::abridge, a “symbol compression” mechanism for shortening the names of otherwise huge symbols.

funcs.h provides best::tame, which removes the qualifiers from an abominable function type. quals.h provides best::qualifies_to, necessary for determining if a type is “more const” than another. empty.h provides a standard empty type that interoperates cleanly with void.

On top of the type traits is the metaprogramming library //best/meta, which includes generalized constructibility traits in init.h (e.g., to check that you can, in fact, initialize a T& from a T&&, for example). tlist.h provides a very general type-level heterogenous list abstraction; a parameter pack as-a-type.

The other part of “the foundation” is //best/base, which mostly provides access to intrinsics, portability helpers, macros, and “tag types” such as our versions of std::in_place. For example, macro.h provides BEST_STRINGIFY(), port.h provides BEST_HAS_INCLUDE(), and hint.h provides best::unreachable().

guard.h provides our version of the Rust ? operator, which is not an expression because statement expressions are broken in Clang.

Finally, within //best/container we find best::object, a special type for turning any C++ type into an object (i.e., a type that you can form a reference to). This is useful for manipulating any type generically, without tripping over the assign-through semantics of references. For example, best::object<T&> is essentially a pointer.

“ADT” Containers

On top of this foundation we build the basic algebraic data types of best: best::row and best::choice, which replace std::tuple and std::variant.

best::row<A, B, C> is a heterogenous collection of values, stored inside of best::objects. This means that best::row<int&> has natural rebinding, rather than assign-through, semantics.

Accessing elements is done with at(): my_row.at<0>() returns a reference to the first element. Getting the first element is so common that you can also use my_row.first(). Using my_row.object<0>() will return a reference to a best::object instead, which can be used for rebinding references. For example:

int x = 0, y = 0;
best::row<int&> a{x};
a.at<0>() = 42;     // Writes to x.
a.object<0>() = y;  // Rebinds a.0 to y.
a.at<0>() = 2*x;    // Writes to y.
C++

There is also second() and last(), for the other two most common elements to access.

best::row is named so in reference to database rows: it provides many operations for slicing and dicing that std::tuple does not.

For example, in addition to extracting single elements, it’s also possible to access contiguous subsequences, using best::bounds: a.at<best::bounds{.start = 1, .end = 10}>()! There are also a plethora of mutation operations:

  • a + b concatenates tuples, copying or moving as appropriate (a + BEST_MOVE(b) will move out of the elements of b, for example).
  • a.push(x) returns a copy of a with x appended, while a.insert<n>(x) does the same at an arbitrary index.
  • a.update<n>(x) replaces the nth element with x, potentially of a different type.
  • a.remove<n>() deletes the nth element, while a.erase<...>() deletes a contiguous range.
  • a.splice<best::bounds{...}>(...) splices a row into another row, offering a general replace/delete operation that all of the above operations are implemented in terms of.
  • gather() and scatter() are even more general, allowing for non-contiguous indexing.

Meanwhile, std::apply is a method now: a.apply(f) calls f with a’s elements as its arguments. a.each(f) is similar, but instead expands to n unary calls of f, one with each element.

And of course, best::row supports structured bindings.

Meanwhile, best::choice<A, B, C> contains precisely one value from various types. There is an underlying best::pun<A, B, C> type that implements a variadic untagged union that works around many of C++’s bugs relating to unions with members of non-trivial type.

The most common way to operate on a choice is to match on it:

best::choice<int, *int, void> z = 42;
int result = z.match(
  [](int x) { return x; },
  [](int* x) { return *x; },
  [] { return 0; }
);
C++

Which case gets called here is chosen by overload resolution, allowing us to write a default case as [](auto&&) { ... }.

Which variant is currently selected can be checked with z.which(), while specific variants can be accessed with z.at(), just like a best::row, except that it returns a best::option<T&>.

best::choice is what all of the other sum types, like best::option and best::result, are built out of. All of the clever layout optimizations live here.

Speaking of best::option<T>, that’s our option type. It’s close in spirit to what Option<T> is in Rust. best has a generic niche mechanism that user types can opt into, allowing best::option<T&> to be the same size as a pointer, using nullptr for the best::none variant.

best::option provides the usual transformation operations: map, then, filter. Emptiness can be checked with is_empty() or has_value(). You can even pass a predicate to has_value() to check the value with, if it’s present: x.has_value([](auto& x) { return x == 42; }).

The value can be accessed using operator* and operator->, like std::optional; however, this operation is checked, instead of causing UB if the option is empty. value_or() can be used to unwrap with a default; the default can be any number of arguments, which are used to construct the default, or even a callback. For example:

best::option<Foo> x;

// Pass arguments to the constructor.
do_something(x.value_or(args, to, foo));

// Execute arbitrary logic if the value is missing.
do_something(x.value_or([] {
  return Foo(...);
}))
C++

best::option<void> also Just Works (in fact, best::option<T> is a best::choice<void, T> internally), allowing for truly generic manipulation of optional results.

best::result<T, E> is, unsurprisingly, the analogue of Rust’s Result<T, E>. Because it’s a best::choice internally, best::result<void, E> works as you might expect, and is a common return value for I/O operations.

It’s very similar to best::option, including offering operator-> for accessing the “ok” variant. This enables succinct idioms:

if (auto r = fallible()) {
  r->do_something();
} else {
  best::println("{}", *r.err());
}
C++

r.ok() and r.err() return best::options containing references to the ok and error variants, depending on which is actually present; meanwhile, a best::option can be converted into a best::result using ok_or() or err_or(), just like in Rust.

best::results are constructed using best::ok and best::err. For example:

best::result<Foo, Error> r = best::ok(args, to, foo);
C++

These internally use best::args, a wrapper over best::row that represents a “delayed initialization” that can be stored in a value. It will implicitly convert into any type that can be constructed from its elements. For example:

Foo foo = best::args(args, to, foo);  // Calls Foo::Foo(args, to, foo).
C++

Also, every one of the above types is a structural type, meaning it can be used for non-type template parameters!

Memory and Pointers

Of course, all of these ADTs need to be built on top of pointer operations, which is where //best/memory comes in. best::ptr<T> is a generalized pointer type that provides many of the same operations as Rust’s raw pointers, including offsetting, copying, and indexing. Like Rust pointers, best::ptr<T> can be a fat pointer, i.e., it can carry additional metadata on top of the pointer. For example, best::ptr<int[]> remembers the size of the array.

Providing metadata for a best::ptr is done through a member alias called BestPtrMetadata. This alias should be private, which best is given access to by befriending best::access. Types with custom metadata will usually not be directly constructible (because they are of variable size), and must be manipulated exclusively through types like best::ptr.

Specifying custom metadata allows specifying what the pointer dereferences to. For example, best::ptr<int[]> dereferences to a best::span<int>, meaning that all the span operations are accessible through operator->: for example, my_array_ptr->first().

Most of this may seem a bit over-complicated, since ordinary C++ raw pointers and references are fine for most uses. However, best::ptr is the foundation upon which best::box<T> is built on. best::box<T> is a replacement for std::unique_ptr<T> that fixes its const correctness and adds Rust Box-like helpers. best::box<T[]> also works, but unlike std::unique_ptr<T[]>, it remembers its size, just like best::ptr<T[]>.

best::box is parameterized by its allocator, which must satisfy best::allocator, a much less insane API than what std::allocator offers. best::malloc is a singleton allocator representing the system allocator.

best::span<T>, mentioned before, is the contiguous memory abstraction, replacing std::span. Like std::span, best::span<T, n> is a fixed-length span of n elements. Unlike std::span, the second parameter is a best::option<size_t>, not a size_t that uses -1 as a sentinel.

best::span<T> tries to approximate the API of Rust slices, providing indexing, slicing, splicing, search, sort, and more. Naturally, it’s also iterable, both forwards and backwards, and provides splitting iterators, just like Rust.

Slicing and indexing is always bounds-checked. Indexing can be done with size_t values, while slicing uses a best::bounds:

best::span<int> xs = ...;
auto x = xs[5];
auto ys = xs[{.start = 1, .end = 6}];
C++

best::bounds is a generic mechanism for specifying slicing bounds, similar to Rust’s range types. You can specify the start and end (exclusive), like x..y in Rust. You can also specify an inclusive end using .inclusive_end = 5, equivalent to Rust’s x..=y. And you can specify a count, like C++’s slicing operations prefer: {.start = 1, .count = 5}. best::bounds itself provides all of the necessary helpers for performing bounds checks and crashing with a nice error message. best::bounds is also iterable, as we’ll see shortly.

best::layout is a copy of Rust’s Layout type, providing similar helpers for performing C++-specific size and address calculations.

Iterators

C++ iterator pairs suck. C++ ranges suck. best provides a new paradigm for iteration that is essentially just Rust Iterators hammered into a C++ shape. This library lives in //best/iter.

To define an iterator, you define an iterator implementation type, which must define a member function named next() that returns a best::option:

class my_iter_impl final {
  public:
    best::option<int> next();
};
C++

This type is an implementation detail; the actual iterator type is best::iter<my_iter_impl>. best::iter provides all kinds of helpers, just like Iterator, for adapting the iterator or consuming items out of it.

Iterators can override the behavior of some of these adaptors to be more efficient, such as for making count() constant-time rather than linear. Iterators can also offer extra methods if they define the member alias BestIterArrow; for example, the iterators for best::span have a ->rest() method for returning the part of the slice that has not been yielded by next() yet.

One of the most important extension points is size_hint(), analogous to Iterator::size_hint(), for right-sizing containers that the iterator is converted to, such as a best::vec.

And of course, best::iter provides begin/end so that it can be used in a C++ range-for loop, just like C++20 ranges do. best::int_range<I>4, which best::bounds is an instantiation of, is also an iterator, and can be used much like Rust ranges would:

for (auto i : best::int_range<int>{.start = 1, .count = 200}) {
  // ...
}
C++

best::int_range will carefully handle all of the awkward corner cases around overflow, such as best::int_range<uint8_t>{.end_inclusive = 255}.

Heap Containers

Iterators brings us to the most complex container type that’s checked in right now, best::vec. Not only can you customize its allocator type, but you can customize its small vector optimization type.

In libc++, std::strings of at most 23 bytes are stored inline, meaning that the strings’s own storage, rather than heap storage, is used to hold them. best::vec generalizes this, by allowing any trivially copyable type to be inlined. Thus, a best::vec<int> will hold at most five ints inline, on 64-bit targets.

best::vec mostly copies the APIs of std::vector and Rust’s Vec. Indexing and slicing works the same as with best::span, and all of the best::span operations can be accessed through ->, allowing for things like my_vec->sort(...).

I have an active (failing) PR which adds best::table<K, V>, a general hash table implementation that can be used as either a map or a set. Internally it’s backed by a Swisstable5 implementation. Its API resembles neither std::unordered_map, absl::flat_hash_map, or Rust’s HashMap. Instead, everything is done through a general entry API, similar to that of Rust, but optimized for clarity and minimizing hash lookups. I want to get it merged soonish.

Beyond best::table, I plan to add at least the following containers:

  • best::tree, a btree map/set with a similar API.
  • best::heap, a simple min-heap implementation.
  • best::lru, a best::table with a linked list running through it for in-order iteration and oldest-member eviction.
  • best::ring, a ring buffer like VecDeque.
  • best::trie, a port of my twie crate.

Possible other ideas: Russ’s sparse array, splay trees, something like Java’s EnumMap, bitset types, and so on.

Text Handling

best’s string handling is intended to resemble Rust’s as much as possible; it lives within //best/text. best::rune is the Unicode scalar type, which is such that it is always within the valid range for a Unicode scalar, but including the unpaired surrogates. It offers a number of relatively simple character operations, but I plan to extend it to all kinds of character classes in the future.

best::str is our replacement for best::string_view, close to Rust’s str: a sequence of valid UTF-8 bytes, with all kinds of string manipulation operations, such as rune search, splitting, indexing, and so on.

best::rune and best::str use compiler extensions to ensure that when constructed from literals, they’re constructed from valid literals. This means that the following won’t compile!

best::str invalid = "\xFF";
C++

best::str is a best::span under the hood, which can be accessed and manipulated the same way as the underlying &[u8] to &str is.

best::strbuf is our std::string equivalent. There isn’t very much to say about it, because it works just like you’d expect, and provides a Rust String-like API.

Where this library really shines is that everything is parametrized over encodings. best::str is actually a best::text<best::utf8>; best::str16 is then best::text<best::utf16>. You can write your own text encodings, too, so long as they are relatively tame and you provide rune encode/decode for them. best::encoding is the concept

best::text is always validly encoded; however, sometimes, that’s not possible. For this reason we have best::pretext, which is “presumed validly encoded”; its operations can fail or produce replacement characters if invalid code units are found. There is no best::pretextbuf; instead, you would generally use something like a best::vec<uint8_t> instead.

Unlike C++, the fact that a best::textbuf is a best::vec under the hood is part of the public interface, allowing for cheap conversions and, of course, we get best::vec’s small vector optimization for free.

best provides the following encodings out of the box: best::utf8, best::utf16, best::utf32, best::wtf8, best::ascii, and best::latin1.

Formatting

//best/text:format provides a Rust format!()-style text formatting library. It’s as easy as:

auto str = best::format("my number: 0x{:08x}", n);
C++

Through the power of compiler extensions and constexpr, the format is actually checked at compile time!

The available formats are the same as Rust’s, including the {} vs {:?} distinction. But it’s actually way more flexible. You can use any ASCII letter, and types can provide multiple custom formatting schemes using letters. By convention, x, X, b, and o all mean numeric bases. q will quote strings, runes, and other text objects; p will print pointer addresses.

The special format {:!} “forwards from above”; when used in a formatting implementation, it uses the format specifier the caller used. This is useful for causing formats to be “passed through”, such as when printing lists or best::option.

Any type can be made formattable by providing a friend template ADL extension (FTADLE) called BestFmt. This is analogous to implementing a trait like fmt::Debug in Rust, however, all formatting operations use the same function; this is similar to fmt.Formatter in Go.

The best::formatter type, which gets passed into BestFmt, is similar to Rust’s Formatter. Beyond being a sink, it also exposes information on the specifier for the formatting operation via current_spec(), and helpers for printing indented lists and blocks.

BestFmtQuery is a related FTADLE that is called to determine what the valid format specifiers for this type are. This allows the format validator to reject formats that a type does not support, such as formatting a best::str with {:x}.

best::format returns (or appends to) a best::strbuf; best::println and best::eprintln can be used to write to stdout and stderr.

Reflection

Within the metaprogramming library, //best/meta:reflect offers a basic form of reflection. It’s not C++26 reflection, because that’s wholely overkill. Instead, it provides a method for introspecting the members of structs and enums.

For example, suppose that we want to have a default way of formatting arbitrary aggregate structs. The code for doing this is actually devilishly simple:

void BestFmt(auto& fmt, const best::is_reflected_struct auto& value) {
  // Reflect the type of the struct.
  auto refl = best::reflect<decltype(value)>;
  // Start formatting a "record" (key-value pairs).
  auto rec = fmt.record(refl.name());

  // For each field in the struct...
  refl.each([&](auto field) {
    // Add a field to the formatting record...
    rec.field(
      field.name(),   // ...whose name is the field...
      value->*field,  // ...and with the appropriate value.
    );
  });
}
C++

best::reflect provides access to the fields (or enum variants) of a user-defined type that opts itself in by providing the BestReflect FTADLE, which tells the reflection framework what the fields are. The simplest version of this FTADLE looks like this:

friend constexpr auto BestReflect(auto& mirror, MyStruct*) {
  return mirror.infer();
}
C++

best::mirror is essentially a “reflection builder” that offers fine-grained control over what reflection actually shows of a struct. This allows for hiding fields, or attaching tags to specific fields, which generic functions can then introspect using best::reflected_field::tags().

The functions on best::reflected_type allow iterating over and searching for specific fields (or enum variants); these best::reflected_fields provide metadata about a field (such as its name) and allow accessing it, with the same syntax as a pointer-to-member: value->*field.

Explaining the full breadth (and implementation tricks) of best::reflect would be a post of its own, so I’ll leave it at that.

Unit Tests and Apps

best provides a unit testing framework under //best/test, like any good standard library should. To define a test, you define a special kind of global variable:

best::test MyTest = [](best::test& t) {
  // Test code.
};
C++

This is very similar to a Go unit test, which defines a function that starts with Test and takes a *testing.T as its argument. The best::test& value offers test assertions and test failures. Through the power of looking at debuginfo, we can extract the name MyTest from the binary, and use that as the name of the test directly.

That’s right, this is a C++ test framework with no macros at all!

Meanwhile, at //best/cli we can find a robust CLI parsing library, in the spirit of #[derive(clap::Parser)] and other similar Rust libraries. The way it works is you first define a reflectable struct, whose fields correspond to CLI flags. A very basic example of this can be found in test.h, since test binaries define their own flags:

struct test::flags final {
  best::vec<best::strbuf> skip;
  best::vec<best::strbuf> filters;

  constexpr friend auto BestReflect(auto& m, flags*) {
    return m.infer()
      .with(best::cli::app{.about = "a best unit test binary"})
      .with(&flags::skip,
            best::cli::flag{
              .arg = "FILTER",
              .help = "Skip tests whose names contain FILTER",
            })
      .with(&flags::filters,
            best::cli::positional{
              .name = "FILTERS",
              .help = "Include only tests whose names contain FILTER",
            });
  }
};
C++

Using best::mirror::with, we can apply tags to the individual fields that describe how they should be parsed and displayed as CLI flags. A more complicated, full-featured example can be found at toy_flags.h, which exercises most of the CLI parser’s features.

best::parse_flags<MyFlags>(...) can be used to parse a particular flag struct from program inputs, independent of the actual argv of the program. A best::cli contains the actual parser metadata, but this is not generally user-accessible; it is constructed automatically using reflection.

Streamlining top-level app execution can be done using best::app, which fully replaces the main() function. Defining an app is very similar to defining a test:

best::app MyApp = [](MyFlags& flags) {
  // Do something cool!
};
C++

This will automatically record the program inputs, run the flag parser for MyFlags (printing --help and existing, when requested), and then call the body of the lambda.

The lambda can either return void, an int (as an exit code) or even a best::result, like Rust. best::app is also where the argv of the program can be requested by other parts of the program.

What’s Next?

There’s still a lot of stuff I want to add to best. There’s no synchronization primitives, neither atomics nor locks or channels. There’s no I/O; I have a work-in-progress PR to add best::path and best::file. I’d like to write my own math library, best::rc (reference-counting), and portable SIMD. There’s also some other OS APIs I want to build, such as signals and subprocesses. I want to add a robust PRNG, time APIs, networking, and stack symbolization.

Building the best C++ library is a lot of work, not the least because C++ is a very tricky language and writing exhaustive tests is tedious. But it manages to make C++ fun for me again!

I would love to see contributions some day. I don’t expect anyone to actually use this, but to me, it proves C++ could be so much better.

  1. They are also terrible people

  2. I will grant that JeanHeyd has made significant process where many people believed was impossible. He appears to have the indomitable willpower of a shōnen protagonist. 

  3. I have heard an apocryphal story that the namespace was going to be abc or abcl, because it was “Alphabet’s library”. This name was ultimately shot down by the office of the CEO, or so the legend goes. 

  4. This may get renamed to best::interval or even best::range We’ll see! 

  5. The fourth time I’ve written one in my career, lmao. I also wrote a C implementation at one point. My friend Matt has an excellent introduction to the Swisstable data structure.