The Best C++ Library

It’s no secret that my taste in programming languages is very weird for a programming language enthusiast professional. Several of my last few posts are about Go, broadly regarded as the programming language equivalent of eating plain oatmeal for breakfast.

To make up for that, I’m going to write about the programming language equivalent of diluting your morning coffee with Everclear. I am, of course, talking about C++.

If you’ve ever had the misfortune of doing C++ professionally, you’ll know that the C++ standard library is really bad. Where to begin?

Well, the associative containers are terrible. Due to bone-headed API decisions, std::unordered_map MUST be a closed-addressing, array-of-linked-lists map, not a Swisstable, despite closed-addressing being an outdated technology. std::map, which is not what you usually want, must be a red-black tree. It can’t be a b-tree, like every sensible language provides for the ordered map.

std::optional is a massive pain in the ass to use, and is full of footguns, like operator*. std::variant is also really annoying to use. std::filesystem is full of sharp edges. And where are the APIs for signals?

Everything is extremely wordy. std::hardware_destructive_interference_size could have been called std::cache_line. std::span::subspan could have used opeartor[]. The standard algorithms are super wordy, because they deal with iterator pairs. Oh my god, iterator pairs. They added std::ranges, which do not measure up to Rust’s Iterator at all!

I’m so mad about all this! The people in charge of C++ clearly, actively hate their users!1 They want C++ to be as hard and unpleasant as possible to use. Many brilliant people that I am lucky to consider friends and colleagues, including Titus Winters, JeanHeyd Meneide, Matt Fowles-Kulukundis, and Andy Soffer, have tried and mostly failed2 to improve the language.

This is much to say that I believe C++ in its current form is unfixable. But that’s only due to the small-mindedness of a small cabal based out of Redmond. What if we could do whatever we wanted? What if we used C++’s incredible library-building language features to build a brand-new language?

For the last year-or-so I’ve been playing with a wild idea: what would C++ look like if we did it over again? Starting from an empty C++20 file with no access to the standard library, what can we build in its place?

Starting Over

Titus started Abseil while at Google, whose namespace, absl, is sometimes said to stand for “a better standard library”3. To me, Abseil is important because it was an attempt to work with the existing standard library and make it better, while retaining a high level of implementation quality that a C++ shop’s home-grown utility library won’t have, and a uniformity of vision that Boost is too all-over-the-place to achieve.

Rather than trying to coexist with the standard library, I want to surpass it. As a form of performance art, I want to discover what the standard library would look like if we designed it today, in 2025.

In this sense, I want to build something that isn’t just better. It should be the C++ standard library from the best possible world. It is the best possible library. This is why my library’s namespace is best.

In general, I am trying not to directly copy either what C++, or Abseil, or Rust, or Go did. However, each of them has really interesting ideas, and the best library probably lies in some middle-ground somewhere.

The rest of this post will be about what I have achieved with best so far, and where I want to take it. You can look at the code here.

Building a Foundation

We’re throwing out everything, and that includes <type_traits>. This is a header which shows its age: alias templates were’t added until C++14, and variable templates were added in C++17. As a result, many things that really aught to be concepts have names like best::is_same_v. All of these now have concept equivalents in <concepts>.

I have opted to try to classify type traits into separate headers to make them easier to find. They all live under //best/meta/traits, and they form the leaves of the dependency graph.

For example, arrays.h contains all of the array traits, such as best::is_array, best::un_array (to remove an array extent), and best::as_array, which applies an extent to a type T, such that best::as_array<T, 0> is not an error.

types.h contains very low-level metaprogramming helpers, such as:

  • best::id and best::val, the identity traits for type- and value-kinded traits.
  • best::same<...>, which returns whether an entire pack of types is all equal.
  • best::lie, our version of std::declval.
  • best::select, our std::conditional_t.
  • best::abridge, a “symbol compression” mechanism for shortening the names of otherwise huge symbols.

funcs.h provides best::tame, which removes the qualifiers from an abominable function type. quals.h provides best::qualifies_to, necessary for determining if a type is “more const” than another. empty.h provides a standard empty type that interoperates cleanly with void.

On top of the type traits is the metaprogramming library //best/meta, which includes generalized constructibility traits in init.h (e.g., to check that you can, in fact, initialize a T& from a T&&, for example). tlist.h provides a very general type-level heterogenous list abstraction; a parameter pack as-a-type.

The other part of “the foundation” is //best/base, which mostly provides access to intrinsics, portability helpers, macros, and “tag types” such as our versions of std::in_place. For example, macro.h provides BEST_STRINGIFY(), port.h provides BEST_HAS_INCLUDE(), and hint.h provides best::unreachable().

guard.h provides our version of the Rust ? operator, which is not an expression because statement expressions are broken in Clang.

Finally, within //best/container we find best::object, a special type for turning any C++ type into an object (i.e., a type that you can form a reference to). This is useful for manipulating any type generically, without tripping over the assign-through semantics of references. For example, best::object<T&> is essentially a pointer.

“ADT” Containers

On top of this foundation we build the basic algebraic data types of best: best::row and best::choice, which replace std::tuple and std::variant.

best::row<A, B, C> is a heterogenous collection of values, stored inside of best::objects. This means that best::row<int&> has natural rebinding, rather than assign-through, semantics.

Accessing elements is done with at(): my_row.at<0>() returns a reference to the first element. Getting the first element is so common that you can also use my_row.first(). Using my_row.object<0>() will return a reference to a best::object instead, which can be used for rebinding references. For example:

int x = 0, y = 0;
best::row<int&> a{x};
a.at<0>() = 42;     // Writes to x.
a.object<0>() = y;  // Rebinds a.0 to y.
a.at<0>() = 2*x;    // Writes to y.
C++

There is also second() and last(), for the other two most common elements to access.

best::row is named so in reference to database rows: it provides many operations for slicing and dicing that std::tuple does not.

For example, in addition to extracting single elements, it’s also possible to access contiguous subsequences, using best::bounds: a.at<best::bounds{.start = 1, .end = 10}>()! There are also a plethora of mutation operations:

  • a + b concatenates tuples, copying or moving as appropriate (a + BEST_MOVE(b) will move out of the elements of b, for example).
  • a.push(x) returns a copy of a with x appended, while a.insert<n>(x) does the same at an arbitrary index.
  • a.update<n>(x) replaces the nth element with x, potentially of a different type.
  • a.remove<n>() deletes the nth element, while a.erase<...>() deletes a contiguous range.
  • a.splice<best::bounds{...}>(...) splices a row into another row, offering a general replace/delete operation that all of the above operations are implemented in terms of.
  • gather() and scatter() are even more general, allowing for non-contiguous indexing.

Meanwhile, std::apply is a method now: a.apply(f) calls f with a’s elements as its arguments. a.each(f) is similar, but instead expands to n unary calls of f, one with each element.

And of course, best::row supports structured bindings.

Meanwhile, best::choice<A, B, C> contains precisely one value from various types. There is an underlying best::pun<A, B, C> type that implements a variadic untagged union that works around many of C++’s bugs relating to unions with members of non-trivial type.

The most common way to operate on a choice is to match on it:

best::choice<int, *int, void> z = 42;
int result = z.match(
  [](int x) { return x; },
  [](int* x) { return *x; },
  [] { return 0; }
);
C++

Which case gets called here is chosen by overload resolution, allowing us to write a default case as [](auto&&) { ... }.

Which variant is currently selected can be checked with z.which(), while specific variants can be accessed with z.at(), just like a best::row, except that it returns a best::option<T&>.

best::choice is what all of the other sum types, like best::option and best::result, are built out of. All of the clever layout optimizations live here.

Speaking of best::option<T>, that’s our option type. It’s close in spirit to what Option<T> is in Rust. best has a generic niche mechanism that user types can opt into, allowing best::option<T&> to be the same size as a pointer, using nullptr for the best::none variant.

best::option provides the usual transformation operations: map, then, filter. Emptiness can be checked with is_empty() or has_value(). You can even pass a predicate to has_value() to check the value with, if it’s present: x.has_value([](auto& x) { return x == 42; }).

The value can be accessed using operator* and operator->, like std::optional; however, this operation is checked, instead of causing UB if the option is empty. value_or() can be used to unwrap with a default; the default can be any number of arguments, which are used to construct the default, or even a callback. For example:

best::option<Foo> x;

// Pass arguments to the constructor.
do_something(x.value_or(args, to, foo));

// Execute arbitrary logic if the value is missing.
do_something(x.value_or([] {
  return Foo(...);
}))
C++

best::option<void> also Just Works (in fact, best::option<T> is a best::choice<void, T> internally), allowing for truly generic manipulation of optional results.

best::result<T, E> is, unsurprisingly, the analogue of Rust’s Result<T, E>. Because it’s a best::choice internally, best::result<void, E> works as you might expect, and is a common return value for I/O operations.

It’s very similar to best::option, including offering operator-> for accessing the “ok” variant. This enables succinct idioms:

if (auto r = fallible()) {
  r->do_something();
} else {
  best::println("{}", *r.err());
}
C++

r.ok() and r.err() return best::options containing references to the ok and error variants, depending on which is actually present; meanwhile, a best::option can be converted into a best::result using ok_or() or err_or(), just like in Rust.

best::results are constructed using best::ok and best::err. For example:

best::result<Foo, Error> r = best::ok(args, to, foo);
C++

These internally use best::args, a wrapper over best::row that represents a “delayed initialization” that can be stored in a value. It will implicitly convert into any type that can be constructed from its elements. For example:

Foo foo = best::args(args, to, foo);  // Calls Foo::Foo(args, to, foo).
C++

Also, every one of the above types is a structural type, meaning it can be used for non-type template parameters!

Memory and Pointers

Of course, all of these ADTs need to be built on top of pointer operations, which is where //best/memory comes in. best::ptr<T> is a generalized pointer type that provides many of the same operations as Rust’s raw pointers, including offsetting, copying, and indexing. Like Rust pointers, best::ptr<T> can be a fat pointer, i.e., it can carry additional metadata on top of the pointer. For example, best::ptr<int[]> remembers the size of the array.

Providing metadata for a best::ptr is done through a member alias called BestPtrMetadata. This alias should be private, which best is given access to by befriending best::access. Types with custom metadata will usually not be directly constructible (because they are of variable size), and must be manipulated exclusively through types like best::ptr.

Specifying custom metadata allows specifying what the pointer dereferences to. For example, best::ptr<int[]> dereferences to a best::span<int>, meaning that all the span operations are accessible through operator->: for example, my_array_ptr->first().

Most of this may seem a bit over-complicated, since ordinary C++ raw pointers and references are fine for most uses. However, best::ptr is the foundation upon which best::box<T> is built on. best::box<T> is a replacement for std::unique_ptr<T> that fixes its const correctness and adds Rust Box-like helpers. best::box<T[]> also works, but unlike std::unique_ptr<T[]>, it remembers its size, just like best::ptr<T[]>.

best::box is parameterized by its allocator, which must satisfy best::allocator, a much less insane API than what std::allocator offers. best::malloc is a singleton allocator representing the system allocator.

best::span<T>, mentioned before, is the contiguous memory abstraction, replacing std::span. Like std::span, best::span<T, n> is a fixed-length span of n elements. Unlike std::span, the second parameter is a best::option<size_t>, not a size_t that uses -1 as a sentinel.

best::span<T> tries to approximate the API of Rust slices, providing indexing, slicing, splicing, search, sort, and more. Naturally, it’s also iterable, both forwards and backwards, and provides splitting iterators, just like Rust.

Slicing and indexing is always bounds-checked. Indexing can be done with size_t values, while slicing uses a best::bounds:

best::span<int> xs = ...;
auto x = xs[5];
auto ys = xs[{.start = 1, .end = 6}];
C++

best::bounds is a generic mechanism for specifying slicing bounds, similar to Rust’s range types. You can specify the start and end (exclusive), like x..y in Rust. You can also specify an inclusive end using .inclusive_end = 5, equivalent to Rust’s x..=y. And you can specify a count, like C++’s slicing operations prefer: {.start = 1, .count = 5}. best::bounds itself provides all of the necessary helpers for performing bounds checks and crashing with a nice error message. best::bounds is also iterable, as we’ll see shortly.

best::layout is a copy of Rust’s Layout type, providing similar helpers for performing C++-specific size and address calculations.

Iterators

C++ iterator pairs suck. C++ ranges suck. best provides a new paradigm for iteration that is essentially just Rust Iterators hammered into a C++ shape. This library lives in //best/iter.

To define an iterator, you define an iterator implementation type, which must define a member function named next() that returns a best::option:

class my_iter_impl final {
  public:
    best::option<int> next();
};
C++

This type is an implementation detail; the actual iterator type is best::iter<my_iter_impl>. best::iter provides all kinds of helpers, just like Iterator, for adapting the iterator or consuming items out of it.

Iterators can override the behavior of some of these adaptors to be more efficient, such as for making count() constant-time rather than linear. Iterators can also offer extra methods if they define the member alias BestIterArrow; for example, the iterators for best::span have a ->rest() method for returning the part of the slice that has not been yielded by next() yet.

One of the most important extension points is size_hint(), analogous to Iterator::size_hint(), for right-sizing containers that the iterator is converted to, such as a best::vec.

And of course, best::iter provides begin/end so that it can be used in a C++ range-for loop, just like C++20 ranges do. best::int_range<I>4, which best::bounds is an instantiation of, is also an iterator, and can be used much like Rust ranges would:

for (auto i : best::int_range<int>{.start = 1, .count = 200}) {
  // ...
}
C++

best::int_range will carefully handle all of the awkward corner cases around overflow, such as best::int_range<uint8_t>{.end_inclusive = 255}.

Heap Containers

Iterators brings us to the most complex container type that’s checked in right now, best::vec. Not only can you customize its allocator type, but you can customize its small vector optimization type.

In libc++, std::strings of at most 23 bytes are stored inline, meaning that the strings’s own storage, rather than heap storage, is used to hold them. best::vec generalizes this, by allowing any trivially copyable type to be inlined. Thus, a best::vec<int> will hold at most five ints inline, on 64-bit targets.

best::vec mostly copies the APIs of std::vector and Rust’s Vec. Indexing and slicing works the same as with best::span, and all of the best::span operations can be accessed through ->, allowing for things like my_vec->sort(...).

I have an active (failing) PR which adds best::table<K, V>, a general hash table implementation that can be used as either a map or a set. Internally it’s backed by a Swisstable5 implementation. Its API resembles neither std::unordered_map, absl::flat_hash_map, or Rust’s HashMap. Instead, everything is done through a general entry API, similar to that of Rust, but optimized for clarity and minimizing hash lookups. I want to get it merged soonish.

Beyond best::table, I plan to add at least the following containers:

  • best::tree, a btree map/set with a similar API.
  • best::heap, a simple min-heap implementation.
  • best::lru, a best::table with a linked list running through it for in-order iteration and oldest-member eviction.
  • best::ring, a ring buffer like VecDeque.
  • best::trie, a port of my twie crate.

Possible other ideas: Russ’s sparse array, splay trees, something like Java’s EnumMap, bitset types, and so on.

Text Handling

best’s string handling is intended to resemble Rust’s as much as possible; it lives within //best/text. best::rune is the Unicode scalar type, which is such that it is always within the valid range for a Unicode scalar, but including the unpaired surrogates. It offers a number of relatively simple character operations, but I plan to extend it to all kinds of character classes in the future.

best::str is our replacement for best::string_view, close to Rust’s str: a sequence of valid UTF-8 bytes, with all kinds of string manipulation operations, such as rune search, splitting, indexing, and so on.

best::rune and best::str use compiler extensions to ensure that when constructed from literals, they’re constructed from valid literals. This means that the following won’t compile!

best::str invalid = "\xFF";
C++

best::str is a best::span under the hood, which can be accessed and manipulated the same way as the underlying &[u8] to &str is.

best::strbuf is our std::string equivalent. There isn’t very much to say about it, because it works just like you’d expect, and provides a Rust String-like API.

Where this library really shines is that everything is parametrized over encodings. best::str is actually a best::text<best::utf8>; best::str16 is then best::text<best::utf16>. You can write your own text encodings, too, so long as they are relatively tame and you provide rune encode/decode for them. best::encoding is the concept

best::text is always validly encoded; however, sometimes, that’s not possible. For this reason we have best::pretext, which is “presumed validly encoded”; its operations can fail or produce replacement characters if invalid code units are found. There is no best::pretextbuf; instead, you would generally use something like a best::vec<uint8_t> instead.

Unlike C++, the fact that a best::textbuf is a best::vec under the hood is part of the public interface, allowing for cheap conversions and, of course, we get best::vec’s small vector optimization for free.

best provides the following encodings out of the box: best::utf8, best::utf16, best::utf32, best::wtf8, best::ascii, and best::latin1.

Formatting

//best/text:format provides a Rust format!()-style text formatting library. It’s as easy as:

auto str = best::format("my number: 0x{:08x}", n);
C++

Through the power of compiler extensions and constexpr, the format is actually checked at compile time!

The available formats are the same as Rust’s, including the {} vs {:?} distinction. But it’s actually way more flexible. You can use any ASCII letter, and types can provide multiple custom formatting schemes using letters. By convention, x, X, b, and o all mean numeric bases. q will quote strings, runes, and other text objects; p will print pointer addresses.

The special format {:!} “forwards from above”; when used in a formatting implementation, it uses the format specifier the caller used. This is useful for causing formats to be “passed through”, such as when printing lists or best::option.

Any type can be made formattable by providing a friend template ADL extension (FTADLE) called BestFmt. This is analogous to implementing a trait like fmt::Debug in Rust, however, all formatting operations use the same function; this is similar to fmt.Formatter in Go.

The best::formatter type, which gets passed into BestFmt, is similar to Rust’s Formatter. Beyond being a sink, it also exposes information on the specifier for the formatting operation via current_spec(), and helpers for printing indented lists and blocks.

BestFmtQuery is a related FTADLE that is called to determine what the valid format specifiers for this type are. This allows the format validator to reject formats that a type does not support, such as formatting a best::str with {:x}.

best::format returns (or appends to) a best::strbuf; best::println and best::eprintln can be used to write to stdout and stderr.

Reflection

Within the metaprogramming library, //best/meta:reflect offers a basic form of reflection. It’s not C++26 reflection, because that’s wholely overkill. Instead, it provides a method for introspecting the members of structs and enums.

For example, suppose that we want to have a default way of formatting arbitrary aggregate structs. The code for doing this is actually devilishly simple:

void BestFmt(auto& fmt, const best::is_reflected_struct auto& value) {
  // Reflect the type of the struct.
  auto refl = best::reflect<decltype(value)>;
  // Start formatting a "record" (key-value pairs).
  auto rec = fmt.record(refl.name());

  // For each field in the struct...
  refl.each([&](auto field) {
    // Add a field to the formatting record...
    rec.field(
      field.name(),   // ...whose name is the field...
      value->*field,  // ...and with the appropriate value.
    );
  });
}
C++

best::reflect provides access to the fields (or enum variants) of a user-defined type that opts itself in by providing the BestReflect FTADLE, which tells the reflection framework what the fields are. The simplest version of this FTADLE looks like this:

friend constexpr auto BestReflect(auto& mirror, MyStruct*) {
  return mirror.infer();
}
C++

best::mirror is essentially a “reflection builder” that offers fine-grained control over what reflection actually shows of a struct. This allows for hiding fields, or attaching tags to specific fields, which generic functions can then introspect using best::reflected_field::tags().

The functions on best::reflected_type allow iterating over and searching for specific fields (or enum variants); these best::reflected_fields provide metadata about a field (such as its name) and allow accessing it, with the same syntax as a pointer-to-member: value->*field.

Explaining the full breadth (and implementation tricks) of best::reflect would be a post of its own, so I’ll leave it at that.

Unit Tests and Apps

best provides a unit testing framework under //best/test, like any good standard library should. To define a test, you define a special kind of global variable:

best::test MyTest = [](best::test& t) {
  // Test code.
};
C++

This is very similar to a Go unit test, which defines a function that starts with Test and takes a *testing.T as its argument. The best::test& value offers test assertions and test failures. Through the power of looking at debuginfo, we can extract the name MyTest from the binary, and use that as the name of the test directly.

That’s right, this is a C++ test framework with no macros at all!

Meanwhile, at //best/cli we can find a robust CLI parsing library, in the spirit of #[derive(clap::Parser)] and other similar Rust libraries. The way it works is you first define a reflectable struct, whose fields correspond to CLI flags. A very basic example of this can be found in test.h, since test binaries define their own flags:

struct test::flags final {
  best::vec<best::strbuf> skip;
  best::vec<best::strbuf> filters;

  constexpr friend auto BestReflect(auto& m, flags*) {
    return m.infer()
      .with(best::cli::app{.about = "a best unit test binary"})
      .with(&flags::skip,
            best::cli::flag{
              .arg = "FILTER",
              .help = "Skip tests whose names contain FILTER",
            })
      .with(&flags::filters,
            best::cli::positional{
              .name = "FILTERS",
              .help = "Include only tests whose names contain FILTER",
            });
  }
};
C++

Using best::mirror::with, we can apply tags to the individual fields that describe how they should be parsed and displayed as CLI flags. A more complicated, full-featured example can be found at toy_flags.h, which exercises most of the CLI parser’s features.

best::parse_flags<MyFlags>(...) can be used to parse a particular flag struct from program inputs, independent of the actual argv of the program. A best::cli contains the actual parser metadata, but this is not generally user-accessible; it is constructed automatically using reflection.

Streamlining top-level app execution can be done using best::app, which fully replaces the main() function. Defining an app is very similar to defining a test:

best::app MyApp = [](MyFlags& flags) {
  // Do something cool!
};
C++

This will automatically record the program inputs, run the flag parser for MyFlags (printing --help and existing, when requested), and then call the body of the lambda.

The lambda can either return void, an int (as an exit code) or even a best::result, like Rust. best::app is also where the argv of the program can be requested by other parts of the program.

What’s Next?

There’s still a lot of stuff I want to add to best. There’s no synchronization primitives, neither atomics nor locks or channels. There’s no I/O; I have a work-in-progress PR to add best::path and best::file. I’d like to write my own math library, best::rc (reference-counting), and portable SIMD. There’s also some other OS APIs I want to build, such as signals and subprocesses. I want to add a robust PRNG, time APIs, networking, and stack symbolization.

Building the best C++ library is a lot of work, not the least because C++ is a very tricky language and writing exhaustive tests is tedious. But it manages to make C++ fun for me again!

I would love to see contributions some day. I don’t expect anyone to actually use this, but to me, it proves C++ could be so much better.

  1. They are also terrible people

  2. I will grant that JeanHeyd has made significant process where many people believed was impossible. He appears to have the indomitable willpower of a shōnen protagonist. 

  3. I have heard an apocryphal story that the namespace was going to be abc or abcl, because it was “Alphabet’s library”. This name was ultimately shot down by the office of the CEO, or so the legend goes. 

  4. This may get renamed to best::interval or even best::range We’ll see! 

  5. The fourth time I’ve written one in my career, lmao. I also wrote a C implementation at one point. My friend Matt has an excellent introduction to the Swisstable data structure. 

What's //go:nosplit for?

Most people don’t know that Go has special syntax for directives. Unfortunately, it’s not real syntax, it’s just a comment. For example, //go:noinline causes the next function declaration to never get inlined, which is useful for changing the inlining cost of functions that call it.

There are three types of directives:

  1. The ones documented in gc’s doc comment. This includes //go:noinline and //line.

  2. The ones documented elsewhere, such as //go:build and //go:generate.

  3. The ones documented in runtime/HACKING.md, which can only be used if the -+ flag is passed to gc. This includes //go:nowritebarrier.

  4. The ones not documented at all, whose existence can be discovered by searching the compiler’s tests. These include //go:nocheckptr, //go:nointerface, and //go:debug.

We are most interested in a directive of the first type, //go:nosplit. According to the documentation:

The //go:nosplit directive must be followed by a function declaration. It specifies that the function must omit its usual stack overflow check. This is most commonly used by low-level runtime code invoked at times when it is unsafe for the calling goroutine to be preempted.

What does this even mean? Normal program code can use this annotation, but its behavior is poorly specified. Let’s dig in.

Go Stack Growth

Go allocates very small stacks for new goroutines, which grow their stack dynamically. This allows a program to spawn a large number of short-lived goroutines without spending a lot of memory on their stacks.

This means that it’s very easy to overflow the stack. Every function knows how large its stack is, and runtime.g, the goroutine struct, contains the end position of the stack; if the stack pointer is less than it (the stack grows up) control passes to runtime.morestack, which effectively preempts the goroutine while its stack is resized.

In effect, every Go function has the following code around it:

TEXT    .f(SB), ABIInternal, $24-16
  CMPQ    SP, 16(R14)
  JLS     grow
  PUSHQ   BP
  MOVQ    SP, BP
  SUBQ    $16, SP
  // Function body...
  ADDQ    $16, SP
  POPQ    BP
  RET
grow:
  MOVQ    AX, 8(SP)
  MOVQ    BX, 16(SP)
  CALL    runtime.morestack_noctxt(SB)
  MOVQ    8(SP), AX
  MOVQ    16(SP), BX
  JMP     .f(SB)
x86 Assembly (Go Syntax)

Note that r14 holds a pointer to the current runtime.g, and the stack limit is the third word-sized field (runtime.g.stackguard0) in that struct, hence the offset of 16. If the stack is about to be exhausted, it jumps to a special block at the end of the function that spills all of the argument registers, traps into the runtime, and, once that’s done, unspills the arguments and re-starts the function.

Note that arguments are spilled before adjusting rsp, which means that the arguments are written to the caller’s stack frame. This is part of Go’s ABI; callers must allocate space at the top of their stack frames for any function that they call to spill all of its registers for preemption1.

Preemption is not reentrant, which means that functions that are running in the context of a preempted G or with no G at all must not be preempted by this check.

Nosplit Functions

The //go:nosplit directive marks a function as “nosplit”, or a “non-splitting function”. “Splitting” has nothing to do with what this directive does.

Segmented Stacks

In the bad old days, Go’s stacks were split up into segments, where each segment ended with a pointer to the next, effectively replacing the stack’s single array with a linked list of such arrays.

Segmented stacks were terrible. Instead of triggering a resize, these prologues were responsible for updating rsp to the next (or previous) block by following this pointer, whenever the current segment bottomed out. This meant that if a function call happened to be on a segment boundary, it would be extremely slow in comparison to other function calls, due to the significant work required to update rsp correctly.

This meant that unlucky sizing of stack frames meant sudden performance cliffs. Fun!

Go has since figured out that segmented stacks are a terrible idea. In the process of implementing a correct GC stack scanning algorithm (which it did not have for many stable releases), it also gained the ability to copy the contents of a stack from one location to another, updating pointers in such a way that user code wouldn’t notice.

This stack splitting code is where the name “nosplit” comes from.

A nosplit function does not load and branch on runtime.g.stackguard0, and simply assumes it has enough stack. This means that nosplit functions will not preempt themselves, and, as a result, are noticeably faster to call in a hot loop. Don’t believe me?

//go:noinline
func noinline(x int) {}

//go:nosplit
func nosplit(x int) { noinline(x) }
func yessplit(x int) { noinline(x) }

func BenchmarkCall(b *testing.B) {
  b.Run("nosplit", func(b *testing.B) {
    for b.Loop() { nosplit(42) }
  })
  b.Run("yessplit", func(b *testing.B) {
    for b.Loop() { yessplit(42) }
  })
}
Go

If we profile this and pull up the timings for each function, here’s what we get:

390ms      390ms           func nosplit(x int) { noinline(x) }
 60ms       60ms   51fd80:     PUSHQ BP
 10ms       10ms   51fd81:     MOVQ SP, BP
    .          .   51fd84:     SUBQ $0x8, SP
 60ms       60ms   51fd88:     CALL .noinline(SB)
190ms      190ms   51fd8d:     ADDQ $0x8, SP
    .          .   51fd91:     POPQ BP
 70ms       70ms   51fd92:     RET

440ms      490ms           func yessplit(x int) { noinline(x) }
 50ms       50ms   51fda0:     CMPQ SP, 0x10(R14)
 20ms       20ms   51fda4:     JBE 0x51fdb9
    .          .   51fda6:     PUSHQ BP
 20ms       20ms   51fda7:     MOVQ SP, BP
    .          .   51fdaa:     SUBQ $0x8, SP
 10ms       60ms   51fdae:     CALL .noinline(SB)
200ms      200ms   51fdb3:     ADDQ $0x8, SP
    .          .   51fdb7:     POPQ BP
140ms      140ms   51fdb8:     RET
    .          .   51fdb9:     MOVQ AX, 0x8(SP)
    .          .   51fdbe:     NOPW
    .          .   51fdc0:     CALL runtime.morestack_noctxt.abi0(SB)
    .          .   51fdc5:     MOVQ 0x8(SP), AX
    .          .   51fdca:     JMP .yessplit(SB)
x86 Assembly (Go Syntax)

The time spent at each instruction (for the whole benchmark, where I made sure equal time was spent on each test case with -benchtime Nx) is comparable for all of the instructions these functions share, but an additional ~2% cost is incurred for the stack check.

This is a very artificial setup, because the g struct is always in L1 in the yessplit benchmark due to the fact that no other memory operations occur in the loop. However, for very hot code that needs to saturate the cache, this can have an outsized effect due to cache misses. We can enhance this benchmark by adding an assembly function that executes clflush [r14], which causes the g struct to be ejected from all caches.

TEXT .clflush(SB)
  CLFLUSH (R14)  // Eject the pointee of r14 from all caches.
  RET
x86 Assembly (Go Syntax)

If we add a call to this function to both benchmark loops, we see the staggering cost of a cold fetch from RAM show up in every function call: 120.1 nanosecods for BenchmarkCall/nosplit, versus 332.1 nanoseconds for BenchmarkCall/yessplit. The 200 nanosecond difference is a fetch from main memory. An L1 miss is about 15 times less expensive, so if the g struct manages to get kicked out of L1, you’re paying about 15 or so nanoseconds, or about two map lookups!

Despite the language resisting adding an inlining heuristic, which programmers would place everywhere without knowing what it does, they did provide something worse that makes code noticeably faster: nosplit.

But It’s Harmless…?

Consider the following program2:

//go:nosplit
func x(y int) { x(y+1) }
Go

Naturally, this will instantly overflow the stack. Instead, we get a really scary linker error:

x.x: nosplit stack over 792 byte limit
x.x<1>
    grows 24 bytes, calls x.x<1>
    infinite cycle
Console

The Go linker contains a check to verify that any chain of nosplit functions which call nosplit functions do not overflow a small window of extra stack, which is where the stack frames of nosplit functions live if they go past stackguard0.

Every stack frame contributes some stack use (for the return address, at minimum), so the number of functions you can call before you get this error is limited. And because every function needs to allocate space for all of its callees to spill their arguments if necessary, you can hit this limit every fast if every one of these functions uses every available argument register (ask me how I know).

Also, turning on fuzzing instruments the code by inserting nosplit calls into the fuzzer runtime around branches, meaning that turning on fuzzing can previously fine code to no longer link. Stack usage also varies slightly by architecture, meaning that code which builds in one architecture fails to link in others (most visible when going from 32-bit to 64-bit).

There is no easy way to control directives using build tags (two poorly-designed features collide), so you cannot just “turn off” performance-sensitive nosplits for debugging, either.

For this reason, you must be very very careful about using nosplit for performance.

Virtual Nosplit Functions

Excitingly, nosplit functions whose addresses are taken do not have special codegen, allowing us to defeat the linker stack check by using virtual function calls.

Consider the following program:

package main

var f func(int)

//go:nosplit
func x(y int) { f(y+1) }

func main() {
  f = x
  f(0)
}
Go

This will quickly exhaust the main G’s tiny stack and segfault in the most violent way imaginable, preventing the runtime from printing a debug trace. All this program outputs is signal: segmentation fault.

This is probably a bug.

Other Side Effects

It turns out that nosplit has various other fun side-effects that are not documented anywhere. The main thing it does is it contributes to whether a function is considered “unsafe” by the runtime.

Consider the following program:

package main

import (
  "fmt"
  "os"
  "runtime"
  "time"
)

func main() {
  for range runtime.GOMAXPROCS(0) {
    go func() {
      for {}
    }()
  }
  time.Sleep(time.Second) // Wait for all the other Gs to start.

  fmt.Println("Hello, world!")
  os.Exit(0)
}
Go

This program will make sure that every P becomes bound to a G that loops forever, meaning they will never trap into the runtime. Thus, this program will hang forever, never printing its result and exiting. But that’s not what happens.

Thanks to asynchronous preemption, the scheduler will detect Gs that have been running for too long, and preempt its M by sending a signal to it (due to happenstance, this is SIGURG of all things.)

However, asynchronous preemption is only possible when the M stops due to the signal at a safe point, as determined by runtime.isAsyncSafePoint. It includes the following block of code:

	up, startpc := pcdatavalue2(f, abi.PCDATA_UnsafePoint, pc)
	if up == abi.UnsafePointUnsafe {
		// Unsafe-point marked by compiler. This includes
		// atomic sequences (e.g., write barrier) and nosplit
		// functions (except at calls).
		return false, 0
	}
Go

If we chase down where this value is set, we’ll find that it is set explicitly for write barrier sequences, for any function that is “part of the runtime” (as defined by being built with the -+ flag) and for any nosplit function.

With a small modification of hoisting the go body into a nosplit function, the following program will run forever: it will never wake up from time.Sleep.

package main

import (
  "fmt"
  "os"
  "runtime"
  "time"
)

//go:nosplit
func forever() {
  for {}
}

func main() {
  for range runtime.GOMAXPROCS(0) {
    go forever()
  }
  time.Sleep(time.Second) // Wait for all the other Gs to start.

  fmt.Println("Hello, world!")
  os.Exit(0)
}
Go

Even though there is work to do, every P is bound to a G that will never reach a safe point, so there will never be a P available to run the main goroutine.

This represents another potential danger of using nosplit functions: those that do not call preemptable functions must terminate promptly, or risk livelocking the whole runtime.

Conclusion

I use nosplit a lot, because I write high-performance, low-latency Go. This is a very insane thing to do, which has caused me to slowly generate bug reports whenever I hit strange corner cases.

For example, there are many cases where spill regions are allocated for functions that never use them, for example, functions which only call nosplit functions allocate space for them to spill their arguments, which they don’t do.3

This is a documented Go language feature which:

  1. Isn’t very well-documented (the async preemption behavior certainly isn’t)!
  2. Has very scary optimization-dependent build failures.
  3. Can cause livelock and mysterious segfaults.
  4. Can be used in user programs that don’t import "unsafe"!
  5. And it makes code faster!

I’m surprised such a massive footgun exists at all, buuuut it’s a measureable benchmark improvement for me, so it’s impossible to tell if it’s bad or not.

  1. The astute reader will observe that because preemption is not reentrant, only one of these spill regions will be in use at at time in a G. This is a known bug in the ABI, and is essentially a bodge to enable easy adoption of passing arguments by register, without needing all of the parts of the runtime that expect arguments to be spilled to the stack, as was the case in the slow old days when Go’s ABI on every platform was “i386-unknown-linux but worse”, i.e., arguments went on the stack and made the CPU’s store queue sad.

    I recently filed a bug about this that boils down to “add a field to runtime.g to use a spill space”, which seems to me to be simpler than the alternatives described in the ABIInternal spec. 

  2. Basically every bug report I write starts with these four words and it means you’re about to see the worst program ever written. 

  3. The spill area is also used for spilling arguments across calls, but in this case, it is not necessary for the caller to allocate it for a nosplit function. 

Protobuf Tip #7: Scoping It Out

You’d need a very specialized electron microscope to get down to the level to actually see a single strand of DNA. – Craig Venter

TL;DR: buf convert is a powerful tool for examining wire format dumps, by converting them to JSON and using existing JSON analysis tooling. protoscope can be used for lower-level analysis, such debugging messages that have been corrupted.

I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog.

JSON from Protobuf?

JSON’s human-readable syntax is a big reason why it’s so popular, possibly second only to built-in support in browsers and many languages. It’s easy to examine any JSON document using tools like online prettifiers and the inimitable jq.

But Protobuf is a binary format! This means that you can’t easily use jq -like tools with it…or can you?

Transcoding with buf convert

The Buf CLI offers a utility for transcoding messages between the three Protobuf encoding formats: the wire format, JSON, and textproto; it also supports YAML. This is buf convert, and it’s very powerful.

To perform a conversion, we need four inputs:

  1. A Protobuf source to get types out of. This can be a local .proto file, an encoded FileDescriptorSet , or a remote BSR module.
    • If not provided, but run in a directory that is within a local Buf module, that module will be used as the Protobuf type source.
  2. The name of the top-level type for the message we want to transcode, via the --type flag.
  3. The input message, via the --from flag.
  4. A location to output to, via the --to flag.

buf convert supports input and output redirection, making it usable as part of a shell pipeline. For example, consider the following Protobuf code in our local Buf module:

// my_api.proto
syntax = "proto3";
package my.api.v1;

message Cart {
  int32 user_id = 1;
  repeated Order orders = 2;
}

message Order {
  fixed64 sku = 1;
  string sku_name = 2;
  int64 count = 3;
}
Protobuf

Then, let’s say we’ve dumped a message of type my.api.v1.Cart from a service to debug it. And let’s say…well—you can’t just cat it.

$ cat dump.pb | xxd -ps
08a946121b097ac8e80400000000120e76616375756d20636c65616e6572
18011220096709b519000000001213686570612066696c7465722c203220
7061636b1806122c093aa8188900000000121f69736f70726f70796c2061
6c636f686f6c203730252c20312067616c6c6f6e1802
Console

However, we can use buf convert to turn it into some nice JSON. We can then pipe it into jq to format it.

$ buf convert --type my.api.v1.Cart --from dump.pb --to -#format=json | jq
{
  "userId": 9001,
  "orders": [
    {
      "sku": "82364538",
      "skuName": "vacuum cleaner",
      "count": "1"
    },
    {
      "sku": "431294823",
      "skuName": "hepa filter, 2 pack",
      "count": "6"
    },
    {
	    "sku": "2300094522",
      "skuName": "isopropyl alcohol 70%, 1 gallon",
      "count": "2"
    }
  ]
}
Console

Now you have the full expressivity of jq at your disposal. For example, we could pull out the user ID for the cart:

$ function buf-jq() { buf convert --type $1 --from $2 --to -#format=json | jq $3 }
$ buf-jq my.api.v1.Cart dump.pb '.userId'
9001
Console

Or we can extract all of the SKUs that appear in the cart:

$ buf-jq my.api.v1.Cart dump.pb '[.orders[].sku]'
[
  "82364538",
  "431294823",
  "2300094522"
]
Console

Or we could try calculating how many items are in the cart, total:

$ buf-jq my.api.v1.Cart dump.pb '[.orders[].count] | add'
"162"
Console

Wait. That’s wrong. The answer should be 9. This illustrates one pitfall to keep in mind when using jq with Protobuf. Protobuf will sometimes serialize numbers as quoted strings (the C++ reference implementation only does this when they’re integers outside of the IEEE754 representable range, but Go is somewhat lazier, and does it for all 64-bit values).

You can test if an x int64 is in the representable float range with this very simple check: int64(float64(x)) == x). See https://go.dev/play/p/T81SbbFg3br. The equivalent version in C++ is much more complicated.

This means we need to use the tonumber conversion function:

$ buf-jq my.api.v1.Cart dump.pb '[.orders[].count | tonumber] | add'
9
Console

jq ’s whole deal is JSON, so it brings with it all of JSON’s pitfalls. This is notable for Protobuf when trying to do arithmetic on 64-bit values. As we saw above, Protobuf serializes integers outside of the 64-bit float representable range (and in some runtimes, some integers inside it).

For example, if you have a repeated int64 that you want to sum over, it may produce incorrect answers due to floating-point rounding. For notes on conversions in jq, see https://jqlang.org/manual/#identity.

Disassembling with protoscope

protoscope is a tool provided by the Protobuf team (which I originally wrote!) for decoding arbitrary data as if it were encoded in the Protobuf wire format. This process is called disassembly. It’s designed to work without a schema available, although it doesn’t produce especially clean output.

$ go install github.com/protocolbuffers/protoscope/cmd/protoscope...@latest
$ protoscope dump.pb
1: 9001
2: {
  1: 82364538i64
  2: {"vacuum cleaner"}
  3: 1
}
2: {
  1: 431294823i64
  2: {
    13: 101
    14: 97
    4: 102
    13: 1.3518748403899336e-153   # 0x2032202c7265746ci64
    14: 97
    12:SGROUP
    13:SGROUP
  }
  3: 6
}
2: {
  1: 2300094522i64
  2: {"isopropyl alcohol 70%, 1 gallon"}
  3: 2
}
Console

The field names are gone; only field numbers are shown. This example also reveals an especially glaring limitation of protoscope, which is that it can’t tell the difference between string and message fields, so it guesses according to some heuristics. For the first and third elements it was able to grok them as strings, but for orders[1].sku_name, it incorrectly guessed it was a message and produced garbage.

The tradeoff is that not only does protoscope not need a schema, it also tolerates almost any error, making it possible to analyze messages that have been partly corrupted. If we flip a random bit somewhere in orders[0], disassembling the message still succeeds:

$ protoscope dump.pb
1: 9001
2: {`0f7ac8e80400000000120e76616375756d20636c65616e65721801`}
2: {
  1: 431294823i64
  2: {
    13: 101
    14: 97
    4: 102
    13: 1.3518748403899336e-153   # 0x2032202c7265746ci64
    14: 97
    12:SGROUP
    13:SGROUP
  }
  3: 6
}
2: {
  1: 2300094522i64
  2: {"isopropyl alcohol 70%, 1 gallon"}
  3: 2
}
Console

Although protoscope did give up on disassembling the corrupted submessage, it still made it through the rest of the dump.

Like buf convert, we can give protoscope a FileDescriptorSet to make its heuristic a little smarter.

$ protoscope \
  --descriptor-set <(buf build -o -) \
  --message-type my.api.v1.Cart \
  --print-field-names \
  dump.pb
1: 9001                   # user_id
2: {                      # orders
  1: 82364538i64          # sku
  2: {"vacuum cleaner"}   # sku_name
  3: 1                    # count
}
2: {                          # orders
  1: 431294823i64             # sku
  2: {"hepa filter, 2 pack"}  # sku_name
  3: 6                        # count
}
2: {                                      # orders
  1: 2300094522i64                        # sku
  2: {"isopropyl alcohol 70%, 1 gallon"}  # sku_name
  3: 2                                    # count
}
Console

Not only is the second order decoded correctly now, but protoscope shows the name of each field (via --print-field-names ). In this mode, protoscope still decodes partially-valid messages.

protoscope also provides a number of other flags for customizing its heuristic in the absence of a FileDescriporSet. This enables it to be used as a forensic tool for debugging messy data corruption bugs.