2025-03-11 • 3738 words • 31 minutes
• #formats • #parsing • #frontend

The Art of Formatting Code

Every modern programming language needs a formatter to make your code look pretty and consistent. Formatters are source-transformation tools that parse source code and re-print the resulting AST in some canonical form that normalizes whitespace and optional syntactic constructs. They remove the tedium of matching indentation and brace placement to match a style guide.

Go is particularly well-known for providing a formatter as part of its toolchain from day one. It is not a good formatter, though, because it cannot enforce a maximum column width. Later formatters of the 2010s, such as rustfmt and clang-format, do provide this feature, which ensure that individual lines of code don’t get too long.

The reason Go doesn’t do this is because the naive approach to formatting code makes it intractable to do so. There are many approaches to implementing this, which can make it seem like a very complicated layout constraint solving problem.

So what’s so tricky about formatting code? Aren’t you just printing out an AST?

“Just” an AST

An AST¹ (abstract syntax tree) is a graph representation of a program’s syntax. Let’s consider something like JSON, whose naively-defined AST type might look something like this.

enum Json {
  Null,
  Bool(bool),
  Number(f64),
  String(String),
  Array(Vec<Json>),
  Object(HashMap<String, Json>)
}Rust

The AST for the document {"foo": null, "bar": 42} might look something like this:

let my_doc = Json::Object([
  ("foo".to_string(), Json::Null),
  ("bar".to_string(), Json::Number(42)),
].into());Rust

This AST has some pretty major problems. A formatter must not change the syntactic structure of the program (beyond removing things like redundant braces). Formatting must also be deterministic.

First off, Json::Object is a HashMap, which is unordered. So it will immediately discard the order of the keys. Json::String does not retain the escapes from the original string, so "\n" and "\u000a" are indistinguishable. Json::Number will destroy information: JSON numbers can specify values outside of the f64 representable range, but converting to f64 will quantize to the nearest float.

Now, JSON doesn’t have comments, but if it did, our AST has no way to record it! So it would destroy all comment information! Plus, if someone has a document that separates keys into stanzas², as shown below, this information is lost too.

{
  "this": "is my first stanza",
  "second": "line",

  "here": "is my second stanza",
  "fourth": "line"
}JSON

Truth is, the AST for virtually all competent toolchains are much more complicated than this. Here’s some important properties an AST needs to have to be useful.

Retain span information. Every node in the graph remembers what piece of the file it was parsed from.
Retain whitespace information. “Whitespace” typically includes both whitespace characters, and comments.
Retain ordering information. The children of each node need to be stored in ordered containers.

The first point is achieved in a number of ways, but boils down to somehow associating to each token a pair of integers³, identifying the start and end offsets of the token in the input file.

Given the span information for each token, we can then define the span for each node to be the join of its tokens’ spans, namely the start is the min of its constituent tokens’ starts and its end is the max of the ends. This can be easily calculated recursively.

Once we have spans, it’s easy to recover the whitespace between any two adjacent syntactic constructs by calculating the text between them. This approach is more robust than, say, associating each comment with a specific token, because it makes it easier to discriminate stanzas for formatting.

Being able to retrieve the comments between any two syntax nodes is crucial. Suppose the user writes the following Rust code:

let x = false && // HACK: disable this check.
  some_complicated_check();Rust

If we’re formatting the binary expression containing the &&, and we can’t query for comments between the LHS and the operator, or the operator and the RHS, the // HACK comment will get deleted on format, which is pretty bad!

An AST that retains this level of information is sometimes called a “concrete syntax tree”. I do not consider this a useful distinction, because any useful AST must retain span and whitespace information, and it’s kind of pointless to implement the same AST more than once. To me, an AST without spans is incomplete.

Updating Our JSON AST

With all this in mind, the bare minimum for a “good” AST is gonna be something like this.

struct Json {
  kind: JsonKind,
  span: (usize, usize),
}

enum JsonKind {
  Null,
  Bool(bool),
  Number(f64),
  String(String),
  Array(Vec<Json>),
  Object(Vec<(String, Json)>),  // Vec, not HashMap.
}Rust

There are various layout optimizations we can do: for example, the vast majority of strings exist literally in the original file, so there’s no need to copy them into a String; it’s only necessary if the string contains escapes. My byteyarn crate, which I wrote about here, is meant to make handling this case easy. So we might rewrite this to be lifetime-bound to the original file.

struct Json<'src> {
  kind: JsonKind<'src>,
  span: (usize, usize),
}

enum JsonKind<'src> {
  Null,
  Bool(bool),
  Number(f64),
  String(Yarn<'src, str>),
  Array(Vec<Json>),
  Object(Vec<(Yarn<'src, str>, Json)>),  // Vec, not HashMap.
}Rust

But wait, there’s some things that don’t have spans here. We need to include spans for the braces of Array and Object, their commas, and the colons on object keys. So what we actually get is something like this:

struct Span {
  start: usize,
  end: usize,
}

struct Json<'src> {
  kind: JsonKind<'src>,
  span: Span,
}

enum JsonKind<'src> {
  Null,
  Bool(bool),
  Number(f64),
  String(Yarn<'src, str>),

  Array {
    open: Span,
    close: Span,
    entries: Vec<ArrayEntry>,
  },
  Object {
    open: Span,
    close: Span,
    entries: Vec<ObjectEntry>,
  },
}

struct ArrayEntry {
  value: Json,
  comma: Option<Span>,
}

struct ObjectEntry {
  key: Yarn<'src, str>,
  key_span: Span,
  colon: Span,
  value: Json,
  comma: Option<Span>,
}Rust

Implementing an AST is one of my least favorite parts of writing a toolchain, because it’s tedious to ensure all of the details are recorded and properly populated.

“Just” Printing an AST

In Rust, you can easily get a nice recursive print of any struct using the #[derive(Debug)] construct. This is implemented by recursively calling Debug::fmt() on the elements of a struct, but passing modified Formatter state to each call to increase the indentation level each time.

This enables printing nested structs in a way that looks like Rust syntax when using the {:#?} specifier.

Foo {
  bar: 0,
  baz: Baz {
    quux: 42,
  },
}Rust

We can implement a very simple formatter for our JSON AST by walking it recursively.

fn fmt(out: &mut String, json: &Json, file: &str, indent: usize) {
  match &json.kind {
    Json::Null | Json::Bool(_) | Json::Number(_) | Json::String(_) => {
      // Preserve the input exactly.
      out.push_str(file[json.span.start..json.span.end]);
    }

    Json::Array { entries, .. } => {
      out.push('[');
      for entry in entries {
        out.push('\n');
        for _ in indent*2+2 {
          out.push(' ');
        }
        fmt(out, &entry.value, file, indent + 1)
        if entry.comma.is_some() {
          out.push(',');
        }
      }
      out.push('\n');
      for _ in indent*2 {
        out.push(' ');
      }
      out.push(']');
    }

    Json::Object { entries, .. } => {
      out.push('{');
      for entry in entries {
        out.push('\n');
        for _ in indent*2+2 {
          out.push(' ');
        }

        // Preserve the key exactly.
        out.push_str(file[entry.key_span.start..entry.key_span.end]);

        out.push_str(": ");
        fmt(out, &entry.value, file, indent + 1)
        if entry.comma.is_some() {
          out.push(',');
        }
      }
      out.push('\n');
      for _ in indent*2 {
        out.push(' ');
      }
      out.push('}');
    }
  }
}Rust

This is essentially what every JSON serializer’s “pretty” mode looks like. It’s linear, it’s simple. But it has one big problem: small lists.

If I try to format the document {"foo": []} using this routine, the output will be

{
  "foo": [
  ]
}JSON

This is pretty terrible, but easy to fix by adding a special case:

Json::Array { entries, .. } => {
  if entries.is_empty() {
    out.push_str("[]");
    return
  }

  // ...
}Rust

Unfortunately, this doesn’t handle the similar case of a small but non-empty list. {"foo": [1, 2]} formats as

{
  "foo": [
    1,
    2
  ]
}JSON

Really, we’d like to keep "foo": [1, 2] on one line. And now we enter the realm of column wrapping.

How Wide Is a Codepoint?

The whole point of a formatter is to work with monospaced text, which is text formatted using a monospaced or fixed-width typeface, which means each character is the same width, leading to the measure of the width of lines in columns.

So how many columns does the string cat take up? Three, pretty easy. But we obviously don’t want to count bytes, this isn’t 1971. If we did, кішка, when UTF-8 encoded, it would be 10, rather than 5 columns wide. So we seem to want to count Unicode characters instead?

Oh, but what is a Unicode character? Well, we could say that you’re counting Unicode scalar values (what Rust’s char and Go’s rune) types represent. Or you could count grapheme clusters (like Swift’s Character).

But that would give wrong answers. CJK languages’ characters, such as 猫, usually want to be rendered as two columns, even in monospaced contexts. So, you might go to Unicode and discover UAX#11, and attempt to use it for assigning column widths. But it turns out that the precise rules that monospaced fonts use are not written down in a single place in Unicode. You would also discover that some scripts, such as Arabic, have complex ligature rules that mean that the width of a single character depends on the characters around it.

This is a place where you should hunt for a library. unicode_width is the one for Rust. Given that Unicode segmentation is a closely associated operation to width, segmentation libraries are a good place to look for a width calculation routine.

But most such libraries will still give wrong answers, because of tabs. The tab character U+0009 CHARACTER TABULATION’s width depends on the width of all characters before it, because a tab is as wide as needed to reach the next tabstop, which is a column position an integer multiple of the tab width (usually 2, 4, or, on most terminals, 8).

With a tab width of 4, "\t", "a\t", and "abc\t" are all four columns wide. Depending on the context, you will either want to treat tabs as behaving as going to the next tabstop (and thus being variable width), or having a fixed width. The former is necessary for assigning correct column numbers in diagnostics, but we’ll find that the latter is a better match for what we’re doing.

The reason for being able to calculate the width of a string is to enable line wrapping. At some point in the 2010s, people started writing a lot of code on laptops, where it is not easy to have two editors side by side on the small screen. This removes the motivation to wrap all lines at 80 columns⁴, which in turn results in lines that tend to get arbitrarily long.

Line wrapping helps ensure that no matter how wide everyone’s editors are, the code I have to read fits on my very narrow editors.

Accidentally Quadratic

A lot of folks’ first formatter recursively formats a node by formatting its children to determine if they fit on one line or not, and based on that, and their length if they are single-line, determine if their parent should break.

This is a naive approach, which has several disadvantages. First, it’s very easy to accidentally backtrack, trying to only break smaller and smaller subexpressions until things fit on one line, which can lead to quadratic complexity. The logic for whether a node can break is bespoke per node and that makes it easy to make mistakes.

Consider formatting {"foo": [1, 2]}. In our AST, this will look something like this:

Json {
  kind: JsonKind::Object {
    open: Span { start: 0, end: 1 },
    close: Span { start: 14, end: 15 },
    entries: vec![ObjectEntry {
      key: "foo",
      key_span: Span { start: 1, end: 4 },
      colon: Span { start: 4, end: 5 },
      value: Json {
        kind: JsonKind::Array {
          span: Span { start: 8, end: 9 },
          span: Span { start: 13, end: 14 },
          entries: vec![
            ArrayEntry {
              value: Json {
                kind: JsonKind::Number(1.0),
                span: Span { start: 9, end: 10 },
              },
              comma: Some(Span { start: 10, end: 11 }),
            },
            ArrayEntry {
              value: Json {
                kind: JsonKind::Number(s.0),
                span: Span { start: 12, end: 13 },
              },
              comma: None,
            },
          ],
        },
        span: Span { start: 8, end: 14 },
      },
      comma: None,
    }],
  },
  span: Span { start: 0, end: 15 },
}Rust

To format the whole document, we need to know the width of each field in the object to decide whether the object fits on one line. To do that, we need to calculate the width of each value, and add to it the width of the key, and the width of the : separating them.

How can this be accidentally quadratic? If we simply say “format this node” to obtain its width, that will recursively format all of the children it contains without introducing line breaks, performing work that is linear in how many transitive children that node contains. Having done this, we can now decide if we need to introduce line breaks or not, which increases the indentation at which the children are rendered. This means that the children cannot know ahead of time how much of the line is left for them, so we need to recurse into formatting them again, now knowing the indentation at which the direct children are rendered.

Thus, each node performs work equal to the number of nodes beneath it. This has resulted in many slow formatters.

Now, you could be more clever and have each node be capable of returning its width based on querying its children’s width directly, but that means you need to do complicated arithmetic for each node that needs to be synchronized with the code that actually formats it. Easy to make mistakes.

The solution is to invent some kind of model for your document that specifies how lines should be broken if necessary, and which tracks layout information so that it can be computed in one pass, and then used in a second pass to figure out whether to actually break lines or not.

This is actually how HTML works. The markup describes constraints on the layout of the content, and then a layout engine, over several passes, calculates sizes, solves constraints, and finally produces a raster image representing that HTML document. Following the lead of HTML, we can design…

A DOM for Your Code

The HTML DOM is a markup document: a tree of tags where each tag has a type, such as , <a>, <hr>, or , properties, such as <a href=...>, and content consisting of nested tags (and bare text, which every HTML engine just handles as a special kind of tag), such as Hello World!.

We obviously want to have a tag for text that should be rendered literally. We also want a tag for line breaks that is distinct from the text tag, so that they can be merged during rendering. It might be good to treat text tags consisting of just whitespace, such as whitespace, specially: two newlines \n\n are a blank line, but we might want to merge consecutive blank lines. Similarly, we might want to merge consecutive spaces to simplify generating the DOM.

Consider formatting a language like C++, where a function can have many modifiers on it that can show up in any order, such as inline, virtual, constexpr, and explicit. We might want to canonicalize the order of these modifiers. We don’t want to accidentally wind up printing inline constexpr Foo() because we printed an empty string for virtual. Having special merging for spaces means that all entities are always one space apart if necessary. This is a small convenience in the DOM that multiplies to significant simplification when lowering from AST to DOM.

Another useful tag is something like <indent by=" ">, which increases the indentation level by some string (or perhaps simply a number of spaces; the string just makes supporting tabs easier) for the tags inside of it. This allows control of indentation in a carefully-scoped manner.

Finally, we need some way to group tags that are candidates for “breaking”: if the width of all of the tags inside of a <group> is greater than the maximum width that group can have (determined by indentation and any elements on the same line as that group), we can set that group to “broken”, and… well, what should breaking do?

We want breaking to not just cause certain newlines (at strategic locations) to appear, but we also want it to cause an indentation increase, and in languages with trailing commas like Rust and Go, we want (or in the case of Go, need) to insert a trailing comma only when broken into multiple lines. We can achieve this by allowing any tag to be conditioned on whether the enclosing group is broken or not.

Taken all together, we can render the AST for our {"foo": [1, 2]} document into this DOM, according to the tags we’ve described above.

<group>
  <text s="{" />
  <text s="\n" if=broken />
  <indent by="  ">
    <text s='"foo"' />
    <text s=":" />
    <text s=" " />
    <group>
      <text s="[" />
      <text s="\n" if=broken />
      <indent by="  ">
        <text s="1" />
        <text s="," />
        <text s=" " if=flat />
        <text s="\n" if=broken />
        <text s="2" />
      </indent>
      <text s="\n" if=broken />
      <text s="]"/>
    </group>
  </indent>
  <text s="\n" if=broken />
  <text s="}" />
</group>XML

Notice a few things: All of the newlines are set to appear only if=broken. The space between the two commas only appears if the enclosing group is not broken, that is if=flat. The groups encompass everything that can move due to a break, which includes the outer braces. This is necessary because if that brace is not part of the group, and it is the only character past the line width limit, it will not cause the group to break.

Laying Out Your DOM

The first pass is easy: it measures how wide every node is. But we don’t know whether any groups will break, so how can we measure that without calculating breaks, which depend on indentation, and the width of their children, and…

This is one tricky thing about multi-pass graph algorithms (or graph algorithms in general): it can be easy to become overwhelmed trying to factor the dependencies at each node so that they are not cyclic. I struggled with this algorithm, until I realized that the only width we care about is the width if no groups are ever broken.

Consider the following logic: if a group needs to break, all of its parents must obviously break, because the group will now contain a newline, so its parents must break no matter what. Therefore, we only consider the width of a node when deciding if a group must break intrinsically, i.e., because all of its children decided not to break. This can happen for a document like the following, where each inner node is quite large, but not large enough to hit the limit.

[
  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
  [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
]JSON

Because we prefer to break outer groups rather than inner groups, we can measure the “widest a single line could be” in one pass, bottom-up: each node’s width is the sum of the width of its children, or its literal contents for <text> elements. However, we must exclude all text nodes that are if=broken, because they obviously do not contribute to the single-line length. We can also ignore indentation because indentation never happens in a single line.

However, this doesn’t give the full answer for whether a given group should break, because that depends on indentation and what nodes came before on the same line.

This means we need to perform a second pass: having laid everything out assuming no group is broken, we must lay things out as they would appear when we render them, taking into account breaking. But now that we know the maximum width of each group if left unbroken, we can make breaking decisions.

As we walk the DOM, we keep track of the current column and indentation value. For each group, we decide to break it if either:

Its width, plus the current column value, exceeds the maximum column width.
It contains any newlines, something that can be determined in the first pass.

The first case is why we can’t actually treat tabs as if they advance to a tabstop. We cannot know the column at which a node will be placed at the time that we measure its width, so we need to assume the worst case.

Whenever we hit a newline, we update the current width to the width induced by indentation, simulating a newline plus indent. We also need to evaluate the condition, if present, on each tag now, since by the time we inspect a non-group tag, we have already made a decision as to whether to break or not.

Render It!

Now that everything is determined, rendering is super easy: just walk the DOM and print out all the text nodes that either have no condition or whose condition matches the innermost group they’re inside of.

And, of course, this is where we need to be careful with indentation: you don’t want to have lines that end in whitespace, so you should make sure to not print out any spaces until text is written after a newline. This is also a good opportunity to merge adjacent only-newlines text blocks. The merge algorithm I like is to make sure that when n and m newline blocks are adjacent, print max(n, m) newlines. This ensures that a DOM node containing \n\n\n is respected, while deleting a bunch of \ns in a row that would result in many blank lines.

What’s awesome about this approach is that the layout algorithm is highly generic: you can re-use it for whatever compiler frontend you like, without needing to fuss with layout yourself. There is a very direct conversion from AST to DOM, and the result is very declarative.

More Complicated: YAML

YAML is a superset of JSON that SREs use to write sentient configuration files. It has a funny list syntax that we might want to use for multi-line lists, but we might want to keep JSON-style lists for short ones.

A document of nested lists might look something like this:

- [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
- [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]YAML

How might we represent this in the DOM? Starting from our original JSON document {"foo": [1, 2]}, we might go for something like this:

<group>
  <text s="{" if=flat />
  <indent by="  ">
    <text s='"foo"' />
    <text s=":" />
    <text s=" " />
    <group>
      <text s="[" if=flat />
      <text s="\n" if=broken />
      <text s="- " if=broken />
      <indent by="  ">
        <text s="1" />
      </indent>
      <text s="," if=flat />
      <text s=" " if=flat />
      <text s="\n" if=broken />
      <text s="- " if=broken />
      <indent by="  ">
        <text s="2" />
      </indent>
      <text s="\n" if=broken />
      <text s="]" if=flat />
    </group>
  </indent>
  <text s="\n" if=broken />
  <text s="}" if=flat />
</group>XML

Here, we’ve made the [] and the comma only appear in flat mode, while in broken mode, we have a - prefix for each item. The inserted newlines have also changed somewhat, and the indentation blocks have moved: now only the value is indented, since YAML allows the -s of list items to be at the same indentation level as the parent value for lists nested in objects. (This is a case where some layout logic is language-specific, but now the output is worrying about declarative markup rather than physical measurements.)

There are other enhancements you might want to make to the DOM I don’t describe here. For example, comments want to be word-wrapped, but you might not know what the width is until layout happens. Having a separate tag for word-wrapped blocks would help here.

Similarly, a mechanism for “partial breaks”, such as for the document below, could be implemented by having a type of line break tag that breaks if the text that follows overflows the column, which can be easily implemented by tracking the position of the last such break tag.

{
  "foo": ["very", "long", "list",
          "of", "strings"]
}JSON

Using This Yourself

I think that a really good formatter is essential for any programming language, and I think that a high-quality library that does most of the heavy-lifting is important to make it easier to demand good formatters.

So I wrote a Rust library. I haven’t released it on crates.io because I don’t think it’s quite at the state I want, but it turns out that the layout algorithm is very simple, so porting this to other languages should be EZ.

Now you have no excuse. :D

Everyone pronounces this acronym “ay ess tee”, but I have a friend who really like to say ast, rhyming with mast, so I’m making a callout post my twitter dot com. ↩
In computing, a group of lines not separated by blank lines is called a stanza, in analogy to the stanzas of a poem, which are typeset with no blank lines between the lines of the stanza. ↩

You could also just store a string, containing the original text, but storing offsets is necessary for diagnostics, which is the jargon term for a compiler error. Compiler errors are recorded using an AST node as context, and to report the line at which the error occurred, we need to be able to map the node back to its offset in the file.

Once we have the offset, we can calculate the line in $O(\log n)$ time using binary search. Having pre-computed an array of the offset of each \n byte in the input file, binary search will tell us the index and offset of the \n before the token; this index is the zero-indexed line number, and the string from that \n to the offset can be used to calculate the column.

use unicode_width::UnicodeWidthStr;

/// Returns the index of each newline. Can be pre-computed and re-used
/// multiple times.
fn newlines(file: &str) -> Vec<usize> {
  file.bytes()
      .enumerate()
      .filter_map(|(i, b)| (b == b'\n').then_some(i+1))
}

/// Returns the line and column of the given offset, given the line
/// tarts of the file.
fn location(
  file: &str,
  newlines: &[usize],
  offset: usize,
) -> (usize, usize) {
  match newlines.binary_search(offset) {
    // Ok means that offset refers to a newline, so this means
    // we want to return the width of the line that it ends as
    // the column.
    //
    // Err means that this is after the nth newline, except Err(0),
    // which means it is before the first one.
    Ok(0) | Err(0) => (1, file[..offset].width()),
    Ok(n) => (n+1, file[newlines[n-1]..offset].width()),
    Err(n) => (n+2, file[newlines[n]..offset].width()),
  }
}Rust

↩

The Rust people keep trying to convince me that it should be 100. They are wrong. 80 is perfect. They only think they need 100 because they use the incorrect tab width of four spaces, rather than two. This is the default for clang-format and it’s perfect. ↩

2024-12-16 • 3570 words • 29 minutes
• #dark-arts • #go

Go's Weird Little Iterators

A second post on Go silliness (Miguel, aren’t you a C++ programmer?): in 1.23, Go finally added custom iterators. Now, back when I was at Google and involved in the Go compiler as “the annoying Rust guy who gets lunch with us”, there were proposals suggesting adding something like this, implemented as either an interface or a func:

type Iter[T any] = func() (T, bool)Go

This is not what Go did. No, Go did something really weird. And the implementation is incredible.

What’s an Iterator?

An iterator, in the context of programming language design, is a special type of value that can be used to walk through a sequence of values, without necessarily materializing the sequence as whatever the language’s array type is.

But, a proper iterator must fit with the language’s looping construct. An iterable type is one which can be used in a for-each loop, such as C++’s for (T x : y) or Python’s for x in y (modern languages usually only have a for-each loop as their only for loop, because C-style for loops are not in anymore).

C++ Iterator Pairs

Every language defines a desugaring that defines how custom iteration works in term of the more primitive loops. For example, in C++, when we write for (T x : y) { ... } (called a range-based for loop, added in C++11), desugars as follows¹:

auto&& __range = y;
auto __begin = begin(__range); // ADL
auto __end = end(__range);     // ADL
for (; __begin != __end; ++__begin) {
  T x = *__begin;
  // ...
}C++

break, continue, and return inside of the loop body require no special handling: they Just Work, because this is just a plain ol for loop.

This begin and end weirdness is because, if the iterator backs an actual array, begin and end can just be pointers to the first element and one-past-the-end and this will Just Work. Before C++11, the convention for C++ iterators was to construct types that imitated pointers; you would usually write loops over non-array types like this:

for (auto it = things.begin(); it != things.end(); ++it) {
  whatever(*it);
}C++

C++ simply codified common (if gross) practice. It is very tedious to implement C++ iterators, though. You need to provide a dummy end iterator, you need to provide some kind of comparison operator, and iterators that don’t return a reference out of operator*() are… weird.

Begin and end can be different types (which is how C++20 ranges pretend to be iterable), but being able to query done-ness separately from the next value makes implementation annoying: it means that an iterator that has not begun iteration (i.e., ++ has not been executed yet, because it occurs in the loop’s latch, not its header²) needs to do extra work to answer != end, which usually means an extra bool to keep track of whether iteration has started or not.

Here’s what writing an iterator (that is also an iterable usable in a range for-loop) over the non-zero elements of a std::span<const int> might look like.

struct NonZero {
  std::span<const int> ints;

  auto begin() { return *this; }
  auto end() { return sentinel{}; }

  bool operator==(sentinel) {
    while (!ints.empty()) {
      if (ints[0] == 0) ints = ints.subspan(1);
    }
    return ints.empty();
  }
  bool operator!=(sentinel s) { return !(*this == s); }

  NonZero& operator++() {
    ints = ints.subspan(1);
    return *this;
  }
  NonZero operator++(int) {
    auto prev = *this;
    ++*this;
    return prev;
  }

  const int& operator*() const { return ints[0]; }

 private:
  struct sentinel{};
};C++

In this case, operator== is not const, which is a bit naughty. Purists might argue that this type should have a constructor, which adjusts ints to point to the first non-zero element on construction, and operator++ to perform the mutation. That would look like this:

class NonZero {
 public:
  NonZero(std::span<const int> ints) : ints_(ints) {
    skip_zeros();
  }

  auto begin() { return *this; }
  auto end() { return sentinel{}; }

  bool operator==(sentinel) const { return ints.empty(); }
  bool operator!=(sentinel s) const { return !(*this == s); }

  NonZero& operator++() {
    skip_zeros();
    ints = ints.subspan(1);
    return *this;
  }
  NonZero operator++(int) {
    auto prev = *this;
    ++*this;
    return prev;
  }

  const int& operator*() const { return ints[0]; }

 private:
  struct sentinel{};
  void skip_zeros() {
    while (!ints.empty() && ints[0] == 0) {
      ints = ints.subspan(1);
    }
  }
  std::span<const int> ints_;
};C++

std::sentinel_for (C++’s iterator concepts are terribly named) really wants operator== to be const, but I could have also just marked ints as mutable to avoid that. It it’s not already clear, I really dislike this pattern. See here for some faffing about with C++ iterators on my part.

Java Also Got This Wrong

At least Java provides a standard iterable interface, thankfully.

package java.util;

public interface Iterable<E> {
  Iterator<E> iterator();
}

public interface Iterator<E> {
  boolean hasNext();
  E next();
}Java

The desugaring of for (T x : y) { ... } is then:

for (var $iter = x.iterator(); $iter.hasNext();) {
  T x = $iter.next();
}Java

Do you see the problem here? Although Java now provides a standard interface, doesn’t require annoying equality comparisons, and doesn’t require an end value, these things are still a pain to implement! You still need to be able to query if you’re done before you’ve had a chance to step through the iterator.

Like before, suppose we have an int[], and we want to yield every non-zero value in it. How do we construct an iterator for that?

int[] xs = null;
var it = new Iterator<Integer>() {
  int[] array = xs;
  int idx;

  public boolean hasNext() {
    for (; !done() && array[idx] == 0; idx++) {}
    return !done();
  }

  public Integer next() {
    if (!hasNext()) throw new NoSuchElementException();
    return array[idx++];
  }

  private boolean done() {
    return array == null || idx == array.length;
  }
};Java

What a pain. Java’s anonymous classes being wordy aside, it’s annoying and error-prone to do this: it’s tempting to accidentally implement hasNext by simply checking if the array is empty. (Aside, I hate that xs.length throws on null arrays. Just return zero like in Go, c’mon).

Also, it’s no a single-abstract-method interface, so I can’t use a lambda to create an iterator.

At least break, continue, and return Just Work, because the underlying operation is a for loop like before.

Rust Does It Better

Rust also has a standard iterable interface.

// mod core::iter

pub trait IntoIterator {
  type Item;
  type Iter: Iterator<Item=Self::Item>;

  fn into_iter() -> Self::Iter;
}

pub trait Iterator {
  type Item;
  fn next() -> Option<Self::Item>;
}Rust

The desugaring for for x in y { ... } is reasonably straightforward, like in Java:

let mut __it = IntoIterator::into_iter(y);
while let Some(x) = __it.next() {
  // ...
}Rust

This is so straightforward that it’s not so unusual to write it yourself, when you don’t plan on consuming the entire iterator. Alternatively, you can partially iterate over an iterator by taking a mutable reference to it. This is useful for iterators that can yield their remainder.

let mut it = my_slice.chunks_exact(8);
for chunk in &mut it {
  do_thing(chunk);
}
do_thing(it.remainder());Rust

break, continue, and return work in the obvious way.

The interface solves the problems C++ and Java had very cleanly: next both computes the next item and whether the iterator has more elements. Rust even allows iterators to resume yielding Some after yielding None, although few algorithms will make use of this.

Implementing the non-zero iterator we’ve been writing so far is quite simple:

let mut ints: &[i32] = ...;
let it = std::iter::from_fn(move || {
  while ints.get(0) == Some(0) {
    ints = &ints[1..];
  }
  let item = ints.get(0);
  ints = &ints[1..];
  item
});Rust

However, this can be written far more simply³ using iterator combinators:

let ints = &[i32] = ...;
let it = ints.iter().copied().filter(|x| x != 0);Rust

It requires a little bit of effort to implement some iterators, but most of the common cases are easy to put together with composition.

Python iterators are basically the same thing, but there’s no interface to implement (because Python doesn’t believe in type safety). Lua iterators are similar. The Rust pattern of a function that returns the next item (or a special end-of-sequence value) is relatively popular because of this simplicity and composability, and because they can model a lot of iteration strategies.

So, What Did Go Do?

Well. Go has a range for syntax like many other languages. The syntax looks like this:

for x := range y {
  // ...
}Go

The x can be a list of places, and the := can be plain assignment, =. You can also write for range y { ... } if the iteration values aren’t needed.

The behavior of this construct, like many others in Go, depends explicitly on the type after range. Each range iteration can yield zero or more values; the

These are:

For []T, [n]T, and *[n]T, each step yields an index of the slice and the value at that offset, in order.
For map[K]V, each step yields a key and a value, in a random order.

For <- chan T, it desugars into

for {
 x, ok := <-y
 if !ok { break }
 // ...
}Go

Starting in Go 1.22, ranging on an integer type would desugar into
```
for x := 0; x < y; i++ {
 // ...
}
```
Go

All of these desugars are essentially still just loops, so break, continue, goto, and return all work as expected.

But, how do custom types, like weird map types, implement iteration? The usual⁴ implementation is sync.Map.Range, which looks like this:

func (*Map) Range(yield func(key, value any) bool)Go

This function will call yield for each element in the map. If the function returns false, iteration will stop. This pattern is not uncommon, but sometimes libraries omit the bool return (like container/ring.Ring.Do). Some, like filepath.WalkDir, have a more complex interface involving errors.

This is the template for what became rangefuncs, a mechanism for using the for-range syntax with certain function values.

Rangefuncs

The word “rangefunc” does not appear in Go’s specification. It is a term used to refer to them in some documentation, within the compiler, and in the runtime.

A rangefunc is any function with one of the following signatures:

func(yield func() bool)
func(yield func(V) bool)
func(yield func(K, V) bool)

They work like sync.Map.Range does: the function calls yield (hereafter simply called “the yield”) for each element, and stops early if yield returns false. The iter package contains types for the second and third of these:

package iter

type Seq[V any] func(yield func(V) bool)
type Seq2[K, V any] func(yield func(K, V) bool)Go

For example, the slices package provides an adaptor for converting a slice into an iterator that ranges over it.

package slices

// All returns an iterator over index-value pairs in the slice
// in the usual order.
func All[Slice ~[]E, E any](s Slice) iter.Seq2[int, E] {
	return func(yield func(int, E) bool) {
		for i, v := range s {
			if !yield(i, v) {
				return
			}
		}
	}
}Go

So. These things are actually pretty nuts. They break my brain somewhat, because this is the opposite of how iterators usually work. Go calls what I’ve described all the other languages do a “pull iterator”, whereas rangefuncs are “push iterators”.

They have a few obvious limitations. For one, you can’t do smart sizing like with Rust or C++ iterators⁵. Another is that you can’t easily “pause” iteration.

But they do have one advantage, which I think is the real reason Go went to so much trouble to implement them (and yes, I will dig into how insane that part is). Using push iterators by default means that users “only” need to write an ordinary for loop packaged into a function. Given that Go makes major performance sacrifices in order to be easy to learn⁶, trying to make it so that an iterator packages the actual looping construct it represents makes quite a bit of sense.

Rangefuncs are actually really cool in some respects, because they enable unusual patterns. For example, you can use a rangefunc to provide RAII blocks.

func Open(fs fs.FS, path string) iter.Seq2[fs.File, error] {
  return func(yield func(fs.File, error) bool) {
    f, err := fs.Open(path)
    if err {
      yield(nil, err)
      return
    }

    defer f.Close()
    yield(f, nil)
  }
}

for f, err := range Open(os.DirFS("/"), "etc/passwd") {
  if err != nil {
    return nil
  }

  // ...
}Go

Being a block that you can put an epilog onto after yielding a single element is quite powerful! You can also use a nilary rangefunc to simply create a block that you can break out of, instead of having to use goto.

func Once() func(func() bool) {
  return func(y func() bool) { y() }
}

for range Once() {
  if canDo() {
    break
  }

  do()
}Go

So wait. You can return out of rangefunc loops. That means that… Go has non-local returns?!

Go Now Has Non-Local Returns

The desugaring for rangefuncs is very complicated. This is because break, continue, goto, and return all work in a rangefunc! How does this work? Let’s Godbolt it.

Let’s start with something really basic: a loop body that just calls a function.

package x

import "iter"

func run(s iter.Seq[int]) {
  for x := range s {
    sink(x)
  }
}

func sink(int)Go

This produces the following assembly output (which I’ve reformatted into Intel syntax, and removed some extraneous ABI things, including a writer barrier where (*) is below).

x.run:
    push    rbp
    mov     rbp, rsp
    add     rsp, -24
    mov     [rsp + 40], rax
    lea     rax, [type:int]
    call    runtime.newobject
    mov     [rsp + 16], rax
    mov     [rax], internal/abi.RF_READY
    lea     rax, ["type:noalg.struct { F uintptr; X0 *int }"]
    call    runtime.newobject
    lea     rcx, x.run-range1
    mov     [rax], rcx  // (*)
    mov     rcx, [rsp + 16]
    mov     [rax + 8], rcx
    mov     rdx, [rsp + 40]
    mov     rbx, [rdx]
    call    rbx
    mov     rcx, [rsp + 16]
    cmp     [rcx], internal/abi.RF_PANIC
    jeq     panic
    mov     [rcx], internal/abi.RF_EXHAUSTED
    add     rsp, 24
    pop     rbp
    ret
panic:
    mov     rax, internal/abi.RF_MISSING_PANIC
    call    runtime.panicrangestate

x.run-range1:
    push    rbp
    mov     rbp, rsp
    add     rsp, -24
    mov     [rsp + 8], rdx
    mov     rcx, [rdx + 8]
    mov     rdx, [rcx]
    cmp     qword ptr [rdx], internal/abi.RF_READY
    jne     panic2
    mov     [rsp + 16], rcx
    mov     qword ptr [rcx], internal/api.RF_PANIC
    call    x.sink
    mov     rcx, [rsp + 16]
    mov     qword ptr [rcx], internal/abi.RF_READY
    mov     rax, 1
    add     rsp, 24
    pop     rpb
    ret
panic2:
    mov     rax, rdx
    call    runtime.panicrangestatex86 Assembly

This is a lot to take in, but if we look carefully, we decompile this function into a Go function:

import (
  "internal/abi"
  "runtime"
)

func run(s iter.Seq[int]) {
  __state := abi.RF_PANIC
  s(func(v int) bool {
    if __state != abi.RF_READY {
      runtime.panicrangestate(*state)
    }
    __state = abi.RF_PANIC
    sink(v)  // Loop body
    __state = abi.RF_READY
    return true
  })
  __state = abi.RF_EXHAUSTED
}Go

Go will actually enforce invariants on the yield it synthesizes in a range for, in order to catch buggy code. In particular, __state escapes because s is an arbitrary function, so it gets spilled to the heap.

So, what happens when the loop body contains a break? Consider:

package x

import "iter"

func run(s iter.Seq[int]) {
  for x := range s {
    if sink(x) { break }
  }
}

func sink(int)Go

I’ll spare you the assembly listing, since it’s very similar, so I’ll just reverse-engineer the output directly:

import (
  "internal/abi"
  "runtime"
)

func run(s iter.Seq[int]) {
  __state := abi.RF_PANIC
  s(func(v int) bool {
    if __state != abi.RF_READY {
      runtime.panicrangestate(*state)
    }
    __state = abi.RF_PANIC
    if sink(v) {
      __state = abi.RF_DONE
      return false
    }
    __state = abi.RF_READY
    return true
  })
  __state = abi.RF_EXHAUSTED
}Go

Non-local returns are much more complicated. Consider:

package x

import "iter"

func run(s iter.Seq[int]) int {
  for x := range s {
    if sink(x) { return x }
  }
  return -1
}

func sink(int)Go

The resulting assembly is something like this, with some irrelevant code, such as write barriers, removed:

x.run:
    push    rbp
    mov     rbp, rsp
    add     rsp, -40
    mov     [rsp + 56], rax
    lea     rax, [type:int]
    call    runtime.newobject
    mov     [rsp + 24], rax
    lea     rax, [type:int]
    call    runtime.newobject
    mov     [rsp + 32], rax
    lea     rax, [type:int]
    call    runtime.newobject
    mov     [rsp + 16], rax
    mov     [rax], internal/abi.RF_READY
    lea     rax, ["type:noalg.struct { F uintptr; X0 *int; X1 *int; X2 *int }"]
    call    runtime.newobject
    lea     rcx, [x.run-range1]
    mov     [rax], rcx
    mov     rcx, [rsp + 16]
    mov     rbx, [rsp + 24]
    mov     rsi, [rsp + 32]
    mov     [rax + 8], rcx
    mov     [rax + 16], rbx
    mov     [rax + 24], rsi
    mov     rdx, [rsp + 56]
    mov     rdi, [rdx]
    call    rdi
    mov     rcx, [rsp + 16]
    cmp     qword ptr [rcx], internal/abi.RF_PANIC
    jeq     panic
    mov     [rcx], internal/abi.RF_EXHAUSTED
    mov     rcx, [rsp + 32]
    cmp     qword ptr [rcx], -1
    jne     resume
    mov     rcx, [rsp + 32]
    mov     rax, [rcx]
    add     rsp, 40
    pop     rbp
    ret
resume:
    mov     rcx, [rsp + 32]
    mov     qword ptr [rcx], -1
    mov     rax, -1
    add     rsp, 40
    pop     rbp
    ret
panic:
    mov     rax, internal/abi.RF_MISSING_PANIC
    call    runtime.panicrangestate

x.run-range1
    push    rbp
    mov     rbp, rsp
    add     rsp, -40
    mov     [rsp + 8], rdx
    mov     rcx, [rdx + 8]
    mov     rbx, [rcx]
    mov     rsi, [rdx + 16]
    mov     rdx, [rdx + 24]
    cmp     rbx, internal/abi.RF_READY
    jne     panic2
    mov     [rsp + 56], rax
    mov     [rsp + 16], rcx
    mov     [rsp + 24], rsi
    mov     [rsp + 32], rdx
    mov     qword ptr [rcx], internal/abi.RF_PANIC
    call    x.sink
    test    al, al
    jeq     cont
    mov     rcx, [rsp + 56]
    mov     rdx, [rsp + 24]
    mov     [rdx], rcx
    mov     rcx, [rsp + 32]
    mov     qword ptr [rcx], -1
    mov     rcx, [rsp + 16]
    mov     qword ptr [rcx], internal/abi.RF_DONE
    xor     eax, eax
    add     rsp, 40
    pop     rbp
    ret
cont:
    mov     rcx, [rsp + 16]
    mov     qword ptr [rcx], internal/abi.RF_READY
    mov     rax, 1
    pop     rpb
    ret
panic:
    mov     rax, rbx
    call    runtime.panicrangestatex86 Assembly

Try to reverse engineer this yourself, if you like! If you write this out as Go, here’s what you get:

import (
  "internal/abi"
  "runtime"
)

func run(s iter.Seq[int]) (__ret int) {
  var __next int
  __state := abi.RF_PANIC
  s(func(v int) bool {
    if __state != abi.RF_READY {
      runtime.panicrangestate(*state)
    }
    __state = abi.RF_PANIC
    if sink(v) {
      __state = abi.RF_DONE
      __next = -1
      __ret = v
      return false
    }
    __state = abi.RF_READY
    return true
  })
  __state = abi.RF_EXHAUSTED
  if __next == -1 {
    return
  }

  return -1
}Go

The reason __next is an int is because it is also used when exiting the loop via goto or a break/continue with label. It specifies where to jump to after the call into the rangefunc returns. Each potential control flow out of the loop is assigned some negative number.

The precise details of the lowering have been exquisitely documented by Russ Cox and David Chase, the primary implementers of the feature.

You might be curious what runtime.panicrangestate does. It’s pretty simple, and it lives in runtime/panic.go:

package runtime

//go:noinline
func panicrangestate(state int) {
	switch abi.RF_State(state) {
	case abi.RF_DONE:
		panic(rangeDoneError)
	case abi.RF_PANIC:
		panic(rangePanicError)
	case abi.RF_EXHAUSTED:
		panic(rangeExhaustedError)
	case abi.RF_MISSING_PANIC:
		panic(rangeMissingPanicError)
	}
	throw("unexpected state passed to panicrangestate")
}Go

If you visit this function in runtime/panic.go, you will be greeted by this extremely terrifying comment from Russ Cox immediately after it.

// deferrangefunc is called by functions that are about to
// execute a range-over-function loop in which the loop body
// may execute a defer statement. That defer needs to add to
// the chain for the current function, not the func literal synthesized
// to represent the loop body. To do that, the original function
// calls deferrangefunc to obtain an opaque token representing
// the current frame, and then the loop body uses deferprocat
// instead of deferproc to add to that frame's defer lists.
//
// The token is an 'any' with underlying type *atomic.Pointer[_defer].
// It is the atomically-updated head of a linked list of _defer structs
// representing deferred calls. At the same time, we create a _defer
// struct on the main g._defer list with d.head set to this head pointer.
//
// The g._defer list is now a linked list of deferred calls,
// but an atomic list hanging off:
//
// (increasingly terrifying discussion of concurrent data structures)Go

This raises one more thing that works in range funcs, seamlessly: defer. Yes, despite the yield executing multiple call stacks away, possibly on a different goroutine… defer still gets attached to the calling function.

Go Now Has Non-Local Defer

The way defer works is that each G (the goroutine struct, runtime.g) holds a linked list of defer records, of type _defer. Each call to defer sticks one of these onto this list. On function return, Go calls runtime.deferreturn(), which essentially executes and pops defers off of the list until it finds one whose stack pointer is not the current function’s stack pointer (so, it must belong to another function).

Rangefuncs throw a wrench in that mix: if myFunc.range-n defers, that defer has to be attached to myFunc’s defer records somehow. So the list must have a way of inserting in the middle.

This is what this comment is about: when defer occurs in the loop body, that defer gets attached to a defer record for that function, using a token that the yield captures; this is later canonicalized when walking the defer list on the way out of myFunc. Because the yield can escape onto another goroutine, this part of the defer chain has to be atomic.

Incredibly, this approach is extremely robust. For example, if we spawn the yield as a goroutine, and carefully synchronize between that and the outer function, we can force the runtime to hard-crash when defering to a function that has returned.

package main

import (
	"fmt"
	"sync"
)

func bad() (out func()) {
	var w1, w2 sync.WaitGroup
	w1.Add(1)
	w2.Add(1)

	out = w2.Done
	defer func() { recover() }()
	iter := func(yield func() bool) {
		go yield()
		w1.Wait() // Wait to enter yield().
    // This panics once w1.Done() executes, because
    // we exit the rangefunc while yield() is still
    // running. The runtime incorrectly attributes
    // this to recovering in the rangefunc.
	}

	for range iter {
		w1.Done() // Allow the outer function to exit the loop.
		w2.Wait() // Wait for bad() to return.
		defer fmt.Println("bang")
	}

  return nil // Unreachable
}

func main() {
	resume := bad()
  resume()
  select {}  // Block til crash.
}Go

This gets us fatal error: defer after range func returned. Pretty sick! It accomplishes this by poisoning the token the yield func uses to defer.

I have tried various other attempts at causing memory unsafety with rangefuncs, but Go actually does a really good job of avoiding this. The only thing I’ve managed to do that’s especially interesting is to tear the return slot on a function without named returns, but that’s no worse than tearing any other value (which is still really bad, because you can tear interface values, but it’s not worse).

Pull Iterators and Coroutines

Of course we’re not done. Go provides a mechanism for converting push iterators into pull iterators. Essentially, there is a function that looks like this:

package iter

func Pull[V any](seq Seq[V]) (next func() (V, bool), stop func()) {
  yield := make(chan struct{value V; ok bool})
  pull := make(chan struct{}{})
  go func() {
    seq(func(v V) bool {
      _, ok := <-pull
      if !ok {
        return false
      }
      yield <- struct{value V; ok bool}{v, true}
    })

    close(yield)
  }()

  next = func() (V, bool) {
    pull <- struct{}{}
    return <-yield
  }
  stop = func() { close(pull) }
  return
}Go

Essentially, you can request values with next(), and stop() can be used if you finish early. But also, this spawns a whole goroutine and uses channels to communicate and synchronize, which feels very unnecessary.

The implementation doesn’t use goroutines. It uses coroutines.

Giving Up on Goroutines

Spawning a goroutine is expensive. Doing so expends scheduler and memory resources. It’s overkill for a helper like this (ironic, because the original premise of Go was that goroutines would be cheap enough to allocate willy-nilly).

Go instead implements this using “coroutines”, a mechanism for concurrency without parallelism. This is intended to make context switching very cheap, because it does not need to go through the scheduler: instead, it uses cooperative multitasking.

The coroutine interface is something like the following. My “userland” implementation will not be very efficient, because it relies on the scheduler to transfer control. The goroutines may run on different CPUs, so synchronization is necessary for communication, even if they are not running concurrently.

package coro

import (
  "runtime"
  "sync"
)

type Coro struct {
  m sync.Mutex
}

func New(f func()) *Coro {
  c := new(Coro)
  c.m.Lock()
  go func() {
    c.m.Lock()
    f()
    c.m.Unlock()
  }
  return c
}

func (c *Coro) Resume() {
  c.m.Unlock()
  c.m.Lock()
}Go

When we create a coroutine with coro.New(), it spawns a goroutine that waits on a mutex. Another goroutine can “take its place” as the mutex holder by calling c.Resume(), which allows the coroutine spawned by coro.New to resume and enter f().

Using the coroutine as a rendezvous point, two goroutines can perform concurrent work: in the case of iter.Pull, one can be deep inside of whatever loops the iterator wants to do, and the other can request values.

Here’s what using my coro.Coro to implement iter.Pull might look like:

package iter

func Pull[V any](seq Seq[V]) (next func() (V, bool), stop func()) {
  var (
    done bool
    v, z V
  )

  c := coro.New(func() {
    s(func(v1 V) bool {
      c.Resume()  // Wait for a request for a value.
      if done {
        // This means we resumed from stop(). Break out of the
        // loop.
        return false
      }
      v = v1
    })
    if !done {
      // Yield the last value.
      c.Resume()
    }

    v = z
    done = true
  })

  next = func() (V, bool) {
    if done { return z, false }

    c.Resume()      // Request a value.
    return v, true  // Return it.
  }

  stop = func() {
    if done { return }

    done = true // Mark iteration as complete.
    c.Resume()  // Resume the iteration goroutine to it can exit.
  }

  return next, stop
}Go

If you look at the implementation in iter.go, it’s basically this, but with a lot of error checking and race detection, to prevent misuse, such as if next or stop escape to other goroutines.

Now, the main thing that runtime support brings here is that Resume() is immediate: it does not go to the scheduler, which might not decide to immediately run the goroutine that last called Resume() for a variety of reasons (for example, to ensure wakeup fairness). Coroutines sidestep fairness, by making Resume() little more than a jump to the last Resume() (with registers fixed up accordingly).

This is not going to be that cheap: a goroutine still needs to be allocated, and switching needs to poke and prod the underlying Gs a little bit. But it’s a cool optimization, and I hope coroutines eventually make their way into more things in Go, hopefully as a language or sync primitive.

Conclusion

Congratulations, you have survived over 3000 words of me going on about iterators. Go’s push iterators are a unique approach to a common language design problem (even if it took a decade for them to materialize).

I encountered rangefuncs for the first time earlier this year and have found them absolutely fascinating, both from a “oh my god they actually did that” perspective and from a “how do we express iteration” perspective. I don’t think the result was perfect by any means, and it is unsuitable for languages that need the performance you can only get from pull iterators. I think they would be a great match for a language like Python or Java, though.

I’d like to thank David Chase, an old colleague, for tolerating my excited contrived questions about the guts of this feature.

Ugh, ok. This is the C++20 desugaring, and there are cases where we do not just call std::begin(). In particular, array references and class type references with .begin() and .end() do not call std::begin() and are open-coded. This means that you can’t use ADL to override these types’ iterator. ↩
In compiler jargon, a loop is broken up into three parts: the header, which is where the loop is entered, the body, which is one step of iteration, and the latch, which is the part that jumps back to the start of the body. This is where incrementation in a C-style for loop happens. ↩
And with better performance. Rust’s iterators can provide a size hint to help size containers before a call to collect(), via the FromIterator trait. ↩
Some people observed that you can use a channel as a custom iterator, by having a parallel goroutine run a for loop to feed the channel. Do not do this. It is slow: it has to transit each element through the heap, forcing anything it points to escape. It takes up an extra M and a P in the scheduler, and requires potentially allocating a stack for a G. It’s probably faster to just build a slice and return that, especially for small iterations. ↩

For this reason, I wish that Go had instead defined something along these lines.

package iter

type Seq[V any] interface {
  Iterate(yield func(V) bool)
}Go

This is functionally identical to what they did, but it would have permitted future extensions such as the following interface:

package iter

type SizedSeq[V any] interface {
  Seq[V]

  SizeHint() (lower, upper int64)
}Go

This would mean that slices.Collect could be enhanced into something like this.

package slices

func Collect[E any](seq iter.Seq[E]) []E {
  var out []E
  if sized, ok := seq.(iter.SizedSeq[E]); ok {
    lower, _ := sized.SizeHint()
    out = make([]E, 0, lower)
  }

  for v := range seq {
    out = append(v)
  }
  return out
}Go

I don’t think there’s an easy way to patch this up, at this point. ↩

Disclaimer: I am not going to dig into Go’s rationale for rangefuncs. Knowing how the sausage is made, most big Go proposals are a mix of understandable reasoning and less reasonable veiled post-hoc justification to compensate for either Google planning/approvals weirdness or because the design was some principal engineer’s pony. This isn’t even a Go thing, it’s a Google culture problem. I say this as the architect of Protobuf Editions, the biggest change to Protobuf since Rob’s misguided proto3⁷ experiment. I have written this kind of language proposal, on purpose, because bad culture mandated it.

The purpose of a system is what it does. It is easier to understand a system by observing its response to stimuli, rather than what it says on the tin. So let’s use that lens.

Go wants to be easy to learn. It intended to replace C++ at Google (lol, lmao), which, of course, failed disastrously, because performance of the things already written in C++ is tied to revenue. They have successfully pivoted to being an easy-to-learn language that makes it easy to onboard programmers regardless of what they already use, as opposed to onboarding them to C++.

This does not mean that Go is user-friendly. In fact, user-friendliness is clearly not a core value. Rob and his greybeard crowd didn’t seem to care about the human aspect of interacting with a toolchain, so Go tooling rarely provides good diagnostics, nor did the language, until the last few years, try to reduce toil. After all, if it is tedious to use but simple, that does make it easy to onboard new programmers.

Rust is the opposite: it is very difficult to learn with a famously steep learning curve; however, it is very accessible, because the implementors have sanded down every corner and sharp edge using diagnostics, error messages, and tooling. C++ is neither of these things. It is very difficult to learn, and most compilers are pretty unhelpful (if they diagnose anything at all).

I think that Go has at least realized the language can be a pain to use in some situations, which is fueled in part by legitimate UX research. This is why Go has generics and other recent advanced language features, like being able to use the for syntax with integers or with custom iterators.

I think that rangefuncs are easy to learn in the way Go needs them to be. If you expect more users to want to write rangefuncs than users want to write complicated uses of rangefuncs, I think push iterators are the easiest to learn how to use.

I think this is a much more important reason for all the trouble that rangefuncs generate for the compiler and runtime than, say, compatibility with existing code; I have not seen many cases in the wild or in the standard library that conform to the rangefunc signatures. ↩
But please don’t use proto3. I’m telling you that as the guy who maintained the compiler. Just don’t. ↩

2024-12-12 • 2804 words • 23 minutes
• #dark-arts • #assembly • #go

Things You Never Wanted To Know About Go Interfaces

Lately I’ve been finding myself writing a bit of Go, and I’ve picked up various fun “layout secrets” that help inform how I write code to minimize hidden allocations, and generally be kind to the optimizer. This article is a series of notes on the topic.

This post is about Go implementation details, so they can probably break you at any time if you rely on it. On the other hand, Hyrum’s law is a bitch, so taking your chances may not be that bad. After all, they’re probably never going to be able to properly clean up the mess people made with //go:linkname with runtime symbols…

As with many of my other posts, I’ll assume a basic familiarity with being able to read assembly. I’m using x86 for this post, but it’s worth looking at my RISC-V post for a refresher.

GC Shapes

The most basic Go-specific concept when it comes to type layouts is the shape of a type. This is an implementation detail of Go’s garbage collector that leaks through the unsafe package.

Like in most native programming languages, every Go type has a size (the number of bytes that type takes up in memory) and an alignment (a power of two that every pointer to that type must be divisible by). Go, like most other languages, requires that size be divisible by the alignment: that is, the size is equal to the stride of an array of that type.

The size an alignment of a type can be queried by the intrinsics unsafe.Sizeof and unsafe.Alignof. These are very unwieldy in generic code, so I like to define a couple of helpers¹:

func Size[T any]() int {
  var v T
  return int(unsafe.Sizeof(v))
}

func Align[T any]() int {
  var v T
  return int(unsafe.Alignof(v))
}Go

Together, these two quantities are called the layout of a type (a term common to many native languages). However, the shape of a type also records what pieces thereof contain pointers. This is because memory visible to the GC (such as globals, heap memory, or stack roots) is typed, and the GC needs to know which parts of those types are pointers that it needs to trace through.

Because all pointers have the same size and alignment (4 or 8 bytes depending on the system) the pointer words of a type can be represented as a bitset, one bit for every 4 or 8 bytes in the type. This, in fact, is the representation used by the GC².

In particular, this means that whether a field is to be interpreted as an unsafe.Pointer or as a uintptr is a static property of the type. As we will see when we discuss interfaces, this restriction prevents a few layout optimizations.

Slices and Strings

Go is very public about the layout of slices and strings. A slice is

type slice[T] struct {
  data     *T
  len, cap int
}Go

len and cap are extracted by their eponymous builtins, and data can be obtained using unsafe.SliceData (or &s[0] if the slice is nonempty, but that costs a bounds-check).

A string has the same layout as a []byte, except for a capacity:

type string struct {
  data *byte
  len  int
}Go

Despite essentially being slices, Go treats strings subtly differently. Strings are comparable, so they can be used as map keys. They are also immutable, which enables a handful of optimizations. Immutability is also why they are comparable: Go made the mistake of not keeping const from C, but they really want map keys to be const.

There is nothing stopping us from aliasing strings to data pointed to by a slice: after all, strings.Builder does it to avoid a copy in String(). We can implement this easily enough with some unsafe:

func StringAlias(data []byte) string {
  return unsafe.String(unsafe.SliceData(data), len(data))
}Go

Doing this is perfectly safe, so long as data is not mutated while the returned string is accessible. This allows virtually any slice type to be used as a key in a map, with some caveats.

Types which contain alignment padding cannot be used, because Go does not promise that it zeros memory returned by new.
Types which contain pointers will cause those pointers to become unreachable if the only reference is the aliased string; this is because the pointed to data’s shape contains no pointer words.
Incomparable types and interfaces will be compared by address (that is, maps, channels and funcs).

Dynamic Arrays with Reflection

Now, this isn’t the only to accomplish this: you can create dynamically-sized array types using reflection, like so:

func Slice2Array[T any](s []T) any {
  if s == nil { return nil }

  var v T
  elem := reflect.TypeOf(v)
  array := reflect.ArrayOf(len(s), elem)

  // NOTE: NewAt will return a reflect.Value containing a
  // pointer, not an array!
  refl := reflect.NewAt(array, unsafe.SliceData(s))
  refl = refl.Elem() // Dereference to get a pointer-to-array.
  return refl.Interface()
}Go

This will return an any whose type is [len(s)]T. You can even type assert it for static array sizes. This any is suitable for placing into a map[any]T, just as if we had built it with e.g. any([...]byte("foo"))

However, and this is not at all obvious from the code here, calling refl.Interface() will perform a copy of the whole array. Interface() delegates through a few functions until it calls reflect.packEface().

The code this function (found here) is reproduced below:

package reflect

// packEface converts v to the empty interface.
func packEface(v Value) any {
	t := v.typ()
	var i any
	e := (*abi.EmptyInterface)(unsafe.Pointer(&i))
	// First, fill in the data portion of the interface.
	switch {
	case t.IfaceIndir():
		if v.flag&flagIndir == 0 {
			panic("bad indir")
		}
		// Value is indirect, and so is the interface we're making.
		ptr := v.ptr
		if v.flag&flagAddr != 0 {
			c := unsafe_New(t)
			typedmemmove(t, c, ptr)
			ptr = c
		}
		e.Data = ptr
	case v.flag&flagIndir != 0:
		// Value is indirect, but interface is direct. We need
		// to load the data at v.ptr into the interface data word.
		e.Data = *(*unsafe.Pointer)(v.ptr)
	default:
		// Value is direct, and so is the interface.
		e.Data = v.ptr
	}
	// Now, fill in the type portion. We're very careful here not
	// to have any operation between the e.word and e.typ assignments
	// that would let the garbage collector observe the partially-built
	// interface value.
	e.Type = t
	return i
}Go

The switch determines precisely how the interface data pointer is computed. It turns out that (almost all) array types return true for t.IfaceIndr(), so the first case is selected, which triggers a copy (that being the call to unsafe_New() followed by a typedmemmove). This copy is to ensure that the value of the resulting interface cannot be mutated.

Now, if only we knew the layout of Go’s interfaces, we might be able to get somewhere here…

The Layout of Go’s Interfaces

Oh, yes, that’s what this article is about. So, if we look at the runtime2.go file in the runtime (yes, that’s what it’s called), nestled among the giant scheduler types for Gs, Ps, and Ms, we’ll find a couple of structs that really elucidate what’s going on:

package runtime

type funcval struct {
	fn uintptr
	// variable-size, fn-specific data here
}

type iface struct {
	tab  *itab
	data unsafe.Pointer
}

type eface struct {
	_type *_type
	data  unsafe.Pointer
}Go

funcval is the layout of a func(), more on that later. iface is the layout of your “usual” interface, consisting of an itab (an interface table, or what Go calls a vtable) and a pointer to some data. eface is the layout of any (the artist formerly known as interface{}, hence the name: empty interface).

eface having its own layout is an optimization. Because any exists to be downcast from dynamically, storing the type directly cuts out a pointer load when doing a type switch on an any specifically. If we look at what an itab is (which is “just” an abi.ITab):

package abi

// The first word of every non-empty interface type contains an *ITab.
// It records the underlying concrete type (Type), the interface type
// it is implementing (Inter), and some ancillary information.
//
// allocated in non-garbage-collected memory
type ITab struct {
	Inter *InterfaceType
	Type  *Type
	Hash  uint32     // copy of Type.Hash. Used for type switches.
	Fun   [1]uintptr // fun[0]==0 means Type does not implement Inter.
}Go

Codegen for Interface Operations

An ITab contains the same type it would have as an any, which makes the generated code for a function that upcasts an interface to any very simple³:

package foo

func Upcast(i MyIface) any {
  return i
}Go

foo.F:
    test    rax, rax
    jeq     nil
    mov     rax, [rax + 8]
nil:
    retx86 Assembly

In the register ABI, the x86 argument (and return) registers are rax, rbx, rcx, rdi, rsi, r8, r9, r10 and r11 (with rdx reserved for passing a closure capture, more on that later; r14 holds a pointer to the currently running G).

The *ITab comes in on rax and the data pointer on rbx. First, we need to check if this is the nil interface, identified by having a nil itab (or type, in the case of any). If it is nil, we just return: rax:rbx already contain the data of a nil any. Otherwise, we load ITab.Type, at offset 8, into rax, and return.

How do interface function calls work?

package foo

type MyIface interface {
  Method(int) int
}

func Call(m MyIface) int {
  return m.Method(42)
}Go

foo.Call:
    cmp     rsp, [r14 + 16]
    jls     grow
    push    rbp
    mov     rsp, rbp
    add     rsp, -16
    mov     [rsp], rax
    mov     [rsp + 8], rbx
    mov     rcx, [rax + 24]
    mov     rax, rbx
    mov     rbx, 42
    call    rcx
    add     rsp, 16
    pop     rbp
    ret
grow:
    nop
    mov     [rsp + 8], rax
    mov     [rsp + 16], rbx
    call    runtime.morestack_noctxt
    mov     rax, [rsp + 8]
    mov     rbx, [rsp + 16]
    jmp     foo.Callx86 Assembly

This function seems to be doing a lot more than it actually is. Part of it is that its prologue has to do a call to runtime.morestack_noctxt(), which is simply a call to runtime.morestack that clobbers rdx, the closure capture parameter. The meat of it comes when it loads [rax + 24], the first element of ITab.Fun. It then moves the data pointer in rbx to rax, the argument into rbx, and issues the call.

What about upcasts? An upcast to a concrete type is quite simple: simply compare the type in the interface (either directly or in the *ITab) to a particular statically-known one. Downcasting to an interface (sometimes called a sidecast) is much more complicated, because it essentially requires a little bit of reflection.

package foo

type MyIface interface {
  Method(int) int
}

func Downcast(m any) MyIface {
  return m.(MyIface)
}Go

foo.Downcast:
    cmp     rsp, [r14 + 16]
    jls     grow
    push    rpb
    mov     rbp, rsp
    add     rsp, -24
    mov     [rsp], rax
    mov     [rsp + 8], rbx
    test    rax, rax
    jeq     nil
    mov     rcx, [foo..typeAssert0]
    mov     rdx, [rcx]
    mov     rsi, [rax + 16]
hashProbe:
    mov     rdi, rsi
    and     rsi, rdx
    shl     rsi, 4
    mov     r8, [rcx + rsi + 8]
    cmp     rax, r8
    jeq     found
    lea     rsi, [rdi + 1]
    test    r8, r8
    jnz     hashProbe
    mov     [rsp + 8], rbx
    mov     rbx, rax
    leq     rax, [foo..typeAssert0]
    call    runtime.typeAssert
    mov     rbx, [rsp + 8]
    jmp     done
found:
    mov     rax, [rcx + rsi + 16]
done:
    add     rsp, 24
    pop     rpb
    ret
nil:
    lea     rax, [type:foo.MyIface]
    call    runtime.panicnildottype
grow:
    // Same as it was in foo.Call above.
    jmp     foo.Downcastx86 Assembly

When we request an interface downcast, the Go compiler synthesizes a symbol of type abi.TypeAssert. Its definition is reproduced below.

package abi

type TypeAssert struct {
	Cache   *TypeAssertCache
	Inter   *InterfaceType
	CanFail bool
}
type TypeAssertCache struct {
	Mask    uintptr
	Entries [1]TypeAssertCacheEntry
}
type TypeAssertCacheEntry struct {
	// type of source value (a *runtime._type)
	Typ uintptr
	// itab to use for result (a *runtime.itab)
	// nil if CanFail is set and conversion would fail.
	Itab uintptr
}Go

The first thing this function does is check if rax contains 0, i.e., if this is a nil any, and panics if that’s the case (that’s a call to runtime.panicnildottype). It then loads foo..typeAssert0, a synthetic global variable containing an abi.TypeAssert value. It loads the Cache field, as well as the Hash field of the abi.Type attached to the any. It masks off the low bits using typeAssert0.Cache.Mask, and uses that to start probing the very simple open-addressed hash table located in typeAssert0.Cache.Entries.

If it finds a TypeAssertCacheEntry with the type we’re looking for (compared by address), we’ve found it. We load that entry’s Itab value into rax to change the value from being an any to being a MyIface, and we’re done.

If it finds a TypeAssertCacheEntry with a nil Typ pointer, we’re forced to hit the slow path, implemented at runtime.typeAssert(). This dynamically builds an itab by searching the method set of the type inside the any.

This then calls the reflection code in runtime.getitab(), which is what actually performs the messy search through the method set, comparing the names and signatures of methods with those in the interface, to produce an itab at runtime.

Then, it shoves this the resulting itab into the global itab cache, which is protected by a global lock! There are lots of scary atomics in this code. There are many places where this can potentially panic, bubbling up a type assertion failure to the user.

When runtime.getitab() returns, runtime.typeAssert() will maybe⁴ update the type assertion cache, and return the new itab. This allows the code in our function to return directly, without needing to take another trip into the hashProbe loop.

In theory, PGO could be used to pre-fill the cache, but I couldn’t find any code in the compiler that indicates that this is something they do. In the meantime, you can optimize a hot type assert ahead of time by asserting to a known common type:

func DoSomething(r io.Reader) {
  var rs io.ReadSeeker
  if f, ok := r.(*os.File); ok {
    // Check for a known implementation first. This only costs
    // a pointer comparison with the *abi.Type in the itab.
    rs = f
  } else if f, ok := r.(io.ReadSeeker); ok {
    // Do an interface type assertion. This would eventually
    // learn os.File, but the branch above skips that "warmup"
    // time. It also lets the hardware branch predictor allocate
    // a prediction slot just for os.File.
    rs = f
  } else {
    // ...
  }
} Go

Type switches, incidentally, use a very similar caching mechanism for switches that include interface types among the cases.

What Was That About Indirect Interfaces?

Back when we were hacking arrays into existence with reflection, there was some trouble in reflect.Value.Interface(), where it would do a seemingly unnecessary copy.

This is because an interface’s data pointer must be a pointer. If you cram, say, an int into an any, Go will spill it to the heap. This is often called boxing, but the Go runtime refers to it as an “indirect interface”.

package foo

func Int2Any(x int) any {
  return x
}Go

foo.Int2Any:
  push     rbp
  mov      rbp, rsp
  add      rsp, -8
  call     runtime.convT64
  move     rbx, rax
  lea      rax, [type:int]
  add      rsp, 8
  pop      rbp
  retx86 Assembly

Like many other managed languages, Go will skip boxing very small values by instead returning pointers into some global array.

Now, this boxing could be avoided: after all, an int is no larger than a pointer, so we could cram it into the data pointer field directly. However, the GC really doesn’t like that: the GC assumes it can trace through any pointer. Now, the GC could treat interfaces differently, and look at the type/itab pointer to determine if the data value pointer or a scalar. However, this would add significant complexity to both the representation of shapes, and to the tracing code in the GC, resulting in more branches and slower tracing.

However, if the type being wrapped in an interface happens to be a pointer, it can just use that pointer value directly.

package foo

func Int2Any(x int) any {
  return x
}Go

foo.Int2Any:
  move     rbx, rax
  lea      rax, [type:int]
  retx86 Assembly

Any type that has the same shape as a pointer will be indirect. This includes maps, channels, and funcs. It also includes one element arrays of such types, such as [1]*int and [1]chan error, and single-field structs of such types. Curiously, this does not include structs which contain a zero-sized field before the pointer-sized field, even though those have the same shape as a pointer.

This means it’s generally not safe to play games with forging an interface out of a pointer to some type: whether that type is indirect in an interface is a subtle implementation detail of the compiler.

And of course, it’s important to remember that if you want to return a value by interface, you had best hope it can get inlined, so the compiler can promote the heap allocation to the stack.

Function Pointers

The last thing to look at are Go’s function pointers. For the longest time, I assumed they had the same layout as an interface: a pointer to closure data, and a hardware function pointer.

It turns out the layout is weirder: let’s revisit the runtime.funcval we found in runtime2.go earlier.

package runtime

type funcval struct {
	fn uintptr
	// variable-size, fn-specific data here
}Go

This unusual layout is best understood by looking at the generated assembly.

package foo

func Call(
  f func(int) int,
  x int,
) int {
  return f(x)
}Go

foo.Call:
    cmp     rsp, [r14 + 16]
    jls     grow
    push    rpb
    mov     rpb, rsp
    add     rsp, -8
    mov     rcx, [rax]
    mov     rdx, rax
    mov     rax, rbx
    call    rcx
    add     rsp, 8
    pop     rbp
    ret
grow:
    // Same as before.
    jmp     foo.Callx86 Assembly

To call f, first we interpret it as a *funcval and load f.fn into a temporary. That is, the first word pointed to by rax (which holds f on function entry). Then, we place f in rdx, the closure context register. The reason for using this extra magic register will become clear shorter. Then, we arrange the rest of the arguments in their usual registers, and we jump to the address stored in f.fn.

Inside of f, captures are accessed by offsetting from rdx. What does one of those closures look like?

package foo

func Capture(x int) func(int) int {
  return func(y int) int {
    return x * y
  }
}Go

foo.Capture:
    cmp     rsp, [r14 + 16]
    jls     grow
    push    rpb
    mov     rpb, rsp
    add     rsp, -16
    mov     [rsp], rax
    lea     rax, ["type:noalg.struct { F uintptr; X0 int }"]
    call    runtime.newobject
    lea     rcx, foo.Capture.func1
    mov     [rax], rcx
    mov     rcx, [rsp]
    mov     [rax + 8], rcx
    add     rsp, 16
    pop     rbp
    ret
grow:
    // Same as before.
    jmp     foo.Capture

foo.Capture.func1:
    mov     rcx, [rdx + 8]
    imul    rax, rcx
    retx86 Assembly

All Capture is doing is allocating a funcval with a single int capture; that’s the { F uintptr; X0 int } in the code above. It then places the address of Capture.func1, which implements the callback, into F, and the argument of Capture into X0.

What about when returning a reference to a function? In that case, all that happens is it returns a reference to a global containing the address of the function.

package foo

func Capture(x int) func(int) int {
  return Id
}

func Id(x int) int {
  return x
}Go

foo.Capture:
    lea     rax, [foo.Id·f]
    ret

foo.Id:
    ret

foo.Id·f:
    .quad foo.Idx86 Assembly

Because we pass the closure arguments in an extra register not used by regular functions, we don’t need to create a thunk for this case.

Unfortunately, we do need to create a thunk for methods, even methods with a pointer receiver. This is because of the following incompatible constraints:

The receiver pointer for a method must point exactly to the value the method is called on. It can’t be a fixed offset before, because that would create an out-of-bounds pointer, which the GC does not tolerate.
The closure pointer must point to the start of the funcval, not its captures, because adjusting the pointer to point to the captures would cause it to point one-past-the-end of a value, which the GC also does not tolerate!

Thus, even if methods accepted a pointer receiver via rdx, closures and methods disagree about where that pointer should be passed.

Of course, there are adjustments we can make to fix this problem. For example, we could require that all funcval values have at least one capture. No-capture funcvals would have a synthetic _ byte field. This is not unlike how a non-empty struct whose final field is empty will be padded with an extra _ byte field: this is specifically to avoid a pointer to that field being a past-the-end pointer. The cost is that every non-capturing closure costs twice as much binary size.

Another fix is to make the GC blind to the pointer in rdx. This will never be the only pointer by which a value is reachable, so it would be safe to replace mov rdx, rax with a lea rdx, [rax + 8]. The GC would never know!

Until then, beware that writing return foo.Method secretly allocates 16 bytes or so. (Aside: I used to sit next to the Go team at Google, and I remember having a conversation with Austin Clements about this. Apparently I misremembered, because until recently I thought Go already implemented this optimization!)

Conclusion

If you made it this far this is probably you right now:

Miguel as a Whimsicott, dizzy with register names.

This isn’t intended to be as polished as most of my articles, but there’s been enough things I’ve come across that I wanted to write this all up for my own reference.

Sizeof and Alignof are intrinsics, so the compiler will turn them into constants. However, they are only constants if the type being measured is not generic, so wrapping them in a function like this doesn’t actually hurt in generic code. ↩
Except for very large types that would have more words than can be recorded by an array of size abi.MaxPtrmaskBytes. For larger types, we use GC programs! A GC program is an LZ-compressed bitset serving the same purpose as the pointer bitset most smaller types use. See gcprog.go.

In fact, reflection knows how to create programs on the fly for most types! See reflect/type.go. ↩
I will be writing assembly examples in Intel-syntax x86. Go’s assembly syntax is horrible and an impediment to the point I’m making. ↩
Maybe? Well, the cache will only get updated about 0.1% of the time. This is to amortize the costs of growing the cache. I assume they benchmarked this, and found that the cost of growing the cache makes it only worthwhile when that assertion is getting hammered. ↩

< Prev • Next >