What the Hell Is a Target Triple?

Cross-compiling is taking a computer program and compiling it for a machine that isn’t the one hosting the compilation. Although historically compilers would only compile for the host machine, this is considered an anachronism: all serious native compilers are now cross-compilers.

After all, you don’t want to be building your iPhone app on literal iPhone hardware.

Many different compilers have different mechanisms for classifying and identifying targets. A target is a platform that the compiler can produce executable code for. However, due to the runaway popularity of LLVM, virtually all compilers now use target triples. You may have already encountered one, such as the venerable x86_64-unknown-linux, or the evil x86_64-pc-windows. This system is convoluted and almost self-consistent.

But what is a target triple, and where did they come from?

Stupid GCC Conventions

So if you go poking around the Target Triplet page on OSDev, you will learn both true and false things about target triples, because this page is about GCC, not native compilers in general.

Generally, there is no “ground truth” for what a target triple is. There isn’t some standards body that assigns these names. But as we’ll see, LLVM is the trendsetter.

If you run the following command you can learn the target triple for your machine:

$ gcc -dumpmachine
x86_64-linux-gnu
Console

Now if you’re at all familiar with any system that makes pervasive use of target triples, you will know that this is not a target triple, because this target’s name is x86_64-unknown-linux-gnu, which is what both clang and rustc call-

$ clang -dumpmachine
x86_64-pc-linux-gnu
$ rustc -vV | grep host
host: x86_64-unknown-linux-gnu
Console

Oh no.

Well, GCC is missing the the pc or unknown component, and that’s specifically a GCC thing; it allows omitting parts of the triple in such a way that is unambiguous. And they are a GCC invention, so perhaps it’s best to start by assessing GCC’s beliefs.

According to GCC, a target triple is a string of the form <machine>-<vendor>-<os>. The “machine” part unambiguously identifies the architecture of the system. Practically speaking, this is the assembly language that the compiler will output at the end. The “vendor” part is essentially irrelevant, and mostly is of benefit for sorting related operating systems together. Finally, the “os” part identifies the operating system that this code is being compiled for. The main thing this identifies for a compiler is the executable format: COFF/PE for Windows, Mach-O for Apple’s operating systems, ELF for Linux and friends, and so on (this, however, is an oversimplification).

But you may notice that x86_64-unknown-linux-gnu has an extra, fourth entry1, which plays many roles but is most often called the target’s “ABI”. For linux, it identifies the target’s libc, which has consequences for code generation of some language features, such as thread locals and unwinding. It is optional, since many targets only have one ABI.

Cross Compiling with GCC

A critical piece of history here is to understand the really stupid way in which GCC does cross compiling. Traditionally, each GCC binary would be built for one target triple. The full name of a GCC binary would include the triple, so when cross-compiling, you would compile with x86_64-unknown-linux-gcc, link with x86_64-unknown-linux-ld, and so on (here, gcc is not the fourth ABI component of a triple; it’s just one of the tools in the x86_64-unknown-linux toolchain).

Nobody with a brain does this2. LLVM and all cross compilers that follow it instead put all of the backends in one binary, and use a compiler flag like --target to select the backend.

But regardless, this is where target triples come from, and why they look the way they look: they began as prefixes for the names of binaries in autoconf scripts.

But GCC is ancient technology. In the 21st century, LLVM rules all native compilers.

Names in the Ancient Language

LLVM’s target triple list is the one that should be regarded as “most official”, for a few reasons:

  1. Inertia. Everyone and their mother uses LLVM as a middleend and backend, so its naming conventions bubble up into language frontends like clang, rustc swiftc, icc, and nvcc.

  2. Upstream work by silicon and operating system vendors. LLVM is what people get hired to work on for the most part, not GCC, so its platform-specific conventions often reflect the preferences of vendors.

These are in no small part because Apple, Google, and Nvidia have armies of compiler engineers contributing to LLVM.

The sources for “official” target triples are many. Generally, I would describe a target triple as “official” when:

  1. A major compiler (so, clang or rustc) uses it. Rust does a way better job than LLVM of documenting their targets, so I prefer to give it deference. You can find Rust’s official triples here.

  2. A platform developer (e.g., a hardware manufacturer, OS vendor) distributes a toolchain with a target triple in the arch-vendor-os format.

So, what are the names in class (1)? LLVM does not really go out of its way to provide such a list. But we gotta start somewhere, so source-diving it is.

We can dig into Triple.cpp in LLVM’s target triple parser. It lists all of the names LLVM recognizes for each part of a triple. Looking at Triple::parseArch(), we have the following names, including many, many aliases. The first item on the right column is LLVM’s preferred name for the architecture, as indicated by Triple::getArchTypeName().

Architecture Possible Names
Intel x86 (32-bit) i386, i486, i586, i686, i786, i886, i986
Intel x86 (64-bit) x86_64, amd64, x86_86h3
ARM (32-bit) arm, xscale, …
ARM (32-bit, big-endian) armeb, xscaleeb, …
ARM (64-bit) aarch64, aarch64e, aarch64ec, arm64, …
ARM (64-bit, big-endian) aarch64_be, …
ARM (64-bit, ILP324) aarch64_32, arm64_32, …
ARM Thumb thumb, …
ARM Thumb (big-endian) thumbeb, …
IBM PowerPC5 (32-bit) powerpc, powerpcspe, ppc, ppc32
IBM PowerPC (little-endian) powerpcle, ppcle, ppc32le
IBM PowerPC (64-bit) powerpc64, ppu, ppc64
IBM PowerPC (64-bit, little-endian) powerpc64le, ppc64le
MIPS (32-bit) mips, mipseb, mipsallegrex, mipsisa32r6, mipsr6
MIPS (32-bit, little-endian) mipsel, mipsallegrexel, mipsisa32r6el, mipsr6el
MIPS (64-bit) mips64, mips64eb, mipsn32, mipsisa64r6, mips64r6, mipsn32r6
MIPS (64-bit, little-endian) mips64el, mipsn32el, mipsisa64r6el, mips64r6el, mipsn32r6el
RISC-V (32-bit) riscv32
RISC-V (64-bit) riscv64
IBM z/Architecture s390x6, systemz
SPARC sparc
SPARC (little-endian) sparcel
SPARC (64-bit) sparcv6, sparc64
WebAssembly (32-bit) wasm32
WebAssembly (64-bit) wasm64
Loongson (32-bit) loongarch32
Loongson (64-bit) loongarch64
Radeon R600 r600
AMD GCN amdgcn
Qualcomm Hexagon hexagon
Nvidia PTX7 (32-bit) nvptx
Nvidia PTX (64-bit) nvptx64
AMD IL8 (32-bit) amdil
AMD IL (64-bit) amdil64
Direct-X IL dxil, …
HSAIL (32-bit) hsail
HSAIL (64-bit) hsail64
Khronos SPIR (32-bit) spir
Khronos SPIR (64-bit) spir64
Khronos SPIR-V spirv, …
Khronos SPIR-V (32-bit) spirv32, …
Khronos SPIR-V (64-bit) spirv64, …
Android RenderScript (32-bit) renderscript32
Android RenderScript (64-bit) renderscript64
Movidius SHAVE shave
Atmel AVR avr
Motorola 68k m68k
Argonaut ARC arc
Texas Instruments MSP430 msp430
Tensilica Xtensa xtensa
C-SKY csky
OpenASIP tce
OpenASIP (little-endian) tcele
Myracom Lanai lanai
XMOS xCore xcore
Kalimba9 kalimba
VE9 ve

Here we begin to see that target triples are not a neat system. They are hell. Where a list of architecture names contains a “…”, it means that LLVM accepts many more names.

The problem is that architectures often have versions and features, which subtly change how the compiler generates code. For example, when compiling for an x86_64, we may want to specify that we want AVX512 instructions to be used. On LLVM, you might do that with -mattr=+avx512. Every architecture has a subtly-different way of doing this, because every architecture had a different GCC! Each variant of GCC would put different things behind -mXXX flags (-m for “machine”), meaning that the interface is not actually that uniform. The meanings of -march, -mcpu, -mtune, and -mattr thus vary wildly for this reason.

Because LLVM is supposed to replace GCC (for the most part), it replicates a lot of this wacky behavior.

So uh, we gotta talk about 32-bit ARM architecture names.

ARMTargetParser.cpp

There is a hellish file in LLVM dedicated to parsing ARM architecture names. Although members of the ARM family have many configurable features (which you can discover with llc -march aarch64 -mattr help10), the name of the architecture is somewhat meaningful, and can hav many options, mostly relating to the many versions of ARM that exist.

How bad is it? Well, we can look at all of the various ARM targets that rustc supports with rustc --print target-list:

$ rustc --print target-list | grep -P 'arm|aarch|thumb' \
  | cut -d- -f1 | sort | uniq
aarch64
aarch64_be
arm
arm64_32
arm64e
arm64ec
armeb
armebv7r
armv4t
armv5te
armv6
armv6k
armv7
armv7a
armv7k
armv7r
armv7s
armv8r
thumbv4t
thumbv5te
thumbv6m
thumbv7a
thumbv7em
thumbv7m
thumbv7neon
thumbv8m.base
thumbv8m.main
Console

Most of these are 32-bit ARM versions, with profile information attached. These correspond to the names given here. Why does ARM stick version numbers in the architecture name, instead of using -mcpu like you would on x86 (e.g. -mcpu alderlake)? I have no idea, because ARM is not my strong suit. It’s likely because of how early ARM support was added to GCC.

Internally, LLVM calls these “subarchitectures”, although ARM gets special handling because there’s so many variants. SPIR-V, Direct X, and MIPS all have subarchitectures, so you might see something like dxilv1.7 if you’re having a bad day.

Of course, LLVM’s ARM support also sports some naughty subarchitectures not part of this system, with naughty made up names.

  • arm64e is an Apple thing, which is an enhancement of aarch64 present on some Apple hardware, which adds their own flavor of pointer authentication and some other features.

  • arm64ec is a completely unrelated Microsoft invention that is essentially “aarch64 but with an x86_64-ey ABI” to make x86_64 emulation on what would otherwise be aarch64-pc-windows-msvc target somewhat more amenable.

Why the Windows people invented a whole other ABI instead of making things clean and simple like Apple did with Rosetta on ARM MacBooks? I have no idea, but http://www.emulators.com/docs/abc_arm64ec_explained.htm contains various excuses, none of which I am impressed by. My read is that their compiler org was just worse at life than Apple’s, which is not surprising, since Apple does compilers better than anyone else in the business.

Actually, since we’re on the topic of the names of architectures, I have a few things I need to straighten out.

Made Up Names of Architectures

x86 and ARM both seem to attract a lot of people making up nicknames for them, which leads to a lot of confusion in:

  1. What the “real” name is.

  2. What name a particular toolchain wants.

  3. What name you should use in your own cosmopolitan tooling.

Let’s talk about the incorrect names people like to make up for them. Please consider the following a relatively normative reference on what people call these architectures, based on my own experience with many tools.

When we say “x86” unqualified, in 2025, we almost always mean x86_64, because 32-bit x86 is dead. If you need to talk about 32-bit x86, you should either say “32-bit x86”, “protected mode”11, or “i386” (the first Intel microarchitecture that implemented protected mode)12. You should not call it x86_32 or just x86.

You might also call it IA-32 for Intel Architecture 32, (or ia32), but nobody calls it that and you risk confusing people with ia64, or IA-64, the official name of Intel’s failed general-purpose VLIW architecture, Itanium, which is in no way compatible with x86. ia64 was what GCC and LLVM named Itanium triples with. Itanium support was drowned in a bathtub during the Obama administration, so it’s not really relevant anymore. Rust has never had official Itanium support.

32-bit x86 is extremely not called “x32”; this is what Linux used to call its x86 ILP324 variant before it was removed (which, following the ARM names, would have been called x86_6432).

There are also many ficticious names for 64-bit x86, which you should avoid unless you want the younger generation to make fun of you. amd64 refers to AMD’s original implementation of long mode in their K8 microarchitecture, first shipped in their Athlon 64 product. AMD still makes the best x86 chips (I am writing this on a machine socketed with a Zen2 Threadripper), sure, but calling it amd64 is silly and also looks a lot like arm64, and I am honestly kinda annoyed at how much Go code I’ve seen with files named fast_arm64.s and fast_amd64.s. Debian also uses amd64/arm64, which makes browsing packages kind of annoying.

On that topic, you should absolutely not call 64-bit mode k8, after the AMD K8. Nobody except for weird computer taxonomists like me know what that is. But Bazel calls it that, and it’s really irritating13.

You should also not call it x64. Although LLVM does accept amd64 for historical purposes, no one calls it x64 except for Microsoft. And even though it is fairly prevalent on Windows, I absolutely give my gamedev friends a hard time when they write x64.

On the ARM side, well. Arm14 has a bad habit of not using consistent naming for 64-bit ARM, since they used both AArch64 and ARM64 for it. However, in compiler land, aarch64 appears to be somewhat more popular.

You should also probably stick to the LLVM names for the various architectures, instead of picking your favorite Arm Cortex name (like cortex_m0).

Vendors and Operating Systems

The worst is over. Let’s now move onto examinining the rest of the triple: the platform vendor, and the operating system.

The vendor is intended to identify who is responsible for the ABI definition for that target. Although provides little to no value to the compiler itself, but it does help to sort related targets together. Sort of.

Returning to llvm::Triple, we can examine Triple::VendorType. Vendors almost always correspond to companies which develop operating systems or other platforms that code runs on, with some exceptions.

We can also get the vendors that rustc knows about with a handy dandy command:

rustc --print target-list | grep -P '\w+-\w+-' | cut -d- -f2 | sort | uniq
Console

The result is this. This is just a representative list; I have left off a few that are not going to be especially recognizeable.

Vendor Name Example Triple
Vendor Unknown15 unknown x86_64-unknown-linux
“PC” pc x86_64-pc-windows-msvc
Advanced Micro Devices Inc. amd amdgcn-amd-gfx906
Apple Inc. apple aarch64-apple-ios-sim
Intel Corporation intel i386-intel-elfiamcu
IBM Corporation ibm powerpc64-ibm-aix
Mesa3D Project mesa amdgcn-mesa-mesa3d
MIPS Technologies LLC mti mips-mti-none-elf
Nintendo nintendo armv6k-nintendo-3ds
Nvidia Corporation nvidia nvptx64-nvidia-cuda
Sony Interactive Entertainment scei, sie, sony x86_64-sie-ps5
Sun Microsystems sun sparcv9-sun-solaris
SUSE S. A. suse aarch64-suse-linux
Red Hat, Inc redhat x86_64-redhat-linux
Universal Windows Platform uwp aarch64-uwp-windows-msvc

Most vendors are the names of organizations that produce hardware or operating systems. For example suse and redhat are used for those organizations’ Linux distributions, as a funny branding thing. Some vendors are projects, like the mesa vendor used with the Mesa3D OpenGL implementation’s triples.

The unknown vendor is used for cases where the vendor is not specified or just not important. For example, the canonical Linux triple is x86_64-unknown-linux… although one could argue it should be x86_64-torvalds-linux. It is not uncommon for companies that sell/distribute Linux distributions to have their own target triples, as do SUSE and sometimes RedHat. Notably, there are no triples with a google vendor, even though aarch64-linux-android and aarch64-unknown-fuchsia should really be called aarch64-google-linux-android and aarch64-google-fuchsia. The target triple system begins to show cracks here.

The pc vendor is a bit weirder, and is mostly used by Windows targets. The standard Windows target is x86_64-pc-windows-msvc, but really it should have been x86_64-microsoft-windows-msvc. This is likely complicated by the fact that there is also a x86_64-pc-windows-gnu triple, which is for MinGW code. This platform, despite running on Windows, is not provided by Microsoft, so it would probably make more sense to be called x86_64-unknown-windows-gnu.

But not all Windows targets are pc! UWP apps use a different triple, that replaces the pc with uwp. rustc provides targets for Windows 7 backports that use a win7 “vendor”.

Beyond Operating Systems

The third (or sometimes second, ugh) component of a triple is the operating system, or just “system”, since it’s much more general than that. The main thing that compilers get from this component relates to generating code to interact with the operating system (e.g. SEH on Windows) and various details related to linking, such as object file format and relocations.

It’s also used for setting defines like __linux__ in C, which user code can use to determine what to do based on the target.

We’ve seen linux and windows, but you may have also seen x86_64-apple-darwin. Darwin?

The operating system formerly known as Mac OS X (now macOS16) is a POSIX operating system. The POSIX substrate that all the Apple-specific things are built on top of is called Darwin. Darwin is a free and open source operating system based on Mach, a research kernel whose name survives in Mach-O, the object file format used by all Apple products.

All of the little doodads Apple sells use the actual official names of their OSes, like aarch64-apple-ios. For, you know, iOS. On your iPhone. Built with Xcode on your iMac.

none is a common value for this entry, which usually means a free-standing environment with no operating system. The object file format is usually specified in the fourth entry of the triple, so you might see something like riscv32imc-unknown-none-elf.

Sometimes the triple refers not to an operating system, but to a complete hardware product. This is common with game console triples, which have “operating system” names like ps4, psvita, 3ds, and switch. (Both Sony and Nintendo use LLVM as the basis for their internal toolchains; the Xbox toolchain is just MSVC).

ABI! ABI!

The fourth entry of the triple (and I repeat myself, yes, it’s still a triple) represents the binary interface for the target, when it is ambiguous.

For example, Apple targets never have this, because on an Apple platform, you just shut up and use CoreFoundation.framework as your libc. Except this isn’t true, because of things like x86_64-apple-ios-sim, the iOS simulator running on an x86 host.

On the other hand, Windows targets will usually specify -msvc or -gnu, to indicate whether they are built to match MSVC’s ABI or MinGW. Linux targets will usually specify the libc vendor in this position: -gnu for glibc, -musl for musl, -newlib for newlib, and so on.

This doesn’t just influence the calling convention; it also influences how language features, such as thread locals and dynamic linking, are handled. This usually requires coordination with the target libc.

On ARM free-standing (armxxx-unknown-none) targets, -eabi specifies the ARM EABI, which is a standard embeded ABI for ARM. -eabihf is similar, but indicates that no soft float support is necessary (hf stands for hardfloat). (Note that Rust does not include a vendor with these architectures, so they’re more like armv7r-none-eabi).

A lot of jankier targets use the ABI portion to specify the object file, such as the aforementioned riscv32imc-unknown-none-elf.

WASM Targets

One last thing to note are the various WebAssembly targets, which completely ignore all of the above conventions. Their triples often only have two components (they are still called triples, hopefully I’ve made that clear by now). Rust is a little bit more on the forefront here than clang (and anyways I don’t want to get into Emscripten) so I’ll stick to what’s going on in rustc.

There’s a few variants. wasm32-unknown-unknown (here using unknown instead of none as the system, oops) is a completely bare WebAssebly runtime where none of the standard library that needs to interact with the outside world works. This is essentially for building WebAssembly modules to deploy in a browser.

There are also the WASI targets, which provide a standard ABI for talking to the host operating system. These are less meant for browsers and more for people who are using WASI as a security boundary. These have names like wasm32-wasip1, which, unusually, lack a vendor! A “more correct” formulation would have been wasm32-unknown-wasip1.

Aside on Go

Go does the correct thing and distributes a cross compiler. This is well and good.

Unfortunately, they decided to be different and special and do not use the target triple system for naming their targets. Instead, you set the GOARCH and GOOS environment variables before invoking gc. This will sometimes be shown printed with a slash between, such as linux/amd64.

Thankfully, they at least provide documentation for a relevant internal package here, which offers the names of various GOARCH and GOOS values.

They use completely different names from everyone else for a few things, which is guaranteed to trip you up. They use call the 32- and 64-bit variants of x86 386 (note the lack of leading i) and amd64. They call 64-bit ARM arm64, instead of aarch64. They call little-endian MIPSes mipsle instead of mipsel.

They also call 32-bit WebAssembly wasm instead of wasm32, which is a bit silly, and they use js/wasm as their equivalent of wasm32-unknown-unknown, which is very silly.

Android is treated as its own operating system, android, rather than being linux with a particular ABI; their system also can’t account for ABI variants in general, since Go originally wanted to not have to link any system libraries, something that does not actually work.

If you are building a new toolchain, don’t be clever by inventing a cute target triple convention. All you’ll do is annoy people who need to work with a lot of different toolchains by being different and special.

Inventing Your Own Triples

Realistically, you probably shouldn’t. But if you must, you should probably figure out what you want out of the triple.

Odds are there isn’t anything interesting to put in the vendor field, so you will avoid people a lot of pain by picking unknown. Just include a vendor to avoid pain for people in the future.

You should also avoid inventing a new name for an existing architecture. Don’t name your hobby operating system’s triple amd64-unknown-whatever, please. And you definitely don’t want to have an ABI component. One ABI is enough.

If you’re inventing a triple for a free-standing environment, but want to specify something about the hardware configuration, you’re probably gonna want to use -none-<abi> for your system. For some firmware use-cases, though, the system entry is a better place, such as for the UEFI triples. Although, I have unforunately seen both x86_64-unknown-uefi and x86_64-pc-none-uefi in the wild.

And most imporantly: this sytem was built up organically. Disabuse yourself now of the idea that the system is consistent and that target triples are easy to parse. Trying to parse them will make you very sad.

  1. And no, a “target quadruple” is not a thing and if I catch you saying that I’m gonna bonk you with an Intel optimization manual. 

  2. I’m not sure why GCC does this. I suspect that it’s because computer hard drives used to be small and a GCC with every target would have been too large to cram into every machine. Maybe it has some UNIX philosophy woo mixed into it.

    Regardless, it’s really annoying and thankfully no one else does this because cross compiling shouldn’t require hunting down a new toolchain for each platform. 

  3. This is for Apple’s later-gen x86 machines, before they went all-in on ARM desktop. 

  4. ILP32 means that the int, long, and pointer types in C are 32-bit, despite the architecture being 64-bit. This allows writing programs that are small enough top jive in a 32-bit address space, while taking advantage of fast 64-bit operations. It is a bit of a frankentarget. Also existed once as a process mode on x86_64-unknown-linux by the name of x32 2

  5. Not to be confused with POWER, an older IBM CPU. 

  6. This name is Linux’s name for IBM’s z/Architecture. See https://en.wikipedia.org/wiki/Linux_on_IBM_Z#Hardware

  7. Not a real chip; refers to Nvidia’s PTX IR, which is what CUDA compiles to. 

  8. Similar to PTX; an IR used by AMD for graphics. See https://openwall.info/wiki/john/development/AMD-IL

  9. No idea what this is, and Google won’t help me.  2

  10. llc is the LLVM compiler, which takes LLVM IR as its input. Its interface is much more regular than clang’s because it’s not intended to be a substitute for GCC the way clang is. 

  11. Very kernel-hacker-brained name. It references the three processor modes of an x86 machine: real mode, protected mode, long mode, which correspond to 16-, 32-, and 64-bit modes. There is also a secret fourth mode called unreal mode, which is just what happens when you come down to real mode from protected mode after setting up a protected mode GDT.

    If you need to refer to real mode, call it “real mode”. Don’t try to be clever by calling it “8086” because you are almost certainly going to be using features that were not in the original Intel 8086. 

  12. I actually don’t like this name, but it’s the one LLVM uses so I don’t really get to complain. 

  13. Bazel also calls 32-bit x86 piii, which stands for, you guessed it, “Pentium III”. Extremely unserious. 

  14. The intelectual property around ARM, the architecture famility, is owned by the British company Arm Holdings. Yes, the spelling difference is significant.

    Relatedly, ARM is not an acronym, and is sometimes styled in all-lowercase as arm. The distant predecesor of Arm Holdings is Acorn Computers. Their first compute, the Acorn Archimedes, contained a chip whose target triple name today might have been armv1. Here, ARM was an acronym, for Acorn RISC Machine. Wikipedia alleges without citation that the name was at once point changed to Advanced RISC Machine at the behest of Apple, but I am unable to find more details. 

  15. “You are not cool enough for your company to be on the list.” 

  16. Which I pronounce as one word, “macos”, to drive people crazy. 

Protobuf Tip #1: Field Names Are Forever

I wake up every morning and grab the morning paper. Then I look at the obituary page. If my name is not on it, I get up. –Ben Franklin

TL;DR: Don’t rename fields. Even though there are a slim number of cases where you can get away with it, it’s rarely worth doing, and is a potential source of bugs.

I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog.

Names and Tags

Protobuf message fields have field tags that are used in the binary wire format to discriminate fields. This means that the wire format serialization does not actually depend on the names of the fields. For example, the following messages will use the exact same serialization format.

message Foo {
  string bar = 1;
}

message Foo2 {
  string bar2 = 1;
}
Protobuf

In fact, the designers of Protobuf intended for it to be feasible to rename an in-use field. However, they were not successful: it can still be a breaking change.

Schema Consumers Need to Update

If your schema is public, the generated code will change. For example, renaming a field from first_name to given_name will cause the corresponding Go accessor to change from FirstName to GivenName, potentially breaking downstream consumers.

Renaming a field to a “better” name is almost never a worthwhile change, simply because of this breakage.

JSON Serialization Breaks

Wire format serialization doesn’t look at names, but JSON does! This means that Foo and Foo2 above serialize as {"bar":"content"} and {"bar2":"content"} respectively, making them non-interchangeable.

This can be partially mitigated by using the [json_name = "..."] option on a field. However, this doesn’t actually work, because many Protobuf runtimes’ JSON codecs will accept both the name set in json_name, and the specified field name. So string given_name = 1 [json_name = "firstName"]; will allow deserializing from a key named given_name, but not first_name like it used to. This is still a breaking protocol change!

This is a place where Protobuf could have done better—if json_name had been a repeated string, this wire format breakage would have been avoidable. However, for reasons given below, renames are still a bad idea.

Reflection!

Even if you could avoid source and JSON breakages, the names are always visible to reflection. Although it’s very hard to guard against reflection breakages in general (since it can even see the order fields are declared in), this is one part of reflection that can be especially insidious—for example, if callers choose to sort fields by name, or if some middleware is using the name of a field to identify its frequency, or logging/redaction needs.

Don’t change the name, because reflection means you can’t know what’ll go wrong!

But I Really Have To!

There are valid reasons for wanting to rename a field, such as expanding its scope. For example, first_name and given_name are not the same concept: in the Sinosphere, as well as in Hungary, the first name in a person’s full name is their family name, not their given name.

Or maybe a field that previously referred to a monetary amount, say cost_usd, is being updated to not specify the currency:

message Before {
  sint64 cost_usd = 1;
}

message After {
  enum Currency {
    CURRENCY_UNSPECIFIED = 0;
    CURRENCY_USD = 1;
    CURRENCY_EUR = 2;
    CURRENCY_JPY = 3;
    CURRENCY_USD_1000TH = 4; // 0.1 cents.
  }

  sint64 cost = 1;
  Currency currency = 2;
}
Protobuf

In cases like this, renaming the field is a terrible idea. Setting aside source code or JSON breakage, the new field has completely different semantics. If an old consumer, expecting a price in USD, receives a new wire format message serialized from {"cost":990,"currency":"CURRENCY_USD_1000TH"}, it will incorrectly interpret the price as 990USD, rather than 0.99USD. That’s a disastrous bug!

Instead, the right plan is to add cost and currency side-by-side cost_usd. Then, readers should first check for cost_usd when reading cost, and take that to imply that currency is CURRENCY_USD (it’s also worth generating an error if cost and cost_usd are both present).

cost_usd can then be marked as [deprecated = true] . It is possible to even delete cost_usd in some cases, such as when you control all readers and writers — but if you don’t, the risk is very high. Plus, you kind of need to be able to re-interpret cost_usd as the value of cost in perpetuity.

If you do wind up deleting them, make sure to reserve the field’s number and name, to avoid accidental re-use.

reserved 1;
reserved "cost_usd";
Protobuf

But try not to. Renaming fields is nothing but tears and pain.

The Art of Formatting Code

Every modern programming language needs a formatter to make your code look pretty and consistent. Formatters are source-transformation tools that parse source code and re-print the resulting AST in some canonical form that normalizes whitespace and optional syntactic constructs. They remove the tedium of matching indentation and brace placement to match a style guide.

Go is particularly well-known for providing a formatter as part of its toolchain from day one. It is not a good formatter, though, because it cannot enforce a maximum column width. Later formatters of the 2010s, such as rustfmt and clang-format, do provide this feature, which ensure that individual lines of code don’t get too long.

The reason Go doesn’t do this is because the naive approach to formatting code makes it intractable to do so. There are many approaches to implementing this, which can make it seem like a very complicated layout constraint solving problem.

So what’s so tricky about formatting code? Aren’t you just printing out an AST?

“Just” an AST

An AST1 (abstract syntax tree) is a graph representation of a program’s syntax. Let’s consider something like JSON, whose naively-defined AST type might look something like this.

enum Json {
  Null,
  Bool(bool),
  Number(f64),
  String(String),
  Array(Vec<Json>),
  Object(HashMap<String, Json>)
}
Rust

The AST for the document {"foo": null, "bar": 42} might look something like this:

let my_doc = Json::Object([
  ("foo".to_string(), Json::Null),
  ("bar".to_string(), Json::Number(42)),
].into());
Rust

This AST has some pretty major problems. A formatter must not change the syntactic structure of the program (beyond removing things like redundant braces). Formatting must also be deterministic.

First off, Json::Object is a HashMap, which is unordered. So it will immediately discard the order of the keys. Json::String does not retain the escapes from the original string, so "\n" and "\u000a" are indistinguishable. Json::Number will destroy information: JSON numbers can specify values outside of the f64 representable range, but converting to f64 will quantize to the nearest float.

Now, JSON doesn’t have comments, but if it did, our AST has no way to record it! So it would destroy all comment information! Plus, if someone has a document that separates keys into stanzas2, as shown below, this information is lost too.

{
  "this": "is my first stanza",
  "second": "line",

  "here": "is my second stanza",
  "fourth": "line"
}
JSON

Truth is, the AST for virtually all competent toolchains are much more complicated than this. Here’s some important properties an AST needs to have to be useful.

  1. Retain span information. Every node in the graph remembers what piece of the file it was parsed from.

  2. Retain whitespace information. “Whitespace” typically includes both whitespace characters, and comments.

  3. Retain ordering information. The children of each node need to be stored in ordered containers.

The first point is achieved in a number of ways, but boils down to somehow associating to each token a pair of integers3, identifying the start and end offsets of the token in the input file.

Given the span information for each token, we can then define the span for each node to be the join of its tokens’ spans, namely the start is the min of its constituent tokens’ starts and its end is the max of the ends. This can be easily calculated recursively.

Once we have spans, it’s easy to recover the whitespace between any two adjacent syntactic constructs by calculating the text between them. This approach is more robust than, say, associating each comment with a specific token, because it makes it easier to discriminate stanzas for formatting.

Being able to retrieve the comments between any two syntax nodes is crucial. Suppose the user writes the following Rust code:

let x = false && // HACK: disable this check.
  some_complicated_check();
Rust

If we’re formatting the binary expression containing the &&, and we can’t query for comments between the LHS and the operator, or the operator and the RHS, the // HACK comment will get deleted on format, which is pretty bad!

An AST that retains this level of information is sometimes called a “concrete syntax tree”. I do not consider this a useful distinction, because any useful AST must retain span and whitespace information, and it’s kind of pointless to implement the same AST more than once. To me, an AST without spans is incomplete.

Updating Our JSON AST

With all this in mind, the bare minimum for a “good” AST is gonna be something like this.

struct Json {
  kind: JsonKind,
  span: (usize, usize),
}

enum JsonKind {
  Null,
  Bool(bool),
  Number(f64),
  String(String),
  Array(Vec<Json>),
  Object(Vec<(String, Json)>),  // Vec, not HashMap.
}
Rust

There are various layout optimizations we can do: for example, the vast majority of strings exist literally in the original file, so there’s no need to copy them into a String; it’s only necessary if the string contains escapes. My byteyarn crate, which I wrote about here, is meant to make handling this case easy. So we might rewrite this to be lifetime-bound to the original file.

struct Json<'src> {
  kind: JsonKind<'src>,
  span: (usize, usize),
}

enum JsonKind<'src> {
  Null,
  Bool(bool),
  Number(f64),
  String(Yarn<'src, str>),
  Array(Vec<Json>),
  Object(Vec<(Yarn<'src, str>, Json)>),  // Vec, not HashMap.
}
Rust

But wait, there’s some things that don’t have spans here. We need to include spans for the braces of Array and Object, their commas, and the colons on object keys. So what we actually get is something like this:

struct Span {
  start: usize,
  end: usize,
}

struct Json<'src> {
  kind: JsonKind<'src>,
  span: Span,
}

enum JsonKind<'src> {
  Null,
  Bool(bool),
  Number(f64),
  String(Yarn<'src, str>),

  Array {
    open: Span,
    close: Span,
    entries: Vec<ArrayEntry>,
  },
  Object {
    open: Span,
    close: Span,
    entries: Vec<ObjectEntry>,
  },
}

struct ArrayEntry {
  value: Json,
  comma: Option<Span>,
}

struct ObjectEntry {
  key: Yarn<'src, str>,
  key_span: Span,
  colon: Span,
  value: Json,
  comma: Option<Span>,
}
Rust

Implementing an AST is one of my least favorite parts of writing a toolchain, because it’s tedious to ensure all of the details are recorded and properly populated.

“Just” Printing an AST

In Rust, you can easily get a nice recursive print of any struct using the #[derive(Debug)] construct. This is implemented by recursively calling Debug::fmt() on the elements of a struct, but passing modified Formatter state to each call to increase the indentation level each time.

This enables printing nested structs in a way that looks like Rust syntax when using the {:#?} specifier.

Foo {
  bar: 0,
  baz: Baz {
    quux: 42,
  },
}
Rust

We can implement a very simple formatter for our JSON AST by walking it recursively.

fn fmt(out: &mut String, json: &Json, file: &str, indent: usize) {
  match &json.kind {
    Json::Null | Json::Bool(_) | Json::Number(_) | Json::String(_) => {
      // Preserve the input exactly.
      out.push_str(file[json.span.start..json.span.end]);
    }

    Json::Array { entries, .. } => {
      out.push('[');
      for entry in entries {
        out.push('\n');
        for _ in indent*2+2 {
          out.push(' ');
        }
        fmt(out, &entry.value, file, indent + 1)
        if entry.comma.is_some() {
          out.push(',');
        }
      }
      out.push('\n');
      for _ in indent*2 {
        out.push(' ');
      }
      out.push(']');
    }

    Json::Object { entries, .. } => {
      out.push('{');
      for entry in entries {
        out.push('\n');
        for _ in indent*2+2 {
          out.push(' ');
        }

        // Preserve the key exactly.
        out.push_str(file[entry.key_span.start..entry.key_span.end]);

        out.push_str(": ");
        fmt(out, &entry.value, file, indent + 1)
        if entry.comma.is_some() {
          out.push(',');
        }
      }
      out.push('\n');
      for _ in indent*2 {
        out.push(' ');
      }
      out.push('}');
    }
  }
}
Rust

This is essentially what every JSON serializer’s “pretty” mode looks like. It’s linear, it’s simple. But it has one big problem: small lists.

If I try to format the document {"foo": []} using this routine, the output will be

{
  "foo": [
  ]
}
JSON

This is pretty terrible, but easy to fix by adding a special case:

Json::Array { entries, .. } => {
  if entries.is_empty() {
    out.push_str("[]");
    return
  }

  // ...
}
Rust

Unfortunately, this doesn’t handle the similar case of a small but non-empty list. {"foo": [1, 2]} formats as

{
  "foo": [
    1,
    2
  ]
}
JSON

Really, we’d like to keep "foo": [1, 2] on one line. And now we enter the realm of column wrapping.

How Wide Is a Codepoint?

The whole point of a formatter is to work with monospaced text, which is text formatted using a monospaced or fixed-width typeface, which means each character is the same width, leading to the measure of the width of lines in columns.

So how many columns does the string cat take up? Three, pretty easy. But we obviously don’t want to count bytes, this isn’t 1971. If we did, кішка, when UTF-8 encoded, it would be 10, rather than 5 columns wide. So we seem to want to count Unicode characters instead?

Oh, but what is a Unicode character? Well, we could say that you’re counting Unicode scalar values (what Rust’s char and Go’s rune) types represent. Or you could count grapheme clusters (like Swift’s Character).

But that would give wrong answers. CJK languages’ characters, such as , usually want to be rendered as two columns, even in monospaced contexts. So, you might go to Unicode and discover UAX#11, and attempt to use it for assigning column widths. But it turns out that the precise rules that monospaced fonts use are not written down in a single place in Unicode. You would also discover that some scripts, such as Arabic, have complex ligature rules that mean that the width of a single character depends on the characters around it.

This is a place where you should hunt for a library. unicode_width is the one for Rust. Given that Unicode segmentation is a closely associated operation to width, segmentation libraries are a good place to look for a width calculation routine.

But most such libraries will still give wrong answers, because of tabs. The tab character U+0009 CHARACTER TABULATION’s width depends on the width of all characters before it, because a tab is as wide as needed to reach the next tabstop, which is a column position an integer multiple of the tab width (usually 2, 4, or, on most terminals, 8).

With a tab width of 4, "\t", "a\t", and "abc\t" are all four columns wide. Depending on the context, you will either want to treat tabs as behaving as going to the next tabstop (and thus being variable width), or having a fixed width. The former is necessary for assigning correct column numbers in diagnostics, but we’ll find that the latter is a better match for what we’re doing.

The reason for being able to calculate the width of a string is to enable line wrapping. At some point in the 2010s, people started writing a lot of code on laptops, where it is not easy to have two editors side by side on the small screen. This removes the motivation to wrap all lines at 80 columns4, which in turn results in lines that tend to get arbitrarily long.

Line wrapping helps ensure that no matter how wide everyone’s editors are, the code I have to read fits on my very narrow editors.

Accidentally Quadratic

A lot of folks’ first formatter recursively formats a node by formatting its children to determine if they fit on one line or not, and based on that, and their length if they are single-line, determine if their parent should break.

This is a naive approach, which has several disadvantages. First, it’s very easy to accidentally backtrack, trying to only break smaller and smaller subexpressions until things fit on one line, which can lead to quadratic complexity. The logic for whether a node can break is bespoke per node and that makes it easy to make mistakes.

Consider formatting {"foo": [1, 2]}. In our AST, this will look something like this:

Json {
  kind: JsonKind::Object {
    open: Span { start: 0, end: 1 },
    close: Span { start: 14, end: 15 },
    entries: vec![ObjectEntry {
      key: "foo",
      key_span: Span { start: 1, end: 4 },
      colon: Span { start: 4, end: 5 },
      value: Json {
        kind: JsonKind::Array {
          span: Span { start: 8, end: 9 },
          span: Span { start: 13, end: 14 },
          entries: vec![
            ArrayEntry {
              value: Json {
                kind: JsonKind::Number(1.0),
                span: Span { start: 9, end: 10 },
              },
              comma: Some(Span { start: 10, end: 11 }),
            },
            ArrayEntry {
              value: Json {
                kind: JsonKind::Number(s.0),
                span: Span { start: 12, end: 13 },
              },
              comma: None,
            },
          ],
        },
        span: Span { start: 8, end: 14 },
      },
      comma: None,
    }],
  },
  span: Span { start: 0, end: 15 },
}
Rust

To format the whole document, we need to know the width of each field in the object to decide whether the object fits on one line. To do that, we need to calculate the width of each value, and add to it the width of the key, and the width of the : separating them.

How can this be accidentally quadratic? If we simply say “format this node” to obtain its width, that will recursively format all of the children it contains without introducing line breaks, performing work that is linear in how many transitive children that node contains. Having done this, we can now decide if we need to introduce line breaks or not, which increases the indentation at which the children are rendered. This means that the children cannot know ahead of time how much of the line is left for them, so we need to recurse into formatting them again, now knowing the indentation at which the direct children are rendered.

Thus, each node performs work equal to the number of nodes beneath it. This has resulted in many slow formatters.

Now, you could be more clever and have each node be capable of returning its width based on querying its children’s width directly, but that means you need to do complicated arithmetic for each node that needs to be synchronized with the code that actually formats it. Easy to make mistakes.

The solution is to invent some kind of model for your document that specifies how lines should be broken if necessary, and which tracks layout information so that it can be computed in one pass, and then used in a second pass to figure out whether to actually break lines or not.

This is actually how HTML works. The markup describes constraints on the layout of the content, and then a layout engine, over several passes, calculates sizes, solves constraints, and finally produces a raster image representing that HTML document. Following the lead of HTML, we can design…

A DOM for Your Code

The HTML DOM is a markup document: a tree of tags where each tag has a type, such as <p>, <a>, <hr>, or <strong>, properties, such as <a href=...>, and content consisting of nested tags (and bare text, which every HTML engine just handles as a special kind of tag), such as <p>Hello <em>World</em>!</p>.

We obviously want to have a tag for text that should be rendered literally. We also want a tag for line breaks that is distinct from the text tag, so that they can be merged during rendering. It might be good to treat text tags consisting of just whitespace, such as whitespace, specially: two newlines \n\n are a blank line, but we might want to merge consecutive blank lines. Similarly, we might want to merge consecutive spaces to simplify generating the DOM.

Consider formatting a language like C++, where a function can have many modifiers on it that can show up in any order, such as inline, virtual, constexpr, and explicit. We might want to canonicalize the order of these modifiers. We don’t want to accidentally wind up printing inline constexpr Foo() because we printed an empty string for virtual. Having special merging for spaces means that all entities are always one space apart if necessary. This is a small convenience in the DOM that multiplies to significant simplification when lowering from AST to DOM.

Another useful tag is something like <indent by=" ">, which increases the indentation level by some string (or perhaps simply a number of spaces; the string just makes supporting tabs easier) for the tags inside of it. This allows control of indentation in a carefully-scoped manner.

Finally, we need some way to group tags that are candidates for “breaking”: if the width of all of the tags inside of a <group> is greater than the maximum width that group can have (determined by indentation and any elements on the same line as that group), we can set that group to “broken”, and… well, what should breaking do?

We want breaking to not just cause certain newlines (at strategic locations) to appear, but we also want it to cause an indentation increase, and in languages with trailing commas like Rust and Go, we want (or in the case of Go, need) to insert a trailing comma only when broken into multiple lines. We can achieve this by allowing any tag to be conditioned on whether the enclosing group is broken or not.

Taken all together, we can render the AST for our {"foo": [1, 2]} document into this DOM, according to the tags we’ve described above.

<group>
  <text s="{" />
  <text s="\n" if=broken />
  <indent by="  ">
    <text s='"foo"' />
    <text s=":" />
    <text s=" " />
    <group>
      <text s="[" />
      <text s="\n" if=broken />
      <indent by="  ">
        <text s="1" />
        <text s="," />
        <text s=" " if=flat />
        <text s="\n" if=broken />
        <text s="2" />
      </indent>
      <text s="\n" if=broken />
      <text s="]"/>
    </group>
  </indent>
  <text s="\n" if=broken />
  <text s="}" />
</group>
XML

Notice a few things: All of the newlines are set to appear only if=broken. The space between the two commas only appears if the enclosing group is not broken, that is if=flat. The groups encompass everything that can move due to a break, which includes the outer braces. This is necessary because if that brace is not part of the group, and it is the only character past the line width limit, it will not cause the group to break.

Laying Out Your DOM

The first pass is easy: it measures how wide every node is. But we don’t know whether any groups will break, so how can we measure that without calculating breaks, which depend on indentation, and the width of their children, and…

This is one tricky thing about multi-pass graph algorithms (or graph algorithms in general): it can be easy to become overwhelmed trying to factor the dependencies at each node so that they are not cyclic. I struggled with this algorithm, until I realized that the only width we care about is the width if no groups are ever broken.

Consider the following logic: if a group needs to break, all of its parents must obviously break, because the group will now contain a newline, so its parents must break no matter what. Therefore, we only consider the width of a node when deciding if a group must break intrinsically, i.e., because all of its children decided not to break. This can happen for a document like the following, where each inner node is quite large, but not large enough to hit the limit.

[
  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
  [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
]
JSON

Because we prefer to break outer groups rather than inner groups, we can measure the “widest a single line could be” in one pass, bottom-up: each node’s width is the sum of the width of its children, or its literal contents for <text> elements. However, we must exclude all text nodes that are if=broken, because they obviously do not contribute to the single-line length. We can also ignore indentation because indentation never happens in a single line.

However, this doesn’t give the full answer for whether a given group should break, because that depends on indentation and what nodes came before on the same line.

This means we need to perform a second pass: having laid everything out assuming no group is broken, we must lay things out as they would appear when we render them, taking into account breaking. But now that we know the maximum width of each group if left unbroken, we can make breaking decisions.

As we walk the DOM, we keep track of the current column and indentation value. For each group, we decide to break it if either:

  1. Its width, plus the current column value, exceeds the maximum column width.

  2. It contains any newlines, something that can be determined in the first pass.

The first case is why we can’t actually treat tabs as if they advance to a tabstop. We cannot know the column at which a node will be placed at the time that we measure its width, so we need to assume the worst case.

Whenever we hit a newline, we update the current width to the width induced by indentation, simulating a newline plus indent. We also need to evaluate the condition, if present, on each tag now, since by the time we inspect a non-group tag, we have already made a decision as to whether to break or not.

Render It!

Now that everything is determined, rendering is super easy: just walk the DOM and print out all the text nodes that either have no condition or whose condition matches the innermost group they’re inside of.

And, of course, this is where we need to be careful with indentation: you don’t want to have lines that end in whitespace, so you should make sure to not print out any spaces until text is written after a newline. This is also a good opportunity to merge adjacent only-newlines text blocks. The merge algorithm I like is to make sure that when n and m newline blocks are adjacent, print max(n, m) newlines. This ensures that a DOM node containing \n\n\n is respected, while deleting a bunch of \ns in a row that would result in many blank lines.

What’s awesome about this approach is that the layout algorithm is highly generic: you can re-use it for whatever compiler frontend you like, without needing to fuss with layout yourself. There is a very direct conversion from AST to DOM, and the result is very declarative.

More Complicated: YAML

YAML is a superset of JSON that SREs use to write sentient configuration files. It has a funny list syntax that we might want to use for multi-line lists, but we might want to keep JSON-style lists for short ones.

A document of nested lists might look something like this:

- [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
- [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
YAML

How might we represent this in the DOM? Starting from our original JSON document {"foo": [1, 2]}, we might go for something like this:

<group>
  <text s="{" if=flat />
  <indent by="  ">
    <text s='"foo"' />
    <text s=":" />
    <text s=" " />
    <group>
      <text s="[" if=flat />
      <text s="\n" if=broken />
      <text s="- " if=broken />
      <indent by="  ">
        <text s="1" />
      </indent>
      <text s="," if=flat />
      <text s=" " if=flat />
      <text s="\n" if=broken />
      <text s="- " if=broken />
      <indent by="  ">
        <text s="2" />
      </indent>
      <text s="\n" if=broken />
      <text s="]" if=flat />
    </group>
  </indent>
  <text s="\n" if=broken />
  <text s="}" if=flat />
</group>
XML

Here, we’ve made the [] and the comma only appear in flat mode, while in broken mode, we have a - prefix for each item. The inserted newlines have also changed somewhat, and the indentation blocks have moved: now only the value is indented, since YAML allows the -s of list items to be at the same indentation level as the parent value for lists nested in objects. (This is a case where some layout logic is language-specific, but now the output is worrying about declarative markup rather than physical measurements.)

There are other enhancements you might want to make to the DOM I don’t describe here. For example, comments want to be word-wrapped, but you might not know what the width is until layout happens. Having a separate tag for word-wrapped blocks would help here.

Similarly, a mechanism for “partial breaks”, such as for the document below, could be implemented by having a type of line break tag that breaks if the text that follows overflows the column, which can be easily implemented by tracking the position of the last such break tag.

{
  "foo": ["very", "long", "list",
          "of", "strings"]
}
JSON

Using This Yourself

I think that a really good formatter is essential for any programming language, and I think that a high-quality library that does most of the heavy-lifting is important to make it easier to demand good formatters.

So I wrote a Rust library. I haven’t released it on crates.io because I don’t think it’s quite at the state I want, but it turns out that the layout algorithm is very simple, so porting this to other languages should be EZ.

Now you have no excuse. :D

  1. Everyone pronounces this acronym “ay ess tee”, but I have a friend who really like to say ast, rhyming with mast, so I’m making a callout post my twitter dot com. 

  2. In computing, a group of lines not separated by blank lines is called a stanza, in analogy to the stanzas of a poem, which are typeset with no blank lines between the lines of the stanza. 

  3. You could also just store a string, containing the original text, but storing offsets is necessary for diagnostics, which is the jargon term for a compiler error. Compiler errors are recorded using an AST node as context, and to report the line at which the error occurred, we need to be able to map the node back to its offset in the file.

    Once we have the offset, we can calculate the line in O(logn)O(\log n) time using binary search. Having pre-computed an array of the offset of each \n byte in the input file, binary search will tell us the index and offset of the \n before the token; this index is the zero-indexed line number, and the string from that \n to the offset can be used to calculate the column.

    use unicode_width::UnicodeWidthStr;
    
    /// Returns the index of each newline. Can be pre-computed and re-used
    /// multiple times.
    fn newlines(file: &str) -> Vec<usize> {
      file.bytes()
          .enumerate()
          .filter_map(|(i, b)| (b == b'\n').then_some(i+1))
    }
    
    /// Returns the line and column of the given offset, given the line
    /// tarts of the file.
    fn location(
      file: &str,
      newlines: &[usize],
      offset: usize,
    ) -> (usize, usize) {
      match newlines.binary_search(offset) {
        // Ok means that offset refers to a newline, so this means
        // we want to return the width of the line that it ends as
        // the column.
        //
        // Err means that this is after the nth newline, except Err(0),
        // which means it is before the first one.
        Ok(0) | Err(0) => (1, file[..offset].width()),
        Ok(n) => (n+1, file[newlines[n-1]..offset].width()),
        Err(n) => (n+2, file[newlines[n]..offset].width()),
      }
    }
    Rust

  4. The Rust people keep trying to convince me that it should be 100. They are wrong. 80 is perfect. They only think they need 100 because they use the incorrect tab width of four spaces, rather than two. This is the default for clang-format and it’s perfect