Rust generics have an emergent ABI problem. Consider this function from the standard library:
impl<K> HashSet<K> {
pub fn contains<Q>(&self, k: &Q) -> bool
where
K: Borrow<Q>,
Q: Hash + Eq + ?Sized,
}
HashSet::contains takes a reference to a value that can be used to look up a
value in the map; often, Q will just be K. Then, consider the following
code:
fn pred(x: i32) {
let set = HashSet::from([1, 2, 3]);
println!("{}", set.contains(&x));
}
If contains() is not inlined, we will need to spill x to the stack so that
we can pass a pointer to it into contains(). Now, collection functions are
frequently inlined, but this isn’t always the case, particularly when building
for code size.
However, the reference isn’t actually needed because HashSet::contains doesn’t
retain the address of x, so the compiler could generate code for contains()
to take its argument by register rather than by pointer. But, when can the
compiler make this optimization?
In this article I want to describe a class of compiler optimizations I call “ABI optimization” or “optimization-dependent ABI”. I will sketch the general concept, and then an example for promoting pointer arguments to register arguments in Rust.
The C Compilation Model
C code is unsual among other popular languages today in that it has separate
interface and implementation files. .h files contain types and function
interfaces, while .c files contain the bodies of those functions. .h files
can be included by other .h files and .c files, but .c files cannot be
included by anyone.
What this means is that each compilation action consists of compiling a single
.c file, which consumes declarations for all of its dependencies, and
implementations of all the functions it defines. Each .c file defines a single
“translation unit”.
LLVM is built around this compilation model: each .ll file (an LLVM module,
equivalent to a C translation unit) consists of declarations copied from
dependencies, and definitions for functions defined in the translation unit. The
optimizer operates on a single module, and thus can’t see the implementations of
dependencies, even if they’ve already been analyzed by another LLVM invocation.
This is a barrier to many optimizations, but a particularly tricky one is that
LLVM can’t change function signatures. Suppose that a.ll is a dependency of
b.ll:
// a.ll
define i32 @foo(ptr %0) {
start:
%1 = load i32, ptr %0
ret i32 %1
}
// b.ll
declare i32 @foo(ptr)
define i32 @bar(i32 %0) {
start:
%1 = alloca i32, align 4
store i32 %0, ptr %1
%2 = call i32 @foo(ptr %1)
%3 = add i32 %2, i32 42
ret i32 %3
}
These files are optimized in parallel. We would like to optimize foo() into
define i32 @foo(i32 %0) {
start:
ret i32 %0
}
so that the caller of foo() is required to perform the dereference. However,
because the files are not optimized in parallel, the LLVM compiling b.ll won’t
be aware this optimization occured, so we get argument type mismatch and thus,
undefined behavior.
However, this is not how Rust compiles code. Rust doesn’t have headers, so the
compiler generates binary “header” files (.rmeta files) which downstream
compilation jobs can use to import functions across crates. This means that a
Rust build has to build crates in topological order (at least, it needs to run
enough of the compiler to generate the “headers”). This is a channel for past
invocations of the compiler to communicate with future invocations within the
same build.
Summarized:
- C allows all compilation jobs to occur in parallel, but any information shared between compilation units needs to be written by humans.
- Rust requires compilation of a crate to wait for all dependency crates to emit compiler-generated header files, but the compiler can advertise decisions it made made for dependency crates to downstream users.
This idea is not unique to Rust. Go also requires compilation of all packages a
package depends on to be built before it attempts to build the current pacakge
(its generated header files are called “export data”, or .x files).
Because the compiler’s decisions can be recorded in .rmeta files, this
suggests a way that we could share optimization information across crates.
Optimization Remarks
Let’s say that we’re building an optimization or analysis X that only works if findings from dependencies are known. Then, we can implement this within the Rust compilation model like so:
- An
.rmetasection is added that specifies the results of the analysis X; abcense of this section means the analysis did not occur. - When compiling a function in a crate, we can refer to the findings from all
functions that it calls directly, because all those functions must have been
analyzed in a dependency crate, and are thus in that crate’s
.rmeta. We perform the analysis using this as input. - The result of this analysis is included in the
.rmetaoutput for that function.
Note that there are some cases where we can call functions that aren’t defined
in our dependencies, via function pointers or dyn objects. It is also possible
to call functions via extern "Rust" {}:
mod hidden {
#[no_mangle]
fn foo() {}
}
fn bar() {
extern "Rust" {
fn foo();
}
unsafe { foo() }
}
However, these functions must have a stable symbol name, through #[no_mangle]
or a similar attribute. The vast majority of Rust symbols cannot be called
through an extern symbol.
So, what does scalar argument promotion look like in this framework?
Escape Analysis with Remarks
Our goal is to convert functions that take x: &i32 into taking x: i32 as an
argument, instead. This optimization is correct when:
- The type
&Tis such thatTdoes not contain anUnsafeCell; i.e., this guarantees that&Tdoes not need to be reloaded, because that memory location can’t be mutated: for these types,&Tis “transparent”. - The address of
&Tdoes not escape, which is true accordiong to a dataflow analysis like the following: a. The lifetime of&'a Tis contravariant in the function’s corresponding fnptr type. For example,for<'a> fn(&'a T)is contravariant, but it isn’t in either offor<'a> fn(&'a T, &mut &'a T)orfor<'a> fn(&'a T) -> &'a T. b. The reference is never converted into a raw pointer. We can’t keep track of whether the program depends on the address of a raw poitner, so we shouldn’t change it. c. The reference is not passed into any function pointer, trait object, orextern-declared function. d. Any other function that the reference was passed into must not escape it, per rules (a)-(d).
The rule (2d) is where remarks come in: by induction, we’ve already run this analysis on all dependencies, so we can check it for any statically-dispatched calls. Requirement (2a) is quite strict: it can be relaxed by doing more extensive dataflow analysis and is only here to simplify the statement of the requirements. A more complete discussion of this type of escape analysis can be found in the Go compiler’s implementation.
This analysis needs to occur in MIR, so it can be emitted into .rmeta for
looking at in rule (2d) and so it has access to Rust-specific semantics. Of
course, in an unoptimized build this step can be skipped for speed, and
functions that are missing escape information can be assumed to escape all their
arguments.
Also, just this information can be used to improve LLVM’s output quality:
pointer arguments that do not escape can be annotated as nocapture in a
declare when generating LLVM IR.
Scalar Argument Promotion
Now that we have escape information for all functions we’re generating LLVM IR for, we add an additional annotation which indicates that the ABI of an argument was promoted from “by pointer” to “by register”.
Concretely, suppose we have fn foo(x: &i32) which we know does not escape
x’s address. Then:
- Instead of emitting
define void @foo(ptr), we emitdefine void @foo(i32). - We add a prologue to the IR of
foo()that creates analloca i32, which we write thei32argument to; this pointer can be used in place of the original reference argument. LLVM will obliterate this alloca for us during optimization, so generating ugly code is fine in this case. - To every static call to
foo(), we place a load immediately before the call to load the value we would have passed into by pointerfoo()into a register. Again, LLVM will clean this up, particularly if we’re loading an alloca created in the current function. - We generate a thunk
define void @foo.thunk(ptr)that simply loads the pointer argument and tail calls intofoo(). This thunk is used for creating ABI-compatible function pointers; this allowslet f: fn(&i32) = foo;to continue to work.
Concretely, suppose we have the following rust code:
// a.rs
fn foo(x: &i32) -> i32 {
*x + 42
}
// b.rs
fn bar(x: i32) -> i32 {
foo(x * 2)
}
This would have been codegened into something like this:
// a.ll
define i32 @foo(ptr %x) {
start:
%1 = load i32, ptr %x
%2 = add i32 %1, i32 42
ret i32 %2
}
// b.ll
declare i32 @foo(ptr %x)
declare i32 @bar(i32 %x) {
start:
%1 = mul i32 %x, i32 2
%2 = alloca i32, align 4
store i32 %1, ptr %2
%3 = call i32 @foo(ptr %2)
ret i32 %3
}
But, if we applied this optimization, we’d get something like this instead.
// a.ll
define i32 @foo(i32 %x) {
start:
// Store %x into a stack slot, so that none of the existing code
// needs to be changed. This equivalent to telling LLVM "this
// argument doesn't have an address, optimize as appropriate".
%1 = alloca i32, align 4
store i32 %x, ptr %1
%2 = load i32, ptr %1
%3 = add i32 %2, i32 42
ret i32 %3
}
// This function is for creating function pointers to `foo`.
define i32 @foo.thunk(ptr %x) {
start:
%1 = load i32, ptr %x
%2 = call i32 @foo(i32 %1)
ret i32 %2
}
// b.ll
declare i32 @foo(i32 %x) // i32, not ptr!
// We know we can do this because of
// optimization remarks.
declare i32 @bar(i32 %x) {
start:
%1 = mul i32 %x, i32 2
%2 = alloca i32, align 4
store i32 %1, ptr %2
// Load the pointer, because we need to pass by register.
// LLVM will see we're loading something that we just
// stored in an `alloca`, which means it can replace %3 below
// with %1.
%3 = load i32, ptr %2
%4 = call i32 @foo(i32 %3)
ret i32 %4
}
Although it looks like we’re produced worse codegen, this is actually very
benefitial to LLVM, because within a function, it can easily see that the memory
operations we’re doing to spill and unspill %x from the stack are redundant.
The upshot is that foo() takes i32 instead of ptr at the LLVM level, an
optimization LLVM could not have performed. After optimization, this will result
in reduced memory traffic to the stack!
Conclusion
This kind of optimization is particularly important for old but entrenched APIs, where it might be possible to provide a more performant ABI without changing the function signature. Other ABI optimizations to explore could include:
- Struct exploding: Rust passes structs by pointer if they are two registers or larger, which might not be ideal in some cases.
- Argument reordering: passing “hotter” arguments in earlier slots means they can be passed by register instead of on the stack.
- Profiling-inspired ABIs: the compiler could use profiling information to modify an ABI to minimize memory traffic or improve register pressure.
Optimization-dependent ABI is not a concept I’ve seen languages that target LLVM take advantage of, and it is a potential avenue for getting another edge on C and C++ for “ordinary” scalar code.