Soon™ • 2374 words • 26 minutes
• #rust • #language-design • #toolchains

Optimization-Dependent Rust ABI

Rust generics have an emergent ABI problem. Consider this function from the standard library:

impl<K> HashSet<K> {
  pub fn contains<Q>(&self, k: &Q) -> bool
  where
      K: Borrow<Q>,
      Q: Hash + Eq + ?Sized,
}

Rust

HashSet::contains takes a reference to a value that can be used to look up a value in the map; often, Q will just be K. Then, consider the following code:

fn pred(x: i32) {
  let set = HashSet::from([1, 2, 3]);
  println!("{}", set.contains(&x));
}

Rust

If contains() is not inlined, we will need to spill x to the stack so that we can pass a pointer to it into contains(). Now, collection functions are frequently inlined, but this isn’t always the case, particularly when building for code size.

However, the reference isn’t actually needed because HashSet::contains doesn’t retain the address of x, so the compiler could generate code for contains() to take its argument by register rather than by pointer. But, when can the compiler make this optimization?

In this article I want to describe a class of compiler optimizations I call “ABI optimization” or “optimization-dependent ABI”. I will sketch the general concept, and then an example for promoting pointer arguments to register arguments in Rust.

The C Compilation Model

C code is unsual among other popular languages today in that it has separate interface and implementation files. .h files contain types and function interfaces, while .c files contain the bodies of those functions. .h files can be included by other .h files and .c files, but .c files cannot be included by anyone.

What this means is that each compilation action consists of compiling a single .c file, which consumes declarations for all of its dependencies, and implementations of all the functions it defines. Each .c file defines a single “translation unit”.

LLVM is built around this compilation model: each .ll file (an LLVM module, equivalent to a C translation unit) consists of declarations copied from dependencies, and definitions for functions defined in the translation unit. The optimizer operates on a single module, and thus can’t see the implementations of dependencies, even if they’ve already been analyzed by another LLVM invocation.

This is a barrier to many optimizations, but a particularly tricky one is that LLVM can’t change function signatures. Suppose that a.ll is a dependency of b.ll:

// a.ll
define i32 @foo(ptr %0) {
 start:
  %1 = load i32, ptr %0
  ret i32 %1
}

// b.ll
declare i32 @foo(ptr)

define i32 @bar(i32 %0) {
  start:
   %1 = alloca i32, align 4
   store i32 %0, ptr %1
   %2 = call i32 @foo(ptr %1)
   %3 = add i32 %2, i32 42
   ret i32 %3
}

LLVM IR

These files are optimized in parallel. We would like to optimize foo() into

define i32 @foo(i32 %0) {
 start:
  ret i32 %0
}

LLVM IR

so that the caller of foo() is required to perform the dereference. However, because the files are not optimized in parallel, the LLVM compiling b.ll won’t be aware this optimization occured, so we get argument type mismatch and thus, undefined behavior.

However, this is not how Rust compiles code. Rust doesn’t have headers, so the compiler generates binary “header” files (.rmeta files) which downstream compilation jobs can use to import functions across crates. This means that a Rust build has to build crates in topological order (at least, it needs to run enough of the compiler to generate the “headers”). This is a channel for past invocations of the compiler to communicate with future invocations within the same build.

Summarized:

C allows all compilation jobs to occur in parallel, but any information shared between compilation units needs to be written by humans.
Rust requires compilation of a crate to wait for all dependency crates to emit compiler-generated header files, but the compiler can advertise decisions it made made for dependency crates to downstream users.

This idea is not unique to Rust. Go also requires compilation of all packages a package depends on to be built before it attempts to build the current pacakge (its generated header files are called “export data”, or .x files).

Because the compiler’s decisions can be recorded in .rmeta files, this suggests a way that we could share optimization information across crates.

Optimization Remarks

Let’s say that we’re building an optimization or analysis X that only works if findings from dependencies are known. Then, we can implement this within the Rust compilation model like so:

An .rmeta section is added that specifies the results of the analysis X; abcense of this section means the analysis did not occur.
When compiling a function in a crate, we can refer to the findings from all functions that it calls directly, because all those functions must have been analyzed in a dependency crate, and are thus in that crate’s .rmeta. We perform the analysis using this as input.
The result of this analysis is included in the .rmeta output for that function.

Note that there are some cases where we can call functions that aren’t defined in our dependencies, via function pointers or dyn objects. It is also possible to call functions via extern "Rust" {}:

mod hidden {
  #[no_mangle]
  fn foo() {}
}

fn bar() {
  extern "Rust" {
    fn foo();
  }

  unsafe { foo() }
}

Rust

However, these functions must have a stable symbol name, through #[no_mangle] or a similar attribute. The vast majority of Rust symbols cannot be called through an extern symbol.

So, what does scalar argument promotion look like in this framework?

Escape Analysis with Remarks

Our goal is to convert functions that take x: &i32 into taking x: i32 as an argument, instead. This optimization is correct when:

The type &T is such that T does not contain an UnsafeCell; i.e., this guarantees that &T does not need to be reloaded, because that memory location can’t be mutated: for these types, &T is “transparent”.
The address of &T does not escape, which is true accordiong to a dataflow analysis like the following: a. The lifetime of &'a T is contravariant in the function’s corresponding fnptr type. For example, for<'a> fn(&'a T) is contravariant, but it isn’t in either of for<'a> fn(&'a T, &mut &'a T) or for<'a> fn(&'a T) -> &'a T. b. The reference is never converted into a raw pointer. We can’t keep track of whether the program depends on the address of a raw poitner, so we shouldn’t change it. c. The reference is not passed into any function pointer, trait object, or extern-declared function. d. Any other function that the reference was passed into must not escape it, per rules (a)-(d).

The rule (2d) is where remarks come in: by induction, we’ve already run this analysis on all dependencies, so we can check it for any statically-dispatched calls. Requirement (2a) is quite strict: it can be relaxed by doing more extensive dataflow analysis and is only here to simplify the statement of the requirements. A more complete discussion of this type of escape analysis can be found in the Go compiler’s implementation.

This analysis needs to occur in MIR, so it can be emitted into .rmeta for looking at in rule (2d) and so it has access to Rust-specific semantics. Of course, in an unoptimized build this step can be skipped for speed, and functions that are missing escape information can be assumed to escape all their arguments.

Also, just this information can be used to improve LLVM’s output quality: pointer arguments that do not escape can be annotated as nocapture in a declare when generating LLVM IR.

Scalar Argument Promotion

Now that we have escape information for all functions we’re generating LLVM IR for, we add an additional annotation which indicates that the ABI of an argument was promoted from “by pointer” to “by register”.

Concretely, suppose we have fn foo(x: &i32) which we know does not escape x’s address. Then:

Instead of emitting define void @foo(ptr), we emit define void @foo(i32).
We add a prologue to the IR of foo() that creates an alloca i32, which we write the i32 argument to; this pointer can be used in place of the original reference argument. LLVM will obliterate this alloca for us during optimization, so generating ugly code is fine in this case.
To every static call to foo(), we place a load immediately before the call to load the value we would have passed into by pointer foo() into a register. Again, LLVM will clean this up, particularly if we’re loading an alloca created in the current function.
We generate a thunk define void @foo.thunk(ptr) that simply loads the pointer argument and tail calls into foo(). This thunk is used for creating ABI-compatible function pointers; this allows let f: fn(&i32) = foo; to continue to work.

Concretely, suppose we have the following rust code:

// a.rs
fn foo(x: &i32) -> i32 {
  *x + 42
}

// b.rs
fn bar(x: i32) -> i32 {
  foo(x * 2)
}

Rust

This would have been codegened into something like this:

// a.ll
define i32 @foo(ptr %x) {
 start:
  %1 = load i32, ptr %x
  %2 = add i32 %1, i32 42
  ret i32 %2
}

// b.ll
declare i32 @foo(ptr %x)

declare i32 @bar(i32 %x) {
 start:
  %1 = mul i32 %x, i32 2
  %2 = alloca i32, align 4
  store i32 %1, ptr %2
  %3 = call i32 @foo(ptr %2)
  ret i32 %3
}

LLVM IR

But, if we applied this optimization, we’d get something like this instead.

// a.ll
define i32 @foo(i32 %x) {
 start:
  // Store %x into a stack slot, so that none of the existing code
  // needs to be changed. This equivalent to telling LLVM "this
  // argument doesn't have an address, optimize as appropriate".
  %1 = alloca i32, align 4
  store i32 %x, ptr %1

  %2 = load i32, ptr %1
  %3 = add i32 %2, i32 42
  ret i32 %3
}

// This function is for creating function pointers to `foo`.
define i32 @foo.thunk(ptr %x) {
 start:
  %1 = load i32, ptr %x
  %2 = call i32 @foo(i32 %1)
  ret i32 %2
}

// b.ll
declare i32 @foo(i32 %x)  // i32, not ptr!
                          // We know we can do this because of
                          // optimization remarks.

declare i32 @bar(i32 %x) {
 start:
  %1 = mul i32 %x, i32 2
  %2 = alloca i32, align 4
  store i32 %1, ptr %2

  // Load the pointer, because we need to pass by register.
  // LLVM will see we're loading something that we just
  // stored in an `alloca`, which means it can replace %3 below
  // with %1.
  %3 = load i32, ptr %2

  %4 = call i32 @foo(i32 %3)
  ret i32 %4
}

LLVM IR

Although it looks like we’re produced worse codegen, this is actually very benefitial to LLVM, because within a function, it can easily see that the memory operations we’re doing to spill and unspill %x from the stack are redundant.

The upshot is that foo() takes i32 instead of ptr at the LLVM level, an optimization LLVM could not have performed. After optimization, this will result in reduced memory traffic to the stack!

Conclusion

This kind of optimization is particularly important for old but entrenched APIs, where it might be possible to provide a more performant ABI without changing the function signature. Other ABI optimizations to explore could include:

Struct exploding: Rust passes structs by pointer if they are two registers or larger, which might not be ideal in some cases.
Argument reordering: passing “hotter” arguments in earlier slots means they can be passed by register instead of on the stack.
Profiling-inspired ABIs: the compiler could use profiling information to modify an ABI to minimize memory traffic or improve register pressure.

Optimization-dependent ABI is not a concept I’ve seen languages that target LLVM take advantage of, and it is a potential avenue for getting another edge on C and C++ for “ordinary” scalar code.

Soon™ • 2374 words • 26 minutes
• #rust • #language-design • #toolchains

Optimization-Dependent Rust ABI

Rust generics have an emergent ABI problem. Consider this function from the standard library:

implK>gt; HashSetK>gt; {
  pub fn containsQ>gt;(&self, k: &Q) -> t; bool
  where
      K: BorrowQ>gt;,
      Q: Hash + Eq + ?Sized,
}

Rust

HashSet::contains takes a reference to a value that can be used to look up a value in the map; often, Q will just be K. Then, consider the following code:

fn pred(x: i32) {
  let set = HashSet::from([1, 2, 3]);
  println!("{}";{}", set.contains(&x));
}

Rust

However, the reference isn’t actually needed because use HashSet::contains doesn’t retain the address of of x, so the compiler could generate code for contains() to take its argument by register rather than by pointer. But, when can the compiler make this optimization?

The C Compilation Model

This is a barrier to many optimizations, but a particularly tricky one is that LLVM can’t change function signatures. Suppose that hat a.ll is a dependency of b.ll:

// a.ll
define i32 @foo(ptr %0) {
 start:
  %1 = load i32, ptr %0
  ret i32 %1
}

// b.ll
declare i32 @foo(ptr)

define i32 @bar(i32 %0) {
  start:
   %1 = alloca i32, align 4
   store i32 %0, ptr %1
   %2 = call i32 @foo(ptr %1)
   %3 = add i32 %2, i32 42
   ret i32 %3
}

LLVM IR

These files are optimized in parallel. We would like to optimize foo() into

define i32 @foo(i32 %0) {
 start:
  ret i32 %0
}

LLVM IR

However, this is not how Rust compiles code. Rust doesn’t have headers, so the compiler generates binary “header” files (quo; files (.rmeta files) which downstream compilation jobs can use to import functions across crates. This means that a Rust build has to build crates in topological order (at least, it needs to run enough of the compiler to generate the “headers”). This is a channel for nel for past invocations of the compiler to communicate with future invocations within the same build.

Summarized:

C allows all compilation jobs to occur in parallel, but any information shared between compilation units needs to be written by humans.
Rust requires compilation of a crate to wait for all dependency crates to emit compiler-generated header files, but the compiler can advertise decisions it made made for dependency crates to downstream users.

Because the compiler’s decisions can be recorded in in .rmeta files, this suggests a way that we could share optimization information across crates.

Optimization Remarks

Let’s say that we’re building an optimization or analysis X that only works if findings from dependencies are known. Then, we can implement this within the Rust compilation model like so:like so:

An .rmeta section is added that specifies the results of the analysis X; abcense of this section means the analysis did not occur.
When compiling a function in a crate, we can refer to the findings from all functions that it calls directly, because all those functions must have been analyzed in a dependency crate, and are thus in that crate’s o;s .rmeta. We perform the analysis using this as input.
The result of this analysis is included in the .rmeta output for that function.

Note that there are some cases where we can call functions that aren’t defined in our dependencies, via function pointers or or dyn objects. It is also possible to call functions via extern "Rust" {}" {}:

mod hidden {
  #[no_mangle]
  fn foo() {}
}

fn bar() {
  extern "Rust"ust" {
    fn foo();
  }

  unsafe { foo() }
}

Rust

However, these functions must have a stable symbol name, through #[no_mangle] or a similar attribute. The vast majority of Rust symbols cannot be called through an extern symbol.

So, what does scalar argument promotion look like in this framework?

Escape Analysis with Remarks

Our goal is to convert functions that take x: &i32;i32 into taking x: i32 as an argument, instead. This optimization is correct when:

The type &Tmp;T is such that T does not contain an UnsafeCell; i.e., this guarantees that &Tmp;T does not need to be reloaded, because that memory location can’t be mutated: for these types, es, &Tmp;T is “transparent”.”.
The address of &Tmp;T does not escape, which is true accordiong to a dataflow analysis like the following: a. The lifetime of &'a T'a T is contravariant in the function’s corresponding fnptr type. For example, le, for<'a> fn(&'a T) fn(&'a T) is contravariant, but it isn’t in either of of for<'a> fn(&'a T, &mut &'a T)9;a T, &mut &'a T) or for<'a> fn(&'a T) -> &'a T'a T) -> &'a T. b. The reference is never converted into a raw pointer. We can’t keep track of whether the program depends on the address of a raw poitner, so we shouldn’t change it. c. The reference is not passed into any function pointer, trait object, or ect, or extern-declared function. d. Any other function that the reference was passed into must not escape it, per rules (a)-(d).

Also, just this information can be used to improve LLVM’s output quality: pointer arguments that do not escape can be annotated as as nocapture in a declare when generating LLVM IR.

Scalar Argument Promotion

Concretely, suppose we have fn foo(x: &i32)i32) which we know does not escape x’s address. Then:hen:

Instead of emitting define void @foo(ptr), we emit define void @foo(i32).
We add a prologue to the IR of foo() that creates an alloca i32, which we write the i32 argument to; this pointer can be used in place of the original reference argument. LLVM will obliterate this alloca for us during optimization, so generating ugly code is fine in this case.
To every static call to foo(), we place a load immediately before the call to load the value we would have passed into by pointer foo() into a register. Again, LLVM will clean this up, particularly if we’re loading an alloca created in the current function.ion.
We generate a thunk define void @foo.thunk(ptr) that simply loads the pointer argument and tail calls into foo(). This thunk is used for creating ABI-compatible function pointers; this allows let f: fn(&i32) = foo;foo; to continue to work.

Concretely, suppose we have the following rust code:

// a.rs
fn foo(x: &i32) -> t; i32 {
  *x + 42
}

// b.rs
fn bar(x: i32) -> t; i32 {
  foo(x * 2)
}

Rust

This would have been codegened into something like this:

// a.ll
define i32 @foo(ptr %x) {
 start:
  %1 = load i32, ptr %x
  %2 = add i32 %1, i32 42
  ret i32 %2
}

// b.ll
declare i32 @foo(ptr %x)

declare i32 @bar(i32 %x) {
 start:
  %1 = mul i32 %x, i32 2
  %2 = alloca i32, align 4
  store i32 %1, ptr %2
  %3 = call i32 @foo(ptr %2)
  ret i32 %3
}

LLVM IR

But, if we applied this optimization, we’d get something like this instead.ead.

// a.ll
define i32 @foo(i32 %x) {
 start:
  // Store %x into a stack slot, so that none of the existing code
  // needs to be changed. This equivalent to telling LLVM "this
his
  // argument doesn't have an address, optimize as appropriate"ate".
  %1 = alloca i32, align 4
  store i32 %x, ptr %1

  %2 = load i32, ptr %1
  %3 = add i32 %2, i32 42
  ret i32 %3
}

// This function is for creating function pointers to `foo`.
define i32 @foo.thunk(ptr %x) {
 start:
  %1 = load i32, ptr %x
  %2 = call i32 @foo(i32 %1)
  ret i32 %2
}

// b.ll
declare i32 @foo(i32 %x)  // i32, not ptr!
                          // We know we can do this because of
                          // optimization remarks.

declare i32 @bar(i32 %x) {
 start:
  %1 = mul i32 %x, i32 2
  %2 = alloca i32, align 4
  store i32 %1, ptr %2

  // Load the pointer, because we need to pass by register.
  // LLVM will see we're9;re loading something that we just
  // stored in an `alloca`, which means it can replace %3 below
  // with %1.
  %3 = load i32, ptr %2

  %4 = call i32 @foo(i32 %3)
  ret i32 %4
}

LLVM IR

Although it looks like we’re produced ced worse codegen, this is actually very benefitial to LLVM, because within a function, it can easily see that the memory operations we’re doing to spill and unspill ill %x from the stack are redundant.

The upshot is that foo() takes i32 instead of ptr at the LLVM level, an optimization LLVM could not have performed. After optimization, this will result in reduced memory traffic to the stack!

Conclusion

Struct exploding: Rust passes structs by pointer if they are two registers or larger, which might not be ideal in some cases.
Argument reordering: passing “hotter” arguments in earlier slots means they can be passed by register instead of on the stack.e stack.
Profiling-inspired ABIs: the compiler could use profiling information to modify an ABI to minimize memory traffic or improve register pressure.

Related Posts