Thoughts on std::simd

I recently wrote an article about parsing base64 with SIMD. In the process, I used it as an opportunity to evaluate Rust’s std::simd portable SIMD library.

I went into this with fairly low expectations, taking it as a given that the codegen would not be as good as with direct use of intrinsics; LLVM is historically “not good” at optimizing intrinsics code, and std::simd code would not be using intrinsics at all.

That said, after using std::simd I think this is a really good approach. It produces great code (after tripping over some random performance footguns). Most of its issues stem from LLVM having mediocre support for some of the things Rust wants to do, or Rust emitting LLVM IR that results in selection failures, which are both fairly fixable. There’s also some rough edges in the API that I think should be fixed to make the code more readable.

After spending a lot of time complaining to Jubilee, the lead for portable SIMD, we’ve agreed I should write up my findings and feedback, which is what this post is going to be.

This is not my usual post, since it’s aimed at people with a lot of background, specifically other compiler people. I will be referring to general compiler and LLVM-specific concepts (like LLVM IR), API design, and HPC topics without much introduction. I am writing this based off of my own extensive experience writing vectorized code, compiler optimizations, and subtle algorithms that require extensive comments. These things are the main things going into my discussion here.

With that out of the way, let’s dive into the topics in no particular order.

Do We Need a Simd Type?

One question worth looking at is whether having Simd<T, N> be distinct from [T; N] is worthwhile at all. The main reason is layout: on some weird platforms, like PPC, things like [4 x i32] and <4 x i32> have the same size but different alignment. There may also be endianness problems that I am not aware of; I haven’t bothered to survey this but I wouldn’t be surprised if on some BE architectures, you have

%x = extractelement <4 x i32> %v, i32 0

; Equivalent to...
%a = bitcast <4 x i32> %v to [4 x i32]
%x = extractvalue [4 x i32] %a, 3

I.e. the lanes are also in big-endian order. However, currently, Rust does not respect this:

let mut vec = Simd::from_array([1, 2, 3, 4]);
vec.as_mut_array()[0] = 42;

// Does this print [42, 2, 3, 4] or [1, 2, 3, 42]? Depends
// on whether the lane layout matches [i32; 4].
eprintln!("{vec:?}");

So we only really care about alignment (or, as_mut_array() has to be removed). It may be worth interrogating whether alignment matters: the big three modern CPU architectures, x86, ARM, and RISC-V, do not have “overaligned” vectors, so it’s basically a question of whether Rust emits only partially-aligned loads for vectors on “small market share” architectures. I strongly suspect this does not matter for most code, so overaligning via a AlignAsIfSimd<[i32; 4]> type might be good enough?

The reason to ask this question is because arrays are Really Nice, syntactically. It would be very convenient to get to re-use arrays’ [x, y, z] and [x; N] syntax for free, and for existing code that uses [T; N] to be able to use the same vocabulary as SIMD code with minimal cognitive overhead.

However, some other design proposals here become problematic, particularly From impls: for example, it is desireable for a scalar to Into-convert into a vector of any size via a broadcast, but i32: Into<[i32; 4]> feels problematic outside of the context of SIMD.

I generally lean against having arrays be “the” SIMD type for this reason more than any other: it is useful to change the API available for an array-like type depending on whether it is a “list or things” or a “parallel vector of integers”.

What’s the Right Length?

The Simd<T, N> type has an “array-like length”, i.e., it’s in units of array elements. However, most SIMD intrinsics have a “register-like length”, where they specify the number of bits in the vector and don’t specify the type at all. For example, __m256 represents an abstract YMM register in x86 intrinsics code.

Both interpretations of the length are useful. For example, it is useful to have a Simd<u64, 4> and then cast it down to Simd<u32, 4> to get a vector of all the lower halves of the lanes in the u64 vector, but it is also useful to be able to cast it to a Simd<u32, 8>, which interprets each u64 lane as a pair of u32 lanes.

This kind of “size-preserving transmute” is very common in SIMD, particularly because some architectures’ intrinsics (Intel in particular) make it easy, since the lane width is specified on the intrinsic.

I think that in an ideal world, it should be possible to specify both length styles. Here’s more-or-less how I’d approach it, as an API.

// mod std::simd

#[repr(simd)]
pub struct Simd<T: SimdElement, N: Length>(N::Array);

impl<T: SimdElement, N> Simd<T, N> {
  pub fn from_lanes(lanes: Simd<T, N::AsLanes<T>>) -> Self;
  pub fn from_bits(lanes: Simd<T, N::AsBits<T>>) -> Self;

  pub fn into_lanes(self) -> Simd<T, N::AsLanes<T>>;
  pub fn into_bits(self) -> Simd<T, N::AsBits<T>>;
}

pub trait Length: Sealed {
  type Array<T: SimdElement>: Copy;
  type AsLanes<T: SimdElement>: Length<T>;
  type AsBits<T: SimdElement>: Length<T>;
}

pub struct Lanes<const N: usize>;
pub struct Bits<const N: usize>;

The units are now part of the type. I think that making the type independent of the units is probably possible, but I don’t think it’s desireable. Suppose we pick the unit in the realized type to be the lane count. Then suppose we want a way to transmute a Simd<u32, Bits<8>> into a Simd<u64, Bits<4>>. How do we express this API generically? If we pick bit count as “fundamental”, we have the same trouble trying to specify Simd::cast().

Having two distinct types may lead to problems around unification, but I think this requires further study to determine potential issues. In particular, I think it is important to note that having both into_lanes and from_lanes is important, so we can make programs like the following type check:

fn frob<T, N>(x: Simd<T, N>) -> Simd<T, N::AsBits<T>>
where T: SimdElement, N: Length,
{
  x.into_bits()
}

fn do_it<T, const N: usize>(x: Simd<T, Lanes<N>>) -> Simd<T, Lanes<N>>
where T: SimdElement, Lanes<N>: Length,
{
  Simd::from_bits(frob(x))
}

If we instead tried to do frob(x).into_lanes(), we would get something like Simd<T, Lanes<N>::AsBits<N>::AsLanes<N>>, which the trait solver has no way to cancel out to just Simd<T, Lanes<N>>. Simd::from_bits allows us to “undo” this.

Mind, I think that most people are not going to want to write out a type like Simd<T, Lanes<N>> all the time, so I think some convenience aliases are in order. I think one option is something like this.

// The core data type.
#[repr(simd)]
pub struct Simd<T: SimdElement, N: Length>(N::Array);

// Array-like SIMD types.
pub type Array<T, const N: usize> = Simd<T, Lanes<N>>;

// "Very large integer" SIMD types, e.g. v128<i32>
pub type v64<T> = Simd<T, Lanes<64>>;
pub type v128<T> = Simd<T, Lanes<128>>;
pub type v256<T> = Simd<T, Lanes<256>>;
pub type v512<T> = Simd<T, Lanes<512>>;

Another option is a Simd! macro of some kind to make it easier to name different SIMD types. There’s definitely bikeshedding opportunities.

Also, we need an answer for cases where the length in Bits<N> is not cleanly divisible by the element type: e.g., Simd<u8, Bits<15>>, Simd<u64, Bits<32>>. There are three choices for the semantics of this thing:

  1. Monomorphization error.
  2. Round the number of lanes down.
  3. Round the number of lanes up.

This is related, but not exactly the same, as whether things like Simd<u64, Lanes<3>> should be supported. I think it’s hard to tell right now which of these is the best option, but my gut tells me to round down.

Also, there is a question of what Simd<T, Bits<N>>::cast() should mean. Our options are either “size-preserving transmute” or “it’s not available at all”. I don’t think either is wrong; I think that the transmute operation is useful, but it should not have the same name to avoid surprises in generic code.

SIMD Function Name Lookup

Right now, you’re expected to include a prelude for using most SIMD operations, such as e.g. SimdUint. I am not a fan of this type of API for a few reasons:

  1. It’s hard to discover what functions are available without digging through many traits.
  2. Namespace pollution from use std::simd::prelude::*; is not ideal, IMO.
  3. The APIs of functions that are introduced through traits are kind of messy.

There is effectively no reason for these to be implemented as traits, because all of these functions are manually lowered by the compiler. For example, this is the implementation of SimdInt::cast():

#[inline]
fn cast<T: SimdCast>(self) -> Self::Cast<T> {
  // Safety: supported types are guaranteed by SimdCast
  unsafe { intrinsics::simd_as(self) }
}

The compiler directly lowers the simd_as intrinsic to an LLVM trunc, sext, or zext as needed. In my mind, there’s no particular reason this couldn’t be written as

impl<...> Simd<T, Lanes<N>> {
  #[inline]
  fn cast<U>(self) -> Simd<U, Lanes<N>>
  where U: SimdCastFrom<T>
  {
    // Safety: supported types are guaranteed by SimdCast
    unsafe { intrinsics::simd_as(self) }
  }
}

I think there is an interesting question to ask here whether having a separate cast for pointee types (e.g. what <Simd<*mut i32, 4> as SimdMutPtr>::cast<u32>() does today) since you could write vec.cast::<*mut i32>() instead. Currently, std::simd has distinct pointer cast intrinsics for SIMD, but I don’t believe there is a technical necessity here (integer casts lower as described, pointer casts are no-ops because LLVM only has the ptr type).

In this way, things like SimdFloat and SimdInt are replaced with marker traits that guard against invalid calls to intrinsics, but all of the actual callable methods are inherents of Simd.

There are also a handful of functions that would otherwise clash with trait methods from built-in types. For example, lanewise-eq is simd_eq, because if it was named eq it would be ambiguous with PartialEq::eq.

But, if instead simd_eq were inherent, this would not be a problem:

impl<...> Simd<T, N> {
  pub fn eq(self, other: Self) -> Mask<T::Mask, N>
  where T: PartialEq
  {
    // Intrinsics call.
  }
}

Inherent methods always win against trait methods, so a.eq(b) could coexist with a == b. There is definitely argument to be made that shadowing well-known trait methods in a standard API is kinda rude, although I counter that no one ever means PartialEq::eq() when they want to compare two SIMD vectors.

Gather/Scatter

The current state of gather/scatter isn’t very good. Rust lowers these to the load T, <n x ptr> and store T, <n x ptr> instructions, which produce mediocre code, since LLVM has trouble remembering what relationship these pointers had to each other.

I think the current gather/scatter functions should be removed, and replaced with the following more basic primitives.

impl<...> Simd<*mut T, N> {
  pub unsafe fn read(self) -> Simd<T, N>
  where T: SimdElement
  {
    self.read_masked(Mask::splat(true), Simd::splat(0))
  }

  pub unsafe fn read_masked(self, mask: Mask<isize, N>, default: Simd<T, N>)
  -> Simd<T, N>
  where T: SimdElement
  {
    // Intrinsics call.
  }

  pub unsafe fn write(self) -> Simd<T, N>
  where T: SimdElement
  {
    self.write_masked(Mask::splat(true))
  }

  pub unsafe fn write_masked(self, mask: Mask<isize, N>) -> Simd<T, N>
  where T: SimdElement
  {
    // Intrinsics call.
  }
}

Existing gather/scatter operations all lower into something resembling these anyways, and they should be a fairly uncommon operation, so it seems better to provide just the primitive and wait to see what users wind up requesting.

What many users actually want when they do a gather is to load localized data in some shuffled order, which is best realized as a load + shuffle.

However, Simd::from_slice() requires that the input slice be no smaller than the vector being loaded, so it can always lower to a single load instruction. This is fine, but there probably needs to be a Simd::prefix_from_slice() that performs loads less than a full vector of memory. Highly efficient techniques exist for implementing such functions, which perform an O(logn)O(\log n) number of loads and branches; see vb64’s padded load implementation.

Once users have loaded the localized data into a vector, they can shuffle it as needed. The API should try to steer users away from general gather/scatter where possible, since it’s a performance footgun.

Arguments To SIMD Operations

Certain operations are very, very wordy, because SIMD operations only operate on SIMD vectors. For example, I have a helper function like this:

fn mask_splat<T, const N: usize>(mask: Mask<T::Mask, N>, val: T) -> Simd<T, N>
where
  T: SimdElement + Default,
  LaneCount<N>: SupportedLaneCount,
{
  mask.select(Simd::splat(val), Simd::splat(Default::default()))
}

Ideally, though, it would have been good to be able to just write mask.select(val, 0) and be done with it.

I think the best approach is to observe that Simd::splat is the canonical conversion from scalar to vector. For a lot of operations mixed vector and scalar arguments, where the scalars are implicitly splatted, make a lot of sense classically. The most well-known example is scalar multiplication, e.g. Simd<f32, 4> * f32, but I think it’s a valid conversion for virtually every SIMD operation.

In other words, I think that Simd<T, N> should be From<N::Array<T>> and From<T>, and every function on Simd that takes a Simd<T, N> should instead take an impl Into<Simd<T, N>>.

I think this is very important for readability: redundant calls to Simd::splat get in the way of the essential character of the code.

Separately, I think that it would be good to expand the scope of existing “array construction” primitives. The following function, for example, makes it easier to build e.g. shuffle index vectors, which are commonly “stripped”.

impl<T: Clone> [T] {
  fn tile<const N: usize>(&self) -> [T; N] {
    let uninit: [MaybeUninit<T>; N] = /* ... */;
    for (a, b) in uninit.iter_mut().zip(self.iter().cycle()) {
      *a = MaybeUninit::new(b.clone());
    }
    unsafe { uninit.assume_init() }
  }
}

This would make it really easy to write e.g. table.swizzle_dyn([x, y, z].tile()).

Shuffling Operations

My understanding is that the Swizzle and Swizzle2 traits, as they exist, are going away. I think that’s a good step forward but there’s a few issues with shuffles worth considering.

My pony for shuffles is that we can have a single operation like this:

impl<...> Simd<T, Lanes<N>> {
  pub fn lookup_in_table<U, M, const K: usize>(
    self,
    table: [Simd<U, M>; K],
  ) -> Simd<U, Lanes<N>>
  where
    T: SimdUint,
    U: SimdElement,
    M: Length
  {
    // Intrinsics call.
  }
}

Essentially the intent here is that self is a vector of N indices, which we use to select N values from the provided lookup table. This generalizes Swizzle, Swizzle2, and swizzle_dyn. The semantics are: [Simd<U, M>; K] is interpreted as an array [U; M::Array<U>.len() * K]; elements from self are indices in this matrix; out-of-bound values produce zero (or equivalent).

The “main” issue here is that we really want this to lower to an LLVM shufflevector when possible, since that can be optimized better, but it requires that the argument be an immediate. I don’t know how good MIR-opt is at constant folding these days.

Part of the problem is that LLVM does not have a general “shuffle with dynamic index vector” intrinsic, and instead relegates users to the architecture-specific shuffles. Currently, the Rust standard library does selection itself, rather than letting the compiler do it, which is part of the problem.

I’m not sure what the cleanest way out of this is right now. One way to achieve it might be to introduce something like

#![feature(adt_const_params)]
#![feature(generic_const_exprs)]

pub struct Indices<const ARRAY: &'static [usize]>;

impl<const ARRAY: &'static [usize]> Indices<ARRAY> {
  fn lookup_in_table<U, const M: usize, const K: usize>(
    self,
    table: [Simd<U, M>; K],
  ) -> Simd<U, { ARRAY.len() }>
  where
    U: SimdElement,
    LaneCount<M>: SupportedLaneCount,
    LaneCount<{ ARRAY.len() }>: SupportedLaneCount,
  {
    todo!()
  }
}

Then, you could write

simd::Indices<{ &[1, 2, 3, 4] }>.lookup_in_table([x, y, z])

This ensures that the index vector is an immediate, which simplifies selection.

In my opinion, this shouldn’t even have to exist, and LLVM should be able to “just deal with it” by offering an @llvm.vector.shuffle.vNT intrinsic that it knows it can promote to a shufflevector after it’s constant-folded and GVN’d a bit. But fixing that seems like a long-term issue, and at least offering something with a very similar interface to the “ideal” one as a stopgap seems like the best direction.

Conclusion

That’s all I’ve got. Not my usual style of post, but I felt this was something I needed to write up.

Related Posts