I recently wrote an article about parsing base64 with SIMD. In the
process, I used it as an opportunity to evaluate Rust’s
std::simd portable SIMD
library.
I went into this with fairly low expectations, taking it as a given that the
codegen would not be as good as with direct use of intrinsics; LLVM is
historically “not good” at optimizing intrinsics code, and std::simd code
would not be using intrinsics at all.
That said, after using std::simd I think this is a really good approach. It
produces great code (after tripping over some random performance footguns). Most
of its issues stem from LLVM having mediocre support for some of the things Rust
wants to do, or Rust emitting LLVM IR that results in selection failures, which
are both fairly fixable. There’s also some rough edges in the API that I think
should be fixed to make the code more readable.
After spending a lot of time complaining to Jubilee, the lead for portable SIMD, we’ve agreed I should write up my findings and feedback, which is what this post is going to be.
This is not my usual post, since it’s aimed at people with a lot of background, specifically other compiler people. I will be referring to general compiler and LLVM-specific concepts (like LLVM IR), API design, and HPC topics without much introduction. I am writing this based off of my own extensive experience writing vectorized code, compiler optimizations, and subtle algorithms that require extensive comments. These things are the main things going into my discussion here.
With that out of the way, let’s dive into the topics in no particular order.
Do We Need a Simd Type?
One question worth looking at is whether having Simd<T, N> be distinct from
[T; N] is worthwhile at all. The main reason is layout: on some weird
platforms, like PPC, things like [4 x i32] and <4 x i32> have the same size
but different alignment. There may also be endianness problems that I am not
aware of; I haven’t bothered to survey this but I wouldn’t be surprised if on
some BE architectures, you have
%x = extractelement <4 x i32> %v, i32 0
; Equivalent to...
%a = bitcast <4 x i32> %v to [4 x i32]
%x = extractvalue [4 x i32] %a, 3
I.e. the lanes are also in big-endian order. However, currently, Rust does not respect this:
let mut vec = Simd::from_array([1, 2, 3, 4]);
vec.as_mut_array()[0] = 42;
// Does this print [42, 2, 3, 4] or [1, 2, 3, 42]? Depends
// on whether the lane layout matches [i32; 4].
eprintln!("{vec:?}");
So we only really care about alignment (or, as_mut_array() has to be removed).
It may be worth interrogating whether alignment matters: the big three modern
CPU architectures, x86, ARM, and RISC-V, do not have “overaligned” vectors, so
it’s basically a question of whether Rust emits only partially-aligned loads for
vectors on “small market share” architectures. I strongly suspect this does not
matter for most code, so overaligning via a AlignAsIfSimd<[i32; 4]> type might
be good enough?
The reason to ask this question is because arrays are Really Nice,
syntactically. It would be very convenient to get to re-use arrays’ [x, y, z]
and [x; N] syntax for free, and for existing code that uses [T; N] to be
able to use the same vocabulary as SIMD code with minimal cognitive overhead.
However, some other design proposals here become problematic, particularly
From impls: for example, it is desireable for a scalar to Into-convert into
a vector of any size via a broadcast, but i32: Into<[i32; 4]> feels
problematic outside of the context of SIMD.
I generally lean against having arrays be “the” SIMD type for this reason more than any other: it is useful to change the API available for an array-like type depending on whether it is a “list or things” or a “parallel vector of integers”.
What’s the Right Length?
The Simd<T, N> type has an “array-like length”, i.e., it’s in units of array
elements. However, most SIMD intrinsics have a “register-like length”, where
they specify the number of bits in the vector and don’t specify the type at all.
For example, __m256 represents an abstract YMM register in x86 intrinsics
code.
Both interpretations of the length are useful. For example, it is useful to have
a Simd<u64, 4> and then cast it down to Simd<u32, 4> to get a vector of all
the lower halves of the lanes in the u64 vector, but it is also useful to be
able to cast it to a Simd<u32, 8>, which interprets each u64 lane as a pair
of u32 lanes.
This kind of “size-preserving transmute” is very common in SIMD, particularly because some architectures’ intrinsics (Intel in particular) make it easy, since the lane width is specified on the intrinsic.
I think that in an ideal world, it should be possible to specify both length styles. Here’s more-or-less how I’d approach it, as an API.
// mod std::simd
#[repr(simd)]
pub struct Simd<T: SimdElement, N: Length>(N::Array);
impl<T: SimdElement, N> Simd<T, N> {
pub fn from_lanes(lanes: Simd<T, N::AsLanes<T>>) -> Self;
pub fn from_bits(lanes: Simd<T, N::AsBits<T>>) -> Self;
pub fn into_lanes(self) -> Simd<T, N::AsLanes<T>>;
pub fn into_bits(self) -> Simd<T, N::AsBits<T>>;
}
pub trait Length: Sealed {
type Array<T: SimdElement>: Copy;
type AsLanes<T: SimdElement>: Length<T>;
type AsBits<T: SimdElement>: Length<T>;
}
pub struct Lanes<const N: usize>;
pub struct Bits<const N: usize>;
The units are now part of the type. I think that making the type independent of
the units is probably possible, but I don’t think it’s desireable. Suppose we
pick the unit in the realized type to be the lane count. Then suppose we want a
way to transmute a Simd<u32, Bits<8>> into a Simd<u64, Bits<4>>. How do we
express this API generically? If we pick bit count as “fundamental”, we have the
same trouble trying to specify Simd::cast().
Having two distinct types may lead to problems around unification, but I think
this requires further study to determine potential issues. In particular, I
think it is important to note that having both into_lanes and from_lanes is
important, so we can make programs like the following type check:
fn frob<T, N>(x: Simd<T, N>) -> Simd<T, N::AsBits<T>>
where T: SimdElement, N: Length,
{
x.into_bits()
}
fn do_it<T, const N: usize>(x: Simd<T, Lanes<N>>) -> Simd<T, Lanes<N>>
where T: SimdElement, Lanes<N>: Length,
{
Simd::from_bits(frob(x))
}
If we instead tried to do frob(x).into_lanes(), we would get something like
Simd<T, Lanes<N>::AsBits<N>::AsLanes<N>>, which the trait solver has no way to
cancel out to just Simd<T, Lanes<N>>. Simd::from_bits allows us to “undo”
this.
Mind, I think that most people are not going to want to write out a type like
Simd<T, Lanes<N>> all the time, so I think some convenience aliases are in
order. I think one option is something like this.
// The core data type.
#[repr(simd)]
pub struct Simd<T: SimdElement, N: Length>(N::Array);
// Array-like SIMD types.
pub type Array<T, const N: usize> = Simd<T, Lanes<N>>;
// "Very large integer" SIMD types, e.g. v128<i32>
pub type v64<T> = Simd<T, Lanes<64>>;
pub type v128<T> = Simd<T, Lanes<128>>;
pub type v256<T> = Simd<T, Lanes<256>>;
pub type v512<T> = Simd<T, Lanes<512>>;
Another option is a Simd! macro of some kind to make it easier to name
different SIMD types. There’s definitely bikeshedding opportunities.
Also, we need an answer for cases where the length in Bits<N> is not cleanly
divisible by the element type: e.g., Simd<u8, Bits<15>>,
Simd<u64, Bits<32>>. There are three choices for the semantics of this thing:
- Monomorphization error.
- Round the number of lanes down.
- Round the number of lanes up.
This is related, but not exactly the same, as whether things like
Simd<u64, Lanes<3>> should be supported. I think it’s hard to tell right now
which of these is the best option, but my gut tells me to round down.
Also, there is a question of what Simd<T, Bits<N>>::cast() should mean. Our
options are either “size-preserving transmute” or “it’s not available at all”. I
don’t think either is wrong; I think that the transmute operation is useful, but
it should not have the same name to avoid surprises in generic code.
SIMD Function Name Lookup
Right now, you’re expected to include a prelude for using most SIMD operations,
such as e.g. SimdUint. I am not a fan of this type of API for a few reasons:
- It’s hard to discover what functions are available without digging through many traits.
- Namespace pollution from
use std::simd::prelude::*;is not ideal, IMO. - The APIs of functions that are introduced through traits are kind of messy.
There is effectively no reason for these to be implemented as traits, because
all of these functions are manually lowered by the compiler. For example, this
is the implementation of SimdInt::cast():
#[inline]
fn cast<T: SimdCast>(self) -> Self::Cast<T> {
// Safety: supported types are guaranteed by SimdCast
unsafe { intrinsics::simd_as(self) }
}
The compiler directly lowers the simd_as intrinsic to an LLVM trunc, sext,
or zext as needed. In my mind, there’s no particular reason this couldn’t be
written as
impl<...> Simd<T, Lanes<N>> {
#[inline]
fn cast<U>(self) -> Simd<U, Lanes<N>>
where U: SimdCastFrom<T>
{
// Safety: supported types are guaranteed by SimdCast
unsafe { intrinsics::simd_as(self) }
}
}
I think there is an interesting question to ask here whether having a separate
cast for pointee types (e.g. what
<Simd<*mut i32, 4> as SimdMutPtr>::cast<u32>() does today) since you could
write vec.cast::<*mut i32>() instead. Currently, std::simd has distinct
pointer cast intrinsics for SIMD, but I don’t believe there is a technical
necessity here (integer casts lower as described, pointer casts are no-ops
because LLVM only has the ptr type).
In this way, things like SimdFloat and SimdInt are replaced with marker
traits that guard against invalid calls to intrinsics, but all of the actual
callable methods are inherents of Simd.
There are also a handful of functions that would otherwise clash with trait
methods from built-in types. For example, lanewise-eq is simd_eq, because if
it was named eq it would be ambiguous with PartialEq::eq.
But, if instead simd_eq were inherent, this would not be a problem:
impl<...> Simd<T, N> {
pub fn eq(self, other: Self) -> Mask<T::Mask, N>
where T: PartialEq
{
// Intrinsics call.
}
}
Inherent methods always win against trait methods, so a.eq(b) could coexist
with a == b. There is definitely argument to be made that shadowing well-known
trait methods in a standard API is kinda rude, although I counter that no one
ever means PartialEq::eq() when they want to compare two SIMD vectors.
Gather/Scatter
The current state of gather/scatter isn’t very good. Rust lowers these to the
load T, <n x ptr> and store T, <n x ptr> instructions, which produce
mediocre code, since LLVM has trouble remembering what relationship these
pointers had to each other.
I think the current gather/scatter functions should be removed, and replaced with the following more basic primitives.
impl<...> Simd<*mut T, N> {
pub unsafe fn read(self) -> Simd<T, N>
where T: SimdElement
{
self.read_masked(Mask::splat(true), Simd::splat(0))
}
pub unsafe fn read_masked(self, mask: Mask<isize, N>, default: Simd<T, N>)
-> Simd<T, N>
where T: SimdElement
{
// Intrinsics call.
}
pub unsafe fn write(self) -> Simd<T, N>
where T: SimdElement
{
self.write_masked(Mask::splat(true))
}
pub unsafe fn write_masked(self, mask: Mask<isize, N>) -> Simd<T, N>
where T: SimdElement
{
// Intrinsics call.
}
}
Existing gather/scatter operations all lower into something resembling these anyways, and they should be a fairly uncommon operation, so it seems better to provide just the primitive and wait to see what users wind up requesting.
What many users actually want when they do a gather is to load localized data in some shuffled order, which is best realized as a load + shuffle.
However, Simd::from_slice() requires that the input slice be no smaller than
the vector being loaded, so it can always lower to a single load instruction.
This is fine, but there probably needs to be a Simd::prefix_from_slice() that
performs loads less than a full vector of memory. Highly efficient techniques
exist for implementing such functions, which perform an number of
loads and branches; see
vb64’s padded load implementation.
Once users have loaded the localized data into a vector, they can shuffle it as needed. The API should try to steer users away from general gather/scatter where possible, since it’s a performance footgun.
Arguments To SIMD Operations
Certain operations are very, very wordy, because SIMD operations only operate on SIMD vectors. For example, I have a helper function like this:
fn mask_splat<T, const N: usize>(mask: Mask<T::Mask, N>, val: T) -> Simd<T, N>
where
T: SimdElement + Default,
LaneCount<N>: SupportedLaneCount,
{
mask.select(Simd::splat(val), Simd::splat(Default::default()))
}
Ideally, though, it would have been good to be able to just write
mask.select(val, 0) and be done with it.
I think the best approach is to observe that Simd::splat is the canonical
conversion from scalar to vector. For a lot of operations mixed vector and
scalar arguments, where the scalars are implicitly splatted, make a lot of sense
classically. The most well-known example is scalar multiplication, e.g.
Simd<f32, 4> * f32, but I think it’s a valid conversion for virtually every
SIMD operation.
In other words, I think that Simd<T, N> should be From<N::Array<T>> and
From<T>, and every function on Simd that takes a Simd<T, N> should instead
take an impl Into<Simd<T, N>>.
I think this is very important for readability: redundant calls to Simd::splat
get in the way of the essential character of the code.
Separately, I think that it would be good to expand the scope of existing “array construction” primitives. The following function, for example, makes it easier to build e.g. shuffle index vectors, which are commonly “stripped”.
impl<T: Clone> [T] {
fn tile<const N: usize>(&self) -> [T; N] {
let uninit: [MaybeUninit<T>; N] = /* ... */;
for (a, b) in uninit.iter_mut().zip(self.iter().cycle()) {
*a = MaybeUninit::new(b.clone());
}
unsafe { uninit.assume_init() }
}
}
This would make it really easy to write e.g.
table.swizzle_dyn([x, y, z].tile()).
Shuffling Operations
My understanding is that the Swizzle and Swizzle2 traits, as they exist, are
going away. I think that’s a good step forward but there’s a few issues with
shuffles worth considering.
My pony for shuffles is that we can have a single operation like this:
impl<...> Simd<T, Lanes<N>> {
pub fn lookup_in_table<U, M, const K: usize>(
self,
table: [Simd<U, M>; K],
) -> Simd<U, Lanes<N>>
where
T: SimdUint,
U: SimdElement,
M: Length
{
// Intrinsics call.
}
}
Essentially the intent here is that self is a vector of N indices, which we
use to select N values from the provided lookup table. This generalizes
Swizzle, Swizzle2, and swizzle_dyn. The semantics are: [Simd<U, M>; K]
is interpreted as an array [U; M::Array<U>.len() * K]; elements from self
are indices in this matrix; out-of-bound values produce zero (or equivalent).
The “main” issue here is that we really want this to lower to an LLVM
shufflevector when possible, since that can be optimized better, but it
requires that the argument be an immediate. I don’t know how good MIR-opt is at
constant folding these days.
Part of the problem is that LLVM does not have a general “shuffle with dynamic index vector” intrinsic, and instead relegates users to the architecture-specific shuffles. Currently, the Rust standard library does selection itself, rather than letting the compiler do it, which is part of the problem.
I’m not sure what the cleanest way out of this is right now. One way to achieve it might be to introduce something like
#![feature(adt_const_params)]
#![feature(generic_const_exprs)]
pub struct Indices<const ARRAY: &'static [usize]>;
impl<const ARRAY: &'static [usize]> Indices<ARRAY> {
fn lookup_in_table<U, const M: usize, const K: usize>(
self,
table: [Simd<U, M>; K],
) -> Simd<U, { ARRAY.len() }>
where
U: SimdElement,
LaneCount<M>: SupportedLaneCount,
LaneCount<{ ARRAY.len() }>: SupportedLaneCount,
{
todo!()
}
}
Then, you could write
simd::Indices<{ &[1, 2, 3, 4] }>.lookup_in_table([x, y, z])
This ensures that the index vector is an immediate, which simplifies selection.
In my opinion, this shouldn’t even have to exist, and LLVM should be able to
“just deal with it” by offering an @llvm.vector.shuffle.vNT intrinsic that it
knows it can promote to a shufflevector after it’s constant-folded and GVN’d a
bit. But fixing that seems like a long-term issue, and at least offering
something with a very similar interface to the “ideal” one as a stopgap seems
like the best direction.
Conclusion
That’s all I’ve got. Not my usual style of post, but I felt this was something I needed to write up.