Silicon designers are bad at designing secure hardware. Embarrassingly so, sometimes. This means that low-level cryptography, as well as code which directly handles key material, often needs to be written in a particularly delicate style called “constant-time”.
“Constant-time” is a bit of a misnomer. It does not mean that the code’s time complexity is (although this is a closely related property). Constant-time is a threat model for side-channel timing attacks like Spectre, which ensures that key material is not leaked through the microarchitecture of CPUs.
Although constant-time is a powerful countermeasure against the silicon designers leaking our keys, the compiler can still screw us. However, there are magic incantations that can be offered to the compiler to make it behave correctly in many relevant situations.
First: what is constant-time?
The Constant-Time Threat Model
The actual threat model is a series of assumptions about the most advanced attacker we wish to defeat. The assumptions, as applied to a cryptography software library, are as follows:
The attacker has access to both the source code of the library, the compiled artifact linking in the the library, the toolchain used to build it, and any relevant compiler flags (this is true, for example, for an Internet browser).
The attacker has a complete trace of every program counter value visited by the program. This is not the same as an instruction trace, which will usually also record software-visible architectural state at each instruction. Attackers can often obtain this information by directly timing the software, since the relevant information is mostly branches-taken. This includes instructions executed in a privileged mode, such as within the kernel.
The attacker knows the address of every pointer stored or loaded by the program. That is, each program counter value in the above is annotated with the values of registers containing pointers relevant to that instruction, such as the pointer for a load instruction. This data can be obtained through data cache side-channels such as Spectre.
Programming against this model defeats virtually all known timing side-channel attacks, although it does not protect against other side-channel attacks, such as thermal and power analysis.
New Footguns
All cryptography libraries that are safe to use in 2025 implement all of their critical code in constant-time. This model has broad consequences what kinds of programs are allowed.
What’s wrong with this code?
int8* key = ...
for (int i = 0; i < n; ++i) {
if (auto& k = key[i]; k < 0) {
k = -k;
}
}
Because we are looping over n, we immediately leak n, because the attacker
can count the number of loop iterations. If n is secret, this is a problem.
We also leak one eight of the bits in the key: the attacker can see which loop iterations contain a negation instruction; those iterations correspond to bytes which had their sign bit set.
To protect the value of n, we would have to ensure that there is some maximum
value N of n, and that key was allocated to an at-least-N-byte buffer;
the loop would then be over N, making no reference to n. This is a relatively
uncommon situation, since the length of a buffer is almost never a secret in
practice.
Protecting the sign bits is simpler: rather than branching to determine if
the value should be negated, we can mask off the sign bit: key[i] &= 0x7f.
Buffer Comparisons
Many standard library functions are not constant-time. For example, memcmp’s
runtime is not constant in the size of its inputs; most implementations break
out early when encountering an unequal value:
int memcmp(const char* a, const char* b, int n) {
for (int i = 0; i < n; ++i) {
auto diff = a[i] - b[i];
if (diff != 0) {
return diff
}
}
return 0;
}
Now suppose we have the following code to verify a signature:
Message msg = ...;
char* expected = compute_signature(msg.body, private_key);
if (memcmp(msg.signature, expected) == 0) {
// Assume msg is authentic.
}
The attacker can use this as a signing oracle. By providing sending the desired
message to sign with a bad signature, they can determine the first byte of the
signature which is incorrect by timing alone, and brute force that byte. They
can then proceed to the next byte, and so on. The maximum number of queries to
forge a signature is 256 times the number of bits, reducing the cost from
to !
To defeat this attack, we need to use constant-time memcmp. This means that
every byte must be compared, and the comparison must be accumulated without
branching.
The typical implementation is something like this:
bool ct_memeq(const char* a, const char* b, int n) {
char acc = 0;
for (int i = 0; i < n; ++i) {
acc |= a[i] ^ b[i];
}
return acc == 0;
}
If at least one byte differs between a and b, their xor will be non-zero,
and so acc will be nonzero. This function only compares for exact equality,
not lexicographic equality, but the latter is never an operation you want or
need on keys and other sensitive data.
There are many variants on this implementation, but the xor one is the most popular. Subtraction will achieve a similar result (although signed overflow is UB in C++).
Constant-Time Select
A similar trick is used to select one of two values. The ternary operator cannot be used because it can be compiled into a branch.
template <typename T>
T ct_select(bool flag, T a, T b) {
auto mask = -T(flag);
return (a & mask) | (b & ~mask);
}
If flag is true, mask is the all-ones representation for T, so a & mask
is a and b & ~mask is 0, so their or is a. If flag is false, the
opposite is true.
On x86, Clang recognizes this idiom and produces a conditional move instruction, which does not violate the threat model. However, on architectures without conditional move, such as RISC-V, Clang still recognizes the pattern… but produces a branch.
_Z9ct_selectIiET_bS0_S0_:
mov eax, esi
test edi, edi
cmove eax, edx
ret
_Z9ct_selectIiET_bS0_S0_:
bnez a0, .LBB0_2
mv a1, a2
.LBB0_2:
mv a0, a1
ret
You might want to point the finger at the bool, but this actually goes much
deeper than that. This type of unwanted optimization happens all the time
in constant-time code, and preventing it is essential to ensure that the
countermeasure works.
The “easy” solution is to just write the assembly directly, which is the case for performance-critical parts of cryptography implementations, but this is error-prone and not portable.
Thankfully, all modern native compilers provide a hidden feature to block optimizations inimical to security, and what this post is really about: the value barrier.
An Incantation
The value barrier is a special secret syntax construct that instructs the compiler to ignore information it could use to prove the correctness of optimizations.
For example, the reason Clang is able to “defeat” our
ct_select implementation is that it knows from -T(flag)
that mask must be either 0 or -1. If we can hide this fact from Clang,
it will be forced to not emit the branch.
One way we can do this is to “encrypt” mask with a value that Clang cannot
see through, such as a global variable:
template <typename T>
inline T key;
template <typename T>
T ct_select(bool flag, T a, T b) {
auto mask = -T(flag);
mask ^= key<T>;
return (a & mask) | (b & ~mask);
}
Clang cannot know what the value of key will be at runtime, because code in
another translation unit might set it. It must therefore assume it knows nothing
about mask and the return value could be an arbitrary bit-mix of a and b.
However, because we never actually set this global, it will always be zero, so
mask ^= key<T>; is a no-op at runtime.
On RISC-V, this does what we need:
_Z9ct_selectIiET_bS0_S0_:
.Lpcrel_hi0:
auipc a3, %pcrel_hi(_Z3keyIiE)
lw a3, %pcrel_lo(.Lpcrel_hi0)(a3)
neg a0, a0
xor a0, a0, a3
xor a1, a1, a2
and a0, a0, a1
xor a0, a0, a2
ret
_Z3keyIiE:
.word 0
Unfortunately, this does perform a load, which is a performance hiccup we’d like to avoid. Also, if whole-program optimization, such as LTO, BOLT, or similar notices the global is never written to, it can replace all of its loads with immediates.
Another option is to send the value into an assembly function. Not even LTO can see into assembly functions, and BOLT will not attempt to optimize hand-written assembly.
#include <stdint.h>
asm(R"(
.intel_syntax
__asm_eyepatch:
mov rax, rsi
ret
)");
extern "C" uint64_t __asm_eyepatch(uint64_t);
template <typename T>
T ct_select(bool flag, T a, T b) {
auto mask = -T(flag);
mask = __asm_eyepatch(mask);
return (a & mask) | (b & ~mask);
}
This works but now the cost is a non-inlineable function call rather than a pointer load. Not ideal. It’s also again, not portable across targets. The function call is also treated as having side-effects by the compiler, which impedes desirable optimizations.
But a small modification will eliminate this problem: we can simply use an inline assembly block with zero instructions.
template <typename T>
T ct_select(bool flag, T a, T b) {
auto mask = -T(flag);
asm("" : "+r"(mask));
return (a & mask) | (b & ~mask);
}
This is called the value barrier: an empty assembly block which modifies a single
register-sized value in-place as a no-op. The "+r" constraint indicates that
the assembly takes mask as an input, and outputs a result onto mask, in the
same register.
This has the same effect as the xor-with-key solution (prevents optimizations
that depend on the value of mask) without the load from a global. It’s
architecture-independent, because "" is a valid inline assembly block
regardless of underlying assembly language.
The value barrier is often written as its own function (with a massive comment explaining what it does), to be used in key points in cryptographic algorithms.
template <typename T>
[:nodiscard:] T value_barrier(T x) {
asm("" : "+r"(x));
return x;
}
A use of the value barrier looks like this:
template <typename T>
T ct_select(bool flag, T a, T b) {
auto mask = -T(flag);
mask = value_barrier(mask);
return (a & mask) | (b & ~mask);
}
value_barrier is trivially inlineable and compiles down to zero instructions,
but has a profound effect on optimizations.
Dataflow
A better name for the value barrier is the “dataflow barrier”, because that more accurately captures what the instruction does.
If you read my introduction to SSA, you’ll know that every optimization compiler today puts a heavy emphasis on dataflow analysis. To figure out how to optimize some operation, we look at its inputs’ definitions.
Let’s look at what LLVM sees when we compile ct_select,
setting T = int (I have manually lifted everything into registers).
define i32 @_Z9ct_selectIiET_bS0_S0_(i1 zeroext %flag, i32 %a, i32 %2) {
%1 = zext i1 %flag to i32 ; int(flag)
%2 = sub nsw i32 0, %1 ; mask = 0 - int(flag)
%3 = and i32 %a, %2 ; a & mask
%4 = xor i32 %2, -1 ; mask ^ -1
%5 = and i32 %b, %4 ; b & (mask ^ -1)
%6 = or i32 %3, %5 ; (a & mask) | (b & (mask ^ -1))
ret i32 %6
}
LLVM contains pattern-matching code that matches this code (and various
permutations of it) as a “select” idiom. LLVM contains an instruction that
selects one of two values based on an i1, called select. It is essentially
the C ternary with SSA register arguments.
The pattern-matching code looks for an or whose arguments are both ands,
where one argument to the and is the complement of the other (i.e, xor with -1).
Call this argument is the “mask”.
What LLVM has found is a general “bit-mixing” operations which selects bits from the other
operands to the ands, depending on which bits of the mask are set.
LLVM then wants to prove that the mask is either 0 or -1 (all ones). There
are a number of ways LLVM can discover this, but all of them essentially boil
down to “is the mask a sext i1, i.e., sign-extending a one-bit value”. That
does not occur in this code, but a peephole optimization can rewrite
sub nsw i32 0, %1 into sext i1 %0 to i32, allowing this more complicated
pattern to be detected.
LLVM rewrites this into a select, which on x86 turns into a cmov, but on
RISC-V forces a branch.
All we need to do is prevent LLVM from looking through to the definition of the “mask”. That is precisely what the value barrier accomplishes: it inserts a new SSA register whose value is, at runtime, equivalent to the mask, but produced by an instruction that LLVM doesn’t have implement dataflow for. That is the value barrier: an intentional hole left in the compiler’s analysis.
In theory, LLVM can actually see that the inline assembly block is empty and optimize it out. However, it does not by design, because cryptography depends on it remaining an optimization barrier. In fact, LLVM (and GCC) emit special annotations around inline assembly to stop downstream tools, such as linker optimizations and post-link optimizers like BOLT, from accidentally optimizing sensitive sequences. In an assembly dump from LLVM, those regions look like this:
_Z9ct_selectIiET_bS0_S0_:
mov eax, esi
neg edi
#APP
#NO_APP
xor eax, edx
and eax, edi
xor eax, edx
ret
These are just comments; the actual sensitive regions are recorded elsewhere in the resulting object code.
What Does the Barrier Do?
The programming model for the value barrier is simple: it produces an arbitrary value that, at runtime, happens to be bitwise-identical to its input. The compiler may still make assumptions about the input value, but it cannot connect them to the output of the barrier through dataflow.
In other words, the value barrier is simply a register copy that also severs the dataflow link from the destination to the source operand.
The compiler is still allowed to optimized based on the assumption that this is some unknown concrete value. For example:
int random() {
int x = value_barrier(42);
return x - x;
}
x - x is always zero, so LLVM can optimize away the value barrier and its
input altogether. Critically, the value barrier does not have side effects,
so if its result is not used, the value barrier will be deleted through dead
code elimination. The following does not work:
template <typename T>
T ct_select(bool flag, T a, T b) {
auto mask = -T(flag);
value_barrier(mask); // Unused result warning.
return (a & mask) | (b & ~mask);
}
Because it is side-effect-free, it can also be hoisted out of loops and conditionals, which is actually a desirable optimization.
Benchmark Black Box
There is a related function that appears in benchmarking code, which is not equivalent to the value barrier:
template <typename T>
void black_box(const T& v) {
asm volatile("" :: "m"(v))
}
This function guarantees that its input is treated as “used” by the compiler, even if it is side-effect free. Rather than blocking dataflow, it blocks dead code elimination.
It works because asm volatile must be treated as having observable
side-effects that depend on all of its inputs. This means that it cannot be
deleted, hoisted our of loops, or executed speculatively.
A benchmark black box is most commonly used to force the result of a function
being benchmarked to not be deleted by the compiler, so that the runtime of
that function can be accurately measured. It can in place of the value barrier,
because the "m" constraint passes a pointer to the argument into the assembly
block, which it could potentially mutate. However, this is not guaranteed to
work in the same way that the value barrier does: it depends on whether the
surface language (C++ or Rust) considers mutating through that pointer to be
UB.
Using the actual value barrier avoids this altogether.
Unintended Consequences
The value barrier works very well at blocking undesirable optimizations, but it obstructs a few important ones, making its use problematic in performance-sensitive scenarios.
A particularly notable one is automatic vectorization. Consider this function:
uint64_t sum(uint64_t* p, int n) {
n &= ~31; // Round n to the nearest multiple of 32.
u64 a = 0;
for (int i = 0; i < n; i++) {
a += p[i];
}
return a;
}
When we push this through Clang at -O3 -march=zenv5, the loop gets unrolled
32 times over into four parallel AVX512 operations.
_Z3sumPyi:
cmp esi, 32
jl .LBB1_1
mov eax, 6661
vpxor xmm0, xmm0, xmm0
xor ecx, ecx
vpxor xmm1, xmm1, xmm1
vpxor xmm2, xmm2, xmm2
vpxor xmm3, xmm3, xmm3
bextr eax, esi, eax
shl rax, 8
.LBB1_3:
vpaddq zmm0, zmm0, zmmword ptr [rdi + rcx]
vpaddq zmm1, zmm1, zmmword ptr [rdi + rcx + 64]
vpaddq zmm2, zmm2, zmmword ptr [rdi + rcx + 128]
vpaddq zmm3, zmm3, zmmword ptr [rdi + rcx + 192]
add rcx, 256
cmp rax, rcx
jne .LBB1_3
vpaddq zmm0, zmm1, zmm0
vpaddq zmm2, zmm3, zmm2
vpaddq zmm0, zmm2, zmm0
vextracti64x4 ymm1, zmm0, 1
vpaddq zmm0, zmm0, zmm1
vextracti128 xmm1, ymm0, 1
vpaddq xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 238
vpaddq xmm0, xmm0, xmm1
vmovq rax, xmm0
vzeroupper
ret
.LBB1_1:
xor eax, eax
ret
Pretty cool, right? But what if we need to stick a value barrier into the loaded values for some reason?
uint64_t sum(uint64_t* p, int n) {
n &= ~31; // Round n to the nearest multiple of 32.
u64 a = 0;
for (int i = 0; i < n; i++) {
a += value_barrier(p[i]);
}
return a;
}
Surprisingly, this seems to completely wreck vectorization, forcing each loop iteration to load only 8 bytes instead of 256.
_Z3sumPyi:
cmp esi, 32
jl .LBB2_1
mov eax, 6661
xor edx, edx
bextr ecx, esi, eax
xor eax, eax
shl rcx, 8
.LBB2_4:
mov rsi, qword ptr [rdi + rdx]
add rax, rsi
add rdx, 8
cmp rcx, rdx
jne .LBB2_4
ret
.LBB2_1:
xor eax, eax
ret
The reason for this is kind of nasty. From LLVM’s perspective, an inline assembly
block is a call asm instruction to a special function whose name is the
inline assembly string. Although the lack of volatile marks this function as
pure, making it reorderable, there is no way to tell LLVM that it can be merged.
If we try to unroll the loop 32 times, we wind up with 32 calls to a pure
function with the signature i64 -> i64, and LLVM doesn’t know that it’s safe
to merge them into an almost identical function call with a signature
<32 x i64> -> <32 x i64>. Because one of the loop operations cannot be
vectorized (namely, this inline assembly block), vectorization fails.
There is no workaround for this. Allowing the inline assembly block to use an
SSE register using the "+x" constraint doesn’t work, because the failure is
not the assembly constraints: it’s that the call asm instruction does not
support vectorization1.
Is This Really a Language Feature?
For all intents and purposes, this magic incantation is part of C++ (at least, the dialect that GCC and Clang implement, which are the only compilers that matter for security2).
This is because BoringSSL3, the most widely-deployed cryptography library in the world, relies on this trick. This feature is critical to the security posture of the two biggest orgs that fund LLVM: Google and Apple. Tools which break the value barrier have historically been quickly patched to respect it, usually due to the consternation of a professional cryptographer.
The optimization black box is not quite as critical, but its correctness is a side-effect of the value barrier’s: namely, that the optimizer must never peek into an inline assembly block.
Of course, any professional cryptographer who has to implement constant-time primitives would tell you that this is a crummy workaround, and that we really need actual language-level intrinsics for manipulating values in a constant-time way. The tricky part is specifying semantics for them in a way that makes sure we don’t allow “bad” optimizations, a notoriously difficult problem in compiler IR design.
This could be fixed by making
asm("")vectorizable, because it is in the sense that a no-op is vectorizable. However, it’s not immediately clear to me if this is safe, even though it feels like it aught to be. ↩︎For example: all Internet browsers today are built with Clang, because Chromium only supports Clang builds and Firefox and Safari use Clang because Rust and Apple, respectively. ↩︎
The only browsers not shipping BoringSSL are Firefox and Safari; Chromium and all of its derivatives use it. It is also deployed on every Android device (not just the most popular mobile OS: horrible little Android devices, like POS equipment, are everywhere), and, of course, within all of Google’s data centers. Being on every Android device is probably enough to make the claim of “most widely deployed”, but being on 90%+ of consumer equipment, by virtue of Google Chrome being so popular, definitely makes it enough. ↩︎