x86 SIMD Backends (AVX2 / AVX-512)¶
VMAF's x86 SIMD paths vectorize the CPU implementations of the core features. Every core feature (VIF, ADM, Motion) plus several additional features (CAMBI, CIEDE, SSIM, MS-SSIM, float_moment) have both AVX2 and AVX-512 kernels under core/src/feature/x86/.
At runtime the dispatcher picks the widest ISA the host CPU supports: AVX-512 → AVX2 → scalar.
Build¶
AVX2 kernels are compiled unconditionally. AVX-512 kernels are gated by a Meson option because some older toolchains miss the required intrinsics:
To disable the AVX-512 paths (useful when profiling scalar/AVX2 baselines or debugging a codegen regression):
Runtime selection¶
CPU dispatch happens in the feature extractor once per execution; there is no per-frame overhead. The choice can be inspected through the resolved VmafFeatureExtractor::init entry for each feature.
To force a lower ISA for A/B testing, pass the --cpumask flag, which disables instruction sets bit-by-bit (see libvmaf.h):
./build/tools/vmaf --cpumask 16 ... # disable AVX-512 (bit 4)
./build/tools/vmaf --cpumask 24 ... # disable both AVX2 (8) and AVX-512 (16)
./build/tools/vmaf --cpumask 31 ... # disable everything down to scalar
The full bitmask layout: SSE2 (1), SSE3/SSSE3 (2), SSE4.1 (4), AVX2 (8), AVX-512 (16), AVX-512-ICL (32).
Design notes¶
- 64-byte alignment. AVX-512 loads/stores prefer 64-byte-aligned input buffers. Picture allocations go through
aligned_alloc(64, …); scratch buffers allocated inside feature extractors do the same. - Mask registers for loop tails. The AVX-512 kernels use
k1-mask stores for width tails that aren't a multiple of 16/32 elements rather than falling back to scalar cleanup loops. Lower loop overhead and no branch predictor pressure at the tail. - FMA everywhere. Multiply-accumulate chains use
vfmadd*/_mm512_fmadd_psto land the operation in a single µop with better throughput and better precision than the separate mul+add sequence. - Frequency downclocking is obsolete on Zen 4/5 and recent Xeons. The AVX-512 power-license throttle that affected Skylake-X is not a concern on AMD Zen 4/5 or Sapphire Rapids+. We do not insert "warm-up" instructions; the first-frame penalty is negligible on target hardware.
- One kernel per
(feature, ISA). Naming is<feature>_<isa>.{c,h}(e.g.adm_avx512.c,vif_avx2.c); runtime dispatch is by CPU feature flags.
Float VIF convolution (ADR-0504)¶
The separable Gaussian convolution at the heart of the float VIF path (vif_filter1d_s, _sq_s, _xy_s in core/src/feature/vif_tools.c) accounts for roughly 60 % of float model wall time. As of ADR-0504 the hot inner loops are dispatched to:
| CPU support | Path | Inner loop width |
|---|---|---|
| AVX-512F | convolution_f32_avx512_s / _sq / _xy | 16 floats per FMA |
| AVX2 only | convolution_f32_avx_s / _sq / _xy | 8 floats per FMA |
| Scalar | vif_filter1d_s scalar fallback | 1 float per iteration |
The AVX-512 path lives in core/src/feature/common/convolution_avx512.c rather than under x86/ because the convolution helpers are shared by float_adm, float_motion, and several other extractors (not just VIF).
Numerical note. The AVX-512 path widens the FMA partial-sum tree from 8 to 16 lanes, which changes the rounding at the ULP level relative to the AVX2 path. This is accepted for the float VIF path per ADR-0214: the float path already diverges from the integer path. Netflix golden assertions pass at their declared places tolerance regardless of ISA.
float_moment AVX-512 (ADR-0987)¶
float_moment computes two per-frame reductions: the first statistical moment (mean) and the second (mean of squares) over a float-valued picture buffer. The AVX2 path processes 8 floats per inner-loop iteration; the AVX-512 path doubles this to 16 floats per iteration.
| CPU support | Path | Inner loop width |
|---|---|---|
| AVX-512F | compute_1st/2nd_moment_avx512 | 16 floats per iteration |
| AVX2 only | compute_1st/2nd_moment_avx2 | 8 floats per iteration |
| Scalar | compute_1st/2nd_moment | 1 float per iteration |
Reduction strategy. Both functions load a __m512, store to a 64-byte aligned temporary, then add each of the 16 lanes sequentially into a double accumulator. This matches the scalar accumulation order more closely than a horizontal _mm512_reduce_add_ps would, keeping the numerical residual within the 1e-7 relative tolerance tested by test_moment_simd.
Adding a new SIMD path¶
Use the /add-simd-path skill — it scaffolds the <feature>_<isa>.{c,h} pair, the dispatch hook, and a golden-diff test against the scalar implementation.
Debugging a numerical divergence¶
If an AVX-512 path produces a different score than the scalar path, /cross-backend-diff narrows the delta to a specific feature and scale. Common root causes:
- Float reduction order differs between scalar and SIMD — the fix is to accumulate in double precision inside the SIMD path. Example: commit
24c88a32(float ADMsum_cube/csf_den_scale). - Unaligned tail load that wraps past the end of the allocation — usually a bug in the mask computation.
References¶
- Intel Intrinsics Guide — https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
- Agner Fog, Optimizing Assembly — https://www.agner.org/optimize/
- x86 and amd64 instruction reference