x86 SIMD Backends (AVX2 / AVX-512)¶

VMAF's x86 SIMD paths vectorize the CPU implementations of the core features. Every core feature (VIF, ADM, Motion) plus several additional features (CAMBI, CIEDE, SSIM, MS-SSIM, float_moment) have both AVX2 and AVX-512 kernels under core/src/feature/x86/.

At runtime the dispatcher picks the widest ISA the host CPU supports: AVX-512 → AVX2 → scalar.

Build¶

AVX2 kernels are compiled unconditionally. AVX-512 kernels are gated by a Meson option because some older toolchains miss the required intrinsics:

meson setup build -Denable_avx512=true   # default is true on x86_64
ninja -C build

To disable the AVX-512 paths (useful when profiling scalar/AVX2 baselines or debugging a codegen regression):

meson setup build -Denable_avx512=false

Runtime selection¶

CPU dispatch happens in the feature extractor once per execution; there is no per-frame overhead. The choice can be inspected through the resolved VmafFeatureExtractor::init entry for each feature.

To force a lower ISA for A/B testing, pass the --cpumask flag, which disables instruction sets bit-by-bit (see libvmaf.h):

./build/tools/vmaf --cpumask 16 ...     # disable AVX-512 (bit 4)
./build/tools/vmaf --cpumask 24 ...     # disable both AVX2 (8) and AVX-512 (16)
./build/tools/vmaf --cpumask 31 ...     # disable everything down to scalar

The full bitmask layout: SSE2 (1), SSE3/SSSE3 (2), SSE4.1 (4), AVX2 (8), AVX-512 (16), AVX-512-ICL (32).

Design notes¶

64-byte alignment. AVX-512 loads/stores prefer 64-byte-aligned input buffers. Picture allocations go through aligned_alloc(64, …); scratch buffers allocated inside feature extractors do the same.
Mask registers for loop tails. The AVX-512 kernels use k1-mask stores for width tails that aren't a multiple of 16/32 elements rather than falling back to scalar cleanup loops. Lower loop overhead and no branch predictor pressure at the tail.
FMA everywhere. Multiply-accumulate chains use vfmadd* / _mm512_fmadd_ps to land the operation in a single µop with better throughput and better precision than the separate mul+add sequence.
Frequency downclocking is obsolete on Zen 4/5 and recent Xeons. The AVX-512 power-license throttle that affected Skylake-X is not a concern on AMD Zen 4/5 or Sapphire Rapids+. We do not insert "warm-up" instructions; the first-frame penalty is negligible on target hardware.
One kernel per (feature, ISA). Naming is <feature>_<isa>.{c,h} (e.g. adm_avx512.c, vif_avx2.c); runtime dispatch is by CPU feature flags.

Float VIF convolution (ADR-0504)¶

The separable Gaussian convolution at the heart of the float VIF path (vif_filter1d_s, _sq_s, _xy_s in core/src/feature/vif_tools.c) accounts for roughly 60 % of float model wall time. As of ADR-0504 the hot inner loops are dispatched to:

CPU support	Path	Inner loop width
AVX-512F	`convolution_f32_avx512_s` / `_sq` / `_xy`	16 floats per FMA
AVX2 only	`convolution_f32_avx_s` / `_sq` / `_xy`	8 floats per FMA
Scalar	`vif_filter1d_s` scalar fallback	1 float per iteration

The AVX-512 path lives in core/src/feature/common/convolution_avx512.c rather than under x86/ because the convolution helpers are shared by float_adm, float_motion, and several other extractors (not just VIF).

Numerical note. The AVX-512 path widens the FMA partial-sum tree from 8 to 16 lanes, which changes the rounding at the ULP level relative to the AVX2 path. This is accepted for the float VIF path per ADR-0214: the float path already diverges from the integer path. Netflix golden assertions pass at their declared places tolerance regardless of ISA.

float_moment AVX-512 (ADR-0987)¶

float_moment computes two per-frame reductions: the first statistical moment (mean) and the second (mean of squares) over a float-valued picture buffer. The AVX2 path processes 8 floats per inner-loop iteration; the AVX-512 path doubles this to 16 floats per iteration.

CPU support	Path	Inner loop width
AVX-512F	`compute_1st/2nd_moment_avx512`	16 floats per iteration
AVX2 only	`compute_1st/2nd_moment_avx2`	8 floats per iteration
Scalar	`compute_1st/2nd_moment`	1 float per iteration

Reduction strategy. Both functions load a __m512, store to a 64-byte aligned temporary, then add each of the 16 lanes sequentially into a double accumulator. This matches the scalar accumulation order more closely than a horizontal _mm512_reduce_add_ps would, keeping the numerical residual within the 1e-7 relative tolerance tested by test_moment_simd.

Adding a new SIMD path¶

Use the /add-simd-path skill — it scaffolds the <feature>_<isa>.{c,h} pair, the dispatch hook, and a golden-diff test against the scalar implementation.

Debugging a numerical divergence¶

If an AVX-512 path produces a different score than the scalar path, /cross-backend-diff narrows the delta to a specific feature and scale. Common root causes:

Float reduction order differs between scalar and SIMD — the fix is to accumulate in double precision inside the SIMD path. Example: commit 24c88a32 (float ADM sum_cube / csf_den_scale).
Unaligned tail load that wraps past the end of the allocation — usually a bug in the mask computation.

References¶

Intel Intrinsics Guide — https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
Agner Fog, Optimizing Assembly — https://www.agner.org/optimize/
x86 and amd64 instruction reference