ADR-0987: AVX-512 path for float_moment feature extractor¶

Status: Accepted
Date: 2026-06-03
Deciders: Lusoris
Tags: simd, avx512, performance, float_moment, fork-local

Context¶

The .workingdir2 SIMD-coverage audit identified float_moment as one of three x86 features still lacking an AVX-512 implementation. The AVX2 path (moment_avx2.c, ADR-0179) processes 8 floats per iteration; on CPUs that expose VMAF_X86_CPU_FLAG_AVX512 the 8-wide inner loop leaves half of the 512-bit ZMM register unused.

float_moment computes two simple reductions per frame (first and second statistical moment over a float-valued picture). It is a pure reduction with no inter-pixel dependence, making it an ideal candidate for lane widening: doubling from 8 to 16 floats per loop body halves the number of main-loop iterations and allows the CPU to issue the load + multiply + store sequence without pipeline stalls.

The feature was wired with an AVX2 gate in float_moment.c but no AVX-512 branch. This ADR closes that gap.

Decision¶

We will add core/src/feature/x86/moment_avx512.c and the corresponding header, following the same sequential-lane-widening accumulation pattern established by float_psnr_avx512.c: load 16 floats into a ZMM, store to a 64-byte aligned temporary, then add each lane sequentially into a double accumulator. The dispatch in float_moment.c is extended with a HAVE_AVX512 guard that selects the 16-lane path when VMAF_X86_CPU_FLAG_AVX512 is set, overriding the AVX2 selection.

The existing test_moment_simd.c is extended with four AVX-512 parity cases that compare compute_1st/2nd_moment_avx512 against their scalar references using the same MOMENT_REL_TOL = 1e-7 relative tolerance as the AVX2 tests. The cases run conditionally behind simd_test_have_avx512() so they skip cleanly on hosts without AVX-512.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Horizontal `_mm512_reduce_add_ps` for the full row at once	Fewer store instructions; may hide more ILP from the OOO engine	Produces a different reduction order than the scalar reference, breaking the `MOMENT_REL_TOL` contract and requiring a separate tolerance justification	Tolerance-bounded-but-not-bit-exact is the documented contract (ADR-0179); the sequential lane-widening pattern used by `float_psnr_avx512.c` stays within the existing contract without introducing a new gate
Separate `test_moment_avx512_parity` test file (new file, mirroring `test_motion_avx512_parity.c`)	Matches the motion test's file-per-ISA layout	The existing `test_moment_simd.c` already covers both x86 and AARCH64 under a single file; adding four cases to it is less disruptive and matches the AVX2 addition pattern from ADR-0179	Simpler; no new executable to wire into `meson.build`
Skip the AVX-512 path entirely (leave AVX2 as the ceiling)	No new code	Leaves a known coverage gap on all ICX/SPR/SRF CPUs; contradicts the `.workingdir2` SIMD audit directive	Gap closure is the explicit driver for this PR

Consequences¶

Positive: float_moment on AVX-512 CPUs processes 16 floats per inner-loop iteration instead of 8, reducing loop overhead approximately in half for HD/UHD frames. The dispatch layer transparently selects the best available path at runtime, so no user-visible API change is needed.
Negative: One additional .c / .h file pair in core/src/feature/x86/, adding minor build-time overhead. The x86_avx512_static_lib grows by one TU.
Neutral / follow-ups:
psnr_hvs and ssimulacra2_host were identified in the same SIMD audit; they are deferred to separate PRs (each requires its own AVX-512 lane-width analysis and tolerance-bound justification).
The test_moment_simd executable gains four new test cases; CI wall-time impact is negligible (pure arithmetic, no I/O).

References¶

ADR-0179: float_moment AVX2 path (original coverage)
ADR-0245: simd_bitexact_test.h shared harness
ADR-0854: simd_test_have_avx512() addition to the shared harness
float_psnr_avx512.c: precedent for sequential-lane accumulation pattern
.workingdir2 SIMD-coverage audit: three x86 gaps identified
req: user directive — implement the AVX-512 path for float_moment