ADR-0179: float_moment SIMD parity (AVX2 + NEON)¶
- Status: Accepted
- Date: 2026-04-26
- Deciders: Lusoris, Claude (Anthropic)
- Tags: simd, x86, arm64, feature-extractor, fork-local
Context¶
The fork's SIMD coverage matrix (.workingdir2/analysis/metrics-backends-matrix.md) flagged float_moment as the only fully-scalar row remaining after T5-1c closed. The extractor produces four scores (float_moment_{ref,dis}{1st,2nd}) via two trivial reductions over the picture: a sum and a sum-of-squares. Every other float-domain extractor in the fork (ANSNR, MS-SSIM decimate, PSNR-HVS, MOTION, SSIM, SSIMULACRA 2) already has at least AVX2 + NEON paths. Closing the row removes the last scalar bottleneck on the float side of the matrix and keeps the SIMD-coverage backlog (T7-19..T7-25) moving in priority order.
Decision¶
We add AVX2 (8-wide) and NEON (4-wide) implementations of compute_1st_moment / compute_2nd_moment under core/src/feature/{x86,arm64}/moment_*.{c,h} and dispatch to them from float_moment.c via function pointers selected at init() from vmaf_get_cpu_flags(). Both paths follow the ansnr_avx2.c pattern: square in float (matching the scalar reference), accumulate into double either via a scattered tmp buffer (AVX2) or via lane-pair widening (vcvt_f64_f32, NEON). The contract is tolerance-bounded, not bit-exact, with the residual confirmed below 1e-7 relative on representative inputs — five orders of magnitude tighter than the production snapshot gate's places=4. Tests: a new test_moment_simd runs four cases per arch (two random seeds, an aligned width, and a tiny edge case to exercise the per-row tail).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Bit-exact lane-by-lane reduction (mirror scalar order verbatim) | Byte-equal output | Ignores the established fork pattern (ANSNR, MOTION, MS-SSIM all already accept tolerance); kernel author flagged in moment_avx2.c header that the residual is "well inside the snapshot gate's tolerance" | Pattern parity wins — bit-exactness is unnecessary at places=4 and forecloses obvious wins (lane reductions, FMA on AArch64) |
| AVX-512 path now | Future-proof on Sapphire Rapids / Zen 5 | The 1st/2nd moment are memory-bound on a 4096-pixel reduction; AVX-512 wouldn't deliver a measurable win over AVX2 here | Deferred to T7-21 if profiling motivates it — for now the AVX2 path closes the matrix gap |
| Skip the gap-fill — moment is rarely the hot path | Less code | The matrix audit goal is no scalar rows, both for correctness signal (a SIMD path catches new compiler regressions in the scalar) and for snapshot stability across builds | Gap-fill is part of the audit programme — leaving the row scalar contradicts the matrix's purpose |
Consequences¶
- Positive: SIMD-coverage matrix now has zero fully-scalar float rows. New
test_moment_simdexercises both AVX2 and NEON paths against the scalar reference on every arch-matching CI lane. End-to-end CLI output unchanged at JSON%gprecision (verified locally — diff between AVX2 dispatcher and--cpumaskscalar-fallback rounds to zero). - Negative: One more dispatch site to maintain when CPU-flag wiring changes; one more pair of files to keep in sync with the scalar reference if Netflix ever updates
moment.c. - Neutral / follow-ups: T7-21 (
psnr_hvsAVX-512 audit) and T7-20 (integer_ansnrSIMD parity) remain the next gaps in priority order. No model retraining required — the extractor's numerical contract is unchanged.
References¶
- Source: per
.workingdir2/analysis/metrics-backends-matrix.mdordering and user direction on 2026-04-26 ("start filling the gaps of metrics-backends-matrix.md (meaning implementing the gaps)"). - Backlog row:
T7-19in.workingdir2/BACKLOG.md. - Pattern parent: ADR-0125 (MS-SSIM decimate SIMD), ADR-0159 (PSNR-HVS AVX2), ADR-0161 (SSIMULACRA 2 SIMD).
- Related:
req— see commit footnote and PR description.