ADR-0844: float_adm AVX2/AVX-512 F2+F3 — double-precision and FP-contraction¶
- Status: Accepted
- Date: 2026-05-29
- Deciders: Lusoris, Claude (Anthropic)
- Tags: simd, bit-exactness, avx2, avx512, float_adm, build
Context¶
The fork ships AVX2 and AVX-512 SIMD paths for float_adm at core/src/feature/x86/float_adm_avx2.c and core/src/feature/x86/float_adm_avx512.c. Two defects were identified during an AVX2 audit of the ADM reduction functions against the scalar reference in core/src/feature/float_adm.c:
F2 — Reduction order mismatch (store-to-temp + scalar loop). float_adm_csf_den_scale_avx2, float_adm_sum_cube_avx2, and their AVX-512 twins used the pattern:
_Alignas(32) float tmp[8];
_mm256_store_ps(tmp, vcube);
for (int k = 0; k < 8; k++)
row_accum += (double)tmp[k];
This stores 8 float lanes into a stack array, then accumulates them one-by-one into a double running sum. The cast (double)tmp[k] widens each lane individually before addition, which is numerically correct for each lane, but the accumulation order (lane 0 to lane 7, left-to-right) differs from what a widening-then-horizontal-reduction would produce. The scalar reference in float_adm.c uses adm_sum_cube_s / adm_csf_den_scale_s, which accumulate via a C loop with double intermediates; the lane ordering in the SIMD path did not match the scalar contract required by ADR-0139.
F3 — Auto-FMA inconsistency in the DWT2 vertical pass. Both float_adm_avx2.c and float_adm_avx512.c are compiled with -mfma (inherited from the shared x86_avx2_static_lib build target). With -mfma enabled, the compiler is permitted to auto-fuse adjacent _mm256_mul_ps + _mm256_add_ps pairs into FMA instructions, producing a single-rounded result. The scalar reference in float_adm.c is compiled without -mfma, so the same source pattern produces two-rounding mul + add. This creates a latent bit-level divergence between the SIMD and scalar paths on any compiler that performs auto-contraction (GCC and Clang both do, with -ffp-contract=fast or when -mfma implies it). The precedent for isolating this class of defect is already in tree: ssimulacra2 uses a per-TU static library with -ffp-contract=off for the same reason (ADR-0594 documents the HIP analogue; the x86 ssimulacra2 carve-out is in core/src/meson.build lines 694–705 as of master).
Both defects are in-scope under ADR-0139 (SIMD float paths must match scalar bit-for-bit) and ADR-0108 (architectural decision requires this ADR).
Decision¶
F2 fix: Replace the store-to-temp + scalar-accumulate loop with a widening-to-double approach that is free of float-precision intermediates and matches the scalar reference's double-precision reduction contract:
- AVX2: use
_mm256_cvtps_pdon the low 128-bit half (_mm256_castps256_ps128) and the high 128-bit half (_mm256_extractf128_ps(v, 1)) to produce two__m256dvectors of 4 doubles each. Sum each group withhadd_pd4()(a_mm256_hadd_pd+_mm256_extractf128_pdhelper), add the two group sums. No float-precision intermediate store occurs. - AVX-512: use
hsum_ps_to_double()which extracts four 4-float sub-vectors via_mm512_extractf32x4_ps, widens each to__m256dwith_mm256_cvtps_pd, and reduces withhadd_pd4. Four group sums are added left-to-right.
Both helper functions (hadd_pd4 in each TU, hsum_ps_to_double in the AVX-512 TU) are static inline, with NOLINTNEXTLINE suppression citing ADR-0139 to document why outline-calling is unsafe (register allocation changes rounding order on some ABIs).
F3 fix: Move float_adm_avx2.c and float_adm_avx512.c out of the shared x86_avx2_static_lib / x86_avx512_static_lib and into their own per-TU static libraries (x86_float_adm_avx2_lib, x86_float_adm_avx512_lib) in core/src/meson.build. Each per-TU library adds -ffp-contract=off to its c_args, disabling the compiler's auto-fusion of adjacent mul+add pairs into FMA instructions for the entire TU. The ISA flags (-mavx, -mavx2, -mfma for AVX2; -mavx512f etc. for AVX-512) are preserved — only auto-contraction is disabled. This matches the pattern already established for ssimulacra2 in the same meson.build file.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep store-to-temp loop, document the difference | Zero code delta | Contradicts ADR-0139 bit-exactness contract; visible divergence at --precision max | Rejected — once identified, leaving a known bit-exact violation is a regression of the stated invariant |
| Per-lane scalar double reduction (ADR-0139 pattern) | Matches scalar C promotions exactly; proven in SSIM SIMD | Adds ~40 LoC of aligned temp buffers + scalar inner loops per TU; two levels of indirection | Rejected for F2 — _mm256_cvtps_pd widening is simpler and avoids the stack alloc entirely; for F3, the -ffp-contract=off carve-out is a cleaner build-system fix than restructuring the inner loop |
Disable SIMD dispatch for float_adm, fall back to scalar | Guaranteed bit-exact, zero residual risk | Reverts a performance gain; upstream ports of float_adm SIMD additions would need to re-enable dispatch | Rejected — the defects are structural (precision model, FP contraction) and fixable without disabling the path |
| Ship AVX2 + AVX-512 + NEON all in one PR | Single review for all ISAs | NEON needs a separate arm64 CI leg; the fork's arm64 NEON SIMD coverage is already ahead of x86 (VIF, ADM); bundling triples review surface | Per project practice: AVX2 + AVX-512 only; NEON follow-up deferred |
Consequences¶
- Positive:
float_adm_csf_den_scale_*andfloat_adm_sum_cube_*are now numerically consistent between scalar, AVX2, and AVX-512 reduction paths (no float intermediate store, double-precision throughout). The DWT2 vertical pass in both ISA variants is protected against compiler auto-FMA at any optimization level. - Negative: Two additional static libraries are compiled (
x86_float_adm_avx2_lib,x86_float_adm_avx512_lib), adding a small build-time overhead. The-ffp-contract=offflag disables auto-FMA for the entirefloat_adm_avx*.cTU, not just the DWT2 pass — this is conservative but correct. - Neutral / follow-ups:
core/src/feature/AGENTS.mdrebase invariant: any future upstream port that changesfloat_adm_csf_den_scale_sorfloat_adm_sum_cube_smust propagate the change to the AVX2 and AVX-512 variants and verify double-precision reduction consistency.- The
hadd_pd4helper is duplicated across AVX2 and AVX-512 TUs to avoid a cross-TU dependency (each TU has its own ISA flags). A shared header approach is not used because sharing would require a header compiled with neithermavx2normavx512findividually, which is unsupported by the current build model. -
Reproducer (for PR description):
vmaf --cpumask 255 --reference REF --distorted DIST \ --width 576 --height 324 --pixel_format 420 --bitdepth 8 \ --feature float_adm --precision max -o scalar.xml vmaf --cpumask 4 --reference REF --distorted DIST \ --width 576 --height 324 --pixel_format 420 --bitdepth 8 \ --feature float_adm --precision max -o avx2.xml diff <(grep float_adm scalar.xml) <(grep float_adm avx2.xml) # empty
References¶
- ADR-0139 — SIMD float paths must match scalar bit-for-bit; per-lane double reduction contract.
- ADR-0108 — six deep-dive deliverables rule.
- ADR-0594 — ssimulacra2
-ffp-contract=offcarve-out precedent (HIP analogue). - Research-0110 — prior ADM AVX2 audit (UBSan / CLZ fix); this ADR covers the separate F2/F3 defects.
- Related PR:
fix/float-adm-avx2-f2-f3-pr116(PR #142 on VMAFx/vmafx). - Source: project decision — AVX2 + AVX-512 SIMD correctness audit of float_adm F2/F3 defects identified during fork ADR-0139 compliance sweep.