ADR-0504: AVX-512F port of float separable convolution scanlines¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: Lusoris, Claude (Anthropic)
- Tags:
simd,performance,build
Context¶
The float VIF path dominates CPU wall time in the float VMAF model (~60 % of cycles on a representative 1080p run):
convolution_f32_avx_s_1d_h_scanline: 8.48 %convolution_f32_avx_s_1d_v_scanline: 5.60 %extract.lto_priv.17(inlined per-scale loop): 35.89 %
All three hot spots route through the two separable scanline helpers in core/src/feature/common/convolution_avx.c, which are 256-bit (8 floats per FMA). Modern server and consumer CPUs (Skylake-X, Ice Lake, Sapphire Rapids, Zen 4, and their successors) expose AVX-512F, which allows 16 floats per FMA — doubling the arithmetic throughput in the inner convolution loop with identical instruction count overhead.
The float VIF path already diverges from the integer VIF path at the numerical level (ADR-0214). The fork therefore accepts ULP-level differences between SIMD widths on the float path, provided Netflix golden assertions still pass at their declared places tolerance.
Decision¶
We will create core/src/feature/common/convolution_avx512.c porting the four static scanline helpers and the three public wrappers (convolution_f32_avx512_s, _sq_s, _xy_s) from __m256 to __m512 with _mm512_fmadd_ps. The dispatch in vif_tools.c is updated to prefer AVX-512 when VMAF_X86_CPU_FLAG_AVX512 is set, falling back to AVX2, then scalar. The new TU is compiled with -mavx512f -mavx512dq -mavx512bw inside the existing x86_avx512_static_lib in core/src/meson.build.
The invariants from ADR-0143 are preserved in the new file:
- Scanline helpers carry
staticlinkage. - Inner-loop strides are
ptrdiff_t. - Horizontal pass uses
_mm512_loadu_ps(no alignment guarantee on tmp row interior); vertical pass uses_mm512_load_ps(tmp rows are aligned toMAX_ALIGN= 64 bytes, which satisfies the 64-byte_mm512_load_psrequirement becausevmaf_ceiln(width, 16)rounds up to a 64-byte boundary for float arrays allocated viaaligned_malloc).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep AVX2 only | No code addition | Leaves 40-50 % of float VIF cycles on the table on AVX-512 CPUs | Performance goal not met |
| Highway / simde abstraction library | Single source for all widths | Adds a dependency; diverges from fork's "intrinsics by hand" pattern (MEMORY.md) | Not the fork's SIMD style |
| Dual-issue two AVX2 vectors per iteration | Avoids a new file | Does not reduce instruction count; throughput gain marginal vs AVX-512 | Less clean than a proper 512-bit port |
Consequences¶
- Positive: expected +40-50 % throughput on the float VIF path on CPUs that report AVX-512 (runtime-gated; safe on non-AVX-512 machines).
- Negative: AVX-512 increases register pressure and may cause frequency throttling on early Skylake-X. The throughput gain exceeds the throttling penalty for the convolution loop (FMA-bound, not memory-bound).
- Neutral: results on AVX-512 CPUs are NOT bit-identical to AVX2 results (wider FMA partial-sum tree → different rounding); this is accepted per ADR-0214 for the float path.
- Follow-ups: if a snapshot regression appears on
testdata/scores_cpu_*.json, regenerate via/regen-snapshotswith a commit message citing ADR-0504.
References¶
- ADR-0143 (
0143-port-netflix-f3a628b4-generalized-avx-convolve.md) — invariants inherited by this port. - ADR-0214 — precedent for float-path ULP divergence between SIMD widths.
- Perf profile (
/tmp/perf_findings.mdWin #2, 2026-05-18 session).