ADR-0504: AVX-512F port of float separable convolution scanlines¶

Status: Accepted
Date: 2026-05-18
Deciders: Lusoris, Claude (Anthropic)
Tags: simd, performance, build

Context¶

The float VIF path dominates CPU wall time in the float VMAF model (~60 % of cycles on a representative 1080p run):

convolution_f32_avx_s_1d_h_scanline: 8.48 %
convolution_f32_avx_s_1d_v_scanline: 5.60 %
extract.lto_priv.17 (inlined per-scale loop): 35.89 %

All three hot spots route through the two separable scanline helpers in core/src/feature/common/convolution_avx.c, which are 256-bit (8 floats per FMA). Modern server and consumer CPUs (Skylake-X, Ice Lake, Sapphire Rapids, Zen 4, and their successors) expose AVX-512F, which allows 16 floats per FMA — doubling the arithmetic throughput in the inner convolution loop with identical instruction count overhead.

The float VIF path already diverges from the integer VIF path at the numerical level (ADR-0214). The fork therefore accepts ULP-level differences between SIMD widths on the float path, provided Netflix golden assertions still pass at their declared places tolerance.

Decision¶

We will create core/src/feature/common/convolution_avx512.c porting the four static scanline helpers and the three public wrappers (convolution_f32_avx512_s, _sq_s, _xy_s) from __m256 to __m512 with _mm512_fmadd_ps. The dispatch in vif_tools.c is updated to prefer AVX-512 when VMAF_X86_CPU_FLAG_AVX512 is set, falling back to AVX2, then scalar. The new TU is compiled with -mavx512f -mavx512dq -mavx512bw inside the existing x86_avx512_static_lib in core/src/meson.build.

The invariants from ADR-0143 are preserved in the new file:

Scanline helpers carry static linkage.
Inner-loop strides are ptrdiff_t.
Horizontal pass uses _mm512_loadu_ps (no alignment guarantee on tmp row interior); vertical pass uses _mm512_load_ps (tmp rows are aligned to MAX_ALIGN = 64 bytes, which satisfies the 64-byte _mm512_load_ps requirement because vmaf_ceiln(width, 16) rounds up to a 64-byte boundary for float arrays allocated via aligned_malloc).

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Keep AVX2 only	No code addition	Leaves 40-50 % of float VIF cycles on the table on AVX-512 CPUs	Performance goal not met
Highway / simde abstraction library	Single source for all widths	Adds a dependency; diverges from fork's "intrinsics by hand" pattern (MEMORY.md)	Not the fork's SIMD style
Dual-issue two AVX2 vectors per iteration	Avoids a new file	Does not reduce instruction count; throughput gain marginal vs AVX-512	Less clean than a proper 512-bit port

Consequences¶

Positive: expected +40-50 % throughput on the float VIF path on CPUs that report AVX-512 (runtime-gated; safe on non-AVX-512 machines).
Negative: AVX-512 increases register pressure and may cause frequency throttling on early Skylake-X. The throughput gain exceeds the throttling penalty for the convolution loop (FMA-bound, not memory-bound).
Neutral: results on AVX-512 CPUs are NOT bit-identical to AVX2 results (wider FMA partial-sum tree → different rounding); this is accepted per ADR-0214 for the float path.
Follow-ups: if a snapshot regression appears on testdata/scores_cpu_*.json, regenerate via /regen-snapshots with a commit message citing ADR-0504.

References¶

ADR-0143 (0143-port-netflix-f3a628b4-generalized-avx-convolve.md) — invariants inherited by this port.
ADR-0214 — precedent for float-path ULP divergence between SIMD widths.
Perf profile (/tmp/perf_findings.md Win #2, 2026-05-18 session).