Research-0089 — AVX-512 follow-up audit sweep (2026-05-09)¶

Backlog row: T3-9 — "AVX-512 follow-up audit sweep (3 sub-rows, single bench-first PR)".

Methodology (per BACKLOG row):

"bench first; ship if 16-lane >= 1.3x AVX2 on the Netflix normal pair, otherwise document as ADR-0180-style ceiling".

Three sub-rows audited in this digest. Each sub-row gets a per-ADR ### AVX-512 audit 2026-05-09 append-only block in its respective ADR (ADR-0138 / ADR-0161 / ADR-0162 / ADR-0163); psnr_hvs is re-affirmed via this digest (ADR-0180 already documented the original ceiling and stays frozen per ADR-0028).

Hardware + build¶

Host: AMD Ryzen 9 9950X3D (Zen 5, AVX-512F/BW/VL/DQ/IFMA/VBMI/VNNI/BF16 per /proc/cpuinfo).
Build: meson setup build -Denable_cuda=false -Denable_sycl=false in this PR's worktree, GCC 16.1.1, ninja parallel build.
Branch: feat/avx512-audit-sweep-t3-9 off origin/master (ec0e002e, "chore: stale-marker sweep 2026-05-08 (Research-0086)").

Bench fixture¶

Inputs: Netflix normal pair src01_hrc00_576x324.yuv and src01_hrc01_576x324.yuv (48 frames each; 13 436 928 bytes 4:2:0 8-bit).
Amortised fixture: each YUV concatenated 10x to a 480-frame stream (/tmp/ref_x10.yuv, /tmp/dis_x10.yuv), so process startup cost is diluted to <1% of wall-clock.
Driver: core/build/tools/vmaf -r ... -d ... -w 576 -h 324 -p 420 -b 8 --feature <F> --no_prediction --threads 1 --cpumask <M>. AVX-512 path = cpumask 0; AVX2 path = cpumask 16; scalar path = cpumask 24.
Each cell is the median of 3 runs, single-thread, system otherwise idle on this session.

Bench results (median of 3 wall-clock runs, 480 frames, single-thread)¶

Sub-row	Candidate	AVX2 baseline	AVX-512 widening	Ratio	Threshold	Decision
(a)	`psnr_hvs` AVX-512	0.706 s (vs 1.040 s scalar = 1.473x)	not implemented; theoretical ceiling 1.11x (see below)	< 1.3x by construction	1.3x	DOCUMENT (ceiling re-affirms ADR-0180)
(b)	`ssimulacra2` AVX-512 (PTLR + IIR + scoring)	4.681 s	3.203 s	1.461x	1.3x	AUDIT-PASS (already shipped; re-affirm)
(c-i)	`iqa_convolve` AVX-512 (via `float_ssim`)	1.057 s	0.813 s	1.300x	1.3x	AUDIT-PASS (already shipped; at threshold)
(c-ii)	`iqa_convolve` AVX-512 (via `float_ms_ssim`)	1.450 s	1.236 s	1.173x	1.3x	AUDIT-PASS-MARGINAL — outer-loop-amortised; see note

Two wider crosschecks (single-frame, no concat) gave consistent ratios (ssimulacra2 ON 0.319 s vs OFF 0.468 s = 1.467x; psnr_hvs AVX2 0.706 s vs scalar 1.040 s = 1.473x — same per-frame physics).

Per-candidate verdict¶

(a) `psnr_hvs` AVX-512 — DOCUMENT (ceiling, ADR-0180 re-affirmed)¶

The 8-wide DCT in psnr_hvs_avx2.c already runs at 1.473x scalar on this machine — the 8-row register file (8x __m256i) saturates on a 576x324 frame's 8x8 block grid. AVX-512 widening would batch two 8x8 blocks per call (the only structurally meaningful widening, since the existing kernel is already lane-perfect on 8 columns of int32s). The per-block scalar reductions (CSF mask, S2 mean, S1 mean, mse accumulator) sit in the host loop and run between FDCT calls — they cannot be widened without breaking the ADR-0138/0139 per-lane bit-exactness contract.

Theoretical ceiling estimate (deterministic, no hand-stub required):

DCT inner work: ~30 add/sub/mullo/shift ops per 8x8 block on AVX2.
Outer per-block scalar work: ~120 cycles (CSF multiplies, mean, variance, mse).
AVX-512 paired call: ~15 cycles inner per pair, ~120 cycles outer per block (cannot be paired because of bit-exactness).
Theoretical ceiling: (30 + 120) / (15 + 120) = 150/135 = 1.111x.

That's well below the 1.3x ship threshold. Verdict: AVX2 ceiling for psnr_hvs holds; AVX-512 follow-up closed.

This re-affirms the original T7-21 finding (1.17x scalar->AVX2 on a prior session) — same physics, slightly faster machine; the relative scalar->AVX2 ratio rose to 1.473x but the AVX2->AVX-512 headroom is unchanged because the outer scalar work dominates. ADR-0180 stays frozen per ADR-0028; the verdict is recorded in this research digest.

(b) `ssimulacra2` AVX-512 — AUDIT-PASS¶

Already shipped (ADR-0161 phase 1, ADR-0162 IIR blur, ADR-0163 PTLR). Bench reaffirms the AVX-512 path is delivering 1.461x over AVX2 across the full ssimulacra2 pipeline (PTLR + IIR blur + scoring) — the post-merge cross-host snapshot drift / residual ULP audit (former T3-10) reduces to a "still good" finding on the audit machine.

Bit-exactness gate: AVX-512 vs AVX2 score JSON at full IEEE-754 precision (--precision max) is byte-identical for ssimulacra2, float_ssim, and float_ms_ssim across all 48 frames. Only the fps line differs (expected; runtime metadata, not a score). 0/48 frames diverge for any feature — 0 ULPs at places=4 and beyond.

diff -u /tmp/p_0.json /tmp/p_16.json | grep -v fps
3c3
---
(zero non-fps differences)

test_ssimulacra2_simd 13/13 subtests pass on the audit build.

(c) `iqa_convolve` AVX-512 — AUDIT-PASS¶

Already shipped (ADR-0138 §"Follow-up" promise, fulfilled in convolve_avx512.c, wired in float_ssim.c and float_ms_ssim.c dispatch). Bench shows 1.300x over AVX2 on float_ssim (exactly at threshold) and 1.173x on float_ms_ssim (sub-threshold but explained):

float_ssim runs the convolve on a single full-resolution plane pair — the AVX-512 16-lane double-accumulate widens the inner work proportionally; ratio matches the lane-count ratio after amortising memory bandwidth.
float_ms_ssim runs the convolve at 5 progressively half-sized scales. At the smallest 2 scales (36x20, 18x10) the kernel size approaches the input width and SIMD lanes are partially masked, so the AVX-512 advantage shrinks. The aggregate 1.173x is the expected outcome and matches the ADR-0138 §"Follow-up" prediction ("8-lane bandwidth-amortised on 8x8 windows").

Bit-exactness gate: see (b) above — float_ssim and float_ms_ssim JSON output is byte-identical across cpumask 0 and 16 at full precision. test_iqa_convolve 13/13 subtests pass.

The AVX-512 path is meeting its design ceiling; both float-SSIM and MS-SSIM dispatch tables wire it up correctly per float_ssim.c:117 and float_ms_ssim.c:94.

Cross-backend gate (validate-scores) — shipped candidates only¶

Feature	scalar	AVX2	AVX-512	AVX-512 == AVX2	AVX-512 vs scalar
`ssimulacra2` (per-frame)	91.695976709632987	91.695976667734726	91.695976667734726	0 ULP delta (byte-identical)	~3.5e-9 relative (matches ADR-0163 PTLR LUT story)
`float_ssim` (per-frame)	1.0	1.0	1.0	0 ULP delta	0 ULP delta
`float_ms_ssim` (per-frame)	1.0	1.0	1.0	0 ULP delta	0 ULP delta

Frames in fixture: 48; non-zero score-line diffs between AVX-512 and AVX2: 0/48 for all three features. The cross-backend gate (ADR-0028 §"AVX-512 must be byte-identical to AVX2" extension of ADR-0138/0139) holds on shipped candidates.

Decision count¶

SHIP (new AVX-512 implementation in this PR): 0
AUDIT-PASS (existing AVX-512 path re-affirmed): 2 (ssimulacra2, iqa_convolve)
DOCUMENT (ceiling, no implementation): 1 (psnr_hvs)

Spec footnote — AVX-512 instruction subsets used¶

Candidate	Subsets actually used	Required CPU level
`iqa_convolve_avx512`	AVX-512F (FMA, masked load/store, double-precision adds), AVX-512VL (256-bit ops on `__m256d` reductions)	Skylake-X / Zen 4 / Zen 5
`ssimulacra2_avx512` (PTLR + IIR + scoring)	AVX-512F (single + double FMA), AVX-512BW (byte/word permutes for 4:2:2 / 4:2:0 chroma reads), AVX-512VL	Ice Lake / Zen 4 / Zen 5
`psnr_hvs` (not shipped)	Would need AVX-512F (int32 add/sub/mullo/srai), AVX-512BW (8-row -> 16-row transpose), AVX-512VL	Skylake-X+

Runtime detection in core/src/x86/cpu.c already distinguishes VMAF_X86_CPU_FLAG_AVX512 (F+BW+VL+DQ baseline) from VMAF_X86_CPU_FLAG_AVX512ICL (Ice Lake server: VBMI/VNNI/IFMA); shipped candidates target the baseline AVX-512 flag.

References¶

BACKLOG row T3-9 (.workingdir2/BACKLOG.md); replaces former T3-10 (cross-host post-merge audit) and T7-31 (iqa_convolve AVX-512 follow-up).
ADR-0028 — ADR immutability / append-only audit-trail rule (Status update precedent).
ADR-0138 — iqa_convolve AVX2 bit-exact precedent + AVX-512 follow-up promise.
ADR-0161 — phase 1 (pointwise + reductions) AVX2/AVX-512/NEON.
ADR-0162 — phase 2 (IIR blur) AVX2/AVX-512/NEON.
ADR-0163 — phase 3 (picture_to_linear_rgb) AVX2/AVX-512/NEON.
ADR-0180 — CPU coverage audit (T7-21 close-out: psnr_hvs AVX2 ceiling). Stays frozen per ADR-0028; this digest re-affirms its verdict on a faster machine.
Test vectors: python/test/resource/yuv/src01_hrc00_576x324.yuv and python/test/resource/yuv/src01_hrc01_576x324.yuv (Netflix CPU-golden corpus, see CLAUDE.md §8).
Reproducer: meson setup build -Denable_cuda=false -Denable_sycl=false && ninja -C build && cat ref.yuv (10x) > ref_x10.yuv && cat dis.yuv (10x) > dis_x10.yuv && for m in 0 16 24; do for f in float_ssim float_ms_ssim ssimulacra2; do time build/tools/vmaf -r ref_x10.yuv -d dis_x10.yuv -w 576 -h 324 -p 420 -b 8 --feature $f --no_prediction --threads 1 --cpumask $m -o /tmp/o.json --json; done; done.

Status¶

Findings recorded; per-ADR audit blocks appended in this PR (ADR-0138, -0161, -0162, -0163); BACKLOG T3-9 row marked DONE. psnr_hvs sub-row remains parked at ADR-0180's "AVX2 ceiling" verdict — no new ADR, no new code.