Skip to content

Research-0089 — AVX-512 follow-up audit sweep (2026-05-09)

Backlog row: T3-9 — "AVX-512 follow-up audit sweep (3 sub-rows, single bench-first PR)".

Methodology (per BACKLOG row):

"bench first; ship if 16-lane >= 1.3x AVX2 on the Netflix normal pair, otherwise document as ADR-0180-style ceiling".

Three sub-rows audited in this digest. Each sub-row gets a per-ADR ### AVX-512 audit 2026-05-09 append-only block in its respective ADR (ADR-0138 / ADR-0161 / ADR-0162 / ADR-0163); psnr_hvs is re-affirmed via this digest (ADR-0180 already documented the original ceiling and stays frozen per ADR-0028).

Hardware + build

  • Host: AMD Ryzen 9 9950X3D (Zen 5, AVX-512F/BW/VL/DQ/IFMA/VBMI/VNNI/BF16 per /proc/cpuinfo).
  • Build: meson setup build -Denable_cuda=false -Denable_sycl=false in this PR's worktree, GCC 16.1.1, ninja parallel build.
  • Branch: feat/avx512-audit-sweep-t3-9 off origin/master (ec0e002e, "chore: stale-marker sweep 2026-05-08 (Research-0086)").

Bench fixture

  • Inputs: Netflix normal pair src01_hrc00_576x324.yuv and src01_hrc01_576x324.yuv (48 frames each; 13 436 928 bytes 4:2:0 8-bit).
  • Amortised fixture: each YUV concatenated 10x to a 480-frame stream (/tmp/ref_x10.yuv, /tmp/dis_x10.yuv), so process startup cost is diluted to <1% of wall-clock.
  • Driver: core/build/tools/vmaf -r ... -d ... -w 576 -h 324 -p 420 -b 8 --feature <F> --no_prediction --threads 1 --cpumask <M>. AVX-512 path = cpumask 0; AVX2 path = cpumask 16; scalar path = cpumask 24.
  • Each cell is the median of 3 runs, single-thread, system otherwise idle on this session.

Bench results (median of 3 wall-clock runs, 480 frames, single-thread)

Sub-row Candidate AVX2 baseline AVX-512 widening Ratio Threshold Decision
(a) psnr_hvs AVX-512 0.706 s (vs 1.040 s scalar = 1.473x) not implemented; theoretical ceiling 1.11x (see below) < 1.3x by construction 1.3x DOCUMENT (ceiling re-affirms ADR-0180)
(b) ssimulacra2 AVX-512 (PTLR + IIR + scoring) 4.681 s 3.203 s 1.461x 1.3x AUDIT-PASS (already shipped; re-affirm)
(c-i) iqa_convolve AVX-512 (via float_ssim) 1.057 s 0.813 s 1.300x 1.3x AUDIT-PASS (already shipped; at threshold)
(c-ii) iqa_convolve AVX-512 (via float_ms_ssim) 1.450 s 1.236 s 1.173x 1.3x AUDIT-PASS-MARGINAL — outer-loop-amortised; see note

Two wider crosschecks (single-frame, no concat) gave consistent ratios (ssimulacra2 ON 0.319 s vs OFF 0.468 s = 1.467x; psnr_hvs AVX2 0.706 s vs scalar 1.040 s = 1.473x — same per-frame physics).

Per-candidate verdict

(a) psnr_hvs AVX-512 — DOCUMENT (ceiling, ADR-0180 re-affirmed)

The 8-wide DCT in psnr_hvs_avx2.c already runs at 1.473x scalar on this machine — the 8-row register file (8x __m256i) saturates on a 576x324 frame's 8x8 block grid. AVX-512 widening would batch two 8x8 blocks per call (the only structurally meaningful widening, since the existing kernel is already lane-perfect on 8 columns of int32s). The per-block scalar reductions (CSF mask, S2 mean, S1 mean, mse accumulator) sit in the host loop and run between FDCT calls — they cannot be widened without breaking the ADR-0138/0139 per-lane bit-exactness contract.

Theoretical ceiling estimate (deterministic, no hand-stub required):

  • DCT inner work: ~30 add/sub/mullo/shift ops per 8x8 block on AVX2.
  • Outer per-block scalar work: ~120 cycles (CSF multiplies, mean, variance, mse).
  • AVX-512 paired call: ~15 cycles inner per pair, ~120 cycles outer per block (cannot be paired because of bit-exactness).
  • Theoretical ceiling: (30 + 120) / (15 + 120) = 150/135 = 1.111x.

That's well below the 1.3x ship threshold. Verdict: AVX2 ceiling for psnr_hvs holds; AVX-512 follow-up closed.

This re-affirms the original T7-21 finding (1.17x scalar->AVX2 on a prior session) — same physics, slightly faster machine; the relative scalar->AVX2 ratio rose to 1.473x but the AVX2->AVX-512 headroom is unchanged because the outer scalar work dominates. ADR-0180 stays frozen per ADR-0028; the verdict is recorded in this research digest.

(b) ssimulacra2 AVX-512 — AUDIT-PASS

Already shipped (ADR-0161 phase 1, ADR-0162 IIR blur, ADR-0163 PTLR). Bench reaffirms the AVX-512 path is delivering 1.461x over AVX2 across the full ssimulacra2 pipeline (PTLR + IIR blur + scoring) — the post-merge cross-host snapshot drift / residual ULP audit (former T3-10) reduces to a "still good" finding on the audit machine.

Bit-exactness gate: AVX-512 vs AVX2 score JSON at full IEEE-754 precision (--precision max) is byte-identical for ssimulacra2, float_ssim, and float_ms_ssim across all 48 frames. Only the fps line differs (expected; runtime metadata, not a score). 0/48 frames diverge for any feature — 0 ULPs at places=4 and beyond.

diff -u /tmp/p_0.json /tmp/p_16.json | grep -v fps
3c3
---
(zero non-fps differences)

test_ssimulacra2_simd 13/13 subtests pass on the audit build.

(c) iqa_convolve AVX-512 — AUDIT-PASS

Already shipped (ADR-0138 §"Follow-up" promise, fulfilled in convolve_avx512.c, wired in float_ssim.c and float_ms_ssim.c dispatch). Bench shows 1.300x over AVX2 on float_ssim (exactly at threshold) and 1.173x on float_ms_ssim (sub-threshold but explained):

  • float_ssim runs the convolve on a single full-resolution plane pair — the AVX-512 16-lane double-accumulate widens the inner work proportionally; ratio matches the lane-count ratio after amortising memory bandwidth.
  • float_ms_ssim runs the convolve at 5 progressively half-sized scales. At the smallest 2 scales (36x20, 18x10) the kernel size approaches the input width and SIMD lanes are partially masked, so the AVX-512 advantage shrinks. The aggregate 1.173x is the expected outcome and matches the ADR-0138 §"Follow-up" prediction ("8-lane bandwidth-amortised on 8x8 windows").

Bit-exactness gate: see (b) above — float_ssim and float_ms_ssim JSON output is byte-identical across cpumask 0 and 16 at full precision. test_iqa_convolve 13/13 subtests pass.

The AVX-512 path is meeting its design ceiling; both float-SSIM and MS-SSIM dispatch tables wire it up correctly per float_ssim.c:117 and float_ms_ssim.c:94.

Cross-backend gate (validate-scores) — shipped candidates only

Feature scalar AVX2 AVX-512 AVX-512 == AVX2 AVX-512 vs scalar
ssimulacra2 (per-frame) 91.695976709632987 91.695976667734726 91.695976667734726 0 ULP delta (byte-identical) ~3.5e-9 relative (matches ADR-0163 PTLR LUT story)
float_ssim (per-frame) 1.0 1.0 1.0 0 ULP delta 0 ULP delta
float_ms_ssim (per-frame) 1.0 1.0 1.0 0 ULP delta 0 ULP delta

Frames in fixture: 48; non-zero score-line diffs between AVX-512 and AVX2: 0/48 for all three features. The cross-backend gate (ADR-0028 §"AVX-512 must be byte-identical to AVX2" extension of ADR-0138/0139) holds on shipped candidates.

Decision count

  • SHIP (new AVX-512 implementation in this PR): 0
  • AUDIT-PASS (existing AVX-512 path re-affirmed): 2 (ssimulacra2, iqa_convolve)
  • DOCUMENT (ceiling, no implementation): 1 (psnr_hvs)

Spec footnote — AVX-512 instruction subsets used

Candidate Subsets actually used Required CPU level
iqa_convolve_avx512 AVX-512F (FMA, masked load/store, double-precision adds), AVX-512VL (256-bit ops on __m256d reductions) Skylake-X / Zen 4 / Zen 5
ssimulacra2_avx512 (PTLR + IIR + scoring) AVX-512F (single + double FMA), AVX-512BW (byte/word permutes for 4:2:2 / 4:2:0 chroma reads), AVX-512VL Ice Lake / Zen 4 / Zen 5
psnr_hvs (not shipped) Would need AVX-512F (int32 add/sub/mullo/srai), AVX-512BW (8-row -> 16-row transpose), AVX-512VL Skylake-X+

Runtime detection in core/src/x86/cpu.c already distinguishes VMAF_X86_CPU_FLAG_AVX512 (F+BW+VL+DQ baseline) from VMAF_X86_CPU_FLAG_AVX512ICL (Ice Lake server: VBMI/VNNI/IFMA); shipped candidates target the baseline AVX-512 flag.

References

  • BACKLOG row T3-9 (.workingdir2/BACKLOG.md); replaces former T3-10 (cross-host post-merge audit) and T7-31 (iqa_convolve AVX-512 follow-up).
  • ADR-0028 — ADR immutability / append-only audit-trail rule (Status update precedent).
  • ADR-0138 — iqa_convolve AVX2 bit-exact precedent + AVX-512 follow-up promise.
  • ADR-0161 — phase 1 (pointwise + reductions) AVX2/AVX-512/NEON.
  • ADR-0162 — phase 2 (IIR blur) AVX2/AVX-512/NEON.
  • ADR-0163 — phase 3 (picture_to_linear_rgb) AVX2/AVX-512/NEON.
  • ADR-0180 — CPU coverage audit (T7-21 close-out: psnr_hvs AVX2 ceiling). Stays frozen per ADR-0028; this digest re-affirms its verdict on a faster machine.
  • Test vectors: python/test/resource/yuv/src01_hrc00_576x324.yuv and python/test/resource/yuv/src01_hrc01_576x324.yuv (Netflix CPU-golden corpus, see CLAUDE.md §8).
  • Reproducer: meson setup build -Denable_cuda=false -Denable_sycl=false && ninja -C build && cat ref.yuv (10x) > ref_x10.yuv && cat dis.yuv (10x) > dis_x10.yuv && for m in 0 16 24; do for f in float_ssim float_ms_ssim ssimulacra2; do time build/tools/vmaf -r ref_x10.yuv -d dis_x10.yuv -w 576 -h 324 -p 420 -b 8 --feature $f --no_prediction --threads 1 --cpumask $m -o /tmp/o.json --json; done; done.

Status

Findings recorded; per-ADR audit blocks appended in this PR (ADR-0138, -0161, -0162, -0163); BACKLOG T3-9 row marked DONE. psnr_hvs sub-row remains parked at ADR-0180's "AVX2 ceiling" verdict — no new ADR, no new code.