Research-0089 — AVX-512 follow-up audit sweep (2026-05-09)¶
Backlog row: T3-9 — "AVX-512 follow-up audit sweep (3 sub-rows, single bench-first PR)".
Methodology (per BACKLOG row):
"bench first; ship if 16-lane >= 1.3x AVX2 on the Netflix normal pair, otherwise document as ADR-0180-style ceiling".
Three sub-rows audited in this digest. Each sub-row gets a per-ADR ### AVX-512 audit 2026-05-09 append-only block in its respective ADR (ADR-0138 / ADR-0161 / ADR-0162 / ADR-0163); psnr_hvs is re-affirmed via this digest (ADR-0180 already documented the original ceiling and stays frozen per ADR-0028).
Hardware + build¶
- Host: AMD Ryzen 9 9950X3D (Zen 5, AVX-512F/BW/VL/DQ/IFMA/VBMI/VNNI/BF16 per
/proc/cpuinfo). - Build:
meson setup build -Denable_cuda=false -Denable_sycl=falsein this PR's worktree, GCC 16.1.1, ninja parallel build. - Branch:
feat/avx512-audit-sweep-t3-9offorigin/master(ec0e002e, "chore: stale-marker sweep 2026-05-08 (Research-0086)").
Bench fixture¶
- Inputs: Netflix normal pair
src01_hrc00_576x324.yuvandsrc01_hrc01_576x324.yuv(48 frames each; 13 436 928 bytes 4:2:0 8-bit). - Amortised fixture: each YUV concatenated 10x to a 480-frame stream (
/tmp/ref_x10.yuv,/tmp/dis_x10.yuv), so process startup cost is diluted to <1% of wall-clock. - Driver:
core/build/tools/vmaf -r ... -d ... -w 576 -h 324 -p 420 -b 8 --feature <F> --no_prediction --threads 1 --cpumask <M>. AVX-512 path = cpumask 0; AVX2 path = cpumask 16; scalar path = cpumask 24. - Each cell is the median of 3 runs, single-thread, system otherwise idle on this session.
Bench results (median of 3 wall-clock runs, 480 frames, single-thread)¶
| Sub-row | Candidate | AVX2 baseline | AVX-512 widening | Ratio | Threshold | Decision |
|---|---|---|---|---|---|---|
| (a) | psnr_hvs AVX-512 | 0.706 s (vs 1.040 s scalar = 1.473x) | not implemented; theoretical ceiling 1.11x (see below) | < 1.3x by construction | 1.3x | DOCUMENT (ceiling re-affirms ADR-0180) |
| (b) | ssimulacra2 AVX-512 (PTLR + IIR + scoring) | 4.681 s | 3.203 s | 1.461x | 1.3x | AUDIT-PASS (already shipped; re-affirm) |
| (c-i) | iqa_convolve AVX-512 (via float_ssim) | 1.057 s | 0.813 s | 1.300x | 1.3x | AUDIT-PASS (already shipped; at threshold) |
| (c-ii) | iqa_convolve AVX-512 (via float_ms_ssim) | 1.450 s | 1.236 s | 1.173x | 1.3x | AUDIT-PASS-MARGINAL — outer-loop-amortised; see note |
Two wider crosschecks (single-frame, no concat) gave consistent ratios (ssimulacra2 ON 0.319 s vs OFF 0.468 s = 1.467x; psnr_hvs AVX2 0.706 s vs scalar 1.040 s = 1.473x — same per-frame physics).
Per-candidate verdict¶
(a) psnr_hvs AVX-512 — DOCUMENT (ceiling, ADR-0180 re-affirmed)¶
The 8-wide DCT in psnr_hvs_avx2.c already runs at 1.473x scalar on this machine — the 8-row register file (8x __m256i) saturates on a 576x324 frame's 8x8 block grid. AVX-512 widening would batch two 8x8 blocks per call (the only structurally meaningful widening, since the existing kernel is already lane-perfect on 8 columns of int32s). The per-block scalar reductions (CSF mask, S2 mean, S1 mean, mse accumulator) sit in the host loop and run between FDCT calls — they cannot be widened without breaking the ADR-0138/0139 per-lane bit-exactness contract.
Theoretical ceiling estimate (deterministic, no hand-stub required):
- DCT inner work: ~30 add/sub/mullo/shift ops per 8x8 block on AVX2.
- Outer per-block scalar work: ~120 cycles (CSF multiplies, mean, variance, mse).
- AVX-512 paired call: ~15 cycles inner per pair, ~120 cycles outer per block (cannot be paired because of bit-exactness).
- Theoretical ceiling: (30 + 120) / (15 + 120) = 150/135 = 1.111x.
That's well below the 1.3x ship threshold. Verdict: AVX2 ceiling for psnr_hvs holds; AVX-512 follow-up closed.
This re-affirms the original T7-21 finding (1.17x scalar->AVX2 on a prior session) — same physics, slightly faster machine; the relative scalar->AVX2 ratio rose to 1.473x but the AVX2->AVX-512 headroom is unchanged because the outer scalar work dominates. ADR-0180 stays frozen per ADR-0028; the verdict is recorded in this research digest.
(b) ssimulacra2 AVX-512 — AUDIT-PASS¶
Already shipped (ADR-0161 phase 1, ADR-0162 IIR blur, ADR-0163 PTLR). Bench reaffirms the AVX-512 path is delivering 1.461x over AVX2 across the full ssimulacra2 pipeline (PTLR + IIR blur + scoring) — the post-merge cross-host snapshot drift / residual ULP audit (former T3-10) reduces to a "still good" finding on the audit machine.
Bit-exactness gate: AVX-512 vs AVX2 score JSON at full IEEE-754 precision (--precision max) is byte-identical for ssimulacra2, float_ssim, and float_ms_ssim across all 48 frames. Only the fps line differs (expected; runtime metadata, not a score). 0/48 frames diverge for any feature — 0 ULPs at places=4 and beyond.
test_ssimulacra2_simd 13/13 subtests pass on the audit build.
(c) iqa_convolve AVX-512 — AUDIT-PASS¶
Already shipped (ADR-0138 §"Follow-up" promise, fulfilled in convolve_avx512.c, wired in float_ssim.c and float_ms_ssim.c dispatch). Bench shows 1.300x over AVX2 on float_ssim (exactly at threshold) and 1.173x on float_ms_ssim (sub-threshold but explained):
float_ssimruns the convolve on a single full-resolution plane pair — the AVX-512 16-lane double-accumulate widens the inner work proportionally; ratio matches the lane-count ratio after amortising memory bandwidth.float_ms_ssimruns the convolve at 5 progressively half-sized scales. At the smallest 2 scales (36x20, 18x10) the kernel size approaches the input width and SIMD lanes are partially masked, so the AVX-512 advantage shrinks. The aggregate 1.173x is the expected outcome and matches the ADR-0138 §"Follow-up" prediction ("8-lane bandwidth-amortised on 8x8 windows").
Bit-exactness gate: see (b) above — float_ssim and float_ms_ssim JSON output is byte-identical across cpumask 0 and 16 at full precision. test_iqa_convolve 13/13 subtests pass.
The AVX-512 path is meeting its design ceiling; both float-SSIM and MS-SSIM dispatch tables wire it up correctly per float_ssim.c:117 and float_ms_ssim.c:94.
Cross-backend gate (validate-scores) — shipped candidates only¶
| Feature | scalar | AVX2 | AVX-512 | AVX-512 == AVX2 | AVX-512 vs scalar |
|---|---|---|---|---|---|
ssimulacra2 (per-frame) | 91.695976709632987 | 91.695976667734726 | 91.695976667734726 | 0 ULP delta (byte-identical) | ~3.5e-9 relative (matches ADR-0163 PTLR LUT story) |
float_ssim (per-frame) | 1.0 | 1.0 | 1.0 | 0 ULP delta | 0 ULP delta |
float_ms_ssim (per-frame) | 1.0 | 1.0 | 1.0 | 0 ULP delta | 0 ULP delta |
Frames in fixture: 48; non-zero score-line diffs between AVX-512 and AVX2: 0/48 for all three features. The cross-backend gate (ADR-0028 §"AVX-512 must be byte-identical to AVX2" extension of ADR-0138/0139) holds on shipped candidates.
Decision count¶
- SHIP (new AVX-512 implementation in this PR): 0
- AUDIT-PASS (existing AVX-512 path re-affirmed): 2 (ssimulacra2, iqa_convolve)
- DOCUMENT (ceiling, no implementation): 1 (psnr_hvs)
Spec footnote — AVX-512 instruction subsets used¶
| Candidate | Subsets actually used | Required CPU level |
|---|---|---|
iqa_convolve_avx512 | AVX-512F (FMA, masked load/store, double-precision adds), AVX-512VL (256-bit ops on __m256d reductions) | Skylake-X / Zen 4 / Zen 5 |
ssimulacra2_avx512 (PTLR + IIR + scoring) | AVX-512F (single + double FMA), AVX-512BW (byte/word permutes for 4:2:2 / 4:2:0 chroma reads), AVX-512VL | Ice Lake / Zen 4 / Zen 5 |
psnr_hvs (not shipped) | Would need AVX-512F (int32 add/sub/mullo/srai), AVX-512BW (8-row -> 16-row transpose), AVX-512VL | Skylake-X+ |
Runtime detection in core/src/x86/cpu.c already distinguishes VMAF_X86_CPU_FLAG_AVX512 (F+BW+VL+DQ baseline) from VMAF_X86_CPU_FLAG_AVX512ICL (Ice Lake server: VBMI/VNNI/IFMA); shipped candidates target the baseline AVX-512 flag.
References¶
- BACKLOG row T3-9 (
.workingdir2/BACKLOG.md); replaces former T3-10 (cross-host post-merge audit) and T7-31 (iqa_convolveAVX-512 follow-up). - ADR-0028 — ADR immutability / append-only audit-trail rule (Status update precedent).
- ADR-0138 — iqa_convolve AVX2 bit-exact precedent + AVX-512 follow-up promise.
- ADR-0161 — phase 1 (pointwise + reductions) AVX2/AVX-512/NEON.
- ADR-0162 — phase 2 (IIR blur) AVX2/AVX-512/NEON.
- ADR-0163 — phase 3 (
picture_to_linear_rgb) AVX2/AVX-512/NEON. - ADR-0180 — CPU coverage audit (T7-21 close-out: psnr_hvs AVX2 ceiling). Stays frozen per ADR-0028; this digest re-affirms its verdict on a faster machine.
- Test vectors:
python/test/resource/yuv/src01_hrc00_576x324.yuvandpython/test/resource/yuv/src01_hrc01_576x324.yuv(Netflix CPU-golden corpus, see CLAUDE.md §8). - Reproducer:
meson setup build -Denable_cuda=false -Denable_sycl=false && ninja -C build && cat ref.yuv (10x) > ref_x10.yuv && cat dis.yuv (10x) > dis_x10.yuv && for m in 0 16 24; do for f in float_ssim float_ms_ssim ssimulacra2; do time build/tools/vmaf -r ref_x10.yuv -d dis_x10.yuv -w 576 -h 324 -p 420 -b 8 --feature $f --no_prediction --threads 1 --cpumask $m -o /tmp/o.json --json; done; done.
Status¶
Findings recorded; per-ADR audit blocks appended in this PR (ADR-0138, -0161, -0162, -0163); BACKLOG T3-9 row marked DONE. psnr_hvs sub-row remains parked at ADR-0180's "AVX2 ceiling" verdict — no new ADR, no new code.