ADR-0180: CPU coverage matrix audit — close 5 stale gaps¶
- Status: Accepted
- Date: 2026-04-26
- Deciders: Lusoris, Claude (Anthropic)
- Tags: simd, audit, fork-local, doc-correction
Context¶
While picking the next CPU SIMD gap-fill from the post-T7-19 metrics-backends matrix, four supposed gaps and one open backlog item all turned out to be either already-shipped work or stale audit notes from earlier matrix snapshots. Filling them as code PRs would have been wasted effort; treating the matrix as authoritative without re-verification would have left the same phantom gaps in the audit dossier indefinitely.
The five items audited:
- T7-22 —
ms_ssimSSIM-per-scale SIMD: backlog said "still scalar"; in factfloat_ms_ssim::initalready dispatchesiqa_ssim_set_dispatchto AVX2 / AVX-512 / NEON variants ofssim_precompute/ssim_variance/ssim_accumulateper ADR-0138 / 0139 / 0140. Empirical: 3.2× wall-clock speedup vs--cpumask 0xfffffffeon Netflix normal pair (576×324, 48 frames). - CAMBI scalar fallback: matrix bullet claimed "no pure-C scalar path"; in fact
cambi.cshipsincrement_range/decrement_range/get_derivative_data_for_rowas scalar defaults ininit, overridden only when AVX2 / AVX-512 / NEON is detected. - motion_v2 NEON: matrix bullet claimed "x86 SIMD but no NEON"; in fact
arm64/motion_v2_neon.cis a 13kB bit-exact NEON port, fully wired in dispatch and meson sources, shipped since 2026-04. - integer
ansnrmatrix row: there is nointeger_ansnrregistered extractor; the row was a misread ofansnr_tools.c(helpers used by ADM, not a standalone feature). Onlyfloat_ansnrships. - T7-21 —
psnr_hvsAVX-512: this is a real audit question, not a phantom gap. The benchmark below settles it.
Decision¶
We close all five items in one audit ADR, update the metrics-backends matrix to reflect ground truth, and document the psnr_hvs AVX-512 verdict empirically rather than by extrapolation.
Verdict on psnr_hvs AVX-512: AVX2 ceiling — do not implement. Wall-clock benchmark on the Netflix normal pair (576×324, 48 frames):
| path | wall-time | speedup vs scalar |
|---|---|---|
psnr_hvs scalar (--cpumask 0xfffffffe) | 0.123 s | — |
psnr_hvs AVX2 (default) | 0.105 s | 1.17× |
The 8-wide AVX2 DCT (od_bin_fdct8x8_avx2) is already bandwidth-amortised — going to 16 lanes (AVX-512) would force a 2-block batch in the host loop, double the register pressure, and deliver no measurable wall-clock benefit because the per-block scalar reductions (variance, mask, error) remain scalar by design (ADR-0138 / 0139 / 0140 keep float reductions per-lane scalar for bit-exactness vs scalar). The kernel-side gain would be more than offset by the host-side complication.
The same conclusion already applies to the deferred float_moment AVX-512 path (ADR-0179 § Alternatives considered): "AVX-512 wouldn't deliver a measurable win over AVX2 here" — the moment reductions are memory-bound on a 4096-pixel frame. AVX-512 closes as AVX2 ceiling for both kernels.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Close each item with its own ADR + matrix edit | One audit decision per ADR — cleaner per-decision history | 5 ADRs for what is one audit pass; bulk of each ADR would be the same "matrix was stale, code is fine" reasoning | Bundling into one audit ADR captures the lesson (re-verify before code work) instead of distributing it across five micro-decisions |
| Implement psnr_hvs AVX-512 anyway | Closes the matrix row to ✓ | 1.17× scalar speedup of AVX2 already shows the DCT is not the bottleneck on practical fixtures; AVX-512 widening to 2-block batch adds host-loop complexity without payoff | Code work without measurable benefit; "AVX2 ceiling" is the correct close-out per the T7-21 backlog description's own escape clause |
| Implement float_moment AVX-512 anyway | Symmetry with AVX2 + NEON pair shipped in PR #122 | Memory-bound reduction on a 4096-pixel frame has no headroom for wider lanes (already documented in ADR-0179 § Alternatives) | Same as psnr_hvs — AVX2 ceiling |
Consequences¶
- Positive: matrix is now ground-truth; downstream backlog reflects actual code reality; no engineering time wasted on AVX-512 ports that wouldn't move the needle. Saves ~2-3 PRs worth of motion that would have closed as "no measurable difference" anyway.
- Negative: confidence in the matrix as a source of truth drops slightly — five stale entries means the matrix audit cadence needs tightening. Mitigation: every PR that closes a matrix gap now updates the matrix in the same diff (ADR-0179 set the precedent; T7-19 closure carries the example).
- Neutral / follow-ups: matrix corrections land in
.workingdir2/analysis/metrics-backends-matrix.md(gitignored — planning surface). Backlog corrections land in.workingdir2/BACKLOG.md: T7-21 → CLOSED (AVX2 ceiling), T7-22 → CLOSED (already done), the cambi/motion_v2/integer_ansnr items had no backlog rows. The remaining real backlog items in the CPU SIMD column are zero; the next gap surface is GPU long-tail (T7-23..T7-25 + everything else not on CUDA / SYCL / Vulkan).
References¶
- Source: user direction 2026-04-26 ("why not fill the cpu gaps across the table?") plus the matrix-vs-code re-verification pass that exposed the staleness.
- Pattern parent: ADR-0179 § "no AVX-512 — kernel is memory-bound" reasoning, applied to two more kernels with the same shape.
- Backlog rows: T7-21, T7-22 in
.workingdir2/BACKLOG.md— both closed by this ADR. - Matrix:
metrics-backends-matrix.md— five rows / bullets corrected. - Files re-verified during audit:
float_ms_ssim.c:88,cambi.c:446-460,arm64/motion_v2_neon.c,integer_motion_v2.c:202-204.