GPU backend kernel coverage audit — 2026-05-30¶

Scope¶

Cross-reference of the registered GPU feature extractors against existing parity / smoke tests under core/test/, to identify coverage gaps that would let a regression in a kernel's reduction, separable filter, or per-plane accumulator escape CI.

Method¶

# 1. Enumerate registered GPU extractors.
grep -rE '\.name\s*=\s*"[a-z0-9_]+_(cuda|hip|sycl|metal)"' \
  core/src/feature/{cuda,hip,sycl,metal} | sort

# 2. Enumerate test files referencing each.
find core/test -name '*test*cuda*' -o -name '*test*hip*' \
  -o -name '*test*sycl*' -o -name '*test*metal*'

# 3. Diff the two lists.

Pre-PR coverage (master tip `bbcaa8d127`)¶

Backend	Registered extractors	Parity-tested	Coverage %
CUDA	17	2 (`motion_cuda`, `vif_cuda`)	12 %
HIP	18	2 (`motion_hip`, `adm_hip`)	11 %
SYCL	17	2 (`motion_sycl`, `cambi_sycl`)	12 %
Metal	8	1 spot-asserted (`motion_v2_metal`)	13 %

Post-PR coverage (this branch)¶

Backend	Registered	Parity-tested	Δ	New
CUDA	17	4	+2	`psnr_cuda`, `ciede_cuda`
HIP	18	4	+2	`psnr_hip`, `vif_hip`
SYCL	17	4	+2	`psnr_sycl`, `vif_sycl`
Metal	8	8 (registration audit) + 1	+7	full 8-extractor name + TEMPORAL-flag audit

Kernel selection rationale¶

Per-backend, picked the two highest-leverage kernels:

PSNR — per-plane SSE reduction; sensitive to work-group tiling and atomic-reduce ordering. Covers the most-shipped GPU compute pattern (host-side log10 plus device-side SSE).
VIF (HIP/SYCL) — separable Gaussian filter chain feeding M1..M4 statistical accumulators; sensitive to filter accuracy and accumulator order. CUDA already has test_integer_vif_cpu_cuda_parity.
CIEDE2000 (CUDA) — CIE-Lab conversion + chroma-rotation kernel; sensitive to the colour-conversion path.

Tolerance budget¶

Per ADR-0214:

places=4 (1e-4) — unfiltered reductions (PSNR, CIEDE2000).
places=3 (1e-3) — filtered features (VIF scale0).

GPUs are NOT bit-exact with CPU per the user-memory rule feedback_golden_gate_cpu_only. Tighter tolerances would be false-positives.

Remaining gaps (follow-up backlog)¶

core/test/ still lacks parity gates for 38 extractors:

CUDA (13): cambi_cuda, adm_cuda, float_psnr_cuda, float_vif_cuda, float_adm_cuda, float_motion_cuda, psnr_hvs_cuda, integer_ssim_cuda, float_ssim_cuda, float_ms_ssim_cuda, float_moment_cuda, speed_chroma_cuda, speed_temporal_cuda.
HIP (14): ciede_hip, cambi_hip, float_psnr_hip, float_vif_hip, float_adm_hip, float_motion_hip, float_moment_hip, psnr_hvs_hip, integer_ssim_hip, float_ssim_hip, integer_ms_ssim_hip, motion_v2_hip, speed_chroma_hip, speed_temporal_hip, ssimulacra2_hip.
SYCL (13): adm_sycl, ciede_sycl, psnr_hvs_sycl, integer_ssim_sycl, float_ssim_sycl, float_ms_ssim_sycl, float_psnr_sycl, float_vif_sycl, float_adm_sycl, float_motion_sycl, motion_v2_sycl, speed_chroma_sycl, speed_temporal_sycl, float_moment_sycl, ssimulacra2_sycl.

Recommend tracking in .workingdir2/BACKLOG.md under gpu-coverage-tier-2, sized at ~6 tests per follow-up PR to stay within the 200–800 LOC bundle target.

PR overlap audit¶

Confirmed no overlap with avoid-list PRs:

#289 (CUDA PTX unload) — touches cuda/picture_cuda.c, not kernel parity tests.
#290 (HIP ssimulacra2) — ships ssimulacra2_hip kernel + a bit-exactness gate scoped to that extractor only. Our HIP tests cover psnr_hip + vif_hip.
#293 (SYCL 4-extractor) — different SYCL extractors (adm_sycl, float_motion_sycl, etc.). Our SYCL tests cover psnr_sycl + vif_sycl.
#294 (Metal dispatch) — Metal dispatch-strategy assertion; our Metal test asserts the extractor-registration surface.
#308 (HIP/Metal -ENOSYS stubs) — scaffold-shape gates.
#315 (orphan tests) — wires already-written tests into meson; we add net-new test files.

References¶

req — user request asking for GPU-kernel test coverage push across CUDA + HIP + SYCL + Metal with the avoid-list above.
ADR-0214 — cross-backend tolerance budget.
ADR-0361 — Metal backend rollout.
feedback_golden_gate_cpu_only (user memory) — GPU paths are not bit-exact with CPU.