GPU backend kernel coverage audit — 2026-05-30¶
Scope¶
Cross-reference of the registered GPU feature extractors against existing parity / smoke tests under core/test/, to identify coverage gaps that would let a regression in a kernel's reduction, separable filter, or per-plane accumulator escape CI.
Method¶
# 1. Enumerate registered GPU extractors.
grep -rE '\.name\s*=\s*"[a-z0-9_]+_(cuda|hip|sycl|metal)"' \
core/src/feature/{cuda,hip,sycl,metal} | sort
# 2. Enumerate test files referencing each.
find core/test -name '*test*cuda*' -o -name '*test*hip*' \
-o -name '*test*sycl*' -o -name '*test*metal*'
# 3. Diff the two lists.
Pre-PR coverage (master tip bbcaa8d127)¶
| Backend | Registered extractors | Parity-tested | Coverage % |
|---|---|---|---|
| CUDA | 17 | 2 (motion_cuda, vif_cuda) | 12 % |
| HIP | 18 | 2 (motion_hip, adm_hip) | 11 % |
| SYCL | 17 | 2 (motion_sycl, cambi_sycl) | 12 % |
| Metal | 8 | 1 spot-asserted (motion_v2_metal) | 13 % |
Post-PR coverage (this branch)¶
| Backend | Registered | Parity-tested | Δ | New |
|---|---|---|---|---|
| CUDA | 17 | 4 | +2 | psnr_cuda, ciede_cuda |
| HIP | 18 | 4 | +2 | psnr_hip, vif_hip |
| SYCL | 17 | 4 | +2 | psnr_sycl, vif_sycl |
| Metal | 8 | 8 (registration audit) + 1 | +7 | full 8-extractor name + TEMPORAL-flag audit |
Kernel selection rationale¶
Per-backend, picked the two highest-leverage kernels:
- PSNR — per-plane SSE reduction; sensitive to work-group tiling and atomic-reduce ordering. Covers the most-shipped GPU compute pattern (host-side log10 plus device-side SSE).
- VIF (HIP/SYCL) — separable Gaussian filter chain feeding M1..M4 statistical accumulators; sensitive to filter accuracy and accumulator order. CUDA already has
test_integer_vif_cpu_cuda_parity. - CIEDE2000 (CUDA) — CIE-Lab conversion + chroma-rotation kernel; sensitive to the colour-conversion path.
Tolerance budget¶
Per ADR-0214:
- places=4 (1e-4) — unfiltered reductions (PSNR, CIEDE2000).
- places=3 (1e-3) — filtered features (VIF scale0).
GPUs are NOT bit-exact with CPU per the user-memory rule feedback_golden_gate_cpu_only. Tighter tolerances would be false-positives.
Remaining gaps (follow-up backlog)¶
core/test/ still lacks parity gates for 38 extractors:
- CUDA (13):
cambi_cuda,adm_cuda,float_psnr_cuda,float_vif_cuda,float_adm_cuda,float_motion_cuda,psnr_hvs_cuda,integer_ssim_cuda,float_ssim_cuda,float_ms_ssim_cuda,float_moment_cuda,speed_chroma_cuda,speed_temporal_cuda. - HIP (14):
ciede_hip,cambi_hip,float_psnr_hip,float_vif_hip,float_adm_hip,float_motion_hip,float_moment_hip,psnr_hvs_hip,integer_ssim_hip,float_ssim_hip,integer_ms_ssim_hip,motion_v2_hip,speed_chroma_hip,speed_temporal_hip,ssimulacra2_hip. - SYCL (13):
adm_sycl,ciede_sycl,psnr_hvs_sycl,integer_ssim_sycl,float_ssim_sycl,float_ms_ssim_sycl,float_psnr_sycl,float_vif_sycl,float_adm_sycl,float_motion_sycl,motion_v2_sycl,speed_chroma_sycl,speed_temporal_sycl,float_moment_sycl,ssimulacra2_sycl.
Recommend tracking in .workingdir2/BACKLOG.md under gpu-coverage-tier-2, sized at ~6 tests per follow-up PR to stay within the 200–800 LOC bundle target.
PR overlap audit¶
Confirmed no overlap with avoid-list PRs:
- #289 (CUDA PTX unload) — touches
cuda/picture_cuda.c, not kernel parity tests. - #290 (HIP ssimulacra2) — ships
ssimulacra2_hipkernel + a bit-exactness gate scoped to that extractor only. Our HIP tests coverpsnr_hip+vif_hip. - #293 (SYCL 4-extractor) — different SYCL extractors (
adm_sycl,float_motion_sycl, etc.). Our SYCL tests coverpsnr_sycl+vif_sycl. - #294 (Metal dispatch) — Metal dispatch-strategy assertion; our Metal test asserts the extractor-registration surface.
- #308 (HIP/Metal -ENOSYS stubs) — scaffold-shape gates.
- #315 (orphan tests) — wires already-written tests into meson; we add net-new test files.
References¶
req— user request asking for GPU-kernel test coverage push across CUDA + HIP + SYCL + Metal with the avoid-list above.- ADR-0214 — cross-backend tolerance budget.
- ADR-0361 — Metal backend rollout.
feedback_golden_gate_cpu_only(user memory) — GPU paths are not bit-exact with CPU.