ADR-0956: CUDA kernel parity coverage — round 4 (last 5 uncovered kernels)¶
- Status: Accepted
- Date: 2026-05-31
- Deciders: lusoris, Claude
- Tags: testing, cuda, parity, fork-local, gpu-coverage
Context¶
The 2026-05-30 GPU-backend kernel coverage audit (docs/research/gpu-backend-kernel-coverage-audit-2026-05-30.md, ADR-0868) catalogued 19 CUDA feature-extractor registrations under core/src/feature/cuda/. Round 1 (PR #351, ADR-0868) added 2 CUDA parity tests (psnr_cuda, ciede_cuda); round 2 (PR #374, ADR-0886) queued 5 more (adm_cuda, motion_v2_cuda, cambi_cuda, psnr_hvs_cuda, integer_ssim_cuda); round 3 (PR #442, ADR-0947) added 5 float-path twins (float_psnr_cuda, float_vif_cuda, float_ms_ssim_cuda, float_moment_cuda, ssimulacra2_cuda). Combined with the pre-existing master gates (test_cuda_motion3_parity.c, test_integer_vif_cpu_cuda_parity.c) that brings cumulative coverage to 14 of 19 kernels (~74 %).
This ADR closes the last 5 uncovered kernels:
| Kernel | CPU twin | Gate type |
|---|---|---|
float_adm_cuda | float_adm | CPU-vs-CUDA parity (ADR-0214) |
float_motion_cuda | float_motion | CPU-vs-CUDA parity (ADR-0214) |
float_ssim_cuda | float_ssim | CPU-vs-CUDA parity (ADR-0214) |
speed_chroma_cuda | (none — CUDA-only) | finite-score smoke |
speed_temporal_cuda | (none — CUDA-only) | finite-score smoke |
The first three follow the round-3 parity-test template verbatim. The two speed_*_cuda extractors are CUDA-only — there is no CPU twin emitting the same Speed_{chroma,temporal}_feature_*_score family (the closest CPU surface, speed_qa, computes a different spatial+temporal scalar). For those, a CPU-vs-CUDA parity assertion is the wrong tool; instead the gate is a smoke test that runs the extractor end-to-end on a CUDA device and asserts finite, sane output. That catches NaN/Inf drift from kernel-grid changes or covariance-matrix degenerate cases — the failure modes the ADR-0567 chroma/temporal eigendecomp path is exposed to.
Decision¶
Add five fork-local tests under core/test/:
| Test file | Kernel | Probed feature(s) |
|---|---|---|
test_cuda_float_adm_parity.c | float_adm_cuda | VMAF_feature_adm2_score + four adm_scale[0..3]_score |
test_cuda_float_motion_parity.c | float_motion_cuda | VMAF_feature_motion_score, ..._motion2_score (motion3 is host-side; covered by test_cuda_motion3_parity.c) |
test_cuda_float_ssim_parity.c | float_ssim_cuda | float_ssim |
test_cuda_speed_chroma_smoke.c | speed_chroma_cuda | three Speed_chroma_feature_speed_chroma_{u,v,uv}_score |
test_cuda_speed_temporal_smoke.c | speed_temporal_cuda | Speed_temporal_feature_speed_temporal_score |
The three parity tests follow the round-3 test_cuda_float_psnr_parity.c template: 256x144 YUV420P 8-bpc synthetic fixture, 3 frames, score read at frame index 1, places=4 / 1e-4 tolerance (ADR-0214), [skip: no CUDA device] guard when vmaf_cuda_state_init fails.
The two smoke tests use a larger 640x360 fixture so the operating resolution after the ADR-0567 2^4 decimation (40x22 luma, 20x11 chroma) admits a non-singular covariance matrix; they assert finite scores and skip cleanly when no CUDA device is visible.
All five wire into core/test/meson.build behind the existing get_option('enable_cuda') guard, register with suite ['fast', 'gpu'], and link the same (pthread, cuda, math) dependency triple as test_cuda_motion3_parity.
Post-PR CUDA-extractor coverage rises from 14 of 19 (~74 %, after PRs #351 / #374 / #442 land) to 19 of 19 (100 %) once this PR merges. The kernel-coverage backlog is then closed.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Add parity tests only (3 float twins) and defer the speed kernels again | Simpler review, fewer test files | The speed_*_cuda kernels are exactly the ones most likely to silently drift (host-side eigendecomp with degenerate-matrix paths); deferring again kicks the can a fourth round | Smoke gates close the gap without inventing CPU twins; the cost is two ~150 LOC files |
Synthesise a CPU reference for speed_* (port the eigendecomp to a CPU-only TU and parity-test against it) | Tightest possible gate | The CPU reference is exactly what the CUDA kernel calls (speed_internal_* is the host-side path); we'd be parity-testing against ourselves | The host-side eigendecomp is shared code; the smoke test still catches kernel drift without a redundant reference |
Use 1920x1080 fixture (matches python/test set_default_speed_chroma_edge_case) | Maximum coverage of operating resolution | ~50× slowdown for no extra fault detection | 640x360 already exercises 32 blocks (temporal) and 8 blocks (chroma) — enough for non-degenerate eigendecomp |
| Skip ADR-0956 and let drift surface via CHUG re-extracts | Zero new test surface | CHUG re-extracts take hours and only surface drift after the bad commit has shipped; smoke tests fire in seconds in --suite=fast | Defeats the point of the coverage push (ADR-0868 thesis: early detection beats late) |
Consequences¶
- Positive: closes the kernel coverage backlog (19 of 19 CUDA feature extractors gated); pins the last float-path numerical contract; smoke gates on the host-side eigendecomp paths catch the NaN/Inf failure modes the ADR-0567 chroma/temporal path is exposed to.
- Negative: +5 test binaries (~1100 LOC) to the build matrix; ~3 s added to
meson test --suite=gpuwall time on a CUDA-enabled runner. - Neutral / follow-ups: HIP and SYCL kernel coverage are tracked separately (ADR-0945 in flight; SYCL backlog open). The CUDA path is now feature-complete.
References¶
- ADR-0214 — cross-backend tolerance gate (
places=4) - ADR-0567 — speed_chroma/temporal CUDA real-impl + host eigendecomp
- ADR-0868 — round 1 (psnr + ciede)
- ADR-0886 — round 2 (adm/motion_v2/cambi/psnr_hvs/ssim)
- ADR-0947 — round 3 (float-path twins + ssimulacra2)
- ADR-0108 — six-deliverables rule
docs/research/gpu-backend-kernel-coverage-audit-2026-05-30.md— original audit + backlog- Source: req (CUDA kernel coverage round 4 — close the last 5 uncovered CUDA kernels)