ADR-0956: CUDA kernel parity coverage — round 4 (last 5 uncovered kernels)¶

Status: Accepted
Date: 2026-05-31
Deciders: lusoris, Claude
Tags: testing, cuda, parity, fork-local, gpu-coverage

Context¶

The 2026-05-30 GPU-backend kernel coverage audit (docs/research/gpu-backend-kernel-coverage-audit-2026-05-30.md, ADR-0868) catalogued 19 CUDA feature-extractor registrations under core/src/feature/cuda/. Round 1 (PR #351, ADR-0868) added 2 CUDA parity tests (psnr_cuda, ciede_cuda); round 2 (PR #374, ADR-0886) queued 5 more (adm_cuda, motion_v2_cuda, cambi_cuda, psnr_hvs_cuda, integer_ssim_cuda); round 3 (PR #442, ADR-0947) added 5 float-path twins (float_psnr_cuda, float_vif_cuda, float_ms_ssim_cuda, float_moment_cuda, ssimulacra2_cuda). Combined with the pre-existing master gates (test_cuda_motion3_parity.c, test_integer_vif_cpu_cuda_parity.c) that brings cumulative coverage to 14 of 19 kernels (~74 %).

This ADR closes the last 5 uncovered kernels:

Kernel	CPU twin	Gate type
`float_adm_cuda`	`float_adm`	CPU-vs-CUDA parity (ADR-0214)
`float_motion_cuda`	`float_motion`	CPU-vs-CUDA parity (ADR-0214)
`float_ssim_cuda`	`float_ssim`	CPU-vs-CUDA parity (ADR-0214)
`speed_chroma_cuda`	(none — CUDA-only)	finite-score smoke
`speed_temporal_cuda`	(none — CUDA-only)	finite-score smoke

The first three follow the round-3 parity-test template verbatim. The two speed_*_cuda extractors are CUDA-only — there is no CPU twin emitting the same Speed_{chroma,temporal}_feature_*_score family (the closest CPU surface, speed_qa, computes a different spatial+temporal scalar). For those, a CPU-vs-CUDA parity assertion is the wrong tool; instead the gate is a smoke test that runs the extractor end-to-end on a CUDA device and asserts finite, sane output. That catches NaN/Inf drift from kernel-grid changes or covariance-matrix degenerate cases — the failure modes the ADR-0567 chroma/temporal eigendecomp path is exposed to.

Decision¶

Add five fork-local tests under core/test/:

Test file	Kernel	Probed feature(s)
`test_cuda_float_adm_parity.c`	`float_adm_cuda`	`VMAF_feature_adm2_score` + four `adm_scale[0..3]_score`
`test_cuda_float_motion_parity.c`	`float_motion_cuda`	`VMAF_feature_motion_score`, `..._motion2_score` (motion3 is host-side; covered by `test_cuda_motion3_parity.c`)
`test_cuda_float_ssim_parity.c`	`float_ssim_cuda`	`float_ssim`
`test_cuda_speed_chroma_smoke.c`	`speed_chroma_cuda`	three `Speed_chroma_feature_speed_chroma_{u,v,uv}_score`
`test_cuda_speed_temporal_smoke.c`	`speed_temporal_cuda`	`Speed_temporal_feature_speed_temporal_score`

The three parity tests follow the round-3 test_cuda_float_psnr_parity.c template: 256x144 YUV420P 8-bpc synthetic fixture, 3 frames, score read at frame index 1, places=4 / 1e-4 tolerance (ADR-0214), [skip: no CUDA device] guard when vmaf_cuda_state_init fails.

The two smoke tests use a larger 640x360 fixture so the operating resolution after the ADR-0567 2^4 decimation (40x22 luma, 20x11 chroma) admits a non-singular covariance matrix; they assert finite scores and skip cleanly when no CUDA device is visible.

All five wire into core/test/meson.build behind the existing get_option('enable_cuda') guard, register with suite ['fast', 'gpu'], and link the same (pthread, cuda, math) dependency triple as test_cuda_motion3_parity.

Post-PR CUDA-extractor coverage rises from 14 of 19 (~74 %, after PRs #351 / #374 / #442 land) to 19 of 19 (100 %) once this PR merges. The kernel-coverage backlog is then closed.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Add parity tests only (3 float twins) and defer the speed kernels again	Simpler review, fewer test files	The `speed_*_cuda` kernels are exactly the ones most likely to silently drift (host-side eigendecomp with degenerate-matrix paths); deferring again kicks the can a fourth round	Smoke gates close the gap without inventing CPU twins; the cost is two ~150 LOC files
Synthesise a CPU reference for `speed_*` (port the eigendecomp to a CPU-only TU and parity-test against it)	Tightest possible gate	The CPU reference is exactly what the CUDA kernel calls (`speed_internal_*` is the host-side path); we'd be parity-testing against ourselves	The host-side eigendecomp is shared code; the smoke test still catches kernel drift without a redundant reference
Use 1920x1080 fixture (matches python/test `set_default_speed_chroma_edge_case`)	Maximum coverage of operating resolution	~50× slowdown for no extra fault detection	640x360 already exercises 32 blocks (temporal) and 8 blocks (chroma) — enough for non-degenerate eigendecomp
Skip ADR-0956 and let drift surface via CHUG re-extracts	Zero new test surface	CHUG re-extracts take hours and only surface drift after the bad commit has shipped; smoke tests fire in seconds in `--suite=fast`	Defeats the point of the coverage push (ADR-0868 thesis: early detection beats late)

Consequences¶

Positive: closes the kernel coverage backlog (19 of 19 CUDA feature extractors gated); pins the last float-path numerical contract; smoke gates on the host-side eigendecomp paths catch the NaN/Inf failure modes the ADR-0567 chroma/temporal path is exposed to.
Negative: +5 test binaries (~1100 LOC) to the build matrix; ~3 s added to meson test --suite=gpu wall time on a CUDA-enabled runner.
Neutral / follow-ups: HIP and SYCL kernel coverage are tracked separately (ADR-0945 in flight; SYCL backlog open). The CUDA path is now feature-complete.

References¶

ADR-0214 — cross-backend tolerance gate (places=4)
ADR-0567 — speed_chroma/temporal CUDA real-impl + host eigendecomp
ADR-0868 — round 1 (psnr + ciede)
ADR-0886 — round 2 (adm/motion_v2/cambi/psnr_hvs/ssim)
ADR-0947 — round 3 (float-path twins + ssimulacra2)
ADR-0108 — six-deliverables rule
docs/research/gpu-backend-kernel-coverage-audit-2026-05-30.md — original audit + backlog
Source: req (CUDA kernel coverage round 4 — close the last 5 uncovered CUDA kernels)