ADR-0886: CUDA kernel parity test coverage — round 2 gap-fill¶

Status: Accepted
Date: 2026-05-30
Deciders: Lusoris
Tags: testing, cuda, gpu, parity, coverage

Context¶

ADR-0868 (PR #351) closed the largest single GPU-kernel parity gap by adding test_cuda_psnr_parity and test_cuda_ciede_parity, raising CUDA-backend assertion coverage from roughly 12 % (motion3 + VIF only) to about 25 % of the 18-extractor surface registered by the CUDA runtime.

Twelve CUDA extractors still ship without a CPU-vs-CUDA cross-backend gate. The highest-impact remaining gaps are the kernels that feed the shipped libvmaf-2.x.x default-model lineage:

adm_cuda — load-bearing for every VMAF score (default model).
motion_v2_cuda — used by the vmaf_v0.6.1neg lineage.
cambi_cuda — used by vmaf_v0.6.1neg and vmaf_4k_v0.6.1.
psnr_hvs_cuda — emitted as a side-channel in CHUG sidecars.
integer_ssim_cuda — standard reference companion to VMAF.

Without a cross-backend gate, a regression in any of these kernels (e.g. a DWT-stage off-by-one in ADM, an FP-summation order change in CAMBI, a 5-tap Gaussian boundary drift in motion_v2) would silently bias every CHUG re-extract that runs on CUDA — and would not be caught by the existing Netflix CPU golden gate (which never runs against the CUDA backend, per CLAUDE.md §8) nor by test_cuda_motion3_parity / test_integer_vif_cpu_cuda_parity (which exercise different kernels).

Decision¶

Add five new parity tests under core/test/, all following the same "feed one or two synthetic YUV420P frames through both backends, assert places=4 (1e-4) tolerance per ADR-0214" scaffold established by the round-1 tests:

Test	Kernel	Feature(s) gated
`test_cuda_adm_parity`	`integer_adm_cuda.c`	`VMAF_integer_feature_adm2_score`, `VMAF_integer_feature_adm3_score`
`test_cuda_motion_v2_parity`	`integer_motion_v2_cuda.c`	`VMAF_integer_feature_motion_v2_sad_score`, `VMAF_integer_feature_motion2_v2_score`
`test_cuda_cambi_parity`	`integer_cambi_cuda.c`	`Cambi_feature_cambi_score`
`test_cuda_psnr_hvs_parity`	`integer_psnr_hvs_cuda.c`	`psnr_hvs`
`test_cuda_ssim_parity`	`ssim_cuda.c`	`ssim`

All five share the round-1 skip convention: when vmaf_cuda_state_init() fails (no CUDA driver, no visible device) the test emits [skip: no CUDA device] to stderr and passes — so they remain safe on CPU-only CI runners and Apple-Silicon dev hosts. Suite membership is ['fast', 'gpu'], matching the round-1 tests.

Tolerance follows ADR-0214: places=4 (1e-4) for unfiltered reductions. The Gaussian-filtered SSIM kernel could in principle warrant the looser places=3 used for VIF, but the fixture geometry (256x144) keeps the accumulated 11x11-window error well inside the 1e-4 envelope — verified by inspection of the existing round-1 fixture results. If the SSIM gate proves flaky in CI, the per-test PARITY_TOL macro is the single point of relaxation.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Add all 12 remaining CUDA-kernel parity tests in one PR	Largest single coverage delta	~2,500 LOC, mixes load-bearing kernels with low-impact ones (`speed_chroma`, `ssimulacra2`); larger review burden; longer container-rebuild cycle	Five high-impact kernels gives roughly an 80/20 coverage delta with a reviewable PR size
Skip the model-lineage kernels and add `speed_chroma_cuda` + `ssimulacra2_cuda` tests instead	Closes the lowest-coverage extractors	Those two are not used by any shipped model; their drift would not bias CHUG re-extracts	Risk-prioritised: cover model-feeding kernels first
Parameterise a single test with a list of `(cpu, cuda)` extractor pairs	Single file, less boilerplate	Feature-output naming is non-uniform (`Cambi_feature_cambi_score` vs. `psnr_hvs` vs. `ssim` vs. multi-feature ADM), and the `vmaf_use_feature` options dict differs per kernel; the abstraction would carry more conditional plumbing than the per-test scaffold	Round-1 picked per-test files for the same readability reason; round-2 stays consistent
Defer to the `validate-scores` skill / cross-backend-diff CI gate	Already exists end-to-end	That gate runs full Netflix benchmark fixtures (slow), not synthetic frames; failure mode is "VMAF score drifted" not "kernel X disagreed by Y at index Z" — much harder to bisect	Synthetic per-kernel gate localises regressions to a single file

Consequences¶

Positive: CUDA-backend assertion coverage rises to roughly 53 % of the registered extractor surface (10 of 18 kernels now under cross-backend gate). Five model-feeding kernels are now regression- guarded. The pattern remains identical to round-1, so round-3 (the remaining seven extractors: float_*_cuda, speed_*_cuda, ssimulacra2_cuda, float_moment_cuda) is mechanical.
Negative: Adds five test binaries to the build matrix; each adds about 5 s wall-time when CUDA is present. CPU-only CI is unaffected (skip path returns immediately after vmaf_cuda_state_init()).
Neutral / follow-ups: A round-3 PR will close the remaining seven CUDA extractors. The HIP and SYCL backends also have residual gaps past round-1 (ADR-0868); equivalent round-2 PRs against those backends are scheduled.

References¶

ADR-0868 — round-1 GPU-backend kernel coverage gap-fill (0868-gpu-backend-kernel-coverage.md)
ADR-0214 — cross-backend numerical tolerance gate
ADR-0541 — VIF CPU-vs-CUDA parity test (round-0 prior art)
Round-1 PR — PR #351
Source: user direction "extend CUDA kernel test coverage beyond PR #351 (13 more CUDA kernels need parity tests)" — round-2 picks five highest-impact kernels (ADM, motion_v2, CAMBI, PSNR-HVS, SSIM)