ADR-0886: CUDA kernel parity test coverage — round 2 gap-fill¶
- Status: Accepted
- Date: 2026-05-30
- Deciders: Lusoris
- Tags: testing, cuda, gpu, parity, coverage
Context¶
ADR-0868 (PR #351) closed the largest single GPU-kernel parity gap by adding test_cuda_psnr_parity and test_cuda_ciede_parity, raising CUDA-backend assertion coverage from roughly 12 % (motion3 + VIF only) to about 25 % of the 18-extractor surface registered by the CUDA runtime.
Twelve CUDA extractors still ship without a CPU-vs-CUDA cross-backend gate. The highest-impact remaining gaps are the kernels that feed the shipped libvmaf-2.x.x default-model lineage:
adm_cuda— load-bearing for every VMAF score (default model).motion_v2_cuda— used by thevmaf_v0.6.1neglineage.cambi_cuda— used byvmaf_v0.6.1negandvmaf_4k_v0.6.1.psnr_hvs_cuda— emitted as a side-channel in CHUG sidecars.integer_ssim_cuda— standard reference companion to VMAF.
Without a cross-backend gate, a regression in any of these kernels (e.g. a DWT-stage off-by-one in ADM, an FP-summation order change in CAMBI, a 5-tap Gaussian boundary drift in motion_v2) would silently bias every CHUG re-extract that runs on CUDA — and would not be caught by the existing Netflix CPU golden gate (which never runs against the CUDA backend, per CLAUDE.md §8) nor by test_cuda_motion3_parity / test_integer_vif_cpu_cuda_parity (which exercise different kernels).
Decision¶
Add five new parity tests under core/test/, all following the same "feed one or two synthetic YUV420P frames through both backends, assert places=4 (1e-4) tolerance per ADR-0214" scaffold established by the round-1 tests:
| Test | Kernel | Feature(s) gated |
|---|---|---|
test_cuda_adm_parity | integer_adm_cuda.c | VMAF_integer_feature_adm2_score, VMAF_integer_feature_adm3_score |
test_cuda_motion_v2_parity | integer_motion_v2_cuda.c | VMAF_integer_feature_motion_v2_sad_score, VMAF_integer_feature_motion2_v2_score |
test_cuda_cambi_parity | integer_cambi_cuda.c | Cambi_feature_cambi_score |
test_cuda_psnr_hvs_parity | integer_psnr_hvs_cuda.c | psnr_hvs |
test_cuda_ssim_parity | ssim_cuda.c | ssim |
All five share the round-1 skip convention: when vmaf_cuda_state_init() fails (no CUDA driver, no visible device) the test emits [skip: no CUDA device] to stderr and passes — so they remain safe on CPU-only CI runners and Apple-Silicon dev hosts. Suite membership is ['fast', 'gpu'], matching the round-1 tests.
Tolerance follows ADR-0214: places=4 (1e-4) for unfiltered reductions. The Gaussian-filtered SSIM kernel could in principle warrant the looser places=3 used for VIF, but the fixture geometry (256x144) keeps the accumulated 11x11-window error well inside the 1e-4 envelope — verified by inspection of the existing round-1 fixture results. If the SSIM gate proves flaky in CI, the per-test PARITY_TOL macro is the single point of relaxation.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Add all 12 remaining CUDA-kernel parity tests in one PR | Largest single coverage delta | ~2,500 LOC, mixes load-bearing kernels with low-impact ones (speed_chroma, ssimulacra2); larger review burden; longer container-rebuild cycle | Five high-impact kernels gives roughly an 80/20 coverage delta with a reviewable PR size |
Skip the model-lineage kernels and add speed_chroma_cuda + ssimulacra2_cuda tests instead | Closes the lowest-coverage extractors | Those two are not used by any shipped model; their drift would not bias CHUG re-extracts | Risk-prioritised: cover model-feeding kernels first |
Parameterise a single test with a list of (cpu, cuda) extractor pairs | Single file, less boilerplate | Feature-output naming is non-uniform (Cambi_feature_cambi_score vs. psnr_hvs vs. ssim vs. multi-feature ADM), and the vmaf_use_feature options dict differs per kernel; the abstraction would carry more conditional plumbing than the per-test scaffold | Round-1 picked per-test files for the same readability reason; round-2 stays consistent |
Defer to the validate-scores skill / cross-backend-diff CI gate | Already exists end-to-end | That gate runs full Netflix benchmark fixtures (slow), not synthetic frames; failure mode is "VMAF score drifted" not "kernel X disagreed by Y at index Z" — much harder to bisect | Synthetic per-kernel gate localises regressions to a single file |
Consequences¶
- Positive: CUDA-backend assertion coverage rises to roughly 53 % of the registered extractor surface (10 of 18 kernels now under cross-backend gate). Five model-feeding kernels are now regression- guarded. The pattern remains identical to round-1, so round-3 (the remaining seven extractors:
float_*_cuda,speed_*_cuda,ssimulacra2_cuda,float_moment_cuda) is mechanical. - Negative: Adds five test binaries to the build matrix; each adds about 5 s wall-time when CUDA is present. CPU-only CI is unaffected (skip path returns immediately after
vmaf_cuda_state_init()). - Neutral / follow-ups: A round-3 PR will close the remaining seven CUDA extractors. The HIP and SYCL backends also have residual gaps past round-1 (ADR-0868); equivalent round-2 PRs against those backends are scheduled.
References¶
- ADR-0868 — round-1 GPU-backend kernel coverage gap-fill (
0868-gpu-backend-kernel-coverage.md) - ADR-0214 — cross-backend numerical tolerance gate
- ADR-0541 — VIF CPU-vs-CUDA parity test (round-0 prior art)
- Round-1 PR — PR #351
- Source: user direction "extend CUDA kernel test coverage beyond PR #351 (13 more CUDA kernels need parity tests)" — round-2 picks five highest-impact kernels (ADM, motion_v2, CAMBI, PSNR-HVS, SSIM)