ADR-0947: CUDA kernel parity coverage — round 3 (float-path twins + ssimulacra2)¶

Status: Accepted
Date: 2026-05-31
Deciders: lusoris, Claude
Tags: testing, cuda, parity, fork-local, gpu-coverage

Context¶

The 2026-05-30 GPU-backend kernel coverage audit (docs/research/gpu-backend-kernel-coverage-audit-2026-05-30.md, ADR-0868) catalogued 18 CUDA feature extractors under core/src/feature/cuda/ and found only 1 had a CPU-vs-CUDA parity test on origin/master (test_cuda_motion3_parity.c, ADR-0214 gate).

Round 1 (PR #351, ADR-0868) added 2 CUDA parity tests (psnr_cuda, ciede_cuda). Round 2 (PR #374, ADR-0886) is in flight and queues 5 more (adm_cuda, motion_v2_cuda, cambi_cuda, psnr_hvs_cuda, integer_ssim_cuda).

That leaves the float-path twins and the new ssimulacra2_cuda kernel without a cross-backend assertion gate. The float path is what the vmaf_float_v0.6.1 lineage and every research-time --feature float_* invocation exercises; without a parity test, a SIMD or kernel rewrite on either backend could silently drift the float-path scores away from the CPU reference and only surface weeks later via a CHUG re-extract diff.

Five kernels remain on the ADR-0886 backlog after rounds 1+2: float_adm_cuda, float_motion_cuda, float_psnr_cuda, float_vif_cuda, float_ms_ssim_cuda, float_moment_cuda, speed_chroma_cuda, speed_temporal_cuda, ssimulacra2_cuda. The speed_* twins use a 25×25 eigendecomp on host that splits the numerical-equivalence story (see ADR-0567); they are deferred to a separate ADR with their own tolerance budget.

Decision¶

Add 5 fork-local parity tests under core/test/, one per kernel, following the test_cuda_motion3_parity.c template (ADR-0214 places=4 / 1e-4 tolerance, skip-on-no-device guard):

Test	Kernel	CPU twin	Emitted feature(s) probed
`test_cuda_float_psnr_parity`	`float_psnr_cuda`	`float_psnr`	`float_psnr` (luma)
`test_cuda_float_vif_parity`	`float_vif_cuda`	`float_vif`	`VMAF_feature_vif_scale[0..3]_score`
`test_cuda_float_ms_ssim_parity`	`float_ms_ssim_cuda`	`float_ms_ssim`	`float_ms_ssim`
`test_cuda_float_moment_parity`	`float_moment_cuda`	`float_moment`	`float_moment_{ref,dis}{1st,2nd}`
`test_cuda_ssimulacra2_parity`	`ssimulacra2_cuda`	`ssimulacra2`	`ssimulacra2`

Each test wires into core/test/meson.build behind the existing get_option('enable_cuda') guard, registers with suite ['fast', 'gpu'], and links the same (pthread, cuda, math) dependency triple as test_cuda_motion3_parity.

Post-PR CUDA-extractor assertion coverage rises from ~17 % (3 of 18 on origin/master — motion3 + the round-1 vif parity gate + the existing motion3 test) to ~44 % (8 of 18) once rounds 1+2 land behind this PR, and ~72 % (13 of 18) once all three round PRs merge.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Pick `float_*` only (4 kernels)	Smaller PR; one coherent theme	Leaves `ssimulacra2_cuda` (the most newly-landed kernel, highest drift risk) without a gate	`ssimulacra2_cuda` has no CPU-twin parity test and was added recently — covering it now while the design memory is fresh is cheaper than later
Pick all 9 remaining kernels (incl. `speed_*` + `float_motion_cuda`)	Closes the backlog in one PR	`speed_*` involves a host-side eigendecomp with its own tolerance budget (ADR-0567 deferred); `float_motion_cuda` overlaps PR #374's `motion_v2_cuda` blend formula	Right-size to ADR-0108's 200–800 LOC bundle; defer `speed_*` to ADR-N+1
Add per-feature option matrices (chroma on/off, ms_ssim's `enable_lcs/db/clip_db`)	Tighter coverage of option combinations	Triples test surface for no observed bug; option matrix has no audit hit	Stick to default options; matrix work is a separate ADR if drift surfaces
Defer until rounds 1+2 merge	Avoids meson.build conflicts	Loses parallelism on the kernel-coverage push; round 3 picks distinct kernels so the per-file conflict surface is `core/test/meson.build` only	Conflicts on a single file are trivially resolvable; the parallel work outweighs the conflict cost

Consequences¶

Positive: closes the largest remaining gap in CUDA kernel cross-backend coverage; pins the float-path numerical contract that every vmaf_float_* model depends on; the new ssimulacra2_cuda gate prevents the kind of silent drift that the round-1 PSNR/CIEDE gates caught during 2026-05 audits.
Negative: +5 test binaries (~1000 LOC) to the build matrix; ~5 s added to meson test --suite=gpu wall time on a CUDA-enabled runner.
Neutral / follow-ups: ADR-N+1 will cover speed_chroma_cuda / speed_temporal_cuda with their own tolerance budget once the ADR-0567 chroma-eigen path is reviewed. float_motion_cuda is deferred to the same follow-up (overlaps PR #374's motion_v2_cuda blend coverage).

References¶

ADR-0214 — cross-backend tolerance gate (places=4)
ADR-0868 — round 1 (psnr + ciede)
ADR-0886 — round 2 (adm/motion_v2/cambi/psnr_hvs/ssim)
ADR-0108 — six-deliverables rule
ADR-0567 — speed_chroma host-side eigendecomp
docs/research/gpu-backend-kernel-coverage-audit-2026-05-30.md — round 1 audit + backlog
Source: req (CUDA kernel coverage round 3 — extend beyond PRs #351 + #374)