ADR-0868: GPU backend kernel parity-test coverage gap-fill¶

Status: Accepted
Date: 2026-05-30
Deciders: lusoris
Tags: tests, cuda, hip, sycl, metal, coverage

Context¶

A coverage audit of core/test/ versus the registered GPU feature extractors revealed broad gaps. CUDA shipped 14 extractors but only motion_cuda and vif_cuda had cross-backend parity tests (the remaining 12 — psnr_cuda, ciede_cuda, cambi_cuda, adm_cuda, float_psnr_cuda, float_vif_cuda, float_adm_cuda, float_motion_cuda, psnr_hvs_cuda, integer_ssim_cuda, float_ssim_cuda, float_ms_ssim_cuda, float_moment_cuda, speed_chroma_cuda, speed_temporal_cuda — were unverified at CI time). HIP shipped 18 extractors with only motion_hip and adm_hip parity-gated. SYCL shipped 17 extractors with motion_sycl and cambi_sycl covered. Metal shipped 8 extractors with only the runtime smoke (motion_v2_metal) asserting registration.

A regression in any of the uncovered kernels' reductions, separable filters, or per-plane accumulators would have escaped CI and surfaced only at downstream model-prediction time, polluting CHUG re-extracts and vmaf-tune feature exports.

Decision¶

We add six cross-backend parity tests plus one Metal registration audit, picked to:

spread coverage uniformly across CUDA / HIP / SYCL / Metal,
target the two highest-leverage kernels per backend (PSNR for the reduction path, VIF for the separable-filter path, CIEDE2000 for the colour-conversion path),
mirror the established test_*_motion3_parity.c scaffold so the review surface is uniform.

The new tests are:

Test	Backend	Kernel	Tolerance
`test_cuda_psnr_parity`	CUDA	`psnr_cuda` (integer_psnr_cuda.c + psnr_score.cu)	1e-4 (places=4)
`test_cuda_ciede_parity`	CUDA	`ciede_cuda` (integer_ciede_cuda.c + ciede_score.cu)	1e-4
`test_hip_psnr_parity`	HIP	`psnr_hip` (integer_psnr_hip.c + psnr_score.hip)	1e-4
`test_hip_vif_parity`	HIP	`vif_hip` (integer_vif_hip.c + vif_statistics.hip)	1e-3
`test_sycl_psnr_parity`	SYCL	`psnr_sycl` (integer_psnr_sycl.cpp)	1e-4
`test_sycl_vif_parity`	SYCL	`vif_sycl` (integer_vif_sycl.cpp)	1e-3
`test_metal_kernel_registration`	Metal	8 extractor registrations + TEMPORAL flag audit	n/a

Tolerances follow the ADR-0214 cross-backend gate: places=4 (1e-4) for unfiltered reductions, places=3 (1e-3) for filtered features (VIF) where the separable Gaussian's accumulator order changes the last few bits. Each parity test skips cleanly with a [skip: no <backend> device] notice when the runtime is unavailable, matching the existing test_*_motion3_parity.c skip pattern.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Full coverage in a single PR (~30 tests across the 45+ extractor gap)	Closes the audit in one shot.	30+ test files in one PR is past the 200–800-LOC sweet spot called out in the PR-hygiene rule; reviewer cost balloons.	Land highest-leverage 6+1 here, leave a follow-up backlog item for the remaining ~38.
Synthetic fixtures only (current approach)	Deterministic, no file I/O, runs in <1s per backend.	Doesn't exercise the realistic-content code paths that natural video stresses.	Accepted — same trade-off the existing `test_*_motion3_parity.c` makes. The CHUG / netflix-benchmark sweep covers natural content end-to-end.
Bit-exactness (1e-9) tolerance per ADR-0138/0139	Strongest gate.	GPUs are NOT bit-exact vs CPU per the user-memory rule `feedback_golden_gate_cpu_only.md`; would force false-positive failures.	Use the documented near-exact 1e-4 / 1e-3 places-budget instead.
One big merged parity test with subtests per kernel	Fewer executables.	Couples backend availability — a failed CUDA driver init would mask a SYCL regression. Existing tests are one-extractor-per-binary; keep the pattern.	Stay one-binary-per-kernel.

Consequences¶

Positive: 6 GPU kernels gain a cross-backend gate that fires on every PR. Future regressions in the reduction / filter accumulator on any of CUDA / HIP / SYCL are caught at CI time rather than at CHUG re-extract time.
Positive: Metal's 8-extractor registration is now fully audited (was 2/8 spot-checked); a future T8-1d refactor that drops a .mm translation unit will surface immediately.
Negative: 7 new test binaries inflate the GPU CI lane by roughly 7 × (driver-init time + 1 frame). On a 4090 / Arc / MI300 lane this adds well under 5 s total.
Follow-up: ~38 GPU extractors remain uncovered (full list in docs/research/gpu-backend-kernel-coverage-audit-2026-05-30.md). Track via a .workingdir2/BACKLOG.md row tagged gpu-coverage-tier-2. Not gating on this PR.

References¶

ADR-0214 — cross-backend tolerance budget (places=4 unfiltered, places=3 filtered).
ADR-0361 — Metal backend rollout (T8-1c/d).
ADR-0420 — Metal kernel-template runtime.
ADR-0421 — Metal extractor MSL shaders.
ADR-0108 — six fork-local deliverables rule.
feedback_golden_gate_cpu_only (user memory) — GPU paths are not bit-exact with CPU; near-exact tolerance is the correct gate.
Source: req — request to push test coverage on GPU backend kernels covering CUDA + HIP + SYCL + Metal feature extractors, with the avoid-list (PRs #289, #290, #293, #294, #308, #315) and the instruction to wire into core/test/meson.build under the appropriate suite tags.