ADR-0965: CUDA SpEED TU repair — align with current CudaFunctions table (closes T-CUDA-SPEED-TU-REPAIR-2026-05-31)¶
- Status: Accepted
- Date: 2026-05-31
- Deciders: lusoris, Claude (CUDA TU repair)
- Tags:
cuda,feature-extractor,cross-backend-parity,speed,fork-local
Context¶
ADR-0964 (PR #473) wired the HIP and SYCL SpEED twins into meson and the feature-extractor registry, but explicitly deferred the CUDA wiring because the CUDA TUs at core/src/feature/cuda/speed_chroma_cuda.c (868 LOC) and core/src/feature/cuda/speed_temporal_cuda.c (670 LOC) contained three classes of latent bugs that surfaced the moment the TUs were compiled:
-
CHECK_CUDA(cu_f, CALL)— undefined macro. Every working CUDA TU in the fork usesCHECK_CUDA_GOTO(cu_f, CALL, label)(when cleanup is pending) orCHECK_CUDA_RETURN(cu_f, CALL)(when no cleanup is needed). The bareCHECK_CUDAform was never defined in this tree; it existed in an earlier iteration ofcuda_helper.cuhthat was removed when Netflix#1420 was addressed (abort-on-error replaced by errno-return). -
cuMemAllocHost— not aCudaFunctionsmember. The pinned-host allocation API in the fork's CUDA function table iscuMemHostAlloc(with a flags argument).cuMemAllocHostis the older driver-API variant and is not exposed through theCudaFunctionsdispatch struct. Every working CUDA TU that allocates pinned memory callscuMemHostAlloc(seepicture_cuda.c:150,common.c:334). -
Copyright header included
"and Claude (Anthropic)"— the project's copyright policy (decided 2026-05-27, memory entryproject_copyright_lusoris_only.md) usesCopyright 2026 Lusorisonly.
The bugs were latent because the TUs were never compiled before this PR.
Decision¶
Repair both TUs:
- Replace all
CHECK_CUDA(cu_f, CALL)calls withCHECK_CUDA_GOTO(cu_f, CALL, fail). - Replace
cuMemAllocHost((void **)&ptr, sz)withcuMemHostAlloc((void **)&ptr, sz, 0x01u). - Fix copyright headers to
Copyright 2026 Lusoris.
Wire the repaired TUs into core/src/meson.build (under if is_cuda_enabled), declare the extern symbols and add registry rows in core/src/feature/feature_extractor.c (under #if HAVE_CUDA), and add CPU-vs-CUDA parity tests test_cuda_speed_chroma_parity.c and test_cuda_speed_temporal_parity.c mirroring the SYCL parity tests from ADR-0964.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Repair in-place + wire (chosen) | Minimal blast radius; all algorithmic content (GPU kernels, CPU linalg path) was correct — only the error-handling and host-alloc API were wrong | None at the chosen scope | Chosen |
Rewrite CUDA TUs using vmaf_cuda_buffer_host_alloc wrapper from common.c | Consistent with newer CUDA features (e.g. float_adm_cuda.c) | Wider diff; the direct cuMemHostAlloc call is already the pattern used by picture_cuda.c and is correct | Higher cost, no functional benefit |
| Drop CUDA SpEED twins, rely on HIP/SYCL only | Simpler CUDA build | Breaks CUDA-only deployments that want SpEED scores; creates asymmetry between backends | Rejected |
Consequences¶
- Positive:
vmaf_get_feature_extractor_by_name("speed_chroma_cuda")and"speed_temporal_cuda"now resolve on CUDA-enabled builds.- Two new CPU-vs-CUDA parity gates (
test_cuda_speed_chroma_parity,test_cuda_speed_temporal_parity) catch regressions to within places=4 (ADR-0214 cross-backend tolerance). T-CUDA-SPEED-TU-REPAIR-2026-05-31indocs/state.mdis closed.- Negative:
- None at the current scope.
- Neutral / follow-ups:
- The parity tests skip cleanly when no CUDA device is visible (same NaN sentinel pattern as
test_cuda_motion3_parity.c); they will only fire on hardware CI lanes.
References¶
- ADR-0964 —
speed_internal.cimplementation + HIP/SYCL wiring (the PR that deferred this repair and opened the tracking ticket). - ADR-0214 — cross-backend parity tolerance (places=4).
core/src/cuda/cuda_helper.cuh—CHECK_CUDA_GOTO/CHECK_CUDA_RETURNdefinitions.