ADR-0156: CUDA backend: graceful error propagation (Netflix#1420)¶
- Status: Accepted
- Date: 2026-04-24
- Deciders: Lusoris, Claude (Anthropic)
- Tags: cuda, correctness, api, netflix-upstream, reliability
Context¶
Netflix upstream issue #1420 reports that running two VMAF-CUDA analyses concurrently aborts the second process with:
Root cause: the CHECK_CUDA macro in core/src/cuda/cuda_helper.cuh called assert(0) on any CUDA error:
#define CHECK_CUDA(funcs, CALL) \
do { \
const CUresult cu_err = funcs->CALL; \
if (CUDA_SUCCESS != cu_err) { \
const char *err_txt; \
funcs->cuGetErrorName(cu_err, &err_txt); \
printf("code: %d; description: %s\n", \
(int)cu_err, err_txt); \
assert(0); \
} \
} while (0)
Legitimate failure modes — cuMemAlloc OOM, cuStreamCreate resource exhaustion, cuModuleLoadData compile errors on a mismatched driver — all collapsed into assert(0). Two consequences:
- No graceful recovery: a downstream caller that wanted to retry, fall back to CPU, or surface a clean error to the user had no way to do so; the process was already dead.
NDEBUGfootgun: under a release build withNDEBUGdefined,assert(0)is a no-op — the function then silently continued with an un-allocated buffer, leading to a segfault on the first dereference.
vmaf_cuda_buffer_alloc already returned int and every call site used the ret |= ... pattern (see core/src/feature/cuda/integer_motion_cuda.c:159–167, integer_vif_cuda.c:150–155, integer_adm_cuda.c:1035–1054). Callers were ready to handle a failure — but the macro never let the failure propagate.
Fork PRs #60 and #62 (ADR-0122, ADR-0123) hardened the null-state path in common.c but left the assert-on-any-error semantics untouched. Netflix#1420 is the remaining half of that hardening story.
Decision¶
Replace CHECK_CUDA's abort-on-error with graceful error propagation across the entire CUDA backend. Two new macros in core/src/cuda/cuda_helper.cuh:
CHECK_CUDA_GOTO(funcs, CALL, label)— on CUDA failure, logs the error (file / line / CUresult name / call text) and jumps tolabel:for cleanup. The caller declaresint _cuda_err = 0;at function entry and puts cleanup code (context pop, buffer free, etc.) underlabel:before returning_cuda_err. Used for call sites with pending cleanup state.CHECK_CUDA_RETURN(funcs, CALL)— on CUDA failure, logs and returns directly from the enclosing function. Used for sites with no pending state to clean up (kernel launch dispatches, post-pop stream syncs).
A third helper maps CUresult → -errno:
static inline int vmaf_cuda_result_to_errno(int cu_err_code)
{
switch (cu_err_code) {
case 0: return 0; /* CUDA_SUCCESS */
case 2: return -ENOMEM; /* CUDA_ERROR_OUT_OF_MEMORY */
case 3:
case 4: return -ENODEV; /* NOT_INITIALIZED / DEINITIALIZED */
case 1:
case 101:
case 201:
case 400: return -EINVAL; /* INVALID_* */
default: return -EIO;
}
}
Every one of the 175 existing CHECK_CUDA(...) call sites across 7 files (common.c, picture_cuda.c, libvmaf.c, integer_motion_cuda.c, integer_vif_cuda.c, integer_adm_cuda.c, cuda_helper.cuh macro def) is converted:
- 122 sites →
CHECK_CUDA_GOTOwith cleanup labels. - 56 sites →
CHECK_CUDA_RETURN.
Twelve static helper functions that previously returned void and used CHECK_CUDA internally are promoted to int so failures reach their callers:
integer_motion_cuda.c:calculate_motion_score(+ matching function-pointer type inMotionStateCuda).integer_vif_cuda.c:filter1d_8,filter1d_16.integer_adm_cuda.c:dwt2_8_device,adm_dwt2_s123_combined_device,adm_dwt2_16_device,adm_csf_device,i4_adm_csf_device,adm_csf_den_s123_device,adm_csf_den_scale_device,i4_adm_cm_device,adm_cm_device,integer_compute_adm_cuda.
Public ABI is unchanged: every function exported via core/include/libvmaf/libvmaf_cuda.h already returned int and continues to return int with the same sign convention. Previously-undocumented failure modes (OOM, stream creation failure, kernel dispatch failure) now reach callers as -ENOMEM / -EIO / -EINVAL / -ENODEV instead of aborting the process.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Wholesale CHECK_CUDA replacement (this ADR) | Fixes the reported bug and every analogous site in one pass; no mixed abort/graceful semantics left to reason about; the NDEBUG footgun is eliminated everywhere | 178 call sites + 12 void→int signature changes → large refactor; proportional review cost | Chosen — the user explicitly selected wholesale scope via popup; incremental coverage would leave the NDEBUG footgun alive in every untouched site |
Surgical: only vmaf_cuda_buffer_alloc | Smallest diff; addresses the exact reported crash | Every other cuMemAlloc / cuStreamCreate / kernel-launch site keeps the abort-on-error semantics; reporter's symptom recurs on any other CUDA failure | Rejected via popup — user chose wholesale |
Keep assert(0) but add fprintf + abort() | Preserves "fail loud" behaviour; no caller changes | Still aborts the process; NDEBUG no longer gates assert, but downstream still can't retry / fall back / degrade gracefully | Rejected — doesn't solve the reported problem, just changes the flavour of abort |
| Introduce a callback for error policy (abort vs return) | Configurable per-caller | Adds a runtime branch on every CHECK; new public API surface; over-engineered for a yes/no problem | Rejected — the graceful-return path is always the right default; there's no call-site that prefers abort-on-OOM |
Consequences¶
- Positive:
- Netflix#1420 reproducer resolved: a second concurrent VMAF-CUDA process that OOMs on
cuMemAllocnow gets a clean-ENOMEMreturn fromvmaf_cuda_buffer_alloc(and transitively frominit_fex_cuda), instead of aborting. - NDEBUG footgun eliminated: with
assert(0)gone, release builds can no longer silently continue past a failedcuMemAllocinto a segfault. - Downstream integrations (ffmpeg filter, libvmaf CLI, mcp-server) can distinguish transient resource pressure (
-ENOMEM,-EIO) from configuration errors (-EINVAL) and device loss (-ENODEV). Enables "retry with smaller batch", "fall back to CPU", or "surface a clean error" behaviour. - Every error path now logs via
fprintf(stderr, "CUDA error at %s:%d: %s (%d) in %s\n", ...)including the#CALLstringification — more actionable than the oldprintf+assert(0)stack-trace dump. - ADR-0122 / ADR-0123 null-guards on the public entry points in
common.care preserved verbatim; the graceful-return path and the null-guard path compose cleanly. - Negative:
- Visible behaviour change for callers that relied on the process aborting rather than returning an error — specifically, if any downstream code assumed CUDA failures produced a visible crash (and therefore treated VMAF-CUDA success as "process still alive"), that assumption is now false. Flagged under
### Changedin CHANGELOG. - Larger binary: every touched function now has an extra cleanup label + error-translation switch. Measured delta on
libvmaf.so.3.0.0: ~4 kB code size increase (0.05%), within noise. - Twelve
void → intsignature changes instatichelper functions. Private to the CUDA TUs; no public ABI change. - Neutral / follow-ups:
- New reducer test
core/test/test_cuda_buffer_alloc_oom.cexercisescuMemAlloc(1 TiB), verifies the return is-ENOMEM(not abort). GPU-gated — runs only when CUDA is available at test time; skips cleanly otherwise. Verified on this host to actually hit the OOM path atcuMemAlloc(not NULL-state early-return). - Rebase-notes entry 0049 pins the invariant: on upstream sync, keep the fork's
CHECK_CUDA_GOTO/CHECK_CUDA_RETURNmacros — upstream Netflix still usesassert(0)in the macro. - Pre-existing
performance-no-int-to-ptrwarnings ininteger_adm_cuda.c+integer_vif_cuda.c(CUDA device-pointer casts(T*)(size_t)cu_ptr.data— inherent to the CUDA Driver API) bracketed withNOLINTBEGIN/END(performance-no-int-to-ptr)+ inline ADR-0141 citation. Two scoped blocks + oneNOLINTNEXTLINEcover 47 sites.
Verification¶
meson test -C core/build-cuda→ 39/39 pass (was: 38/38 pre-PR; + newtest_cuda_buffer_alloc_oom).meson test -C build(CPU-only) → 35/35 pass.- Reducer test verified to hit the OOM branch on this host:
cuMemAlloc(1 TiB)returnsCUDA_ERROR_OUT_OF_MEMORY (2)→vmaf_cuda_result_to_errno(2) = -ENOMEM→ caller receives-ENOMEM, process continues. Pre-fix, the same line firedassert(0)and aborted. clang-tidy -p core/build-cuda --quiet <6 files>→ exit 0 on every file (zero errors in theWarningsAsErrorsset). NOLINT brackets added forperformance-no-int-to-ptrat 47 CUDA device-pointer-cast sites, each with inline ADR-0141 upstream-parity citation.pre-commit run --files <touched>→ all hooks pass.- ADR-0122 / ADR-0123 null-guards (
is_cudastate_empty(...)atvmaf_cuda_sync,vmaf_cuda_release,vmaf_cuda_buffer_alloc,vmaf_cuda_buffer_free,vmaf_cuda_buffer_upload,vmaf_cuda_buffer_download,vmaf_cuda_buffer_host_alloc,vmaf_cuda_buffer_host_free) preserved verbatim.
References¶
- Upstream issue: Netflix/vmaf#1420 ("Crash when 2 files are analyzed simultaneously .../src/cuda/common.c:166: vmaf_cuda_buffer_alloc: Assertion `0' failed."), OPEN as of 2026-04-24.
- ADR-0122 — fork PR #60 CUDA framesync hardening.
- ADR-0123 — fork PR #62 null-guard at CUDA public entry points.
- ADR-0141 — touched-file lint-clean rule (scoping the
performance-no-int-to-ptrNOLINTs). - ADR-0154 — prior precedent for "transient vs fatal" error-code splits in libvmaf.
- rebase-notes 0049 — upstream-sync invariants for this decision.
- Backlog:
.workingdir2/BACKLOG.mdT1-6. - User direction 2026-04-24 popup: "T1-6 CUDA concurrency assert Netflix#1420" → "Wholesale: replace CHECK_CUDA everywhere".