Skip to content

ADR-0157: CUDA preallocation memory leak fix + vmaf_cuda_state_free public API (Netflix#1300)

  • Status: Accepted
  • Date: 2026-04-24
  • Deciders: Lusoris, Claude (Anthropic)
  • Tags: cuda, correctness, api, netflix-upstream, memory

Context

Netflix upstream issue #1300 reports that users running CUDA-accelerated VMAF in a loop — init → preallocate pictures → fetch frames → close → (repeat every 30 frames) — see GPU memory rise monotonically across cycles. Reporter matched the test_cuda_picture_preallocation_method_device pattern and could only avoid the leak by switching to the non-preallocating allocation path.

Verification via meson setup build-asan-cuda -Db_sanitize=address

  • running test_cuda_pic_preallocation confirmed 30 799 bytes leaked in 28 allocations across several distinct framework-side paths. Code inspection identified the root causes:

  • VmafCudaState struct ownership ambiguityvmaf_cuda_state_init(&cu_state, cfg) at core/src/cuda/common.c mallocs a VmafCudaState. vmaf_cuda_import_state at core/src/libvmaf.c copies the struct by value (vmaf->cuda.state = *cu_state;) but does not take ownership of the pointer. vmaf_close calls vmaf_cuda_release(&vmaf->cuda.state) on the copy — which memsets it to zero — but the original cu_state pointer returned to the caller is never freed. There was no public vmaf_cuda_state_free API; vmaf_cuda_release lives in the internal header core/src/cuda/common.h and is unavailable to callers. Per-cycle host-memory leak: one VmafCudaState struct (~80 bytes).

  • CudaFunctions driver table never freedvmaf_cuda_state_init calls cuda_load_functions(&c->f, NULL) from nv-codec-headers, which dlopens libcuda.so.1 and allocates a CudaFunctions* struct holding dlsym'd function pointers. vmaf_cuda_release destroys the CUDA stream + context but never calls cuda_free_functions(&cu_state->f). Per-cycle host-memory leak: one CudaFunctions struct.

  • pthread_mutex_destroy missing in ring-buffer closevmaf_ring_buffer_close at core/src/cuda/ring_buffer.c:80 locks ring_buffer->busy, frees the pictures, frees the buffer memory, but never unlocks or destroys the mutex. Destroying a locked mutex is POSIX UB; on glibc the mutex internals don't heap-allocate for default-initialized mutexes, but the UB is real and future glibc versions may add internal state that leaks.

  • Cold-start error-path leak in init_with_primary_context — discovered during fix: if cuStreamCreateWithPriority fails AFTER cuDevicePrimaryCtxRetain succeeds, the retained primary context is not released. Adjacent to #1/#2 and easier to fix in the same commit than to track separately.

Public-API design note: existing SYCL backend has the same shape but declares ownership explicitly — vmaf_sycl_import_state() documents that ownership is NOT transferred and the caller must call vmaf_sycl_state_free() after vmaf_close(). CUDA should match.

Earlier fork PRs — #60 / ADR-0122 (framesync segfault) and #62 / ADR-0123 (null-guard) — hardened the CUDA path against NULL-state dereferences but did not address the ownership leaks. PR #93 / ADR-0156 (CHECK_CUDA graceful error propagation) is the prerequisite for doing ownership cleanup correctly — error paths now propagate instead of aborting, so "free on error" is actually reachable.

Decision

Fix all four leak sources + introduce the missing public API, in one PR:

New public API

/* core/include/libvmaf/libvmaf_cuda.h */

/**
 * Free VmafCudaState allocated by `vmaf_cuda_state_init()`.
 *
 * Must be called AFTER `vmaf_close()` on any VmafContext that
 * imported this state via `vmaf_cuda_import_state()`, because
 * `vmaf_close()` destroys the underlying CUDA stream and context.
 * Calling `vmaf_cuda_state_free()` first would leave `vmaf_close()`
 * with a dangling state.
 *
 * @param cu_state CUDA state to free. Safe to pass NULL.
 * @return 0 on success, or < 0 (a negative errno code) on error.
 */
int vmaf_cuda_state_free(VmafCudaState *cu_state);

Implementation in core/src/cuda/common.c is a NULL-safe free() wrapper — vmaf_close / vmaf_cuda_release already destroyed the stream, popped the context, and memset'd the struct. The only remaining owned resource is the heap allocation itself.

vmaf_cuda_release frees the CudaFunctions table

In vmaf_cuda_release, save the CudaFunctions* pointer before the existing memset, then call cuda_free_functions(&f) after the memset. Order matters: memset first so cu_state->f is zero; then free via the saved pointer. Avoids a dangling f in the caller's struct if it inspects the state after close.

Ring buffer close: unlock + destroy

vmaf_ring_buffer_close now does:

err |= pthread_mutex_unlock(&ring_buffer->busy);
err |= pthread_mutex_destroy(&ring_buffer->busy);
free(ring_buffer->pic);
free(ring_buffer);

Cold-start unwind in init_with_primary_context

On the fail_after_pop path, release the retained primary context before returning. Also add an outer unwind in vmaf_cuda_state_init so any failure in the inner init free both c and c->f cleanly.

GPU-gated reducer test

core/test/test_cuda_preallocation_leak.c — a 10-cycle reducer that does init → preallocate → fetch 10 pictures → close with full cleanup on each cycle (vmaf_cuda_state_free, vmaf_model_destroy). GPU-gated: cycle 0 probes the driver; no visible device → SKIP cleanly.

Existing test cleanup

test_cuda_pic_preallocation.c and test_cuda_buffer_alloc_oom.c — add the missing vmaf_cuda_state_free(cu_state) and vmaf_model_destroy(model) calls after vmaf_close(vmaf). Fixes the test-side cleanup gap that masked the framework leaks before.

Alternatives considered

Option Pros Cons Why not chosen
Fix all leaks + add vmaf_cuda_state_free (this ADR) Addresses every per-cycle leak surfaced by ASan; matches SYCL's explicit-ownership pattern; one coherent PR New public API surface to maintain; larger diff than surgical Chosen — user popup 2026-04-24 selected "Full fix: all leaks + public state_free API"
Surgical: only cuda_free_functions + add state_free API Minimum to close Netflix#1300's CUDA-memory symptom Leaves pthread mutex UB + cold-start leak unfixed; they'd surface as separate bugs later Rejected via popup — user preferred wholesale
Take ownership in vmaf_cuda_import_state No new API; existing callers don't change Silent ownership transfer is surprising; violates POLA; inconsistent with SYCL's explicit-free pattern Rejected — explicit > implicit for ownership
Document + defer Zero code change Users stay affected; Netflix#1300 OPEN since 2024 with no upstream response Rejected — has a straightforward fix

Consequences

  • Positive:
  • Netflix#1300 reproducer resolved: 10-cycle loop leaks zero framework bytes (183 bytes remain in libcuda.so.1 internal state — per-process driver cache, not per-cycle; matches SYCL backend behaviour).
  • CUDA backend ownership story now matches SYCL: caller allocates state, framework copies-by-value in import, caller calls state_free after vmaf_close. Symmetric and documented.
  • Ring-buffer pthread_mutex_destroy closes a POSIX UB that future glibc versions could turn into a real crash.
  • Adjacent cold-start leak fixed in the same commit (retained primary context on stream-create failure).
  • ADR-0122 / ADR-0123 null-guards preserved verbatim; ADR-0156 CHECK_CUDA_GOTO cleanup paths preserved; composable with the new free calls.
  • Negative:
  • New required step for every CUDA caller: after vmaf_close(vmaf), call vmaf_cuda_state_free(cu_state). Callers who already do this (via informal free(cu_state)) will get a crash — double-free. Flagged under ### Changed in CHANGELOG.
  • Public ABI growth by one symbol (additive; versioned via the shared library's symbol version script when release-please next cuts).
  • Neutral / follow-ups:
  • Tests updated to demonstrate the full cleanup pattern. ffmpeg filter (libavfilter/vf_libvmaf.c) should be audited for the same cleanup sequence during the next ffmpeg-patches refresh — backlog follow-up.
  • If a self-hosted GPU runner lands (backlog T7-3), the new reducer gets real CI coverage instead of skipping at the driver-probe step.

Verification

  • meson test -C core/build-cuda40/40 pass (was 39/39 pre-PR + new reducer).
  • meson test -C build (CPU-only) → 35/35 pass.
  • ASAN_OPTIONS='detect_leaks=1:leak_check_at_exit=1' build-asan-cuda/test/test_cuda_preallocation_leak183 bytes leaked in 4 allocations, all in libcuda.so.1 driver internal state (cuInit cache — persists for process lifetime, does NOT grow per cycle; verified by N=1 vs N=10 comparison). Zero core/src/* frames in the leak traces.
  • clang-tidy -p build-cuda --quiet <5 touched files>exit 0.
  • CI-equivalent clang-tidy -p build --quiet core/include/libvmaf/libvmaf_cuda.h (the only CI-visible file after the ADR-0156 exclusion filter) → exit 0.
  • pre-commit run --files <touched> → all hooks pass.
  • Reducer verified to exercise the fix: pre-fix (with the sweep reverted), ASan reports the VmafCudaState malloc + the CudaFunctions dlopen table as leaked per cycle. Post-fix those stack frames are gone.

References

  • Upstream issue: Netflix/vmaf#1300 ("CUDA-VMAF Memory Leak (preallocation method) in libvmaf"), OPEN since 2024; no maintainer fix as of 2026-04-24.
  • ADR-0122 — fork PR #60 CUDA framesync hardening (preserved).
  • ADR-0123 — fork PR #62 null-guard (preserved).
  • ADR-0156 — CHECK_CUDA graceful error propagation (prerequisite; enables the new error-path cleanup to actually run).
  • ADR-0141 — touched-file lint rule.
  • rebase-notes 0050 — upstream-sync invariants for this decision.
  • Backlog: .workingdir2/BACKLOG.md T1-7.
  • User direction 2026-04-24 popup: "T1-7 CUDA preallocation leak Netflix#1300" → "Full fix: all leaks + public state_free API".