ADR-0754 — CUDA SSIM vert_combine: __ldg() + __launch_bounds__ + pinned-host leak fix¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-05-29 |
| Tags | cuda, performance, correctness, ssim, fork-local |
Context¶
A code review of calculate_ssim_vert_combine in core/src/feature/cuda/integer_ssim/ssim_score.cu surfaced three findings suitable for a single low-risk PR (ordered lowest to highest risk below).
F4 — __launch_bounds__(128) (occupancy hint). The kernel launches with 16×8 = 128 threads/block. Without a __launch_bounds__ annotation, nvcc may allocate more registers than needed for 128-thread occupancy. Adding __launch_bounds__(128) sets an upper register budget consistent with the actual configuration and follows the pattern established in ADR-0743 for the VIF filter1d kernel.
F2 — __ldg() on the 5 read-only intermediate buffers. calculate_ssim_vert_combine reads 5 float buffers (h_ref_mu, h_cmp_mu, h_ref_sq, h_cmp_sq, h_refcmp) that are written exclusively by the preceding horizontal pass and never aliased in the vertical pass. These 5x11 = 55 inner-loop loads are therefore good candidates for the non-coherent read-only path (texture cache). However, the buffers are passed as VmafCudaBuffer structs by value; without __restrict__, the compiler cannot determine the underlying pointers are non-aliased. Fix: extract const float *__restrict__ pointers once before the inner loop, then use __ldg(&ptr[idx]) for every load. The compiler can now route all 55 loads through the L1 read-only cache, reducing L2 pressure at resolution >= 1080p where the combined 5-plane intermediate footprint exceeds L2 capacity (mirrors ADR-0743 note on VIF tmp buffers).
F6 — pinned-host memory leak in close_fex_cuda (correctness). vmaf_cuda_kernel_readback_free (in cuda/kernel_template.h) explicitly NULLs rb->host_pinned but does NOT free the underlying CUDA pinned host allocation — the template comment documents this as a caller responsibility. integer_ssim_cuda.c::close_fex_cuda called readback_free without first saving the host pointer, losing the only reference needed to call vmaf_cuda_buffer_host_free. Result: one page of pinned host memory leaked per vmaf_close() cycle. Fix: save rb.host_pinned to a local before readback_free, then call vmaf_cuda_buffer_host_free(fex->cu_state, saved).
Deferred findings (not in this PR):
- F1 (AoS to SoA buffer pack) — bigger change; revisit if F2 does not move the needle enough at 1080p.
- F3 (kernel signature change paired with F1) — dependent on F1.
- F5, F7, F8 — cosmetic or separate-concern; different PRs.
integer_psnr_cuda.chas the samereadback_free/host_freegap and is explicitly named for a follow-up fix.
Decision¶
Apply F4, F2, and F6 together in this PR:
- Add
__launch_bounds__(128)tocalculate_ssim_vert_combine. - Extract
const float *__restrict__pointers from the 5VmafCudaBufferarguments before the inner loop; use__ldg(&ptr[idx])for all 55 loads. - In
close_fex_cuda, saves->rb.host_pinnedbeforereadback_free, then callvmaf_cuda_buffer_host_free(fex->cu_state, saved_host_pinned).
F6 is unconditional (memory-safety fix). F4 + F2 are abandoned if ncu measurements show regression (per the ms_ssim_decimate lesson, Research-0749).
Alternatives considered¶
| Option | Considered | Outcome |
|---|---|---|
| F2 only (skip F4) | Yes | F4 is two characters with zero risk; adds occupancy-budget symmetry with ADR-0743. Include. |
| F2 + F4 only (skip F1/F3) | Yes — adopted | AoS to SoA (F1) is a larger change. Phase it: measure if F2 moves the needle at 1080p; if not, revisit F1. |
| F1 + F2 + F3 + F4 full rewrite | Yes | Touching both .cu and .c in a restructure PR increases review risk. Phased approach preferred. |
Fix readback_free itself to call host_free | Yes | Changes the template's documented contract; requires a broader caller audit. Per-caller fix is lower-risk for now. |
Consequences¶
calculate_ssim_vert_combinecarries__launch_bounds__(128).- All 55 inner-loop loads route through the read-only texture cache. L2 pressure is reduced at >= 1080p.
close_fex_cudano longer leaks pinned host memory.integer_psnr_cuda.chas the same readback_free / host_free gap and is named for a follow-up fix.- Invariant notes added to
core/src/feature/cuda/AGENTS.md: __ldg()pattern for read-only intermediate buffers in pass-2 kernels.- Pinned-host memory free responsibility after
readback_free.
References¶
- req: user direction 2026-05-29: "Implement the 3 actionable findings from the SSIM vert_combine review report."
- ADR-0743: VIF filter1d
__launch_bounds__+__ldg()precedent. - ADR-0246: CUDA kernel-template lifecycle contract.
- ADR-0214: GPU-parity CI gate (places=4).
cuda/kernel_template.h: comment nearvmaf_cuda_kernel_readback_freedocumenting the host-free caller responsibility.