ADR-0756: CUDA F3 struct-by-value kernel audit (scope + dispatch order)¶
- Status: Accepted
- Date: 2026-05-29
- Deciders: lusoris
- Tags:
cuda,perf,research
Context¶
PR #93 identified "F3" — a pattern where a __global__ kernel accepts a VmafCudaBuffer (or VmafPicture, AdmBufferCuda) by value. Because VmafCudaBuffer.data is a CUdeviceptr (opaque integer) nested inside a struct copy, ptxas cannot infer that the underlying pointer is non-aliased and cannot emit ld.global.nc (the read-only L1 texture path). PR #93 applied the in-kernel __ldg() extraction fix to calculate_ssim_vert_combine and measured -4.2% kernel duration at 1080p (Research-0754).
The scope of remaining instances across the CUDA kernel suite was not enumerated before PR #93 merged. This ADR records the fork-wide audit result and the chosen dispatch order for follow-on PRs.
Decision¶
We will address the F3 pattern using the in-kernel __ldg() extraction strategy (extract raw const T * from each struct arg before the hot inner loop; read via __ldg()). We will apply this in severity order per Research-0756: ms_ssim_vert_lcs first (PR-1), then ms_ssim_horiz, ciede_8bpc/16bpc, adm_decouple. Kernels whose inner loop is already smem-resident after the tile-load phase (motion, adm_cm) are explicitly out of scope for F3 treatment — the DRAM-bound portion is the tile load, not a repeated inner-loop global read.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Host-side extraction (pass raw CUdeviceptr to cuLaunchKernel) | Full compiler visibility; enables ptxas to use ld.global.nc without __ldg() | Requires modifying every call site in _cuda.c dispatch files; higher risk, larger diff | Deferred; revisit if post-F3 ncu shows remaining DRAM headroom |
| AoS → SoA buffer restructure (F1) | Best long-term coalescing | Structural change to VmafCudaBuffer; API impact | Deferred per Research-0754 decision section |
| Skip F3 for all kernels that are launch-starved at 576p | Correct that at 576p F3 rarely matters | Ignores 1080p+ production workloads | Not chosen; fork targets 1080p/4K production |
Consequences¶
- Positive: Each dispatched PR carries at most 30–40 LOC diff, is bit-identical (zero score drift), and is independently deployable. The in-kernel pattern was proven by PR #93 with a live ncu A/B.
- Negative: Multiple separate PRs rather than one large refactor. Call-site code in
_cuda.cfiles still passes structs; the compiler sees the__ldg()hint but not full__restrict__aliasing. Full benefit requires the eventual host-side extraction pass.
References¶
- req: "Per PR #93: this affects every kernel that takes its inputs as VmafCudaBuffer struct copies" (per user direction, 2026-05-29)
- Research-0754 (
calculate_ssim_vert_combinencu A/B) - Research-0756 (this audit)
- Research-0736 (SSIM ncu hotpath)
- Research-0737 (MS-SSIM ncu hotpath)
- Research-0734 (ADM ncu hotpath)
- ADR-0108 (deep-dive deliverables)