Skip to content

ADR-0756: CUDA F3 struct-by-value kernel audit (scope + dispatch order)

  • Status: Accepted
  • Date: 2026-05-29
  • Deciders: lusoris
  • Tags: cuda, perf, research

Context

PR #93 identified "F3" — a pattern where a __global__ kernel accepts a VmafCudaBuffer (or VmafPicture, AdmBufferCuda) by value. Because VmafCudaBuffer.data is a CUdeviceptr (opaque integer) nested inside a struct copy, ptxas cannot infer that the underlying pointer is non-aliased and cannot emit ld.global.nc (the read-only L1 texture path). PR #93 applied the in-kernel __ldg() extraction fix to calculate_ssim_vert_combine and measured -4.2% kernel duration at 1080p (Research-0754).

The scope of remaining instances across the CUDA kernel suite was not enumerated before PR #93 merged. This ADR records the fork-wide audit result and the chosen dispatch order for follow-on PRs.

Decision

We will address the F3 pattern using the in-kernel __ldg() extraction strategy (extract raw const T * from each struct arg before the hot inner loop; read via __ldg()). We will apply this in severity order per Research-0756: ms_ssim_vert_lcs first (PR-1), then ms_ssim_horiz, ciede_8bpc/16bpc, adm_decouple. Kernels whose inner loop is already smem-resident after the tile-load phase (motion, adm_cm) are explicitly out of scope for F3 treatment — the DRAM-bound portion is the tile load, not a repeated inner-loop global read.

Alternatives considered

Option Pros Cons Why not chosen
Host-side extraction (pass raw CUdeviceptr to cuLaunchKernel) Full compiler visibility; enables ptxas to use ld.global.nc without __ldg() Requires modifying every call site in _cuda.c dispatch files; higher risk, larger diff Deferred; revisit if post-F3 ncu shows remaining DRAM headroom
AoS → SoA buffer restructure (F1) Best long-term coalescing Structural change to VmafCudaBuffer; API impact Deferred per Research-0754 decision section
Skip F3 for all kernels that are launch-starved at 576p Correct that at 576p F3 rarely matters Ignores 1080p+ production workloads Not chosen; fork targets 1080p/4K production

Consequences

  • Positive: Each dispatched PR carries at most 30–40 LOC diff, is bit-identical (zero score drift), and is independently deployable. The in-kernel pattern was proven by PR #93 with a live ncu A/B.
  • Negative: Multiple separate PRs rather than one large refactor. Call-site code in _cuda.c files still passes structs; the compiler sees the __ldg() hint but not full __restrict__ aliasing. Full benefit requires the eventual host-side extraction pass.

References

  • req: "Per PR #93: this affects every kernel that takes its inputs as VmafCudaBuffer struct copies" (per user direction, 2026-05-29)
  • Research-0754 (calculate_ssim_vert_combine ncu A/B)
  • Research-0756 (this audit)
  • Research-0736 (SSIM ncu hotpath)
  • Research-0737 (MS-SSIM ncu hotpath)
  • Research-0734 (ADM ncu hotpath)
  • ADR-0108 (deep-dive deliverables)