ADR-0757 — CUDA MS-SSIM ms_ssim_vert_lcs + ms_ssim_horiz: __ldg() + __launch_bounds__ (F3 fix #2)¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-05-29 |
| Tags | cuda, performance, ms_ssim, fork-local |
Context¶
The PR #96 audit of core/src/feature/cuda/integer_ms_ssim/ms_ssim_score.cu identified two kernels as the top candidates for the F3 fix pattern first applied in ADR-0754 (calculate_ssim_vert_combine):
ms_ssim_horiz — horizontal 11-tap separable Gaussian over ref / cmp. The kernel passes 7 VmafCudaBuffer arguments by value; two of them (ref_in, cmp_in) are read-only input planes written exclusively by the upload pass. All 2×11 = 22 inner-loop loads used plain dereferencing rather than __ldg(), preventing the compiler from routing them through the read-only L1 texture cache.
ms_ssim_vert_lcs — vertical 11-tap pass on 5 horizontal-pass intermediate buffers + per-pixel l/c/s + per-block partial sums. The 5 intermediate buffers (h_ref_mu, h_cmp_mu, h_ref_sq, h_cmp_sq, h_refcmp) are written exclusively by ms_ssim_horiz and are never aliased in the vert pass. However, they were extracted as plain const float * (no __restrict__), which hides the alias-free invariant from the compiler. All 5×11 = 55 inner-loop loads therefore went through L2 rather than the read-only L1 cache.
Neither kernel carried a __launch_bounds__ annotation despite launching with 16×8 = 128 threads/block.
The fix pattern is identical to ADR-0754 (PR #93):
- Add
__launch_bounds__(128)to both kernels. - Extract
const float *__restrict__pointers from everyVmafCudaBufferargument before the inner loop. - Replace all inner-loop loads with
__ldg(&ptr[idx]).
SASS verification via cuobjdump --dump-sass confirmed LDG.E.CONSTANT instructions in both kernel bodies after applying the fix.
Predicted latency reduction at 1080p (from PR #96 audit): −4 to −6% per kernel for ms_ssim_vert_lcs; similar for ms_ssim_horiz. Both are memory-bound at ≥1080p where the combined 5-plane (or 2-plane) intermediate footprint exceeds L2 capacity.
Decision¶
Apply the F3 fix to both kernels in the same PR:
- Add
__launch_bounds__(128)toms_ssim_horizandms_ssim_vert_lcs. - In
ms_ssim_horiz: extractconst float *__restrict__ refandconst float *__restrict__ cmpfromref_in.data/cmp_in.databefore the K=11 loop; use__ldg(&ref[src_idx])and__ldg(&cmp[src_idx])for all 22 inner-loop loads. - In
ms_ssim_vert_lcs: extract all fiveconst float *__restrict__pointers before the K=11 loop; use__ldg(&ptr[src_idx])for all 55 inner-loop loads. - Rename local variable
w→gwin both loops to avoid shadowing the kernel parameterwidthand improve readability.
Alternatives considered¶
| Option | Considered | Outcome |
|---|---|---|
F3 on vert_lcs only | Yes | ms_ssim_horiz has the same alias-hiding pattern and is memory-bound at the same resolution. Include both — 6 extra lines, zero added risk. |
| Shared-memory tiling (ADR-0464 pattern) | Yes | ms_ssim kernels are separable 11-tap Gaussian; spatial overlap is moderate (K/2 = 5 taps). Profiling at 576p shows wave-starvation dominates over memory latency at that resolution. Tiling is a separate investigation item; __ldg() is the lower-risk first step per ADR-0754 precedent. |
| AoS → SoA buffer layout (F1) | Yes | Larger change; deferred pending measurement of F3 impact as in ADR-0754. |
Skip __launch_bounds__ | No | Two characters, zero risk; consistent with ADR-0754 and ADR-0743 precedents. |
Consequences¶
ms_ssim_horizandms_ssim_vert_lcscarry__launch_bounds__(128).- All 22 + 55 = 77 inner-loop loads route through the read-only L1 texture cache (
LDG.E.CONSTANTconfirmed in sm_89 SASS). - L2 pressure is reduced at ≥1080p where the 2-plane or 5-plane intermediate footprint exceeds L2 capacity.
- Zero ULP divergence expected (per ADR-0754 precedent; correctness path is fully determined by the Gaussian weight table;
__ldg()does not alter the loaded value). AGENTS.mdinvariant note under__ldg() pattern for pass-2 read-only intermediate buffers (ADR-0754)updated to note that ms_ssim is the second application of this pattern.
References¶
- req: user direction 2026-05-29: "Apply the F3 fix to the top-2 candidates from PR #96's audit: ms_ssim_vert_lcs and ms_ssim_horiz. Mirror exactly what PR #93 did for calculate_ssim_vert_combine."
- ADR-0754: F3 pattern precedent on
calculate_ssim_vert_combine. - ADR-0743:
__ldg()and__launch_bounds__precedent on VIF filter1d. - ADR-0214: GPU-parity CI gate (places=4).
- Research-0749 / ADR-0750: ms_ssim_decimate profiling baseline (ms_ssim_horiz and ms_ssim_vert_lcs were identified as memory-bound at 1080p in that audit).