ADR-0744: CUDA adm_cm __launch_bounds__(128, 8) register reduction (ms_ssim_decimate smem tiling reverted)¶
- Status: Accepted
- Date: 2026-05-28
- Superseded by / Follow-up: ADR-0750 (measurement)
- Deciders: lusoris
- Tags: cuda, performance, adm, ms_ssim, occupancy
Context¶
PR #77 ncu profiles identified two CUDA kernels as bottlenecks:
ms_ssim_decimate— 81 global/L2 reads per output pixel, no shared-memory reuse, twomirror_idxcalls per read (modulo + conditional branches).adm_cm_line_kernel_8— 114 registers/thread, ~33% theoretical occupancy on Ampere (RTX 4090, CC 8.9).
Two optimisations were proposed and initially implemented (Research-0744):
- Opt A (
ms_ssim_decimatesmem tiling): cooperative CTA tile load ofTILE_H × TILE_W_PAD = 24 × 41floats (3936 B), amortisingmirror_idxcost to once per source element. - Opt B (
adm_cm_line_kernel_8 __launch_bounds__(128, 8)): instructs ptxas to target ≤64 regs/thread (65536 / (8 CTAs × 128 threads)), raising theoretical occupancy from 33% to ~67%.
Hardware measurement was conducted in PR #89 using ncu 2026.2.0 on vmaf-dev-mcp:cuda13.3 (RTX 4090, 128 SMs):
- Opt A measured: +8 to +24% kernel duration (regression at all tested resolutions — 576p and 1080p). Root cause: the baseline
ms_ssim_decimatekernel was already L1-resident (95% hit rate measured by ncu MemoryWorkloadAnalysis). The cooperative-load +__syncthreads()overhead broke the hardware prefetcher and converted L1 hits into L2/DRAM misses. - Opt B measured: −9.3% kernel duration at 1080p, neutral at 576p. Registers confirmed 114→64 per thread.
End-to-end aggregate: +3.9–4.8% fps improvement driven entirely by Opt B.
Decision¶
- Revert Opt A (
ms_ssim_decimatesmem tiling). The L1-resident baseline makes smem staging a net regression. The non-tiled kernel remains atcore/src/feature/cuda/integer_ms_ssim/ms_ssim_score.cu. - Keep Opt B (
adm_cm_line_kernel_8 __launch_bounds__(128, 8)). Confirmed −9.3% at 1080p; negligible at 576p (occupancy-limited only at larger grids). Change is incore/src/feature/cuda/integer_adm/adm_cm.cu.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep both Opt A + Opt B | +Opt B gain retained | Opt A regresses ms_ssim by +8–24%; smem staging adds unnecessary complexity | Measured regression — PR #89 |
| Keep Opt A, drop Opt B | Possible gain if L1 occupancy drops on future hardware | Measured regression now; removes −9.3% adm_cm gain | Dominated by Opt B on current target arch |
| Drop both | No regression risk | Loses confirmed −9.3% adm_cm gain; end-to-end +3.9–4.8% fps lost | Leaves confirmed occupancy win on table |
| Keep Opt A with larger BLOCK size to reduce smem overhead | Might lower cooperative-load cost | L1 hit rate at baseline (95%) means problem is not DRAM throughput but prefetcher disruption; block-size change won't fix it | Root cause is prefetcher, not smem size |
Consequences¶
- Positive: +3.9–4.8% fps end-to-end from adm_cm occupancy gain; ms_ssim_decimate performance unchanged (baseline was already L1-efficient).
- Negative: Smem tiling design is documented as a regression on current hardware (RTX 4090, CUDA 13.3). Future hardware with lower L1 hit rates may benefit — see Research-0744 and ADR-0750 for re-evaluation criteria.
- Neutral: Correctness unaffected. ADR-0214 places=4 parity gate confirmed bit-exact results before and after.
References¶
- Research-0744 — analysis + implementation details
- ADR-0750 — hardware measurement (PR #89)
- ADR-0454 — filter1d.cu smem precedent (L1-absent case)
- ADR-0214 — GPU/CPU parity gate
- req: "Partial revert of PR #79 per the hardware measurement in PR #89 digest: revert ms_ssim_decimate smem tiling (measured -8 to -24%), keep adm_cm launch_bounds (-9.3% at 1080p). Registers 114→64 confirmed."