ADR-0750: Hardware Measurement Verdict for PR perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven¶
- Status: Accepted
- Date: 2026-05-29
- Deciders: lusoris
- Tags: cuda, performance, ms_ssim, adm_cm, measurement
Context¶
PR perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven-20260528 introduced two CUDA optimizations authored with ncu estimates only (no hardware measurement due to GPU contention at authoring time):
ms_ssim_decimatesmem tiling — estimated +68–93% kernel speedupadm_cm_line_kernel_8__launch_bounds__(128, 8)— estimated +66.7% kernel speedup
Research-0749 provides the hardware measurements on RTX 4090 (CC 8.9), ncu 2026.2.0, CUDA 13.3.
Decision¶
Accept the adm_cm_line_kernel_8 __launch_bounds__ change; revert the ms_ssim_decimate smem tiling.
The adm_cm change reduces registers/thread from 114 to 64 (confirmed by ncu) and delivers a measured -9.3% kernel duration improvement at 1080p. At 576x324 the wave count bottleneck dominates regardless, and the change is noise-level neutral.
The ms_ssim_decimate smem tiling is a measured regression: kernel duration increases +24% (WL1) and +8% (WL2). The baseline already achieves 95% L1 hit rate via hardware prefetch — the tiling assumption (that DRAM bandwidth was the bottleneck) was invalid for Ada Lovelace at both profiled resolutions. The cooperative-load pattern breaks the hardware prefetcher and introduces a __syncthreads barrier with no net benefit.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep both changes | Positive end-to-end (+3.9–4.8%) driven by adm_cm | ms_ssim kernel regresses 8–24% hidden by adm_cm gain | Regressed kernel should not land |
| Revert both changes | Clean revert | Loses confirmed adm_cm improvement | adm_cm improvement is real and validated |
| Keep smem tiling, tune | Occupancy improves 3–8% | L1 hit rate drops 95%→24%; requires a fundamentally different load pattern | Not worth re-engineering in this PR |
Consequences¶
- Positive: adm_cm register pressure reduced 43.9% (114→64 regs); measured -9.3% kernel duration at 1080p; correctness passes; no change to public API.
- Negative: ms_ssim_decimate smem tiling must be reverted; the estimated +68–93% improvement was an analysis error.
- Neutral / follow-ups: A correct optimization for ms_ssim_decimate at 1080p should explore occupancy via block-size tuning or horiz/vert kernel fusion (baseline occupancy is 79% at 1080p — the kernel is already throughput-limited, not memory-bound).
References¶
- Research-0749
- Source: user direction per task spec dated 2026-05-28