ADR-0750: Hardware Measurement Verdict for PR perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven¶

Status: Accepted
Date: 2026-05-29
Deciders: lusoris
Tags: cuda, performance, ms_ssim, adm_cm, measurement

Context¶

PR perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven-20260528 introduced two CUDA optimizations authored with ncu estimates only (no hardware measurement due to GPU contention at authoring time):

ms_ssim_decimate smem tiling — estimated +68–93% kernel speedup
adm_cm_line_kernel_8 __launch_bounds__(128, 8) — estimated +66.7% kernel speedup

Research-0749 provides the hardware measurements on RTX 4090 (CC 8.9), ncu 2026.2.0, CUDA 13.3.

Decision¶

Accept the adm_cm_line_kernel_8 __launch_bounds__ change; revert the ms_ssim_decimate smem tiling.

The adm_cm change reduces registers/thread from 114 to 64 (confirmed by ncu) and delivers a measured -9.3% kernel duration improvement at 1080p. At 576x324 the wave count bottleneck dominates regardless, and the change is noise-level neutral.

The ms_ssim_decimate smem tiling is a measured regression: kernel duration increases +24% (WL1) and +8% (WL2). The baseline already achieves 95% L1 hit rate via hardware prefetch — the tiling assumption (that DRAM bandwidth was the bottleneck) was invalid for Ada Lovelace at both profiled resolutions. The cooperative-load pattern breaks the hardware prefetcher and introduces a __syncthreads barrier with no net benefit.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Keep both changes	Positive end-to-end (+3.9–4.8%) driven by adm_cm	ms_ssim kernel regresses 8–24% hidden by adm_cm gain	Regressed kernel should not land
Revert both changes	Clean revert	Loses confirmed adm_cm improvement	adm_cm improvement is real and validated
Keep smem tiling, tune	Occupancy improves 3–8%	L1 hit rate drops 95%→24%; requires a fundamentally different load pattern	Not worth re-engineering in this PR

Consequences¶

Positive: adm_cm register pressure reduced 43.9% (114→64 regs); measured -9.3% kernel duration at 1080p; correctness passes; no change to public API.
Negative: ms_ssim_decimate smem tiling must be reverted; the estimated +68–93% improvement was an analysis error.
Neutral / follow-ups: A correct optimization for ms_ssim_decimate at 1080p should explore occupancy via block-size tuning or horiz/vert kernel fusion (baseline occupancy is 79% at 1080p — the kernel is already throughput-limited, not memory-bound).

References¶

Research-0749
Source: user direction per task spec dated 2026-05-28