ADR-0744: CUDA adm_cm `__launch_bounds__(128, 8)` register reduction (ms_ssim_decimate smem tiling reverted)¶

Context¶

PR #77 ncu profiles identified two CUDA kernels as bottlenecks:

ms_ssim_decimate — 81 global/L2 reads per output pixel, no shared-memory reuse, two mirror_idx calls per read (modulo + conditional branches).
adm_cm_line_kernel_8 — 114 registers/thread, ~33% theoretical occupancy on Ampere (RTX 4090, CC 8.9).

Two optimisations were proposed and initially implemented (Research-0744):

Opt A (ms_ssim_decimate smem tiling): cooperative CTA tile load of TILE_H × TILE_W_PAD = 24 × 41 floats (3936 B), amortising mirror_idx cost to once per source element.
Opt B (adm_cm_line_kernel_8 __launch_bounds__(128, 8)): instructs ptxas to target ≤64 regs/thread (65536 / (8 CTAs × 128 threads)), raising theoretical occupancy from 33% to ~67%.

Hardware measurement was conducted in PR #89 using ncu 2026.2.0 on vmaf-dev-mcp:cuda13.3 (RTX 4090, 128 SMs):

Opt A measured: +8 to +24% kernel duration (regression at all tested resolutions — 576p and 1080p). Root cause: the baseline ms_ssim_decimate kernel was already L1-resident (95% hit rate measured by ncu MemoryWorkloadAnalysis). The cooperative-load + __syncthreads() overhead broke the hardware prefetcher and converted L1 hits into L2/DRAM misses.
Opt B measured: −9.3% kernel duration at 1080p, neutral at 576p. Registers confirmed 114→64 per thread.

End-to-end aggregate: +3.9–4.8% fps improvement driven entirely by Opt B.

Revert Opt A (ms_ssim_decimate smem tiling). The L1-resident baseline makes smem staging a net regression. The non-tiled kernel remains at core/src/feature/cuda/integer_ms_ssim/ms_ssim_score.cu.
Keep Opt B (adm_cm_line_kernel_8 __launch_bounds__(128, 8)). Confirmed −9.3% at 1080p; negligible at 576p (occupancy-limited only at larger grids). Change is in core/src/feature/cuda/integer_adm/adm_cm.cu.

Option	Pros	Cons	Why not chosen
Keep both Opt A + Opt B	+Opt B gain retained	Opt A regresses ms_ssim by +8–24%; smem staging adds unnecessary complexity	Measured regression — PR #89
Keep Opt A, drop Opt B	Possible gain if L1 occupancy drops on future hardware	Measured regression now; removes −9.3% adm_cm gain	Dominated by Opt B on current target arch
Drop both	No regression risk	Loses confirmed −9.3% adm_cm gain; end-to-end +3.9–4.8% fps lost	Leaves confirmed occupancy win on table
Keep Opt A with larger BLOCK size to reduce smem overhead	Might lower cooperative-load cost	L1 hit rate at baseline (95%) means problem is not DRAM throughput but prefetcher disruption; block-size change won't fix it	Root cause is prefetcher, not smem size

Positive: +3.9–4.8% fps end-to-end from adm_cm occupancy gain; ms_ssim_decimate performance unchanged (baseline was already L1-efficient).
Negative: Smem tiling design is documented as a regression on current hardware (RTX 4090, CUDA 13.3). Future hardware with lower L1 hit rates may benefit — see Research-0744 and ADR-0750 for re-evaluation criteria.
Neutral: Correctness unaffected. ADR-0214 places=4 parity gate confirmed bit-exact results before and after.

Research-0744 — analysis + implementation details
ADR-0750 — hardware measurement (PR #89)
ADR-0454 — filter1d.cu smem precedent (L1-absent case)
ADR-0214 — GPU/CPU parity gate
req: "Partial revert of PR #79 per the hardware measurement in PR #89 digest: revert ms_ssim_decimate smem tiling (measured -8 to -24%), keep adm_cm launch_bounds (-9.3% at 1080p). Registers 114→64 confirmed."