ADR-0743: CUDA VIF filter1d ncu-driven performance optimizations¶
- Status: Accepted
- Date: 2026-05-28
- Deciders: lusoris
- Tags:
cuda,performance,vif
Context¶
PR #74 profiling identified filter1d_8_horizontal_kernel_2_17_9 as consuming 35.3% of VIF wall time on RTX 4090 / CUDA 13.3 with 48.6% achieved occupancy versus a 75% theoretical ceiling (56 registers per thread, 9 blocks/SM).
Three optimizations were proposed:
val_per_thread2 → 4 (halve grid count, reduce wave fragmentation)- Split accumulator live range to reduce register count below 48
__ldg()/const __restrict__on the 7 read-only tmp channel loads
Baseline ncu measurements (576×324 YUV, sm_89):
- Registers Per Thread: 56
- Theoretical Occupancy: 75%
- Achieved Occupancy: 42.91%
- Kernel Duration: 17.6 µs
Decision¶
Apply __launch_bounds__(128, 10) to the FILTER1D_8_HORI macro and add __ldg() on the 7 global tmp-channel loads in the smem-fill phase. Reject val_per_thread=4 (see Alternatives). The __launch_bounds__ approach replaces the "serialise accumulators into two sub-loops" proposal because it achieves the same register reduction without algorithmic restructuring or added __syncthreads() overhead.
Post-optimization ncu measurements (same workload):
- Registers Per Thread: 48 (down from 56)
- Theoretical Occupancy: 83.33% (up from 75%)
- Achieved Occupancy: 41.35%
- Kernel Duration: 18.4 µs (within noise at 576×324 — wave-limited)
Correctness verified: CUDA-optimized vs CPU reference delta ≤ 0.000010 (max per-frame absolute), within the ADR-0214 places=4 gate (≤ 0.0001).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
val_per_thread=4 (opt #1) | −33% grid count at 576p | smem 7644→14812 B/block; sm_89 becomes smem-limited at 37.5% occ. vs 62.5% for vpt=2 | Rejected: net occupancy regression |
| Serialise accumulators into two sub-loops (original opt #2 proposal) | Reduces live register range algorithmically | Adds an extra __syncthreads(), extra smem writes for mu1/mu2, doubles inner loop body | Replaced by __launch_bounds__ which achieves the same register budget without algorithmic changes |
__launch_bounds__(128, 10) only, no __ldg | Zero overhead on warm L1 | No L2 pressure relief at ≥1080p | Kept both; __ldg is zero-overhead at register level and beneficial at high-res |
__launch_bounds__(128, 8) (sm_75-safe) | No ptxas advisory | Max 64 regs budget on sm_89 — too loose to force 48 | sm_89 (RTX 4090) is primary target; sm_75 gets no regression, just no benefit |
Consequences¶
- Positive: Theoretical occupancy 75% → 83.3% on sm_89 (RTX 4090). Register count 56 → 48. At production resolutions (≥1080p), more CTAs reside concurrently per SM, improving warp latency hiding.
- Negative: ptxas emits a non-fatal advisory "minnctapersm out of range, ignored" for sm_75/sm_80/sm_86 targets in the multi-arch fatbin. These targets see no register change and no performance regression.
- Neutral: The
__ldgchange is zero-impact at 576×324 (wave-limited) but provides cache-routing benefit at ≥1080p. Both changes are semantically neutral — same integer arithmetic.
References¶
- PR #74 ncu profile report (research digest:
docs/research/research-0743-cuda-vif-filter1d-perf-impl.md) - ADR-0454: stencil/convolution smem staging rule
- ADR-0214: GPU-parity CI gate (places=4 tolerance)
- req: "Implement the 3 ncu-measured optimizations on filter1d.cu from PR #74's profile"