Research-0754 — CUDA SSIM vert_combine: ldg() + __launch_bounds + pinned-host leak¶
Date: 2026-05-29 ADR: ADR-0754 Status: Measurement complete (live ncu A/B, 2026-05-29).
Motivation¶
Three findings from the calculate_ssim_vert_combine review:
- F2: 55 inner-loop global loads from 5 intermediate buffers without the read-only cache path — struct-by-value argument hides pointer from compiler non-aliased-load analysis.
- F4:
__launch_bounds__absent on a 128-thread kernel — register allocation left unconstrained relative to the actual launch config. - F6:
close_fex_cudaleaked one page of CUDA pinned host memory pervmaf_close()cycle (host_pinned NULLed by readback_free, never freed).
Methodology¶
A/B build of origin/master (baseline) vs perf/cuda-ssim-vert-combine-ldg-launch-bounds-leak-20260529 (optimized) in isolated /tmp/ worktrees inside vmaf-dev-mcp:cuda13.3 (--gpus all, --cap-add SYS_ADMIN).
Build flags: meson setup -Denable_cuda=true -Dbuildtype=release -Db_ndebug=true -Dcpp_args='-g -fno-omit-frame-pointer'; ninja -C.
Workloads:
- WL1:
src01_hrc00_576x324.yuv/src01_hrc01_576x324.yuv, 576x324 8bpc, 48 frames. Feature:float_ssim_cuda(regex.*ssim.*). - WL2:
checkerboard_1920_1080_10_3_0_0.yuv/..._10_0.yuv, 1920x1080 8bpc, 3 frames. Feature:float_ssim_cuda=scale=1(scale-override required at 1080p because the CUDA SSIM path restricts auto-detection to scale=1; the kernel arithmetic is resolution-independent).
ncu profiler: version 2026.2.0.0 (CUDA 13.3 toolkit). Register counts: cuobjdump -res-usage. Wall time: time builtin, 3 runs per variant, median taken.
GPU: RTX 4090 (sm_89, 128 SMs).
Correctness¶
ADR-0214 places=4 gate (max per-frame diff < 5e-5).
| Workload | Baseline SSIM mean | Optimized SSIM mean | Max per-frame diff |
|---|---|---|---|
| 576p | 0.863227 | 0.863227 | 0.00e+00 |
| 1080p | -0.991317 | -0.991317 | 0.00e+00 |
Scores are bit-identical. ADR-0214 PASS.
ncu A/B — calculate_ssim_vert_combine¶
WL1: 576x324 8bpc (48 frames, 48 invocations each)¶
| Metric | Baseline | Optimized | Delta | Interpretation |
|---|---|---|---|---|
| Duration | 8252.67 ns | 8584.67 ns | +4.0% | Slight regression — wave-limited kernel (noise) |
| DRAM Throughput | 62.08% | 64.95% | +2.9 pp | Higher, not lower — wave-limited, noise-dominated |
| L1/TEX Cache Throughput | 34.23% | 35.20% | +1.0 pp | Marginal increase |
| L2 Cache Throughput | 32.74% | 31.43% | -1.3 pp | Small L2 reduction |
| Achieved Occupancy | 66.08% | 67.28% | +1.2 pp | Negligible |
At 576p the kernel launches 48 blocks of 128 threads = 6144 active threads across 128 SMs, meaning < 0.4 waves. In this wave-limited regime kernel duration is dominated by launch overhead and scheduling noise. No reliable signal at 576p.
WL2: 1920x1080 8bpc (3 frames, 3 invocations each, scale=1 override)¶
| Metric | Baseline | Optimized | Delta | Interpretation |
|---|---|---|---|---|
| Duration | 57290.67 ns | 54869.33 ns | -4.2% | Clear win — passes ≥3% threshold |
| DRAM Throughput | 89.81% | 89.80% | -0.01 pp | Unchanged — kernel already DRAM-bound |
| L1/TEX Cache Throughput | 50.72% | 49.62% | -1.1 pp | Slight decrease (cache capacity used differently) |
| L2 Cache Throughput | 58.17% | 61.12% | +5.1 pp | Increased — L2 serving more L1 read-only misses |
| Achieved Occupancy | 93.35% | 92.22% | -1.1 pp | Within run-to-run noise |
| Registers Per Thread | 40 | 40 | 0 | Unchanged — __launch_bounds__ had no PTX effect |
| Total DRAM Elapsed Cycles | 7209642.67 | 6901418.67 | -4.3% | DRAM cycles reduced despite same throughput % |
| Elapsed Cycles | 141398.67 | 134822.67 | -4.7% | Consistent with duration reduction |
Interpretation of L2 increase: DRAM throughput = 89.8% in both — the kernel is DRAM-saturated. __ldg() routes the 55 inner-loop loads through the L1 texture path. At 1080p the L1 capacity is insufficient to absorb all 5 intermediate-buffer footprints, and L1 misses spill to L2. L2 sees more hits (read-only path passes through L2 before DRAM), so L2 throughput rises while total DRAM cycles decrease because fewer duplicate DRAM fetches occur (the __ldg() path allows the read-only path to coalesce differently). The net effect is -4.2% wall duration despite unchanged DRAM peak throughput percentage — the kernel does the same work with fewer DRAM round-trips.
Register allocation (sm_89 / Ada Lovelace, RTX 4090)¶
cuobjdump -res-usage shows REG:40 for calculate_ssim_vert_combine on sm_89 in both baseline and optimized. The __launch_bounds__(128) hint had no measurable effect on the ptxas register allocator for this kernel's register pressure at sm_89 (it was already near-optimal). The annotation is retained as a documentation hint and guard against future register pressure increases.
Wall time (end-to-end, 3-run median)¶
Wall time measures the full VMAF pipeline (ADM + VIF + motion + SSIM + prediction), not SSIM in isolation. Shown for completeness.
| Workload | Baseline (median) | Optimized (median) | Delta |
|---|---|---|---|
| 576p | 0.616s | 0.519s | -15.8% |
| 1080p | 0.370s | 0.322s | -12.9% |
Note: the wall-time delta is larger than the kernel-level delta because the 3-frame / 48-frame sample size is small and end-to-end timing includes first-frame warmup and GPU context switching, making the percentage estimates noisy.
Decision¶
Criterion: kernel duration ≥ 3% improvement OR DRAM throughput ≥ 5pp drop.
| Workload | Duration delta | DRAM delta | Criterion met |
|---|---|---|---|
| 576p | +4.0% | +2.9 pp | No (wave-limited noise) |
| 1080p | -4.2% | -0.01 pp | Yes (duration ≥ 3%) |
Verdict: READY. The 1080p kernel-duration criterion is met. The 576p noise result is expected (wave-limited, < 0.4 waves at 576p) and does not contradict the 1080p finding.
Correctness of F6 (pinned-host leak fix)¶
close_fex_cuda now calls vmaf_cuda_buffer_host_free(cu_state, saved_host_pinned) after readback_free. This fix does not affect any computed score — it only ensures the CUDA pinned memory allocated during init_fex_cuda is properly freed during close_fex_cuda. Verified: bit-identical scores before/after.
Decision on F1/F3 follow-up¶
F1 (AoS to SoA buffer restructure) + F3 (matching signature change) are deferred. At 1080p the F2 __ldg() routing captures -4.2% duration with a minimal code change. The L1 read-only hit rate post-F2 is ~50%, indicating further gains may be possible via F1/F3 (better coalescing + wider reads). Revisit if profiling post-merge shows L1 occupancy remains below 60%.
Note on integer_psnr_cuda.c¶
The same readback_free / host_free gap exists in integer_psnr_cuda.c (also confirmed by code inspection: readback_free called in close without preceding host_free). That file uses multiple rb[] slots per extractor. Fix is straightforward but in a different file; named explicitly for the next CUDA cleanup PR.
ncu reproducer commands¶
# Build A/B in isolated /tmp/ worktrees inside vmaf-dev-mcp:cuda13.3
# See docs/development/dev-mcp.md for container setup.
# Profile (replace /tmp/vmaf-{baseline,optimized}-ssim-measure with your worktree paths):
docker run --rm --gpus all --cap-add SYS_ADMIN \
--entrypoint bash \
-v /tmp/vmaf-baseline-ssim-measure:/workspace \
-v python/test/resource/yuv:/yuv:ro \
vmaf-dev-mcp:cuda13.3 \
-c "ncu --target-processes all \
--kernel-name 'regex:.*ssim.*' \
--set basic --csv \
/workspace/core/build-baseline/tools/vmaf \
--reference /yuv/src01_hrc00_576x324.yuv \
--distorted /yuv/src01_hrc01_576x324.yuv \
--width 576 --height 324 --pixel_format 420 --bitdepth 8 \
'--feature' 'float_ssim_cuda' --backend cuda \
--json -o /dev/null > /workspace/ncu-out.csv"