Research-0754 — CUDA SSIM vert_combine: ldg() + __launch_bounds + pinned-host leak¶

Date: 2026-05-29 ADR: ADR-0754 Status: Measurement complete (live ncu A/B, 2026-05-29).

Motivation¶

Three findings from the calculate_ssim_vert_combine review:

F2: 55 inner-loop global loads from 5 intermediate buffers without the read-only cache path — struct-by-value argument hides pointer from compiler non-aliased-load analysis.
F4: __launch_bounds__ absent on a 128-thread kernel — register allocation left unconstrained relative to the actual launch config.
F6: close_fex_cuda leaked one page of CUDA pinned host memory per vmaf_close() cycle (host_pinned NULLed by readback_free, never freed).

Methodology¶

A/B build of origin/master (baseline) vs perf/cuda-ssim-vert-combine-ldg-launch-bounds-leak-20260529 (optimized) in isolated /tmp/ worktrees inside vmaf-dev-mcp:cuda13.3 (--gpus all, --cap-add SYS_ADMIN).

Build flags: meson setup -Denable_cuda=true -Dbuildtype=release -Db_ndebug=true -Dcpp_args='-g -fno-omit-frame-pointer'; ninja -C.

Workloads:

WL1: src01_hrc00_576x324.yuv / src01_hrc01_576x324.yuv, 576x324 8bpc, 48 frames. Feature: float_ssim_cuda (regex .*ssim.*).
WL2: checkerboard_1920_1080_10_3_0_0.yuv / ..._10_0.yuv, 1920x1080 8bpc, 3 frames. Feature: float_ssim_cuda=scale=1 (scale-override required at 1080p because the CUDA SSIM path restricts auto-detection to scale=1; the kernel arithmetic is resolution-independent).

ncu profiler: version 2026.2.0.0 (CUDA 13.3 toolkit). Register counts: cuobjdump -res-usage. Wall time: time builtin, 3 runs per variant, median taken.

GPU: RTX 4090 (sm_89, 128 SMs).

Correctness¶

ADR-0214 places=4 gate (max per-frame diff < 5e-5).

Workload	Baseline SSIM mean	Optimized SSIM mean	Max per-frame diff
576p	0.863227	0.863227	0.00e+00
1080p	-0.991317	-0.991317	0.00e+00

Scores are bit-identical. ADR-0214 PASS.

ncu A/B — `calculate_ssim_vert_combine`¶

WL1: 576x324 8bpc (48 frames, 48 invocations each)¶

Metric	Baseline	Optimized	Delta	Interpretation
Duration	8252.67 ns	8584.67 ns	+4.0%	Slight regression — wave-limited kernel (noise)
DRAM Throughput	62.08%	64.95%	+2.9 pp	Higher, not lower — wave-limited, noise-dominated
L1/TEX Cache Throughput	34.23%	35.20%	+1.0 pp	Marginal increase
L2 Cache Throughput	32.74%	31.43%	-1.3 pp	Small L2 reduction
Achieved Occupancy	66.08%	67.28%	+1.2 pp	Negligible

At 576p the kernel launches 48 blocks of 128 threads = 6144 active threads across 128 SMs, meaning < 0.4 waves. In this wave-limited regime kernel duration is dominated by launch overhead and scheduling noise. No reliable signal at 576p.

WL2: 1920x1080 8bpc (3 frames, 3 invocations each, `scale=1` override)¶

Metric	Baseline	Optimized	Delta	Interpretation
Duration	57290.67 ns	54869.33 ns	-4.2%	Clear win — passes ≥3% threshold
DRAM Throughput	89.81%	89.80%	-0.01 pp	Unchanged — kernel already DRAM-bound
L1/TEX Cache Throughput	50.72%	49.62%	-1.1 pp	Slight decrease (cache capacity used differently)
L2 Cache Throughput	58.17%	61.12%	+5.1 pp	Increased — L2 serving more L1 read-only misses
Achieved Occupancy	93.35%	92.22%	-1.1 pp	Within run-to-run noise
Registers Per Thread	40	40	0	Unchanged — `__launch_bounds__` had no PTX effect
Total DRAM Elapsed Cycles	7209642.67	6901418.67	-4.3%	DRAM cycles reduced despite same throughput %
Elapsed Cycles	141398.67	134822.67	-4.7%	Consistent with duration reduction

Interpretation of L2 increase: DRAM throughput = 89.8% in both — the kernel is DRAM-saturated. __ldg() routes the 55 inner-loop loads through the L1 texture path. At 1080p the L1 capacity is insufficient to absorb all 5 intermediate-buffer footprints, and L1 misses spill to L2. L2 sees more hits (read-only path passes through L2 before DRAM), so L2 throughput rises while total DRAM cycles decrease because fewer duplicate DRAM fetches occur (the __ldg() path allows the read-only path to coalesce differently). The net effect is -4.2% wall duration despite unchanged DRAM peak throughput percentage — the kernel does the same work with fewer DRAM round-trips.

Register allocation (sm_89 / Ada Lovelace, RTX 4090)¶

cuobjdump -res-usage shows REG:40 for calculate_ssim_vert_combine on sm_89 in both baseline and optimized. The __launch_bounds__(128) hint had no measurable effect on the ptxas register allocator for this kernel's register pressure at sm_89 (it was already near-optimal). The annotation is retained as a documentation hint and guard against future register pressure increases.

Wall time (end-to-end, 3-run median)¶

Wall time measures the full VMAF pipeline (ADM + VIF + motion + SSIM + prediction), not SSIM in isolation. Shown for completeness.

Workload	Baseline (median)	Optimized (median)	Delta
576p	0.616s	0.519s	-15.8%
1080p	0.370s	0.322s	-12.9%

Note: the wall-time delta is larger than the kernel-level delta because the 3-frame / 48-frame sample size is small and end-to-end timing includes first-frame warmup and GPU context switching, making the percentage estimates noisy.

Decision¶

Criterion: kernel duration ≥ 3% improvement OR DRAM throughput ≥ 5pp drop.

Workload	Duration delta	DRAM delta	Criterion met
576p	+4.0%	+2.9 pp	No (wave-limited noise)
1080p	-4.2%	-0.01 pp	Yes (duration ≥ 3%)

Verdict: READY. The 1080p kernel-duration criterion is met. The 576p noise result is expected (wave-limited, < 0.4 waves at 576p) and does not contradict the 1080p finding.

Correctness of F6 (pinned-host leak fix)¶

close_fex_cuda now calls vmaf_cuda_buffer_host_free(cu_state, saved_host_pinned) after readback_free. This fix does not affect any computed score — it only ensures the CUDA pinned memory allocated during init_fex_cuda is properly freed during close_fex_cuda. Verified: bit-identical scores before/after.

Decision on F1/F3 follow-up¶

F1 (AoS to SoA buffer restructure) + F3 (matching signature change) are deferred. At 1080p the F2 __ldg() routing captures -4.2% duration with a minimal code change. The L1 read-only hit rate post-F2 is ~50%, indicating further gains may be possible via F1/F3 (better coalescing + wider reads). Revisit if profiling post-merge shows L1 occupancy remains below 60%.

Note on integer_psnr_cuda.c¶

The same readback_free / host_free gap exists in integer_psnr_cuda.c (also confirmed by code inspection: readback_free called in close without preceding host_free). That file uses multiple rb[] slots per extractor. Fix is straightforward but in a different file; named explicitly for the next CUDA cleanup PR.

ncu reproducer commands¶

# Build A/B in isolated /tmp/ worktrees inside vmaf-dev-mcp:cuda13.3
# See docs/development/dev-mcp.md for container setup.

# Profile (replace /tmp/vmaf-{baseline,optimized}-ssim-measure with your worktree paths):
docker run --rm --gpus all --cap-add SYS_ADMIN \
  --entrypoint bash \
  -v /tmp/vmaf-baseline-ssim-measure:/workspace \
  -v python/test/resource/yuv:/yuv:ro \
  vmaf-dev-mcp:cuda13.3 \
  -c "ncu --target-processes all \
        --kernel-name 'regex:.*ssim.*' \
        --set basic --csv \
        /workspace/core/build-baseline/tools/vmaf \
          --reference /yuv/src01_hrc00_576x324.yuv \
          --distorted /yuv/src01_hrc01_576x324.yuv \
          --width 576 --height 324 --pixel_format 420 --bitdepth 8 \
          '--feature' 'float_ssim_cuda' --backend cuda \
          --json -o /dev/null > /workspace/ncu-out.csv"