Research-0749: Hardware Measurement of PR perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven-20260528¶
Date: 2026-05-29 Branch: perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven-20260528 (commit 06a6a00215) Baseline: container image vmaf-dev-mcp:cuda13.3 built from master tip b04b07fd70 Device: RTX 4090 (CC 8.9, 128 SMs), CUDA 13.3 (driver 610.43.02), ncu 2026.2.0 Method: isolated worktrees; optimized libvmaf.so via LD_PRELOAD; privileged container for ncu PMU access
Purpose¶
The PR introducing smem tiling for ms_ssim_decimate and __launch_bounds__(128, 8) for adm_cm_line_kernel_8 was written with ncu estimates only — no hardware measurement was possible at authoring time due to GPU contention. This digest provides the actual measured numbers.
Correctness¶
Both optimizations are bit-exact against master at ADR-0214 places=4 (max_diff=0.0 on all frames, both workloads). Correctness: PASS.
End-to-End Throughput (3-run median, --feature float_ms_ssim_cuda, --backend cuda)¶
| Workload | Baseline | Optimized | Delta |
|---|---|---|---|
| WL1: 576x324, 48f | 2469.1 fps | 2586.8 fps | +4.8% |
| WL2: 1080p, 3f | 482.2 fps | 501.2 fps | +3.9% |
Both workloads show a positive end-to-end delta. Note that ms_ssim_decimate and adm_cm_line_kernel_8 are two of many kernels contributing to total throughput; the aggregate improvement is diluted by other kernel time.
Per-Kernel: ms_ssim_decimate¶
The smem tiling optimization exhibits an unexpected L1-hit-rate inversion: the baseline already achieves 95% L1 hit rate (the small working set fits in L1/L2 cache naturally), so adding smem overhead increases kernel duration at both resolutions.
WL1 (576x324) — largest decimation launch, grid (18,21,1):¶
| Metric | Baseline | Optimized | Delta |
|---|---|---|---|
| Duration | 5120 ns | 6368 ns | +24.4% SLOWER |
| DRAM throughput | 19.24% | 13.76% | -28% |
| L1 hit rate | 95.46% | 24.51% | severely degraded |
| Achieved occupancy | 20.03% | 23.09% | +3% |
| Registers/thread | 39 | 40 | +1 |
| Launches/frame | 8 per channel per scale | same | — |
WL2 (1080p) — largest decimation launch, grid (60,68,1):¶
| Metric | Baseline | Optimized | Delta |
|---|---|---|---|
| Duration | 18656 ns | 20224 ns | +8.4% SLOWER |
| DRAM throughput | 46.08% | 42.67% | -7.4% |
| L1 hit rate | 95.57% | 23.93% | severely degraded |
| Achieved occupancy | 79.86% | 86.18% | +8% |
| Registers/thread | 39 | 40 | +1 |
Root cause: The smem cooperative load pattern breaks the hardware prefetcher's ability to reuse lines in L1. In the baseline kernel, each thread accesses its 9×9 neighbourhood sequentially and the hardware cache delivers 95%+ hit rates. After tiling, the load phase accesses the smem padded array (TILE_W_PAD=41) in a strided pattern that maps poorly to 128-byte cache sectors on Ada Lovelace, causing a large L1 miss penalty that outweighs the eliminated global-memory reads in the compute phase. The 1-register increase (39→40) also likely pushes a compiler decision boundary.
Despite kernel-level regression, the end-to-end throughput is positive because the adm_cm optimization dominates the ADM path which is in the critical pipeline.
Per-Kernel: adm_cm_line_kernel_8¶
The __launch_bounds__(128, 8) hint achieves its primary goal: ptxas reduces register file from 114 to 64 registers per thread.
WL1 (576x324) — grid (8,5,3) = 120 CTAs:¶
| Metric | Baseline | Optimized | Delta |
|---|---|---|---|
| Duration | ~14.2 µs | ~14.9 µs | +5% (noise level) |
| Registers/thread | 114 | 64 | -43.9% |
| Shared memory | 0 B | 0 B | — |
| Achieved occupancy | 7.37% | 7.37% | 0% |
At 576x324 the wave count (120 CTAs / 128 SMs = <1 wave) limits occupancy regardless of register pressure; the register reduction does not help.
WL2 (1080p) — grid (25,14,3) = 1050 CTAs:¶
| Metric | Baseline | Optimized | Delta |
|---|---|---|---|
| Duration | ~68.4 µs | ~62.0 µs | -9.3% FASTER |
| Registers/thread | 114 | 64 | -43.9% |
| Achieved occupancy | 32.1% | see note | — |
At 1080p, with 1050 CTAs across 128 SMs, the register reduction measurably improves scheduling. The -9.3% duration improvement is consistent with the theoretical occupancy increase from 33% to ~50% (65536 / (12864) = 8 CTAs/SM vs 65536 / (128114) ≈ 4.5). This is the workload the PR targeted.
Comparison with ADR-0744 Estimates¶
| Kernel | ADR-0744 estimate | Measured (WL2) |
|---|---|---|
| ms_ssim_decimate speedup | +68–93% | -8.4% (regression) |
| adm_cm speedup | +66.7% | -9.3% (correct direction, ~7× lower magnitude) |
The ms_ssim_decimate estimate was entirely wrong: the original analysis assumed the baseline was DRAM-bound, but it was already L1-bound at both resolutions. The adm_cm estimate was directionally correct at 1080p but overestimated by ~7× because the theoretical occupancy increase only partially converts to throughput given the compute-bound nature of the kernel at 1080p wave counts.
Verdict¶
- adm_cm_line_kernel_8
__launch_bounds__: MEASURED POSITIVE at 1080p (-9.3% kernel duration). Correctness passes. Regression at 576x324 is within noise. Keep. - ms_ssim_decimate smem tiling: MEASURED REGRESSION at both resolutions (kernel +8–24% slower). The L1 hit rate drops from 95% to 24% — the smem tiling assumption was invalid for this kernel on Ada Lovelace. End-to-end improvement (+3.9–4.8%) is driven entirely by the adm_cm change. Recommend reverting the ms_ssim_decimate tiling.
Recommendation for PR perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven-20260528¶
Partial revert: retain __launch_bounds__(128, 8) on adm_cm_line_kernel_8, revert the smem tiling of ms_ssim_decimate in ms_ssim_score.cu. The ms_ssim kernel on Ada Lovelace is already L1-resident for both 576p and 1080p inputs; a better optimization path is occupancy improvement (currently 20–80%) via block size tuning or kernel fusion with the horiz/vert passes, not smem tiling.
Regression vs Last Committed Snapshot¶
testdata/perf_benchmark_results.json does not include float_ms_ssim_cuda timings. No regression flagged for the committed snapshot. WL1/WL2 end-to-end numbers are documented here as the new baseline for future comparisons against this PR branch.
Reproducer Commands¶
# Build optimized libvmaf (patches just 2 CUDA files in container build tree)
docker run --rm --privileged --entrypoint /bin/bash \
--runtime=nvidia --gpus all \
-v /path/to/wt-opt/core/src/feature/cuda/integer_ms_ssim/ms_ssim_score.cu:/opt/ms.cu:ro \
-v /path/to/wt-opt/core/src/feature/cuda/integer_adm/adm_cm.cu:/opt/adm.cu:ro \
-v /tmp/bins:/output \
vmaf-dev-mcp:cuda13.3 \
-c "cp /opt/ms.cu /build/vmaf/core/src/feature/cuda/integer_ms_ssim/ms_ssim_score.cu && \
cp /opt/adm.cu /build/vmaf/core/src/feature/cuda/integer_adm/adm_cm.cu && \
ninja -C /build/vmaf/core/build src/ms_ssim_score.fatbin src/adm_cm.fatbin tools/vmaf && \
cp /build/vmaf/core/build/src/libvmaf.so.3.0.0 /output/libvmaf-opt.so"
# End-to-end timing (baseline)
docker run --rm --privileged --entrypoint /bin/bash \
--runtime=nvidia --gpus all vmaf-dev-mcp:cuda13.3 \
-c "vmaf --backend cuda -r /build/vmaf/testdata/ref_576x324_48f.yuv \
-d /build/vmaf/testdata/dis_576x324_48f.yuv -w 576 -h 324 -p 420 -b 8 \
-m path=/build/vmaf/model/vmaf_v0.6.1.json --feature float_ms_ssim_cuda \
--json -o /dev/stderr 2>&1 | python3 -c 'import json,sys; d=json.load(sys.stdin); print(d[\"fps\"])'
# ncu kernel profiling (requires --privileged)
ncu --target-processes all -k regex:ms_ssim_decimate \
--metrics gpu__time_duration.sum,l1tex__t_sector_hit_rate.pct,\
dram__throughput.avg.pct_of_peak_sustained_elapsed,launch__registers_per_thread \
--csv <vmaf-binary> --backend cuda -r <ref.yuv> -d <dis.yuv> ...