Skip to content

Research-0749: Hardware Measurement of PR perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven-20260528

Date: 2026-05-29 Branch: perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven-20260528 (commit 06a6a00215) Baseline: container image vmaf-dev-mcp:cuda13.3 built from master tip b04b07fd70 Device: RTX 4090 (CC 8.9, 128 SMs), CUDA 13.3 (driver 610.43.02), ncu 2026.2.0 Method: isolated worktrees; optimized libvmaf.so via LD_PRELOAD; privileged container for ncu PMU access

Purpose

The PR introducing smem tiling for ms_ssim_decimate and __launch_bounds__(128, 8) for adm_cm_line_kernel_8 was written with ncu estimates only — no hardware measurement was possible at authoring time due to GPU contention. This digest provides the actual measured numbers.

Correctness

Both optimizations are bit-exact against master at ADR-0214 places=4 (max_diff=0.0 on all frames, both workloads). Correctness: PASS.

End-to-End Throughput (3-run median, --feature float_ms_ssim_cuda, --backend cuda)

Workload Baseline Optimized Delta
WL1: 576x324, 48f 2469.1 fps 2586.8 fps +4.8%
WL2: 1080p, 3f 482.2 fps 501.2 fps +3.9%

Both workloads show a positive end-to-end delta. Note that ms_ssim_decimate and adm_cm_line_kernel_8 are two of many kernels contributing to total throughput; the aggregate improvement is diluted by other kernel time.

Per-Kernel: ms_ssim_decimate

The smem tiling optimization exhibits an unexpected L1-hit-rate inversion: the baseline already achieves 95% L1 hit rate (the small working set fits in L1/L2 cache naturally), so adding smem overhead increases kernel duration at both resolutions.

WL1 (576x324) — largest decimation launch, grid (18,21,1):

Metric Baseline Optimized Delta
Duration 5120 ns 6368 ns +24.4% SLOWER
DRAM throughput 19.24% 13.76% -28%
L1 hit rate 95.46% 24.51% severely degraded
Achieved occupancy 20.03% 23.09% +3%
Registers/thread 39 40 +1
Launches/frame 8 per channel per scale same

WL2 (1080p) — largest decimation launch, grid (60,68,1):

Metric Baseline Optimized Delta
Duration 18656 ns 20224 ns +8.4% SLOWER
DRAM throughput 46.08% 42.67% -7.4%
L1 hit rate 95.57% 23.93% severely degraded
Achieved occupancy 79.86% 86.18% +8%
Registers/thread 39 40 +1

Root cause: The smem cooperative load pattern breaks the hardware prefetcher's ability to reuse lines in L1. In the baseline kernel, each thread accesses its 9×9 neighbourhood sequentially and the hardware cache delivers 95%+ hit rates. After tiling, the load phase accesses the smem padded array (TILE_W_PAD=41) in a strided pattern that maps poorly to 128-byte cache sectors on Ada Lovelace, causing a large L1 miss penalty that outweighs the eliminated global-memory reads in the compute phase. The 1-register increase (39→40) also likely pushes a compiler decision boundary.

Despite kernel-level regression, the end-to-end throughput is positive because the adm_cm optimization dominates the ADM path which is in the critical pipeline.

Per-Kernel: adm_cm_line_kernel_8

The __launch_bounds__(128, 8) hint achieves its primary goal: ptxas reduces register file from 114 to 64 registers per thread.

WL1 (576x324) — grid (8,5,3) = 120 CTAs:

Metric Baseline Optimized Delta
Duration ~14.2 µs ~14.9 µs +5% (noise level)
Registers/thread 114 64 -43.9%
Shared memory 0 B 0 B
Achieved occupancy 7.37% 7.37% 0%

At 576x324 the wave count (120 CTAs / 128 SMs = <1 wave) limits occupancy regardless of register pressure; the register reduction does not help.

WL2 (1080p) — grid (25,14,3) = 1050 CTAs:

Metric Baseline Optimized Delta
Duration ~68.4 µs ~62.0 µs -9.3% FASTER
Registers/thread 114 64 -43.9%
Achieved occupancy 32.1% see note

At 1080p, with 1050 CTAs across 128 SMs, the register reduction measurably improves scheduling. The -9.3% duration improvement is consistent with the theoretical occupancy increase from 33% to ~50% (65536 / (12864) = 8 CTAs/SM vs 65536 / (128114) ≈ 4.5). This is the workload the PR targeted.

Comparison with ADR-0744 Estimates

Kernel ADR-0744 estimate Measured (WL2)
ms_ssim_decimate speedup +68–93% -8.4% (regression)
adm_cm speedup +66.7% -9.3% (correct direction, ~7× lower magnitude)

The ms_ssim_decimate estimate was entirely wrong: the original analysis assumed the baseline was DRAM-bound, but it was already L1-bound at both resolutions. The adm_cm estimate was directionally correct at 1080p but overestimated by ~7× because the theoretical occupancy increase only partially converts to throughput given the compute-bound nature of the kernel at 1080p wave counts.

Verdict

  • adm_cm_line_kernel_8 __launch_bounds__: MEASURED POSITIVE at 1080p (-9.3% kernel duration). Correctness passes. Regression at 576x324 is within noise. Keep.
  • ms_ssim_decimate smem tiling: MEASURED REGRESSION at both resolutions (kernel +8–24% slower). The L1 hit rate drops from 95% to 24% — the smem tiling assumption was invalid for this kernel on Ada Lovelace. End-to-end improvement (+3.9–4.8%) is driven entirely by the adm_cm change. Recommend reverting the ms_ssim_decimate tiling.

Recommendation for PR perf/cuda-ms-ssim-decimate-adm-cm-ncu-driven-20260528

Partial revert: retain __launch_bounds__(128, 8) on adm_cm_line_kernel_8, revert the smem tiling of ms_ssim_decimate in ms_ssim_score.cu. The ms_ssim kernel on Ada Lovelace is already L1-resident for both 576p and 1080p inputs; a better optimization path is occupancy improvement (currently 20–80%) via block size tuning or kernel fusion with the horiz/vert passes, not smem tiling.

Regression vs Last Committed Snapshot

testdata/perf_benchmark_results.json does not include float_ms_ssim_cuda timings. No regression flagged for the committed snapshot. WL1/WL2 end-to-end numbers are documented here as the new baseline for future comparisons against this PR branch.

Reproducer Commands

# Build optimized libvmaf (patches just 2 CUDA files in container build tree)
docker run --rm --privileged --entrypoint /bin/bash \
  --runtime=nvidia --gpus all \
  -v /path/to/wt-opt/core/src/feature/cuda/integer_ms_ssim/ms_ssim_score.cu:/opt/ms.cu:ro \
  -v /path/to/wt-opt/core/src/feature/cuda/integer_adm/adm_cm.cu:/opt/adm.cu:ro \
  -v /tmp/bins:/output \
  vmaf-dev-mcp:cuda13.3 \
  -c "cp /opt/ms.cu /build/vmaf/core/src/feature/cuda/integer_ms_ssim/ms_ssim_score.cu && \
      cp /opt/adm.cu /build/vmaf/core/src/feature/cuda/integer_adm/adm_cm.cu && \
      ninja -C /build/vmaf/core/build src/ms_ssim_score.fatbin src/adm_cm.fatbin tools/vmaf && \
      cp /build/vmaf/core/build/src/libvmaf.so.3.0.0 /output/libvmaf-opt.so"

# End-to-end timing (baseline)
docker run --rm --privileged --entrypoint /bin/bash \
  --runtime=nvidia --gpus all vmaf-dev-mcp:cuda13.3 \
  -c "vmaf --backend cuda -r /build/vmaf/testdata/ref_576x324_48f.yuv \
      -d /build/vmaf/testdata/dis_576x324_48f.yuv -w 576 -h 324 -p 420 -b 8 \
      -m path=/build/vmaf/model/vmaf_v0.6.1.json --feature float_ms_ssim_cuda \
      --json -o /dev/stderr 2>&1 | python3 -c 'import json,sys; d=json.load(sys.stdin); print(d[\"fps\"])'

# ncu kernel profiling (requires --privileged)
ncu --target-processes all -k regex:ms_ssim_decimate \
  --metrics gpu__time_duration.sum,l1tex__t_sector_hit_rate.pct,\
dram__throughput.avg.pct_of_peak_sustained_elapsed,launch__registers_per_thread \
  --csv <vmaf-binary> --backend cuda -r <ref.yuv> -d <dis.yuv> ...