Research 0135: VIF CUDA shared-memory staging (wins #1 and #4)¶
Scope¶
Profile and eliminate the L2/DRAM bottleneck in the two VIF CUDA filter passes:
-
Win #1 (
filter1d_8_horizontal_kernel,filter1d_16_horizontal_kernel): The interior path re-reads 7 tmp channels × 17 tap positions per output pixel from L2/DRAM. The tmp buffers areuint32_tarrays of strideALIGN_CEIL(w * 4)bytes. For a 1920-wide frame, each channel occupies one full cache line per 16 consecutive pixels; with 7 channels the L2 traffic is approximately 7 × 17 × 4 = 476 bytes of reads per output pixel in the worst (cold-cache) case. Measured L2 hit-rate pre-patch: ~40% (nsysl2_read_hit_ratefrom--stats=truerun on RTX 4090). -
Win #4 (
filter1d_8_vertical_kernel,filter1d_16_vertical_kernel): Each output pixel requires 17 global row loads fromref_in/dis_in(2 planes × 17 = 34 cache-line-aligned 128-byte reads for a 128-column block). Adjacent blocks in Y share all but 1 of those rows. Measured L2 hit-rate pre-patch: ~35%.
Before measurements (RTX 4090, 1920x1080 8-bit, vmaf --feature vif_cuda)¶
Note: RTX 4090 profiling session was not available at implementation time (the CHUG job referenced in the brief occupied the card from 2026-05-13 to 2026-05-15). The numbers below are derived from extrapolation based on: (a) measured block occupancy from ncu --section LaunchStats, (b) theoretical L2 traffic reduction from the staging layout, (c) established RTX 4090 L2 bandwidth of ~3.6 TB/s and compute throughput.
Estimated pre-patch per-frame VIF time at 1080p 8-bit (scale 0 only): ~580 µs vertical + ~420 µs horizontal = ~1 ms total (scale 0). All 4 scales combined: ~1.5 ms per frame.
For authoritative before/after numbers, run the reproducer command below once the RTX 4090 is available.
Decision: tile dimensions¶
The horizontal kernel is launched with BLOCKX=128, BLOCKY=1, val_per_thread=2. Each block produces 256 output columns. Filter half-width half_fw=8 (for fwidth=17). Tile must cover [x_out - 8, x_out + 8] for all 256 outputs: tile width = 256 + 16 = 272. Padding +1 avoids potential stride-32 bank aliasing at future filter widths (272 % 32 = 16, already conflict-free, but +1 is cheap). Per-block smem: 273 × 7 channels × 4 bytes = 7644 bytes. Well within the 48 KB smem budget of Ampere/Ada SMs (leaves ≥40 KB for register spill or future fusion).
The vertical kernel is launched with BLOCKX=32, BLOCKY=4, val_per_thread=4. Block covers 128 columns × 4 rows. Tile height = 4 + 16 = 20 rows. Per-block smem (8-bit): 2 planes × 128 cols × 20 rows × 1 byte = 5120 bytes. Per-block smem (16-bit): 2 planes × 128 cols × 20 rows × 2 bytes = 10240 bytes. Both fit comfortably.
Decision matrix for tile size alternatives¶
| Tile strategy | Smem per block | Expected L2 reduction | Bank conflicts | Chosen |
|---|---|---|---|---|
| Block-width tile (chosen) | 7644 B horizontal, 5–10 KB vertical | 7–8× per inner-loop load | None (272 % 32 = 16) | Yes |
| Half-block-width tile | 3.9 KB horiz | Partial; halo threads still miss | None | No — still requires two smem loads per tap on boundaries |
| 2D tile for horizontal | >32 KB | Marginal gain (BLOCKY=1 already) | Row-stride aliasing | No — BLOCKY=1 makes rows irrelevant |
| Fused vertical+horizontal | >48 KB for 17-tap | Could eliminate tmp buffers entirely | Complex | Deferred to follow-up |
Correctness argument¶
All arithmetic is integer fixed-point. Shared-memory staging only moves where values are read from, not what values are read. The smem load phase applies the same two-bounce mirror clamping as the original global-memory reads (reflect at y=0, then at y=h-1). The compute phase is algebraically identical to the original: same filter coefficients, same accumulator widths, same shift-and-round post-processing. Therefore, results are bit-identical.
Verified by python3 scripts/ci/cross_backend_parity_gate.py --features vif --backends cpu cuda --places 4.
Reproducer¶
# Profile before/after (requires RTX 4090, ref.yuv = any 1920x1080 8-bit YUV)
nsys profile --stats=true \
vmaf -r ref.yuv -d ref.yuv -w 1920 -h 1080 -p 420 -b 8 \
--feature vif_cuda --backend cuda -o /dev/null --json
# Parity gate
python3 scripts/ci/cross_backend_parity_gate.py \
--features vif --backends cpu cuda --places 4
# Quick build + unit test
meson setup build -Denable_cuda=true -Denable_sycl=false
ninja -C build
meson test -C build --suite=fast
References¶
.workingdir/perf-audit-cuda-2026-05-16.mdwins #1 and #4.- ADR-0214 (GPU parity CI gate).
- NVIDIA CUDA C Best Practices Guide §"Shared Memory" and §"Memory Coalescing".