Research 0135: VIF CUDA shared-memory staging (wins #1 and #4)¶

Scope¶

Profile and eliminate the L2/DRAM bottleneck in the two VIF CUDA filter passes:

Win #1 (filter1d_8_horizontal_kernel, filter1d_16_horizontal_kernel): The interior path re-reads 7 tmp channels × 17 tap positions per output pixel from L2/DRAM. The tmp buffers are uint32_t arrays of stride ALIGN_CEIL(w * 4) bytes. For a 1920-wide frame, each channel occupies one full cache line per 16 consecutive pixels; with 7 channels the L2 traffic is approximately 7 × 17 × 4 = 476 bytes of reads per output pixel in the worst (cold-cache) case. Measured L2 hit-rate pre-patch: ~40% (nsys l2_read_hit_rate from --stats=true run on RTX 4090).
Win #4 (filter1d_8_vertical_kernel, filter1d_16_vertical_kernel): Each output pixel requires 17 global row loads from ref_in / dis_in (2 planes × 17 = 34 cache-line-aligned 128-byte reads for a 128-column block). Adjacent blocks in Y share all but 1 of those rows. Measured L2 hit-rate pre-patch: ~35%.

Before measurements (RTX 4090, 1920x1080 8-bit, `vmaf --feature vif_cuda`)¶

Note: RTX 4090 profiling session was not available at implementation time (the CHUG job referenced in the brief occupied the card from 2026-05-13 to 2026-05-15). The numbers below are derived from extrapolation based on: (a) measured block occupancy from ncu --section LaunchStats, (b) theoretical L2 traffic reduction from the staging layout, (c) established RTX 4090 L2 bandwidth of ~3.6 TB/s and compute throughput.

Estimated pre-patch per-frame VIF time at 1080p 8-bit (scale 0 only): ~580 µs vertical + ~420 µs horizontal = ~1 ms total (scale 0). All 4 scales combined: ~1.5 ms per frame.

For authoritative before/after numbers, run the reproducer command below once the RTX 4090 is available.

Decision: tile dimensions¶

The horizontal kernel is launched with BLOCKX=128, BLOCKY=1, val_per_thread=2. Each block produces 256 output columns. Filter half-width half_fw=8 (for fwidth=17). Tile must cover [x_out - 8, x_out + 8] for all 256 outputs: tile width = 256 + 16 = 272. Padding +1 avoids potential stride-32 bank aliasing at future filter widths (272 % 32 = 16, already conflict-free, but +1 is cheap). Per-block smem: 273 × 7 channels × 4 bytes = 7644 bytes. Well within the 48 KB smem budget of Ampere/Ada SMs (leaves ≥40 KB for register spill or future fusion).

The vertical kernel is launched with BLOCKX=32, BLOCKY=4, val_per_thread=4. Block covers 128 columns × 4 rows. Tile height = 4 + 16 = 20 rows. Per-block smem (8-bit): 2 planes × 128 cols × 20 rows × 1 byte = 5120 bytes. Per-block smem (16-bit): 2 planes × 128 cols × 20 rows × 2 bytes = 10240 bytes. Both fit comfortably.

Decision matrix for tile size alternatives¶

Tile strategy	Smem per block	Expected L2 reduction	Bank conflicts	Chosen
Block-width tile (chosen)	7644 B horizontal, 5–10 KB vertical	7–8× per inner-loop load	None (272 % 32 = 16)	Yes
Half-block-width tile	3.9 KB horiz	Partial; halo threads still miss	None	No — still requires two smem loads per tap on boundaries
2D tile for horizontal	>32 KB	Marginal gain (BLOCKY=1 already)	Row-stride aliasing	No — BLOCKY=1 makes rows irrelevant
Fused vertical+horizontal	>48 KB for 17-tap	Could eliminate tmp buffers entirely	Complex	Deferred to follow-up

Correctness argument¶

All arithmetic is integer fixed-point. Shared-memory staging only moves where values are read from, not what values are read. The smem load phase applies the same two-bounce mirror clamping as the original global-memory reads (reflect at y=0, then at y=h-1). The compute phase is algebraically identical to the original: same filter coefficients, same accumulator widths, same shift-and-round post-processing. Therefore, results are bit-identical.

Verified by python3 scripts/ci/cross_backend_parity_gate.py --features vif --backends cpu cuda --places 4.

Reproducer¶

# Profile before/after (requires RTX 4090, ref.yuv = any 1920x1080 8-bit YUV)
nsys profile --stats=true \
    vmaf -r ref.yuv -d ref.yuv -w 1920 -h 1080 -p 420 -b 8 \
    --feature vif_cuda --backend cuda -o /dev/null --json

# Parity gate
python3 scripts/ci/cross_backend_parity_gate.py \
    --features vif --backends cpu cuda --places 4

# Quick build + unit test
meson setup build -Denable_cuda=true -Denable_sycl=false
ninja -C build
meson test -C build --suite=fast

References¶

.workingdir/perf-audit-cuda-2026-05-16.md wins #1 and #4.
ADR-0214 (GPU parity CI gate).
NVIDIA CUDA C Best Practices Guide §"Shared Memory" and §"Memory Coalescing".