ADR-0454: VIF CUDA shared-memory staging for horizontal and vertical filter passes¶
- Status: Proposed
- Date: 2026-05-16
- Deciders: lusoris
- Tags:
cuda,gpu,vif,performance,smem,fork-local
Context¶
The VIF feature extractor is the dominant cost in a full VMAF CUDA run (~60% of frame time at 1080p). Its two-pass separable filter (filter1d_8/16_vertical_kernel → filter1d_8/16_horizontal_kernel) reads input data entirely from L2/DRAM on every filter tap with no explicit data reuse:
-
Horizontal pass: 7 tmp channels × 17 taps = 119 global 32-bit reads per output pixel, even though all threads in a block share the same 256-wide row slice. Each warp issues up to 17 cache-line misses into L2 per tap group.
-
Vertical pass: 17 global row reads × 2 planes per output row, with adjacent blocks in Y sharing 16 of those 17 rows — all loaded redundantly.
The perf-audit (perf-audit-cuda-2026-05-16.md) identified these as wins #1 and #4 with estimated speedups of 20–35% (horizontal) and 15–25% (vertical) on a VIF-only run.
Decision¶
Stage input data into __shared__ memory before the filter accumulation loop in all four filter template functions:
filter1d_8_vertical_kernel: tile(BLOCKY + fwidth_0 - 1) × 128per plane (uint8_t). Block-cooperative load, single__syncthreads().filter1d_8_horizontal_kernel: tile 273 uint32_t per tmp channel × 7 channels. Block-cooperative load into 7__shared__arrays.filter1d_16_vertical_kernel: tile(BLOCKY + fwidth - 1) × 128per plane (uint16_t).filter1d_16_horizontal_kernel: same layout as 8-bit horizontal.
The boundary mirror (reflect at y=0, y=h-1, x=0, x=w-1) is applied in the smem load phase; the compute phase reads smem unconditionally. All arithmetic is unchanged — results are bit-identical to the pre-patch kernel.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Texture cache (read-only path via __ldg) | Zero code change to tile logic | Only improves L2 hit-rate; does not eliminate redundant loads across threads | Does not address inter-thread reuse within the block |
| Fused vertical + horizontal single kernel | Eliminates 7 tmp buffer round-trips | 17-tap × 7-channel smem would exceed 48 KB at full block occupancy; requires 2D block restructuring | Deferred to follow-up; the tmp buffers allow pipelined execution today |
| Half-width tile to reduce smem | Halves smem per block, doubles occupancy | Halo threads still issue global loads; the saving is marginal for the dominant interior path | Not worth the indexing complexity |
cudaFuncSetCacheConfig(cudaFuncCachePreferL1) | Zero code change | Moves L1/L2 boundary but does not change the access pattern | Insufficient for 17-tap reuse distance |
Consequences¶
- Positive: Horizontal kernel reduces unique L2 reads by 7–8× for the interior path (7 smem loads replace 7 × 17 global loads per output pixel). Vertical kernel eliminates 16 of 17 redundant row loads per warp across adjacent blocks in Y. Estimated combined speedup: 15–35% on VIF.
- Negative: Smem budget per block increases by ~7.6 KB (horizontal) and ~5–10 KB (vertical). At BLOCKX=128, BLOCKY=1 for horizontal and BLOCKX=32, BLOCKY=4 for vertical, the SM can still host 6+ active blocks per SM (48 KB / 7.6 KB ≈ 6). No occupancy regression expected.
- Neutral: The
__syncthreads()adds one barrier per kernel invocation. For 1080p this is ~135 barriers per pass (1920/256 × 1080 blocks) — a negligible overhead vs. the eliminated L2 traffic.
References¶
- Research digest: Research-0135
- Perf audit:
.workingdir/perf-audit-cuda-2026-05-16.mdwins #1, #4. - ADR-0214 (GPU parity CI gate —
places=4must pass). - NVIDIA CUDA C Best Practices Guide §"Shared Memory".