Skip to content

ADR-0454: VIF CUDA shared-memory staging for horizontal and vertical filter passes

  • Status: Proposed
  • Date: 2026-05-16
  • Deciders: lusoris
  • Tags: cuda, gpu, vif, performance, smem, fork-local

Context

The VIF feature extractor is the dominant cost in a full VMAF CUDA run (~60% of frame time at 1080p). Its two-pass separable filter (filter1d_8/16_vertical_kernelfilter1d_8/16_horizontal_kernel) reads input data entirely from L2/DRAM on every filter tap with no explicit data reuse:

  • Horizontal pass: 7 tmp channels × 17 taps = 119 global 32-bit reads per output pixel, even though all threads in a block share the same 256-wide row slice. Each warp issues up to 17 cache-line misses into L2 per tap group.

  • Vertical pass: 17 global row reads × 2 planes per output row, with adjacent blocks in Y sharing 16 of those 17 rows — all loaded redundantly.

The perf-audit (perf-audit-cuda-2026-05-16.md) identified these as wins #1 and #4 with estimated speedups of 20–35% (horizontal) and 15–25% (vertical) on a VIF-only run.

Decision

Stage input data into __shared__ memory before the filter accumulation loop in all four filter template functions:

  • filter1d_8_vertical_kernel: tile (BLOCKY + fwidth_0 - 1) × 128 per plane (uint8_t). Block-cooperative load, single __syncthreads().
  • filter1d_8_horizontal_kernel: tile 273 uint32_t per tmp channel × 7 channels. Block-cooperative load into 7 __shared__ arrays.
  • filter1d_16_vertical_kernel: tile (BLOCKY + fwidth - 1) × 128 per plane (uint16_t).
  • filter1d_16_horizontal_kernel: same layout as 8-bit horizontal.

The boundary mirror (reflect at y=0, y=h-1, x=0, x=w-1) is applied in the smem load phase; the compute phase reads smem unconditionally. All arithmetic is unchanged — results are bit-identical to the pre-patch kernel.

Alternatives considered

Option Pros Cons Why not chosen
Texture cache (read-only path via __ldg) Zero code change to tile logic Only improves L2 hit-rate; does not eliminate redundant loads across threads Does not address inter-thread reuse within the block
Fused vertical + horizontal single kernel Eliminates 7 tmp buffer round-trips 17-tap × 7-channel smem would exceed 48 KB at full block occupancy; requires 2D block restructuring Deferred to follow-up; the tmp buffers allow pipelined execution today
Half-width tile to reduce smem Halves smem per block, doubles occupancy Halo threads still issue global loads; the saving is marginal for the dominant interior path Not worth the indexing complexity
cudaFuncSetCacheConfig(cudaFuncCachePreferL1) Zero code change Moves L1/L2 boundary but does not change the access pattern Insufficient for 17-tap reuse distance

Consequences

  • Positive: Horizontal kernel reduces unique L2 reads by 7–8× for the interior path (7 smem loads replace 7 × 17 global loads per output pixel). Vertical kernel eliminates 16 of 17 redundant row loads per warp across adjacent blocks in Y. Estimated combined speedup: 15–35% on VIF.
  • Negative: Smem budget per block increases by ~7.6 KB (horizontal) and ~5–10 KB (vertical). At BLOCKX=128, BLOCKY=1 for horizontal and BLOCKX=32, BLOCKY=4 for vertical, the SM can still host 6+ active blocks per SM (48 KB / 7.6 KB ≈ 6). No occupancy regression expected.
  • Neutral: The __syncthreads() adds one barrier per kernel invocation. For 1080p this is ~135 barriers per pass (1920/256 × 1080 blocks) — a negligible overhead vs. the eliminated L2 traffic.

References

  • Research digest: Research-0135
  • Perf audit: .workingdir/perf-audit-cuda-2026-05-16.md wins #1, #4.
  • ADR-0214 (GPU parity CI gate — places=4 must pass).
  • NVIDIA CUDA C Best Practices Guide §"Shared Memory".