ADR-0845: CUDA motion — multi-frame SAD batching to reduce per-launch overhead¶
- Status: Proposed
- Date: 2026-05-29
- Deciders: lusoris
- Tags:
cuda,performance,motion,fork-local
Context¶
The CUDA calculate_motion_score_kernel_8bpc is dispatch-bottlenecked at all resolutions below 4K (Research-0760, 2026-05-29). ncu profiling on an RTX 4090 shows the kernel takes 7 µs at 576p but the total per-frame wall time is ~12.7 ms, meaning GPU busy fraction is under 0.1%. The CUDA backend runs at 0.22× CPU throughput at 576p (79 fps vs 353 fps CPU) purely because of driver round-trip overhead. At 1080p the situation is similarly dominated by dispatch overhead.
The root cause is that the legacy submit/collect pattern issues one cuLaunchKernel, one cuEventRecord × 2, one cuStreamWaitEvent, one cuMemcpyDtoHAsync, and one cuStreamSynchronize per frame. Each driver API call costs 1–3 ms of CPU-side overhead; the kernel itself is irrelevant to wall-clock performance at small resolutions.
This optimization is tracked under PR #75 and PR #77 findings, with the baseline measurement provided by Research-0760.
Decision¶
We will batch MOTION_BATCH_DEPTH (8) kernel launches before issuing a single cuStreamSynchronize to drain all device-to-host copies. Each frame's SAD accumulator uses a dedicated slot in a ring of 8 VmafCudaBuffer *sad[] device buffers and a single pinned host array of 8 uint64_t values. The DtoH readback is deferred from submit() to a batch-boundary collect() call, reducing the number of cuStreamSynchronize calls from N (one per frame) to ceil(N / MOTION_BATCH_DEPTH) (one per 8 frames).
Score emission is also deferred to batch boundaries: non-boundary collect() calls increment frame_index and return 0 without emitting; the batch-boundary collect() emits scores for all MOTION_BATCH_DEPTH frames at once using the emit_batch_scores() helper. flush() handles the final partial batch.
The motion3_postprocess_cuda moving-average guard uses a per-frame frame_index saved/restored inside emit_batch_scores() to preserve numerical correctness (ADR-0219 / ADR-0214 places=4 gate).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Multi-frame kernel (process N frames in one launch) | Maximally reduces launch count | Requires new kernel variant; significant device-code change; hard to keep bit-exact with SYCL/Vulkan twins | Too high risk and scope; correctness gate harder to maintain |
| Separable horizontal + vertical filter passes (Candidate 2, Research-0760) | Reduces compute latency at 4K by ~40% | Two kernel launches per frame (doubles dispatch overhead at sub-4K); only helps at 4K where kernel compute dominates | Worsens sub-4K where the bottleneck is not compute; orthogonal follow-on |
cp.async tile prefetch (Candidate 3, Research-0760) | ~10–15% occupancy improvement | Adds assembly-level PTX inline; sm_80+ only; kernel-level improvement irrelevant at 576p (dispatch-bottlenecked) | Small gain; should be bundled with Candidate 2 as a follow-on |
__launch_bounds__ / block-size tuning only | Zero code risk | Does not address dispatch overhead; kernel improvements are <5% at dispatch-bottlenecked resolutions | Does not address the bottleneck |
| BATCH_DEPTH = 2 (shallow pipeline) | Simpler logic | ~50% overhead reduction vs ~87.5% at BATCH=8 | BATCH=8 achieves 87.5% sync reduction with the same structural complexity |
Consequences¶
- Positive: CUDA motion expected to achieve ≥ 2× CPU throughput at 576p and 1080p (from 0.22× baseline). Per Research-0760 Candidate 1 estimate: ~79 fps → ~800 fps at 576p with BATCH=8. 4K performance is unchanged (dispatch is already amortised there).
- Negative: Score emission is deferred by up to MOTION_BATCH_DEPTH frames. This is transparent to callers because
vmaf_feature_collector_appendaccepts any index. Memory overhead increases from 1 device SAD buffer + 1 pinned uint64 to 8 device SAD buffers (8 × 8 bytes = 64 bytes) + 1 pinned array of 8 × 8 bytes = 64 bytes. Negligible. - Neutral / follow-ups: (1) The drain_batch engine-scope fence optimization (ADR-0242) is no longer relevant for the motion extractor — it has been removed from the submit path. The extractor is no longer registered with the drain batch. (2) The
finishedevent handle (previously used for drain_batch) has been removed from MotionStateCuda. (3) The MOTION_BATCH_DEPTH constant is compile-time only; no runtime option is exposed. (4) A/B measurement against the baseline in Research-0760 is required before promoting to DRAFT or marking PASS. (5) The SYCL / Vulkan motion twins do NOT implement batching — their dispatch overhead profiles differ and this optimization does not apply to them. (6) A correctness check at places=4 (ADR-0214) against CPU is required.
References¶
- Research-0760 (2026-05-29): CUDA motion kernel ncu multi-resolution profile
- Research-0735 (2026-05-28): CUDA motion kernel hotpath analysis at 576p
- ADR-0219: motion3 GPU contract
- ADR-0214: GPU-parity CI gate (places=4)
- ADR-0242: engine-scope fence batching (drain_batch — no longer consumed by motion after this ADR)
- ADR-0760: CUDA motion ncu multi-resolution profiling methodology
req: per-agent task brief 2026-05-29 — "reduce per-launch overhead... Target: motion CUDA wins or ties CPU at 576p AND 1080p"- PR #75, PR #77: benchmark findings that motivated this work