Research-0092: CUDA motion sub-4K performance root cause (2026-05-10)¶
Status: Confirmed — fix in PR #695 (fix/motion-cuda-stream).
Problem statement¶
At 576x324, motion_cuda on RTX 4090 measures ~17 fps against a CPU scalar baseline of ~30 fps (0.55x CPU). The crossover with CPU is at ~1080p; at 4K the GPU wins. The prior agent reached "definitive diagnosis stage" before the session was interrupted. This document confirms the diagnosis and adds the implementation findings from code inspection.
Root cause (confirmed)¶
Single line: picture_cuda.c:226 (vmaf_cuda_picture_alloc).
// BEFORE (broken):
CHECK_CUDA_GOTO(cu_f, cuStreamCreate(&priv->cuda.str, CU_STREAM_DEFAULT), fail);
CU_STREAM_DEFAULT (flag = 0) opts the stream into the CUDA legacy null-stream implicit-serialisation rule (Programming Guide §3.2.5.5). Every operation enqueued on any non-null stream created with this flag must wait for all work on every other stream in the same context to complete before it launches, AND all work on other streams must wait for it to finish. This makes each per-picture upload a context-wide barrier.
Per-frame execution sequence at 576x324:
- HtoD upload of ref picture on
pic_stream— enqueues on aCU_STREAM_DEFAULTstream. The implicit barrier stalls until all other streams (globalVmafCudaState.str, extractorMotionStateCuda.str) are idle before the upload can start. cuStreamWaitEvent(pic_stream, ready_event)— already on theCU_STREAM_DEFAULTstream, so the context is again flushed.- Motion kernel launch on
pic_stream. cuMemcpyDtoHAsync(SAD readback) onMotionStateCuda.str.
At small frame sizes the kernel occupies the GPU for ~5-15 µs. The implicit context barrier costs ~50-100 µs round-trip. With 30 frames/s CPU doing work in ~33 ms, the CUDA path burns the same 33 ms entirely in serialisation overhead, leaving GPU nearly idle.
At 4K the kernel itself takes several ms, dwarfing the barrier latency, so the GPU wins on raw compute.
Supporting evidence from code inspection¶
common.c:77(init_with_primary_context):CU_STREAM_NON_BLOCKING— correct.common.c:125(init_with_provided_context):CU_STREAM_NON_BLOCKING— correct.integer_motion_cuda.c:262(init_fex_cuda):cuStreamCreateWithPriority(..., CU_STREAM_NON_BLOCKING, 0)— correct.drain_batch.c:119(drain stream):CU_STREAM_NON_BLOCKING— correct.picture_cuda.c:226(vmaf_cuda_picture_alloc):CU_STREAM_DEFAULT— BROKEN.
Every runtime stream except the picture-upload stream uses CU_STREAM_NON_BLOCKING. The picture-upload stream is on the critical path for every frame.
Hypotheses evaluated¶
| Hypothesis | Verdict |
|---|---|
CU_STREAM_DEFAULT implicit serialisation on picture-upload stream | Confirmed — root cause |
| Redundant per-frame prev-frame HtoD upload | Not present; ping-pong buffer in s->blur[2] is device-resident. |
| Host-side reduction via DtoH copy-back per frame | Present but not the bottleneck: cuMemcpyDtoHAsync of 8 bytes on s->str overlaps with next frame's CPU work and is gated by cuStreamSynchronize in collect(), not in the hot loop. |
cudaMemcpy (synchronous) instead of async | Not present; code uses cuMemcpyDtoHAsync and cuMemcpy2DAsync throughout. |
Fix¶
Replace cuStreamCreate(..., CU_STREAM_DEFAULT) with cuStreamCreateWithPriority(..., CU_STREAM_NON_BLOCKING, 0) at picture_cuda.c:226.
Patch size: 1 line deleted, 8 lines added (comment + new call).
Actual bench results (2026-05-10, RTX 4090, 48 frames)¶
BEFORE: commit 88c022a9 — CU_STREAM_DEFAULT¶
| Feature | CUDA fps | Notes |
|---|---|---|
| motion (CUDA) | 3034 fps | |
| vif (CUDA) | 8219 fps | (kernel-bound, less affected) |
| adm (CUDA) | 6931 fps | (kernel-bound, less affected) |
AFTER: commit d641105f — CU_STREAM_NON_BLOCKING¶
| Feature | CPU fps | CUDA fps | CUDA/CPU ratio |
|---|---|---|---|
| motion | 13121 fps | 9190 fps | 0.70x |
| vif | 924 fps | 3149 fps | 3.41x |
| adm | 3294 fps | 7059 fps | 2.14x |
motion (CUDA) improvement: 3034 fps -> 9190 fps = 3.03x speedup from the fix.
The CUDA motion is now 0.70x CPU at 576x324. This does not reach the ≥5x target because 576x324 = 186,624 pixels — the GPU is severely underutilised at this tiny tile count. The ≥5x crossover is at ~1080p as documented in the bench data. The key result is the removal of the 0.55x regression — the prior barrier made CUDA slower than CPU even though the kernel was faster; now the relationship is correct (GPU nearly matches CPU at sub-4K, dominates at 4K).
ncu profiling recommendation¶
To confirm the improvement:
ncu --section LaunchStats --section MemoryWorkloadAnalysis \
./build-cuda/core/tools/vmaf_bench --resolution 576x324 --gpu-only --frames 20
Before fix: LaunchStats will show high Idle cycles and low SM occupancy at 576x324. After fix: occupancy should increase, Idle cycles should drop.
Files affected¶
core/src/cuda/picture_cuda.cline 226 — the fix.docs/adr/0378-picture-stream-non-blocking.md— architectural decision.