ADR-0358: CUDA motion correctness — SAD race, pinned-mem leak, and motion2/motion3 precision parity with CPU¶
- Status: Accepted
- Date: 2026-05-09
- Deciders: lusoris, Claude (Anthropic)
- Tags:
cuda,motion,correctness,precision
Context¶
A targeted cuda-reviewer pass on core/src/feature/cuda/integer_motion_cuda.c on 2026-05-09 surfaced four real defects that were latent only because the default golden gate exercises a single configuration (motion_fps_weight = 1.0, motion_moving_average = false):
-
Cross-stream race on the SAD accumulator.
submit_fex_cuda()issuedcuMemsetD8Asyncof the single-int64 SAD buffer ons->str, but launched thecalculate_motion_score_kernel_*kernel on the picture's stream (pic_stream). Both streams areCU_STREAM_NON_BLOCKINGand no event pair links them; the kernel'satomicAddto the same buffer is therefore ordered relative to the memset only by happenstance of single-frame cadence. The matching cousininteger_motion_v2_cuda.c:188already runs the memset onpic_stream. -
Pinned-memory leak of
s->sad_host.init_fex_cuda()allocates a single page-lockeduint64_tviavmaf_cuda_buffer_host_allocfor the D2H copy of the SAD score.close_fex_cuda()did not free it. Each init/close cycle leaked one pinned page.compute-sanitizer --tool memcheck --leak-check fullonmasterreportsLEAK SUMMARY: 8 bytes leaked in 1 allocationstraced toinit_fex_cuda → cuMemHostAlloc. -
motion2_scoreskippedmotion_fps_weight × clip. The CPU reference (integer_motion.c:563) emitsMIN(score2 * motion_fps_weight, motion_max_val)for theVMAF_integer_feature_motion2_scorerow. Both the CUDA collect path (line 468 pre-fix) and the flush path (line 359 pre-fix) emitted the rawmin(prev, cur)(and raws->scorerespectively) — bit-exact only whilemotion_fps_weight == 1.0and themotion_max_valclip never triggers.motion3_scorewas already weighted-and-clipped because themotion3_postprocess_cudahelper does it inline. -
Off-by-one in the moving-average guard.
s->frame_indexis pre-incremented incollect_fex_cuda()beforemotion3_postprocess_cudaruns. The CPU reference (integer_motion.c:523) guards withindex > minimum_past_frames_neededwhereminimum_past_frames_needed == 1for 3-frame mode — i.e. at framework-collect index 1 the guard evaluates1 > 1 = falseand the moving average is not applied. With the pre-increment the GPU helper sawframe_index == 2at that call and applied the average (2 > 1 = true), diverging from CPU at the first non-zero frame whenevermotion_moving_average=true.
In addition the kernels carry two performance advisories:
-
Bank conflict on
__shared__ float tile[20*20]. With 32-bank shared memory andTILE_W = 20,GCD(20, 32) = 4aliases consecutive rows onto the same 4-bank cycle, producing a 2-way conflict between(y=1, x=12..15)and(y=0, x=0..3). Padding the inner dimension to 21 (GCD(21, 32) = 1) eliminates the conflict at +64 bytes per block (1764 vs 1600, far under the 48 KB SM limit). -
No
__launch_bounds__directive. The motion blur+SAD kernels use 16×16 blocks (256 threads) and modest register pressure, so an explicit__launch_bounds__(256, 8)is well within reach for small-/mid-tier GPUs and lets nvcc trim register usage to keep occupancy stable across the supported gencode set.
Decision¶
We will:
- Move the SAD
cuMemsetD8Asyncfroms->strontopic_streamso it shares ordering with the kernel that consumes it; mirrors the v2 pattern (BLOCKER 1). - Free
s->sad_hostinclose_fex_cuda()and theinit_fex_cuda()error unwind viavmaf_cuda_buffer_host_free(BLOCKER 2). - Emit
MIN(score * motion_fps_weight, motion_max_val)forVMAF_integer_feature_motion2_scorein both the collect and flush paths, matchinginteger_motion.c:563line-for-line (NEEDS-CHANGES 3). - Adjust the moving-average guard in
motion3_postprocess_cuda()tos->frame_index > 2(compensating for the pre-increment) so framework- collect-index 1 skips averaging exactly as the CPU reference does (NEEDS-CHANGES 4). - Pad the
motion_score.cuandmotion_v2_score.cushared-memory tile inner dimension fromTILE_WtoTILE_W + 1and add__launch_bounds__(BLOCK_X * BLOCK_Y, 8)to all four kernels (ADVISORY 5 + 6).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Move increment of frame_index after motion3_postprocess_cuda instead of changing the guard | More obviously parallel to the CPU code's index > 1 | Touches both the index==0 and index>0 branches, and the increment is also relied on for the index==0 zero-emit logic; harder to keep the existing semantics intact | The narrow-scope fix is to compensate at the consumer (the helper), keeping the increment site behavioural with the rest of the file |
Insert cuStreamWaitEvent linking s->str to pic_stream for the memset | Preserves the existing two-stream split | Extra event hop per frame for no engineering benefit; the v2 path already established that running the memset on pic_stream is correct | The two-stream split exists so the D2H copy can run in parallel with subsequent work; the memset itself is fast and trivially co-locates with the kernel |
| Leave the bank-conflict padding to a follow-up PR | Smaller PR | The padding cost (+64 B/block) is negligible and the fix is one line per kernel; bundling avoids a second build/CI cycle | Bundle |
Consequences¶
- Positive: motion / motion2 / motion3 are now bit-exact (places=4) with the CPU reference under all option combinations of
motion_fps_weight,motion_max_val, andmotion_moving_average— not just defaults; one page-locked allocation no longer leaks per init/close cycle; the SAD accumulator is no longer racing the kernel; the motion kernels no longer pay a 2-way bank-conflict tax on every shared-memory access. - Negative:
__shared__footprint of the motion blur kernels grows by 64 bytes per block — irrelevant at the 48 KB SM cap. - Neutral / follow-ups: the cuda-reviewer pass also flagged
core/src/cuda/common.c:388,416for an inverted stream-select condition (no live callers); deferred to a separate small PR per the agent brief. Motion3 GPU coverage (T3-15(c)) remains a separate feature track and is not expanded here.
References¶
req: cuda-reviewer 2026-05-09 brief identifying BLOCKERS 1+2, NEEDS-CHANGES 3+4, and ADVISORIES 5+6.- CPU reference:
core/src/feature/integer_motion.c:523, 563. - Existing correct pattern:
core/src/feature/cuda/integer_motion_v2_cuda.c:188. - Related: ADR-0219 (motion3 CUDA scaffold), ADR-0242 (engine-scope fence batching).