ADR-0219: motion3 GPU coverage on Vulkan + CUDA + SYCL (3-frame window)¶
- Status: Accepted
- Date: 2026-04-29
- Deciders: Lusoris, Claude (Opus 4.7)
- Tags: gpu, vulkan, cuda, sycl, motion, feature-extractor, fork-local, t3-15c, places-4
Context¶
The CPU motion extractor (core/src/feature/integer_motion.c) emits three outputs per frame: motion_score, motion2_score, and motion3_score. The GPU twins shipped to date — motion_vulkan (ADR-0177), motion_cuda, motion_sycl — emitted only the first two; motion3_score was deliberately deferred (see the DELIBERATE: motion3_score is omitted comment block at the end of motion_vulkan.c pre-PR). Backlog item T3-15(c) (formerly T3-17) tracks closing that gap.
motion3_score has two distinct code paths in CPU:
- 3-frame window (default,
motion_five_frame_window=false).motion3 = clip(motion_blend(motion2 * fps_weight, blend_factor, blend_offset), max_val)with optional moving-average against the previous unaveraged blended value. No new device-side state: it's a pure host-side scalar transform ofmotion2, which the GPU already produces. - 5-frame window (
motion_five_frame_window=true). Adds a second SAD pair (frame i-2 ↔ i-4), so the device needs a 5-deep blur ring buffer. No shipped VMAF model uses this mode; the option exists for ad-hoc CLI tuning.
T3-15(c)'s scope is the coverage gap, not the 5-frame mode itself. Closing path 1 across all three GPU backends takes the existing motion2 SAD output and adds the same scalar post-processing the CPU runs in extract() / flush(). Closing path 2 requires re-shaping the device-side ring buffer in 3 different kernel languages (GLSL/SPIR-V, CUDA, SYCL), and there is no test that exercises it.
NASA/JPL Power-of-10 rule 6 (declared scope) and CERT C "fail loud, fail early" both argue for accepting motion3 in the 3-frame mode and rejecting motion_five_frame_window=true with -ENOTSUP at init() rather than silently producing wrong answers.
Decision¶
We will emit VMAF_integer_feature_motion3_score from the motion_vulkan, motion_cuda, and motion_sycl extractors in 3-frame window mode (the default). The post-processing — motion_blend() + clip-to-motion_max_val + optional moving-average — runs on the host inside the existing collect() / extract() / flush() paths, mirroring the CPU extractor's integer_motion.c lines 510-560 byte-for-byte. The full options surface (motion_blend_factor, motion_blend_offset, motion_fps_weight, motion_max_val, motion_moving_average) is added to the GPU options tables; defaults match the CPU extractor.
motion_five_frame_window=true is rejected with -ENOTSUP at init() on all three GPU backends. The cross-backend parity gate (scripts/ci/cross_backend_parity_gate.py, scripts/ci/cross_backend_vif_diff.py) is extended to compare integer_motion3 at places=4 whenever a --feature motion backend pair is exercised.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Extend existing motion kernels (chosen) | Zero device-code change; pure host scalar add-on; cross-backend numerical parity automatic because motion2 is already gated; tracks CPU algorithm exactly | The 5-frame window mode still deferred (no shipped model needs it) | Best ratio of coverage gain to risk |
New standalone motion3_* extractor per backend | Clean separation of concerns | Triples the kernel-launch path for a metric that is a deterministic post-process of motion2; would need to recompute motion2 internally; doubles the cross-backend gate matrix | Wasted compute, no upside |
| 5-frame ring buffer on device + full motion3 (paths 1 + 2) | Closes both gaps in one PR | 3 kernels (GLSL, CUDA, SYCL) require a 2-deep → 5-deep ring rewrite, second SAD-pair dispatch, new push-constant / spec-constant geometry; no existing fixture tests motion_five_frame_window=true | Out-of-budget; no test gate; defer |
| Per-frame readback of blurred buffers + CPU-side 5-frame motion3 | Avoids device-side ring expansion | 4× device-host bandwidth at the only point the GPU pipeline is currently bandwidth-clean; defeats the purpose of GPU offload | Defeats GPU offload |
| Batched readback of N frames | Lower amortised bandwidth | Still requires reshaping the ping-pong protocol; adds latency variance; harder to reason about with the SYCL combined-graph API | More complex than the deferred device path |
Consequences¶
- Positive:
- GPU motion now feeds the same downstream model surface as CPU motion. Tools that rely on motion3 (e.g. ad-hoc CLI tuning, research notebooks at
ai/scripts/phase3_subset_sweep.py) work unchanged across--backend cuda|sycl|vulkan. - Cross-backend parity gate auto-extends: adding
integer_motion3toFEATURE_METRICSexercises every existing pair (CPU↔CUDA, CPU↔SYCL, CPU↔Vulkan) atplaces=4. -
provided_features[]for the three GPU twins now matches CPU, closing the framework-routing gap that previously sent a--feature integer_motion3request to the CPU path even with--backend cuda|sycl|vulkan. -
Negative:
motion_five_frame_window=trueis now an explicit GPU failure (-ENOTSUP) where it was previously just unsupported by virtue of motion3 absence. Callers that were relying on the silent-degradation behaviour will see a hard error.-
Three additional
double-typed options enter each GPU extractor's option dict, growing the per-extractor state struct by ~40 bytes. Negligible at runtime. -
Neutral / follow-ups:
- 5-frame window mode tracked as a sub-task of T3-15(c) /
motion_five_frame_window-gpu. The kernel work is non-trivial on all 3 backends; defer until a shipped model needs it. - Cross-backend gate at
places=4should hold trivially since motion3 is a deterministic scalar post-process of motion2, which already meetsplaces=4on all three backends. If a future regression surfaces, the root cause is in the motion2 kernel, not in the new post-processing. - The
motion3_postprocess_*helper triplicates across the three extractors. A future refactor could lift it intomotion_blend_tools.has a shared inline helper alongsidemotion_blend(). Not done in this PR — keeping the implementation local to each extractor matches the existing "each backend owns its score-emission logic" pattern.
References¶
- Upstream extractor:
core/src/feature/integer_motion.c(CPU reference; lines 510-560 are the motion3 emission inextract(), lines 401-438 inflush()). - Sister GPU motion ADRs: ADR-0177 (Vulkan motion T5-1c), ADR-0193 (motion_v2 Vulkan), ADR-0145 (motion_v2 NEON).
- Cross-backend gate: ADR-0125, ADR-0138, ADR-0214.
- Backlog:
docs/backlog-audit-2026-04-28.mdrow A.1.4 (Vulkan motion3) — note that the audit row only mentions Vulkan; this PR closes CUDA + SYCL in the same change since the post-processing is identical. - Prior fork close: T4-1 / Netflix#1486 (CPU motion3 already present, see ADR-0158).
- Source:
req— backlog row T3-15(c): "motion3 (5-frame window) GPU coverage on Vulkan + CUDA + SYCL (former T3-17; T4-1/Netflix#1486 closed CPU side)."