ADR-0219: motion3 GPU coverage on Vulkan + CUDA + SYCL (3-frame window)¶

Status: Accepted
Date: 2026-04-29
Deciders: Lusoris, Claude (Opus 4.7)
Tags: gpu, vulkan, cuda, sycl, motion, feature-extractor, fork-local, t3-15c, places-4

Context¶

The CPU motion extractor (core/src/feature/integer_motion.c) emits three outputs per frame: motion_score, motion2_score, and motion3_score. The GPU twins shipped to date — motion_vulkan (ADR-0177), motion_cuda, motion_sycl — emitted only the first two; motion3_score was deliberately deferred (see the DELIBERATE: motion3_score is omitted comment block at the end of motion_vulkan.c pre-PR). Backlog item T3-15(c) (formerly T3-17) tracks closing that gap.

motion3_score has two distinct code paths in CPU:

3-frame window (default, motion_five_frame_window=false). motion3 = clip(motion_blend(motion2 * fps_weight, blend_factor, blend_offset), max_val) with optional moving-average against the previous unaveraged blended value. No new device-side state: it's a pure host-side scalar transform of motion2, which the GPU already produces.
5-frame window (motion_five_frame_window=true). Adds a second SAD pair (frame i-2 ↔ i-4), so the device needs a 5-deep blur ring buffer. No shipped VMAF model uses this mode; the option exists for ad-hoc CLI tuning.

T3-15(c)'s scope is the coverage gap, not the 5-frame mode itself. Closing path 1 across all three GPU backends takes the existing motion2 SAD output and adds the same scalar post-processing the CPU runs in extract() / flush(). Closing path 2 requires re-shaping the device-side ring buffer in 3 different kernel languages (GLSL/SPIR-V, CUDA, SYCL), and there is no test that exercises it.

NASA/JPL Power-of-10 rule 6 (declared scope) and CERT C "fail loud, fail early" both argue for accepting motion3 in the 3-frame mode and rejecting motion_five_frame_window=true with -ENOTSUP at init() rather than silently producing wrong answers.

Decision¶

We will emit VMAF_integer_feature_motion3_score from the motion_vulkan, motion_cuda, and motion_sycl extractors in 3-frame window mode (the default). The post-processing — motion_blend() + clip-to-motion_max_val + optional moving-average — runs on the host inside the existing collect() / extract() / flush() paths, mirroring the CPU extractor's integer_motion.c lines 510-560 byte-for-byte. The full options surface (motion_blend_factor, motion_blend_offset, motion_fps_weight, motion_max_val, motion_moving_average) is added to the GPU options tables; defaults match the CPU extractor.

motion_five_frame_window=true is rejected with -ENOTSUP at init() on all three GPU backends. The cross-backend parity gate (scripts/ci/cross_backend_parity_gate.py, scripts/ci/cross_backend_vif_diff.py) is extended to compare integer_motion3 at places=4 whenever a --feature motion backend pair is exercised.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Extend existing motion kernels (chosen)	Zero device-code change; pure host scalar add-on; cross-backend numerical parity automatic because motion2 is already gated; tracks CPU algorithm exactly	The 5-frame window mode still deferred (no shipped model needs it)	Best ratio of coverage gain to risk
New standalone `motion3_*` extractor per backend	Clean separation of concerns	Triples the kernel-launch path for a metric that is a deterministic post-process of motion2; would need to recompute motion2 internally; doubles the cross-backend gate matrix	Wasted compute, no upside
5-frame ring buffer on device + full motion3 (paths 1 + 2)	Closes both gaps in one PR	3 kernels (GLSL, CUDA, SYCL) require a 2-deep → 5-deep ring rewrite, second SAD-pair dispatch, new push-constant / spec-constant geometry; no existing fixture tests `motion_five_frame_window=true`	Out-of-budget; no test gate; defer
Per-frame readback of blurred buffers + CPU-side 5-frame motion3	Avoids device-side ring expansion	4× device-host bandwidth at the only point the GPU pipeline is currently bandwidth-clean; defeats the purpose of GPU offload	Defeats GPU offload
Batched readback of N frames	Lower amortised bandwidth	Still requires reshaping the ping-pong protocol; adds latency variance; harder to reason about with the SYCL combined-graph API	More complex than the deferred device path

Consequences¶

Positive:
GPU motion now feeds the same downstream model surface as CPU motion. Tools that rely on motion3 (e.g. ad-hoc CLI tuning, research notebooks at ai/scripts/phase3_subset_sweep.py) work unchanged across --backend cuda|sycl|vulkan.
Cross-backend parity gate auto-extends: adding integer_motion3 to FEATURE_METRICS exercises every existing pair (CPU↔CUDA, CPU↔SYCL, CPU↔Vulkan) at places=4.
provided_features[] for the three GPU twins now matches CPU, closing the framework-routing gap that previously sent a --feature integer_motion3 request to the CPU path even with --backend cuda|sycl|vulkan.
Negative:
motion_five_frame_window=true is now an explicit GPU failure (-ENOTSUP) where it was previously just unsupported by virtue of motion3 absence. Callers that were relying on the silent-degradation behaviour will see a hard error.
Three additional double-typed options enter each GPU extractor's option dict, growing the per-extractor state struct by ~40 bytes. Negligible at runtime.
Neutral / follow-ups:
5-frame window mode tracked as a sub-task of T3-15(c) / motion_five_frame_window-gpu. The kernel work is non-trivial on all 3 backends; defer until a shipped model needs it.
Cross-backend gate at places=4 should hold trivially since motion3 is a deterministic scalar post-process of motion2, which already meets places=4 on all three backends. If a future regression surfaces, the root cause is in the motion2 kernel, not in the new post-processing.
The motion3_postprocess_* helper triplicates across the three extractors. A future refactor could lift it into motion_blend_tools.h as a shared inline helper alongside motion_blend(). Not done in this PR — keeping the implementation local to each extractor matches the existing "each backend owns its score-emission logic" pattern.

References¶

Upstream extractor: core/src/feature/integer_motion.c (CPU reference; lines 510-560 are the motion3 emission in extract(), lines 401-438 in flush()).
Sister GPU motion ADRs: ADR-0177 (Vulkan motion T5-1c), ADR-0193 (motion_v2 Vulkan), ADR-0145 (motion_v2 NEON).
Cross-backend gate: ADR-0125, ADR-0138, ADR-0214.
Backlog: docs/backlog-audit-2026-04-28.md row A.1.4 (Vulkan motion3) — note that the audit row only mentions Vulkan; this PR closes CUDA + SYCL in the same change since the post-processing is identical.
Prior fork close: T4-1 / Netflix#1486 (CPU motion3 already present, see ADR-0158).
Source: req — backlog row T3-15(c): "motion3 (5-frame window) GPU coverage on Vulkan + CUDA + SYCL (former T3-17; T4-1/Netflix#1486 closed CPU side)."