ADR-1108: CUDA motion_v2 twin emits motion3_v2_score¶

Status: Accepted
Date: 2026-06-13
Deciders: Lusoris
Tags: cuda, feature, parity

Context¶

The CPU reference extractor motion_v2 (core/src/feature/integer_motion_v2.c) emits three features in its EOS flush(): motion_v2_sad_score (per-frame), motion2_v2_score (the min-blend), and motion3_v2_score (a per-frame motion_blend(motion2, blend_factor, blend_offset) clipped to motion_max_val, with an optional 2-tap moving average, seeded by a stamp_value for indices below min_idx = 1). The CUDA twin motion_v2_cuda (core/src/feature/cuda/integer_motion_v2_cuda.c) emitted only the first two — its provided_features[] listed sad + motion2_v2 and its flush() stopped after motion2_v2. Its option table carried only motion_fps_weight, so the four motion3-driving options (motion_blend_factor, motion_blend_offset, motion_max_val, motion_moving_average) were silently unavailable on the CUDA path.

ADR-0337 added the full option surface and motion3_v2_score to the CPU extractor, but its Consequences §Neutral explicitly deferred the GPU twins: "GPU twins (CUDA, SYCL, HIP, Vulkan) of motion_v2 do not need the option surface in this PR ... whether GPU twins gain the same options will be decided when each twin needs to emit motion3_v2_score." A motion_v2_cuda consumer (a CHUG re-extract, a model file carrying motion_v2=motion_blend_factor=…, or a co-scheduled CPU+CUDA parity run) that requested motion3_v2_score therefore got nothing on the CUDA path — a genuine feature-coverage gap, not a numerical one. This ADR closes the deferral for the CUDA twin.

Decision¶

We will make motion_v2_cuda emit VMAF_integer_feature_motion3_v2_score with the same per-frame formula, ordering, stamp_value seeding, and feature-name dictionary handling as the CPU integer_motion_v2.c::flush(). The CUDA twin's option table gains the four CPU options it consumes — motion_blend_factor, motion_blend_offset, motion_max_val, motion_moving_average — mirroring the CPU VmafOption definitions byte-for-byte (same names, aliases, defaults, min/max, and VMAF_OPT_FLAG_FEATURE_PARAM). The post-process is host-side scalar over the SAD scores the kernel already produces, so no GPU kernel change is needed; the motion_blend helper is reused verbatim from the shared motion_blend_tools.h header (already consumed by the v1 CUDA twin integer_motion_cuda.c). motion_force_zero and motion_five_frame_window stay CPU-only — the CUDA kernel always computes the SAD, and the 5-frame window remains unsupported per ADR-0337.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Mirror CPU `flush()` host-side over collected SAD scores (chosen)	Bit-exact by construction; no kernel change; reuses shared `motion_blend`	~40 lines of host loop duplicated across twins	This is the established v1 precedent (`integer_motion_cuda.c`) and the lowest-risk path to bit-parity
Replicate `motion_blend` as a static inline in the CUDA TU	Self-contained	Duplicates a shared header that is already includable from CUDA TUs (v1 twin includes it)	Needless duplication; drift risk against the CPU source of truth
Promote the CPU streaming post-process (`motion3_postprocess_cuda` style from v1)	Matches v1 twin's streaming shape	v2's `flush()` is a batch loop with `stamp_value` seeding, not a per-frame streaming post-process; reshaping it invites off-by-one bugs	Mirroring v2's own `flush()` is closer to its source of truth
Leave the gap, document it	Zero code	Model files referencing `motion3_v2` silently lose the feature on CUDA; breaks co-scheduled CPU+CUDA parity	Defeats the purpose; the deferral was always meant to close "when each twin needs to emit motion3_v2_score"

Consequences¶

Positive:
motion_v2_cuda now emits motion3_v2_score, closing the GPU-twin coverage gap that ADR-0337 deferred. Measured CPU-vs-CUDA parity on the Netflix src01_hrc00_576x324 ↔ src01_hrc01_576x324 pair (48 frames) is max abs per-frame diff = 0.000e+00 at default options and at motion_blend_factor=0.5 + motion_moving_average=1 (places=4, ADR-0214).
A model file or CLI invocation carrying motion_v2_cuda=motion_blend_factor=… now loads and scores identically to the CPU path; the feature-name dict suffix (_mbf_0.5) matches between CPU and CUDA, so sfr/hfr co-scheduled naming stays consistent.
motion2_v2_score is now emitted via append_with_dict (was the bare append), matching the CPU naming path so renamed co-schedule columns line up across backends.
Negative:
The host-side flush() post-process is duplicated between integer_motion_v2.c and integer_motion_v2_cuda.c. ADR-0141 (touched-file lint-clean) catches formula drift on the next edit of either file; the inline comments cite the CPU source lines.
Neutral / follow-ups:
The SYCL, HIP, and Metal twins still emit only sad + motion2_v2 (the same gap). This ADR scopes the CUDA twin only; the other twins are tracked as follow-ups (docs/state.md row T-CUDA-MOTION-V2-MOTION3-MISSING notes the sibling gaps). The mirroring is mechanical once the CUDA shape is settled.
Netflix-golden gate (CPU, places=4, ADR-0024) is unaffected: motion v1 and the CPU motion_v2 are untouched; this change is additive on the CUDA path only.

References¶

Supersedes the GPU-twin deferral claim in ADR-0337 §Consequences (Neutral / follow-ups) for the CUDA twin only. ADR-0337 otherwise remains Accepted.
Sister ADRs:
ADR-0219 — GPU motion3 coverage precedent (host-side scalar post-process).
ADR-0358 — the v1 motion_cuda flush motion3 emission this twin mirrors.
ADR-0214 — the places=4 GPU-parity tolerance.
Source: req — user direction to make the CUDA motion_v2 extractor emit motion3_v2_score bit-exactly matching the CPU reference.