Skip to content

ADR-0193: motion_v2 Vulkan kernel — single-dispatch SAD via convolution linearity

  • Status: Accepted
  • Date: 2026-04-27
  • Deciders: Lusoris, Claude (Anthropic)
  • Tags: vulkan, gpu, feature-extractor, fork-local, bit-exact

Context

ADR-0192 scopes batch 3 with motion_v2 as the first metric: smallest CPU reference (300 LOC), builds directly on the already-shipped motion_vulkan (ADR-0177) — same 5-tap separable Gaussian filter, same int64 partial-sum reduction, same VkSpecializationInfo shape.

The CPU reference is integer_motion_v2.c. Stateless variant of integer_motion: instead of storing blurred ref frames between extract calls, it exploits convolution linearity:

SAD(blur(prev), blur(cur)) == sum(|blur(prev - cur)|)

so each frame computes its score in one V→H separable convolve over (prev_ref - cur_ref), accumulating |h| directly into the SAD instead of writing blurred frames to a side buffer.

Decision

Ship motion_v2_vulkan as a single-dispatch GLSL compute kernel with GPU-side raw-pixel ping-pong:

Single dispatch per frame (after frame 0)

The kernel consumes both prev and cur ref pixels via two read-only SSBOs and emits per-WG int64 SAD partials. Frame 0 is short-circuited host-side without a dispatch (CPU emits 0.0 too). No blurred output buffer — the diff is filtered and reduced inline.

GPU-side raw-pixel ping-pong

Two ref_buf[2] SSBOs hold the most recent two ref planes. Each frame uploads only the current ref into ref_buf[cur_ref_idx], then binds ref_buf[1 - cur_ref_idx] as "prev". Halves the per- frame upload bandwidth vs the alternative of using the framework's fex->prev_ref (which would require uploading both this frame's ref AND the previous frame's ref every call).

Mirror padding — diverges from motion.comp

Correction (ADR-0662, 2026-05-20): the literal below was stale. Current CPU integer_motion_v2.c::mirror uses 2 * size - idx - 2, and the CUDA / SYCL / Vulkan twins must match that reflect-101 formula. The old -1 value produced the lavapipe drift fixed by ADR-0662.

CPU integer_motion_v2.c::mirror as originally documented here:

if (idx >= size) return 2 * size - idx - 1;   /* edge replication */

CPU integer_motion.c::edge_8 / edge_16:

if (i_tap >= height) i_tap = 2 * (height - 1) - i_tap;  /* no edge replication */

These differ by one at the right/bottom boundary. motion.comp implements the second formula (matches its CPU twin); this kernel must implement the first one. Catching this offset cost the only debug round in bring-up — initial implementation reused motion.comp's dev_mirror verbatim and produced max_abs_diff = 2.62e-3. The fix is one line in dev_mirror and gets max_abs_diff to 0.0.

Precision contract: bit-exact (places=4 in the gate)

CPU computes the full pipeline in integer arithmetic (int32 / int64 accumulators with explicit round-and-shift). The shader matches the same arithmetic shape:

  • Vertical accum: int64, then (acc + (1 << (bpc-1))) >> bpcint32 cell in s_vert.
  • Horizontal accum: int64, then (h + 32768) >> 16int64.
  • abs(blurred) → per-thread int64 contribution.
  • subgroupAdd → per-WG slot in the int64 SAD SSBO.
  • Host-side: int64 sum across slots, / 256.0 / (W·H) → final motion_v2_sad_score.

int64 is pinned by GL_EXT_shader_explicit_arithmetic_types_int64. The vertical accumulator could fit int32 for bpc <= 12 but overflows at bpc=16 (filter[k]=26386 × diff range ±65535 × 5 taps ≈ 8.6e9 > INT32_MAX), so the kernel uses int64 throughout for uniformity — the perf cost on Arc / Mesa is negligible (the kernel is bandwidth-bound on the input load).

Empirical verification (Intel Arc A380 + Mesa anv driver, Netflix normal pair 576×324):

Bit depth Frames motion_v2_sad_score max_abs_diff motion2_v2_score max_abs_diff
8-bit 48 0.0e+00 0.0e+00
10-bit 3 0.0e+00 0.0e+00

places=4 gate threshold (5e-5) is several orders of magnitude above the actual delta — the contract could be raised to places=17 if a use case ever demands it. Setting it at places=4 keeps it consistent with the rest of the GPU long-tail.

motion2_v2_score emitted host-side in flush()

The motion2_v2_score = min(score[i], score[i+1]) post-process runs on collected feature scores after the per-frame pipeline finishes — same shape as CPU integer_motion_v2.c::flush. No GPU work needed; the kernel only emits motion_v2_sad_score.

Alternatives considered

Option Pros Cons Why not chosen
2-dispatch (V then H, intermediate buffer) Cleaner halo handling — each pass is embarrassingly parallel within a row Needs an int32 intermediate buffer of size W·H (≈730 KB at 576×324, ≈8.3 MB at 1080p). Twice the dispatch cost. The single-dispatch tile-with-halo design from motion.comp is already reviewed and optimal at the WG sizes we use; reusing it 1:1 (modulo the diff load and dropped output) is the smallest possible delta on top of motion.
Use framework's fex->prev_ref and re-upload prev each frame Matches CPU code shape exactly; no GPU-side ping-pong state 2× upload bandwidth per frame (prev + cur) vs ping-pong's 1× (cur only). Negligible at 576×324, noticeable at 4K. Ping-pong is the same pattern motion_vulkan already uses for blurred frames — symmetric scaffolding, lower bandwidth.
Spec-constant for bpc instead of push constant Compiler can specialise the >> bpc shift; one less push-const per dispatch Already a spec constant via BPC (constant_id = 2). Push constants only carry runtime params (width/height/wg_count). This is what we do — flagged for the audit trail.
Drop int64 for bpc <= 12 int32 accumulators free up a register lane; small perf win on integrated GPUs Two pipeline variants to maintain (low-bpc vs high-bpc); kernel is bandwidth-bound anyway, no measurable speedup. YAGNI; the uniform int64 path costs nothing measurable and keeps the shader simple.

Consequences

  • Positive: bit-exact match to CPU on both 8-bit and 10-bit paths. Strongest possible precision contract — the gate runs at places=4 for parity with the rest of the long-tail but the actual floor is exact.
  • Positive: ~280 LOC kernel + ~280 LOC host glue. Smallest fork-local kernel PR in batch 3, validates the "delta on top of motion" thesis from ADR-0192's per-metric ordering.
  • Negative: the original mirror divergence warning became stale after the CPU reference settled on the reflect-101 2 * size - idx - 2 formula. ADR-0662 corrects the Vulkan / CUDA / SYCL twins and keeps lavapipe parity gated.
  • Neutral: motion3 is irrelevant for motion_v2 (no 5-frame window mode in the v2 algorithm); no equivalent of motion.comp's scope exclusion needed.

References