ADR-0192: GPU long-tail batch 3 — closing every remaining metric gap (motion_v2 / float_ansnr / ssimulacra2 / cambi + float twins)¶
- Status: Accepted
- Date: 2026-04-27
- Deciders: Lusoris, Claude (Anthropic)
- Tags: gpu, cuda, sycl, vulkan, feature-extractor, fork-local
Context¶
Batch 1 (ADR-0182) closed psnr / float_moment / ciede across all three backends (PR #125 → #137). Batch 2 (ADR-0188) closed float_ssim / float_ms_ssim / psnr_hvs (PR #139 → #144). The remaining matrix gaps after batch 2 fall into two groups:
Group A — metrics with no GPU twin at all¶
| Metric | CPU LOC | Notes |
|---|---|---|
integer_motion_v2 | 300 | Builds directly on integer_motion (already on GPU). Extra step is a per-row absolute difference + clamped accumulation; the convolve scaffolding is shared. |
float_ansnr | 124 | Anti-noise SNR. Per-pixel (ref - dis)² accumulator + noise = mean(dis²) — two reductions, no spatial filter. The fork ships float-only; no integer variant exists upstream. |
ssimulacra2 | 1118 | Recent fork addition (ADR-0130). XYB color transform + 4-stage IIR low-pass + per-stage ssim-style combine. Float-domain throughout. |
cambi | 1533 | Banding detection. Multi-scale per-pixel range tracking with a sequential per-row state update (get_derivative_data_for_row) — the hardest GPU port in the long tail. |
Group B — float twins of metrics whose int variant is on GPU¶
float_psnr, float_adm, float_vif, float_motion exist as CPU- only paths in the matrix while the matching integer_* variant ships on CUDA + SYCL + Vulkan. The Netflix default models use the integer path; the float variants exist for (1) reproducing Netflix's published float_* scores and (2) the cross-backend gate's float-vs-int sanity check.
This ADR commits to closing both groups in one batch, not split across two. Per direction taken in the popup decision (option 3 of the scoping question on 2026-04-27): finish the GPU long-tail in one logical chunk so the matrix has a clean "every metric has a GPU twin" terminus.
Decision¶
Per-metric ordering¶
Ship in ascending complexity, mirroring batch 2's pattern:
integer_motion_v2first. Smallest (300 LOC), reuses the already-shippedinteger_motionVulkan/CUDA/SYCL convolve as a subroutine. Validates that "delta on top of an existing kernel" composes cleanly across backends before tackling new compute shapes.float_ansnrsecond. Tiny (124 LOC), no spatial filter — pure per-pixel reduction. Same partial-sum pattern aspsnr, second reduction for thenoiseterm. Establishes the "two- parallel-reductions" idiom thatcambilater reuses for its multi-scale accumulators.- Float twins of int metrics on GPU —
float_psnr,float_motion,float_vif,float_adm. Shipped in this order: smallest first (psnr → motion → vif → adm). Each is structurally the float-domain twin of an int kernel that already exists on each backend; the kernel work is mostly translating the integer accumulator tofloatand the post-processing log/divide. Not aliased to the int kernels — see alternatives. ssimulacra2fourth. ~1100 LOC of XYB transform + IIR + per- stage SSIM-style compute. Uses the SSIM scaffolding from batch 2 (ADR-0189 / 0190) as a reusable subroutine for the per-stage compute, but ssimulacra2's IIR low-pass is sequential along the long axis (forward + backward pass per row). The IIR coefficients are runtime-fixed — implement as a per-row dispatch with a work-group serial scan.cambilast. The range-tracking state is sequential per row; the GPU port has to re-shape the algorithm into a per-row parallel scan (Hillis-Steele or Kogge-Stone) plus the multi-scale pyramid. Highest implementation risk; landing it last means every other batch-3 metric's review is already closed before cambi review starts.
Per-backend ordering (within each metric)¶
Same as batches 1 + 2: Vulkan → CUDA → SYCL. Vulkan GLSL is the clean reference; CUDA + SYCL ports follow once the numerical contract is locked.
Precision contracts (measured-first per ADR-0188's pattern)¶
| Metric | Target | Rationale |
|---|---|---|
integer_motion_v2 | places=4 | Integer reduction; matches the integer_motion precedent. Bit-exactness possible if convolve reuses the existing kernel. |
float_ansnr | places=3 | Float-domain log10 final transform compresses per-pixel error. Same shape as float_psnr (which we measure first). |
float_psnr / float_motion / float_vif / float_adm | places=3 | Float accumulators + log10/divide post-process. Looser than the integer twin's places=4 because the int kernels can keep int64 partials whereas float partials lose precision in the per-WG reduction. |
ssimulacra2 | places=2 | Multi-stage float pipeline (XYB + IIR + SSIM-combine + log). Each stage's float rounding accumulates. May surprise upward; measure first per ADR-0188. |
cambi | places=2 | Multi-scale + log post-process. Sequential range-tracking forces a per-row parallel scan that re-orders float adds vs the CPU. Likely the loosest contract. |
Each per-metric ADR (one per Vulkan PR, mirroring batches 1 + 2) locks the actual measured floor.
Chroma handling¶
integer_motion_v2: luma-only (matchesinteger_motion).float_ansnr: luma-only.- Float twins: same plane mask as the corresponding int kernel.
ssimulacra2: needs all three planes (XYB color transform). Chroma upload via the0x7bitmask landed in PR #137.cambi: luma-only by default, optional chroma extension via the existingenc_bitdepthknob.
Per-PR deliverables¶
Same six deliverables as batches 1 + 2, per ADR-0108:
- Kernel + host glue.
- New metric entry in
scripts/ci/cross_backend_vif_diff.pyFEATURE_METRICS(first backend's PR per metric only). - Lavapipe lane step in
tests-and-quality-gates.yml(Vulkan PRs only). - CHANGELOG bullet + matrix update + features.md row update.
- Per-metric ADR for the Vulkan PR (ADR-0193..0199 reserved for batch 3 per-metric ADRs — ssimulacra2 and cambi will likely eat ADR slots for sub-stages too, so the actual count may drift).
- State.md row + rebase-notes entry per CLAUDE §12 r13 / r14.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Defer cambi to batch 4 | Cambi's sequential-state port is the riskiest single chunk; isolating it lets batch 3 ship faster | Splits the long-tail terminus across two batches; the matrix never reaches a "fully closed" state for the duration of the split | User direction (popup, 2026-04-27): finish the long-tail in one batch. The risk-isolation argument is valid but doesn't beat shipping a closed matrix. |
| Skip float twins (Group B), only do Group A | Halves the PR count (12 vs 21+); the int twins already serve every production model path | Breaks the cross-backend gate's float-vs-int sanity coverage; Netflix's published float_* scores can't be reproduced on GPU | Same user direction — the popup explicitly named "all remaining gaps". The cross-backend gate's value compounds with float twin coverage. |
| Alias float twins to the int kernel + a host-side post-quantize | Saves ~12 PRs of kernel work | The int kernel operates on uint8/uint16 quantized samples; the float kernel takes [0, 255] linear floats. Aliasing would either silently quantize the float input (wrong) or run two kernels back-to-back (no win) | The two kernels operate on fundamentally different domains; a thin wrapper would mis-represent one as the other. Better to ship native float kernels. |
| One mega-PR for the whole batch | One review pass | ~7 metrics × 3 backends ≈ 21 PRs' worth of code in one diff (~10k LOC). Mixed precision contracts. Bisect impossible if a regression slips. | Per-PR granularity is non-negotiable at this scale. Same answer as batches 1 + 2. |
Lock all precision contracts at places=2 upfront | Easier to set CI thresholds; no per-metric measurement overhead | Sells precision short for the integer-reduction-friendly metrics (motion_v2, the float twins of int kernels). Already shown by batches 1 + 2 that several metrics land at places=4 empirically. | Measure first, set the contract second — same approach as ciede (ADR-0187), ssim (ADR-0189), ms_ssim (ADR-0190). |
Consequences¶
- Positive: closes the GPU long-tail in one logical chunk. After this batch, every registered feature extractor in the fork has at least one GPU twin (lpips remains delegated to ORT execution providers per ADR-0022). Per-PR scope stays in the ~500-1000 LOC range — same shape as batches 1 + 2.
- Negative: largest batch yet by PR count. 7 metrics × 3 backends = 21 PRs minimum, plus 7+ per-metric ADRs. ssimulacra2 and cambi may each split into multiple sub-PRs (XYB + IIR + ssim- combine for ssimulacra2; multi-scale + scan for cambi), pushing the count toward 30. Total review surface is ~3× batch 2.
- Negative: cambi is the biggest implementation-risk chunk in the entire long-tail effort. Its
get_derivative_data_for_rowstate update is sequential per row and pixel; the GPU port has to either (a) re-implement as a parallel scan with proven-correct algebra, or (b) fall back to a per-row dispatch with no intra-row parallelism (waste). The choice gets locked in cambi's per-metric ADR after a feasibility spike. - Neutral / follow-ups:
integer_motion_v2_vulkanfirst (ADR-0193).- CUDA + SYCL twins follow per batch-2 cadence.
float_ansnr_vulkannext (ADR-0194).- Float twins (4 metrics × 3 backends = 12 PRs) ship as a middle phase — could be parallelised across the three backends if review bandwidth allows.
ssimulacra2_vulkanlikely splits into 2-3 PRs (XYB + IIR + compute) — to be decided in its per-metric ADR after reviewing the GLSL shape.cambi_vulkanlast (ADR-0199 or later). Must be preceded by a feasibility spike (parallel-scan algebra for the range-tracking state).- Once batch 3 closes, the matrix at
.workingdir2/analysis/metrics-backends-matrix.mdshould show every row with at least one GPU✓★. Subsequent GPU work shifts from "long-tail" to "polish" (alternative algorithms, perf tuning, half-precision experiments).
References¶
- Parent: ADR-0182 — batch 1 scope.
- Sibling: ADR-0188 — batch 2 scope.
- Per-metric ADRs (batch 1 + 2 precedent): ADR-0183 ... ADR-0191.
- CPU references for batch 3:
integer_motion_v2.c,float_ansnr.c,ssimulacra2.c,cambi.c. - User direction: AskUserQuestion popup, 2026-04-27 — "All remaining gaps in one batch" / "Yes, draft ADR-0192 now".
Status update 2026-05-09: T3-15 first port landed¶
The successor backlog item T3-15 (GPU coverage long-tail batch 4) shipped its first proof-of-concept port — CUDA psnr chroma extension (psnr_cuda now emits psnr_y / psnr_cb / psnr_cr). Cross-backend gate at places=4 clears bit-exactly (0/48 mismatches on the Netflix normal pair across all three planes). Seven follow-up kernels remain (SYCL PSNR chroma, CUDA + SYCL chroma SSIM / MS-SSIM, CUDA + SYCL cambi); see docs/research/0090-t3-15-gpu-coverage-long-tail-2026-05-09.md for the corrected gap re-audit and per-kernel ordering. The body of this ADR remains frozen per ADR-0028 — this status note is in the References section only.
Status update 2026-05-09: T3-15 #2 SYCL PSNR chroma¶
T3-15(b) — the SYCL twin of the CUDA PSNR chroma extension — landed in the same session. psnr_sycl now emits psnr_y / psnr_cb / psnr_cr. The implementation differs from CUDA in one structural respect: SYCL's existing shared frame buffer (vmaf_sycl_shared_frame_init) is luma-only by design (see core/src/sycl/common.h), so chroma planes ride on per-extractor device buffers populated by host-side staging copies in the combined-graph pre_fn. Luma stays graph-recorded; chroma kernels run direct in post_fn on the same in-order combined queue. Cross-backend gate (scripts/ci/cross_backend_vif_diff.py --feature psnr --backend sycl --places 4) clears bit-exactly on Intel Arc A380 (0/48 mismatches on the Netflix normal pair across all three planes, max_abs_diff = 0.0). Six follow-up kernels remain (CUDA + SYCL chroma SSIM / MS-SSIM, CUDA + SYCL cambi).