ADR-0345: cambi × {CUDA, SYCL, HIP} GPU port strategy¶
- Status: Accepted
- Status update 2026-05-15: CUDA + SYCL ports implemented;
core/src/feature/cuda/integer_cambi_cuda.candcore/src/feature/sycl/integer_cambi_sycl.cpppresent on master; cross-backend gate at places=4 active. HIP port deferred to a separate follow-up task. - Date: 2026-05-09
- Deciders: lusoris@pm.me, Claude (Anthropic)
- Tags: cuda, sycl, hip, gpu, cambi, fork-local, places-4
Context¶
The CAMBI banding metric ships on CPU (scalar + AVX2 + AVX-512 + NEON) and on Vulkan (per ADR-0210, shipped as PR #196 / T7-36). It does not yet ship on CUDA, SYCL, or HIP — those three backends were explicitly deferred from ADR-0192's long-tail batch 3 ("follow per-backend cadence after the Vulkan terminus lands") and again from Research-0090 §Ordering rationale ("cambi × {CUDA, SYCL} last. Highest risk by 5×. ... Defer to a separate planning round once chroma is in").
This ADR is that planning round's decision. It does not implement the ports. The implementing PRs (one per backend) follow per the Research-0091 ordered plan and each carry their own narrow per-port ADR if a new architectural decision arises at port time.
The structural blocker — calculate_c_values's sliding 1024-bin histogram per output column with sequential row-major update — is algorithmic, not hardware-specific. It pessimises CUDA / SYCL / HIP the same way it pessimises Vulkan (ADR-0205 §Decision). The host-staged Strategy II Vulkan precedent therefore generalises directly.
Decision¶
We will port cambi to CUDA, SYCL, and HIP using Strategy II host-staged hybrid — the same architecture ADR-0205 chose and ADR-0210 shipped on Vulkan.
For each of the three backends:
- The GPU services every embarrassingly-parallel pre-pass:
cambi_preprocess,cambi_derivative, the spatial-mask SAT (row-prefix + col-prefix + threshold compare),cambi_decimate, and the separable 3-tap mode filter. - The
calculate_c_valuessliding-histogram pass and the top-K spatial pooling stay on the host, executing the unmodified CPU code (vmaf_cambi_calculate_c_values/vmaf_cambi_spatial_poolingfromcambi_internal.h) against GPU-produced byte-identical buffers. - The cross-backend gate runs at
places=4from day one — same contract as every other GPU twin in the fork. Per ADR-0205 §Precision contract this is by-construction (all GPU pre-passes are integer-only; readback is byte-identical; host residual runs unmodified CPU code).
Implementation ordering: CUDA first, SYCL second, HIP third, per Research-0091 §6.
LOC envelopes (anchored against in-tree comparables — Vulkan cambi 1916 LOC; CUDA integer_adm 2700 LOC; SYCL integer_adm 1663 LOC; HIP integer_psnr ≈600 LOC):
- CUDA ≈ 1100 LOC, risk LOW.
- SYCL ≈ 1300 LOC, risk MEDIUM (single-TU style + dual toolchain matrix per ADR-0335).
- HIP ≈ 1100 LOC, risk MEDIUM-LOW (most-complex HIP consumer to date; structural mirror of CUDA via hipify-perl).
Each per-port PR ships its own ADR-0108 six-deliverables set. The per-port ADR cites this ADR + Research-0091 as design parents; if no new architectural decision arises at port time, the per-port ADR's ## Alternatives considered is "no alternatives — Strategy II inherited from ADR-0345".
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Strategy II hybrid for all three (chosen) | Inherits ADR-0205 / ADR-0210 precedent verbatim; bit-exact w.r.t. CPU; closes the GPU coverage matrix; modest LOC; reuses cambi_internal.h host residual that already shipped with the Vulkan port. | C-values phase doesn't accelerate on any of the three backends; PCIe / HSA shuttle per scale (≤50 MiB at 4K). | Selected — same risk/reward calculation as ADR-0205, applied to three more backends with no new structural argument. |
Strategy III fully-on-GPU calculate_c_values for the new ports only | Maximal GPU utilisation on the new ports; might justify the LOC investment if profile data on RDNA / Ada / Xe shows good cache behaviour at the per-pixel 65×65 scan. | Speculative — no profile data on per-output-pixel 4225-read cache-hit rate exists yet; would create three different cambi GPU architectures (Vulkan = II, CUDA/SYCL/HIP = III) and complicate the cross-backend gate; per ADR-0205 §Precision investigation Strategy III needs its own per-backend precision audit because the histogram-update order differs. | Premature optimisation. Strategy III remains parked per ADR-0205 §Out of scope as a focused v2 PR with its own ADR + profile data. Mixing strategies across backends would multiply the maintenance surface for marginal gain. |
| Single ADR-0NNN per backend at decision time, no umbrella ADR | Fine-grained ADR per port; each port ADR is self-contained. | Creates three near-identical ADRs that all repeat the Strategy II argument from ADR-0205; per ADR-0028 ADRs document non-trivial decisions and three repeats of the same decision are noise; per CLAUDE.md §12 r8 "non-trivial = another engineer could reasonably have chosen differently" — at this point a different engineer cannot reasonably have chosen Strategy III given the Vulkan precedent. | This umbrella ADR locks the strategy once; per-port ADRs become near-mechanical and only fire if a port-time issue forces a new decision. Mirrors the way ADR-0192 acted as the umbrella for the long-tail batch 3 metrics. |
| Defer all three indefinitely | Zero implementation risk for this round. | Leaves three permanent gaps in the GPU coverage matrix; CAMBI is a registered feature extractor that real users invoke (not an experimental metric); the Vulkan terminus is closed but the matrix is not. | Rejected — closing the matrix on the remaining three backends is the explicit Research-0090 §Ordering rationale follow-up. |
| Different strategies per backend (II for CUDA/SYCL, III for HIP) | Could exploit the fact HIP just landed (PR #499) to start clean with the more ambitious architecture. | No rationale supports the asymmetry; cache-hit-rate data is missing on RDNA the same as on Ada / Xe; HIP is the most-junior backend and adding novel-architecture risk on top of post-bring-up risk concentrates failure modes. | Rejected — uniform Strategy II minimises cross-backend variance and lets the existing cambi cross-backend gate cover all three with one line of cross_backend_vif_diff.py config. |
Consequences¶
- Positive: closes the GPU coverage matrix on CAMBI for the three remaining backends. Combined with the Vulkan terminus from ADR-0210, every backend in the fork (CPU + Vulkan + CUDA + SYCL + HIP, plus the in-flight Metal scaffold from PR #509) has a path to a CAMBI twin.
- Positive: preserves the canonical
places=4cross-backend gate on cambi for all three backends from day one — no per-backend tolerance carve-outs. - Positive: the existing
cambi_internal.h(shipped with PR #196 for the Vulkan port) is reused as-is. The host residual surface is one shared header; the three new GPU twins each link against it. - Positive: lowers the cost of any future Strategy III v2 PR. When profile data eventually lands, Strategy III can be applied to one backend at a time (likely starting with whichever vendor shows the best cache-hit data) without disturbing the other ports.
- Negative: cambi GPU performance in v1 is bottlenecked by the CPU c-values phase on all three backends. At 4K this remains ~250 ms/frame end-to-end. Real users who care about cambi throughput need to wait for v2 (Strategy III) or stick with CPU.
- Negative: adds three new maintenance surfaces that mirror the CPU
calculate_c_valueshost residual. Mitigated by the sharedcambi_internal.h— any CPU-side change to the c-value formula flows through to all four GPU twins (Vulkan + CUDA + SYCL + HIP) via the header. - Neutral / follow-ups:
- Per-backend port PR cadence per Research-0091 §6: CUDA → SYCL → HIP, one PR each.
- Each port PR registers a
cambi_<backend>row inscripts/ci/cross_backend_vif_diff.py::FEATURE_METRICSand adds a smoke-fixture lane to the relevant CI workflow. - SYCL port runs through both oneAPI and AdaptiveCpp toolchains per ADR-0335 — the SYCL port PR adds smoke fixtures for both, not just oneAPI (Research-0091 §9).
- HIP port is seeded from the CUDA port via
hipify-perl; the HIP port PR documents the deltas inline, no new architectural decision expected. - v2 Strategy III ADR (likely
ADR-0NNN-cambi-strategy-iii) is a focused follow-up with its own profile data and per-backend precision audit. Out of scope for this round.
Precision investigation¶
Not applicable for this planning round — the precision argument is inherited verbatim from ADR-0205 §Precision contract and ADR-0210 §Precision investigation:
- All Strategy II GPU pre-passes are integer arithmetic (no float rounding).
- Readback into the per-scale
(image, mask)VmafPicturepair is byte-identical to what the CPU code path would have written in-place. - The host residual runs the unmodified CPU c-values code on those buffers; output is bit-identical.
Therefore each per-port PR is expected to land at ULP=0 / max_abs_diff=0 against the CPU on the smoke fixture, comfortably under places=4. Per memory feedback_no_test_weakening: if a port misses places=4 empirically, fix the kernel — never relax the gate.
The one place where empirical confirmation is genuinely required is the SYCL port's AdaptiveCpp lane (Research-0091 §9), because AdaptiveCpp's sycl::group_reduce lowering on AMD differs from oneAPI's. The SYCL port PR carries that empirical step.
References¶
- Companion digest: Research-0091 — full per-backend Strategy / LOC / risk / ordering analysis.
- Predecessor ADR: ADR-0205 — feasibility spike, Strategy II decision for Vulkan.
- Predecessor ADR: ADR-0210 — Vulkan v1 integration that landed as PR #196 / T7-36.
- Predecessor digest: Research-0020 — cross-strategy comparison; bandwidth analysis; literature survey.
- Predecessor digest: Research-0032 — Vulkan integration-time trade-offs.
- Predecessor digest: Research-0090 §Ordering rationale — deferred this round.
- Parent ADR: ADR-0192 — GPU long-tail batch 3 scope; explicitly conditioned cambi × {CUDA, SYCL, HIP} on a separate planning round.
- Sibling ADR: ADR-0212 — HIP backend scaffold.
- Sibling ADR: ADR-0241 — HIP consumer cadence.
- Sibling ADR: ADR-0335 (in-flight, PR #498) — dual SYCL toolchain scope.
- Source:
req— PR #520 §Ordering rationale, "Defer to a separate planning round once chroma is in — the cambi backlog is large enough to warrant its own ADR(s)." - User direction: standing CLAUDE.md §12 r8 (ADRs document non- trivial decisions); §12 r10/r11 (every fork-local PR ships the six deep-dive deliverables; doc-substance rule applies); §12 r4 (
make lintbefore push); memoryfeedback_no_guessing(LOC + risk estimates cite in-tree comparables); memoryfeedback_no_test_weakening(places=4 is non-negotiable).