ADR-0205: cambi GPU feasibility spike¶
- Status: Accepted
- Date: 2026-04-28
- Deciders: lusoris@pm.me, Claude (Anthropic)
- Tags: vulkan, gpu, cambi, feasibility-spike, fork-local
Context¶
ADR-0192 committed the GPU long-tail batch 3 to ship every remaining metric across Vulkan / CUDA / SYCL. After parts 1–6 closed (ssimulacra2 Vulkan kernel landed in ADR-0201; the float twins shipped in ADR-0194 .. ADR-0199), only one metric remains: cambi (1533 LOC, the largest single feature in the fork).
ADR-0192 explicitly conditioned cambi's implementation on a feasibility spike — the calculate_c_values pass maintains a sliding 65 × 65 window histogram per output column and updates it row-by-row using +1 / -1 range modifications. The CPU's get_derivative_data_for_row hot path likewise walks left-to-right per row. The AVX2 / AVX-512 / NEON SIMD paths only vectorise within a single histogram update across columns (cambi_avx2.c::cambi_increment_range_avx2); the row-to-row state dependency is preserved sequentially. That combination — sequential-state running update + 1024-bin histogram per output column — is the single most SIMT-hostile shape in the fork's metric set.
This ADR is the spike. It answers two questions:
- Is cambi feasible on GPU at all? Yes — see Decision.
- Which of the three classical re-formulations (single-WG direct port, parallel scan, parallel tile) wins? None exclusively. Ship a hybrid host/GPU pipeline for v1, mirroring ADR-0201's precedent. Defer a fully-on-GPU c-values pass to a future batch.
The full strategy comparison (effort tables, bandwidth analysis, literature survey) lives in research digest 0020; this ADR locks the decision.
Decision¶
Feasibility verdict¶
Confirmed feasible — ship hybrid host/GPU. Every cambi pipeline phase except calculate_c_values (sliding histogram) is embarrassingly parallel and ports cleanly to Vulkan / CUDA / SYCL. The c-values phase stays on the host for v1.
v1 architecture sketch¶
cambi_vulkan_extract()
│
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
▼ ▼ ▼
GPU shader chain GPU shader chain GPU shader chain
(per-frame setup) (per-scale 0..4) (per-scale 0..4)
cambi_preprocess.comp cambi_decimate.comp cambi_derivative.comp
- decimate to enc_w/enc_h (skipped at scale 0 - per-pixel right + bottom
- bit-shift to 10-bit unless high_res_speedup) - bool: `equal_right && equal_bottom`
- optional anti-dither cambi_filter_mode.comp cambi_mask_dp.comp
- 3-tap mode horizontal - separable summed-area table
- 3-tap mode vertical - threshold compare → mask plane
─── readback boundary ───
CPU residual (host code)
- calculate_c_values
(sliding histogram, top-k pool)
- scale-weighted final score
Concrete shader contracts:
cambi_preprocess.comp— per-pixel decimation + bit-shift to 10-bit + (optional) 4-tap anti-dither. Single dispatch, per- pixel thread, no reduction. Trivially parallel.cambi_derivative.comp— implementsget_derivative_data_for_rowover the whole frame in one dispatch. Each thread readsimage[i, j],image[i, j+1],image[i+1, j]and writesderivative[i, j] = (h_eq && v_eq). Edge cases (last column / last row) match the AVX2 path's special handling. Output is oneuint8per pixel.cambi_mask_dp.comp(split into two passes) — computes the summed-area table ofderivative[]and threshold-compares the 7×7 box-sum againstget_mask_index(...). The CPU does this with a cyclically-indexed sliding 9-row DP buffer; on GPU we materialise the full SAT because global memory is cheap and the SAT is separable (one row-prefix-sum pass + one column- prefix-sum pass, bothsubgroupInclusiveAdd-friendly within a workgroup, cross-WG reduction via decoupled lookback per Merrill & Grimshaw 2016 if needed; the fixture sizes (≤4K) are small enough that a two-dispatch host-orchestrated pass suffices).cambi_decimate.comp— per-pixeldata[i, j] = data[2i, 2j]. Trivial.cambi_filter_mode.comp— separable 3-tap mode filter. Two dispatches (horizontal then vertical); each pass per-pixel threaded. Uses the 3-elementmode3reduction.- Host residual — runs the CPU
calculate_c_values+spatial_poolingagainst the GPU-produced image + mask buffers (mapped HOST_VISIBLE). The c-values + pool produce adoubleper scale; the host applies the scale weighting and the finalMIN(score, cambi_max_val)clamp.
Precision contract — places=4 (tightened from ADR-0192's places=2)¶
ADR-0192 carried places=2 as a planning placeholder for cambi ("measure-first"). The hybrid architecture lets us tighten that to places=4 as the canonical floor — same contract as every other GPU twin in the fork — for two structural reasons:
- The GPU phases (preprocess, derivative, SAT, decimate, mode filter) are all integer + bit-exact w.r.t. the CPU. The derivative output is a single
boolper pixel; the SAT threshold compare is a single>againstmask_index; the decimate is a stride-2 gather; the mode filter is a 3-element lookup. None of these operations introduce float rounding — the GPU buffer is byte-for-byte identical to what the CPU code path would write. - The c-values + pooling phase stays on the host, executing the exact CPU code path against the GPU-produced buffers. Since the input buffers are bit-identical, the output is bit-identical.
So the v1 implementation should land at ULP=0 on the cross- backend gate fixture, comfortably under places=4 (5e-5 threshold). The CI step in tests-and-quality-gates.yml lands at --places 4 from day one — no temporary places=2 lane, no "measure first then ratchet" two-step. The contract is set by the architecture, not by empirical observation.
Per project rule (feedback_no_test_weakening): if v1 misses places=4 we fix the kernel, not the gate.
v1 implementation effort estimate¶
| Surface | Estimated LOC | Notes |
|---|---|---|
cambi_vulkan.c host glue | ~700 | Mirrors ssimulacra2_vulkan.c shape (init, descriptor pool, pipeline build, dispatch sequencer, host residual call). |
cambi_*.comp shaders (6 files) | ~400 total | Each 30–80 LOC; preprocessing, derivative, SAT-row, SAT-col, decimate, filter-mode. |
| Host residual integration | ~50 | Refactor calculate_c_values to take a buffer pair instead of a VmafPicture, share with the existing CPU extractor. |
| Meson + feature_extractor wiring | ~30 | Add to vulkan_sources + extern in feature_extractor.c. |
| Smoke test + cross-backend hookup | ~50 | Lavapipe lane in tests-and-quality-gates.yml; FEATURE_METRICS row in cross_backend_vif_diff.py. |
| Total | ~1230 LOC | Three weeks of focused work, comparable to ssimulacra2_vulkan (1544 LOC). |
v1 PR ordering (post-spike)¶
- This PR — spike (ADR-0205 + research digest 0020 + reference shader scaffolds + dormant
cambi_vulkan.chost skeleton). No build wiring; no feature_extractor registration. - Follow-up PR — wire the scaffolds into the build, implement the host residual call path, validate against the cross-backend gate, register
vmaf_fex_cambi_vulkan. This matches the ADR-0201 precedent: the feasibility kernel lands first as an in-tree reference, then the build wiring + integration follow. - Future — CUDA + SYCL twins, mirroring the Vulkan kernel's shape (per-PR cadence per ADR-0192).
Out of scope for this spike (deferred)¶
- The fully-on-GPU
calculate_c_valuespass via strategy III (direct per-pixel histogram, see digest §III). Tracked as long-tail batch 4 / "GPU perf polish". - Real-time / streaming use-cases on 4K (the host residual at 4K runs ~250 ms / frame; for batch processing this is fine and matches CPU performance, but real-time integration would need strategy III or II).
- Heatmap dump on GPU. Heatmap export remains a host-only feature (
heatmaps_pathoption).
Alternatives considered¶
| Option | Pros | Cons | Effort | Why not chosen |
|---|---|---|---|---|
| Strategy I — single-WG direct port | Bit-exact w.r.t. CPU; trivial to implement | Single workgroup = ~1/64 of GPU active; empirically slower than CPU. Defeats the purpose of a GPU port. | ~600 LOC | Not viable; GPU utilisation is catastrophically low. |
| Strategy II — parallel-scan reformulation | High GPU utilisation; integer-bit-exact; uses well-known Blelloch / Merrill & Grimshaw scan primitives. | Materialises per-row column histograms (W × H × 1033 bytes ≈ 17 GiB at 4K scale 0). Bandwidth-heavy gather phase for c-value lookup. ~1500 LOC. | ~1500 LOC | Bandwidth cost dominates; intermediate storage at 4K is impractical. |
| Strategy III — direct per-pixel histogram (full GPU) | Algorithmically clean; full grid utilisation; integer-bit-exact. Reads only ~9 bins per pixel (the c-value formula's domain). | Per-pixel 65² = 4225 reads → ~9× CPU bandwidth. Cache-hit-rate dependent. ~800 LOC. | ~800 LOC | Worth doing eventually but premature for the long-tail terminus; needs profile data we don't yet have. Tracked as v2 follow-up. |
| Hybrid host/GPU (chosen) | Closes the matrix gap. Mirrors ADR-0201's precedent. Bit-exact. Modest LOC. Leaves perf upside on the table for v2. | The c-values phase doesn't accelerate (host does same work). PCIe shuttle for the per-scale buffer (~50 MiB / frame at 4K, <2 ms). | ~1230 LOC v1 | Selected — best risk/reward for closing the long-tail terminus. |
| Defer cambi GPU port indefinitely | Zero implementation risk for batch 3. | Leaves the matrix permanently incomplete; lpips already has its dedicated exception per ADR-0022, accumulating exceptions weakens the gate. | 0 | Rejected — closing the long-tail is the explicit ADR-0192 goal. |
The runner-up was strategy III. We rejected it for v1 because:
- We have no profile data on the cache-hit rate at the per-output- pixel 65×65 window scan (4225 reads/thread). On Lavapipe (the CI gate) it is plausibly bandwidth-bound; on RDNA / Ada it is plausibly cache-friendly. Without empirical evidence the LOC investment is speculative.
- The hybrid v1 closes the matrix gap now, unblocks the long-tail terminus declared in ADR-0192, and leaves strategy III as a clean follow-up under a focused PR (rather than coupling the architectural decision to the profile-driven optimisation).
- Per the user's standing direction in CLAUDE.md §13: prefer shipping the closed matrix over chasing peak GPU utilisation when the latter trades implementation risk for a perf gain we haven't yet measured.
Consequences¶
- Positive: closes the GPU long-tail matrix terminus. After the follow-up integration PR, every registered feature extractor in the fork has at least one GPU twin (lpips remains delegated to ORT execution providers per ADR-0022). Establishes a reusable hybrid host/GPU template for any future metric whose state is too sequential for full SIMT parallelism.
- Positive: the architectural decision is separated from the full-GPU optimisation. v1 ships the matrix closure; v2 (strategy III) is a focused perf PR with its own ADR and profile data. This matches how ADR-0201 split ssimulacra2 into "hybrid landing first" + "future GPU XYB / SSIM compute".
- Negative: cambi GPU performance in v1 is bottlenecked by the CPU c-values phase. At 4K this is ~250 ms / frame — comparable to a CPU-only run. The GPU does the easy parts faster but doesn't accelerate the hot path. Real users who care about cambi throughput will need to wait for v2 (strategy III) or stick with CPU.
- Negative: adds maintenance surface that mirrors CPU code. The
calculate_c_valueshost residual must stay in lock-step with the CPU extractor; any CPU-side change (e.g. Netflix upstream tuning the c-value formula) requires touching both sites. Mitigated by sharing the samec_value_pixel/update_histogram_*helpers via header inclusion (the host residual is the CPUcalculate_c_values). - Neutral / follow-ups:
- Follow-up PR wires the scaffolds into the build, implements the host residual call path, validates against the cross- backend gate, registers
vmaf_fex_cambi_vulkan. Lavapipe lane added totests-and-quality-gates.yml. - CUDA + SYCL twins follow per ADR-0192 cadence.
- v2 ADR (likely
ADR-02XX-cambi-vulkan-c-values-gpu) tackles strategy III with profile data from Lavapipe + RDNA + Ada. - Heatmap dump remains host-only; if a follow-up wants GPU heatmaps, it adds a write-out shader at the c-values phase.
Precision investigation¶
Not applicable for v1 (the precision-sensitive c-values phase stays on the host). The GPU phases are integer + bit-exact-by-construction (all operations are uint16 arithmetic, comparison, gather; no float rounding, no cross-WG float reduction). The follow-up integration PR — tracked as backlog T7-36 per the Section-A audit decisions §A.1.3 — will run the cross-backend gate empirically and either confirm bit-exact or document any drift in its own ADR.
The strategy III v2 PR will need its own precision investigation because the order of integer histogram increments in a per-pixel scan differs from the row-major sliding update. The c-value formula itself involves a float divide (diff_weight * p_0 * p_1) / (p_1 + p_0) which is order-insensitive at the per-pixel granularity (no cross-pixel summation), so bit-exactness should still hold; empirical validation is required.
References¶
- Parent: ADR-0192 — GPU long- tail batch 3 scope; explicitly conditions cambi on this spike.
- Sibling: ADR-0201 — the ssimulacra2 Vulkan kernel's hybrid host/GPU split is the direct precedent for the cambi v1 decision.
- Sibling: ADR-0178 — Vulkan ADM kernel; closest pipeline shape (multi-scale, multi-stage).
- Sibling: ADR-0022 — lpips carve-out; cambi v1 closes the only other GPU long-tail exception.
- Companion: research digest 0020 — full strategy comparison, bandwidth analysis, literature survey (Blelloch 1990, Sengupta 2007, Merrill & Grimshaw 2016).
- CPU reference:
core/src/feature/cambi.c(1533 LOC); SIMD paths inx86/cambi_avx2.c,x86/cambi_avx512.c,arm64/cambi_neon.c. - Banding-detection literature: Tandon et al., "CAMBI: Contrast- Aware Multiscale Banding Index" (PCS 2021); Wu & Liu, "Banding artefact detection in compressed images" (VCIP 2007).
- User direction: standing CLAUDE.md §12 r10/r11 (every fork- local PR ships the six deep-dive deliverables; doc-substance rule applies); standing CLAUDE.md §13 (prefer closing the matrix over chasing peak utilisation).