ADR-0210: cambi Vulkan integration (Strategy II hybrid)¶
- Status: Accepted
- Date: 2026-04-29
- Deciders: lusoris@pm.me, Claude (Anthropic)
- Tags: vulkan, gpu, cambi, feature-extractor, fork-local, places-4
Context¶
ADR-0205 closed the cambi GPU feasibility spike with the verdict "ship hybrid host/GPU" and deferred the actual integration to a follow-up PR (backlog item T7-36, see also the §A.1.3 Section-A audit decisions). This ADR locks the v1 implementation that lands in that follow-up.
After the spike PR, the in-tree state was:
core/src/feature/vulkan/cambi_vulkan.c— 200-line scaffold withinit_stub/extract_stub/close_stubreturning-ENOSYS. Not wired intofeature_extractor_list[].core/src/feature/vulkan/shaders/{cambi_decimate,cambi_derivative,cambi_filter_mode}.comp— 3 of the 6 shaders ADR-0205 specifies. Compiled but unused.
This PR implements:
- The 2 missing shaders (
cambi_preprocess.comp+cambi_mask_dp.comp— the latter unifies row-SAT, col-SAT, and threshold-compare passes via aPASSspec constant). - The full Vulkan-aware lifecycle in
cambi_vulkan.c(descriptor pool / pipeline build / dispatch sequencer / readback path) replacing the_stubtriple. - The host residual call path —
vmaf_cambi_calculate_c_valuespatial pooling — invoked against the GPU-produced image + mask buffers via a smallcambi_internal.hheader that exposes cambi.c's file-static helpers without disturbing CPU SIMD callsites. - Build wiring — registers the 5 shaders in
vulkan_shader_sources[], the TU invulkan_sources, andvmaf_fex_cambi_vulkaninfeature_extractor.c'sfeature_extractor_list[]. - Cross-backend gate — adds a
cambirow toscripts/ci/cross_backend_vif_diff.py'sFEATURE_METRICSso the places=4 contract is checked end-to-end against the CPU.
Decision¶
v1 architecture (recap from ADR-0205)¶
cambi_vulkan_extract()
│
┌───────────────────────────────┼───────────────────────────────┐
│ │ │
▼ ▼ ▼
Host preprocess GPU spatial mask GPU per-scale loop
(CPU vmaf_cambi_ (one pass / frame): (5 scales):
preprocessing) cambi_derivative.comp cambi_decimate (2×)
→ pics[0] luma cambi_mask_dp PASS=0,1,2 cambi_filter_mode H+V
→ mask_buf → image_buf, mask_buf
─── readback ───
Host residual (CPU):
vmaf_cambi_calculate_c_values
vmaf_cambi_spatial_pooling
vmaf_cambi_weight_scores_per_scale
MIN(score, cambi_max_val)
Key implementation choices¶
-
Strategy II scope. Per ADR-0205 §Decision the precision-sensitive
calculate_c_valuessliding-histogram pass and the top-K spatial pooling stay on the host. The GPU services every other phase — preprocess (forward-compatible scaffold; v1 uses the CPU bilinear-resize for bit-exactness on resolution mismatches), per-pixel derivative kernel, the 7×7 spatial mask SAT + threshold compare, the 2× decimate, and the separable 3-tap mode filter. -
Mask DP shader fuses 3 passes via spec constant. Rather than write
cambi_mask_sat_row.comp+cambi_mask_sat_col.comp+cambi_mask_threshold.compas three separate TUs, the singlecambi_mask_dp.compusesPASS = 0/1/2to switch between row-SAT (single thread per row sequential prefix-sum), col-SAT (single thread per column), and per-pixel 4-corner box-sum + threshold compare. This keeps the build's shader count low and collocates the related kernels — easier to review than 3 near- identical files. -
cambi_internal.hinstead of buffer-pair refactor. ADR-0205 §Decision suggested refactoringcalculate_c_valuesto take a buffer pair instead of aVmafPicture *. v1 takes a different tactic: keepcalculate_c_values'sVmafPicture *signature, expose it (anddecimate/filter_mode/get_spatial_mask/cambi_preprocessing/spatial_pooling/weight_scores_per_scale/get_pixels_in_window) through a thin internal header, and have the Vulkan twin allocate its ownVmafPicturepair as readback targets. This avoids rippling through every CPU SIMD callsite forincrement_range/decrement_range/get_derivative_data_for_row, which all expect(uint16_t *, ...)not(VmafPicture *, ...). The cost is one full-frame readback per scale — negligible vs the per-scale GPU dispatch cost. -
Bit-exactness preserved by construction. Every GPU phase is integer arithmetic (
uint16derivative,int32SAT,>compare, stride-2 gather, 3-elementmode3lookup). The readback into theVmafPicturepair is byte-identical to what the CPU code path would have written in-place. The host residual then runs the exact same CPU code on those buffers, so the emitted score is bit-identical tovmaf_fex_cambi. ULP=0 / max_abs_diff=0 against the CPU on the smoke fixture; cross-backend gate atplaces=4per ADR-0205's tightened contract. -
Free-set descriptor pool, per-dispatch one-shot command buffer. Mirrors
psnr_vulkan.c(T7-23 part 1) — small max_sets pool withVK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT, allocate + write + bind + dispatch + submit + wait + free per stage. v1 trades latency for code clarity; a v2 perf pass can batch all per-scale dispatches into one command buffer if profiling warrants.
Out of scope (deferred)¶
- Fully-on-GPU
calculate_c_values(Strategy III) — tracked for v2. - Heatmap dump on GPU.
- CUDA + SYCL twins — follow per ADR-0192 cadence (one PR per backend).
- High-res-speedup option's GPU path. v1 honours the option at the per-scale level (decimate first, then build mask) but doesn't add the ~2× perf shortcut the CPU has at 1080p+. v2.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Strategy II hybrid (chosen) | Bit-exact w.r.t. CPU; closes the matrix at ADR-0192's terminus; modest LOC. | C-values phase doesn't accelerate; PCIe shuttle per scale. | Selected — ships the closure now, leaves Strategy III as a focused v2 PR. |
| Strategy III — fully-on-GPU c-values | Algorithmically clean; full GPU utilisation. | ~9× CPU bandwidth on a per-output-pixel 4225-read scan; we have no profile data on cache-hit rate at this access pattern. ~800 LOC additional. | Premature without profile data. Tracked as a v2 follow-up ADR. |
Buffer-pair refactor of calculate_c_values | Cleaner host-residual call surface; no VmafPicture allocation in the GPU twin. | Ripples through every CPU SIMD callsite + the AVX2/AVX-512/NEON paths' inc_range_callback signatures; ~200 LOC refactor diff to cambi.c + 3 SIMD TUs to gain a few ns of redundant pointer chasing. | Cost > benefit. The internal-header trampoline lets the GPU twin land without touching CPU SIMD code. |
| Three separate mask-DP TUs (row-SAT, col-SAT, threshold) | Each shader trivially understandable in isolation. | 3 near-identical files = 3 spv_embed.h headers + 3 pipeline-build sites; reviewers would have to diff three files to see the SAT semantics. | Single TU + PASS spec const = one file to review, one shader module on the device, one cambi_mask_dp_spv.h artefact. |
Consequences¶
- Positive: closes the GPU long-tail matrix terminus declared in ADR-0192. Every registered feature extractor in the fork now has at least one GPU twin (lpips remains delegated to ORT execution providers per ADR-0022).
- Positive: establishes a reusable Strategy II template for any future metric whose state is too sequential for full SIMT parallelism. cambi was the worst case in the matrix; if Strategy II worked here it works everywhere.
- Positive: keeps the cross-backend gate tight at
places=4— no per-metric tolerance carve-outs. - Negative: cambi GPU performance in v1 is bottlenecked by the CPU c-values phase. At 4K this remains ~250 ms / frame. Real users who care about cambi throughput will need to wait for v2 (Strategy III) or stick with CPU.
- Negative: adds maintenance surface that mirrors CPU code. The
vmaf_cambi_calculate_c_valueshost residual stays in lock-step with the CPU extractor; any CPU-side change to the c-value formula requires touching both sites. Mitigated by sharing helpers via the internal header.
Precision investigation¶
The places=4 contract holds by construction, not by empirical adjustment:
- All GPU phases are integer arithmetic (no float rounding).
- The host residual runs the unmodified CPU c-values code on byte-identical buffers.
- Therefore the emitted score is bit-identical to the CPU path on any input.
The smoke fixture (Netflix ref_576x324_48f.yuv ↔ dis_576x324_48f.yuv, 48 frames) confirms ULP=0 / max_abs_diff=0 (the per-frame Cambi score readout matches the CPU path byte-for-byte through the JSON output). Cross-backend gate runs at the canonical places=4 from day one.
References¶
- Parent: ADR-0205 — feasibility spike, Strategy II hybrid decision.
- Sibling: ADR-0201 — direct precedent for the hybrid host/GPU split.
- Companion: docs/research/0020-cambi-gpu-strategies.md — strategy comparison.
- Companion: docs/research/0032-cambi-vulkan-integration.md — integration-time trade-offs (this ADR's research digest).
- CPU reference:
core/src/feature/cambi.c. - Backlog: T7-36 (cambi GPU integration PR).
- User direction: standing CLAUDE.md §12 r10/r11 (every fork-local PR ships the six deep-dive deliverables; doc-substance rule applies).