ADR-0216: Vulkan PSNR — chroma extension (psnr_cb / psnr_cr)¶

Status: Accepted
Date: 2026-04-29
Deciders: Lusoris, Claude (Anthropic)
Tags: vulkan, gpu, feature-extractor, psnr

Context¶

psnr_vulkan (T7-23 / ADR-0182) landed luma-only — provided_features = {"psnr_y", NULL} and a single dispatch per frame against a luma-sized SSBO. The original header comment justified the omission with "the picture_vulkan upload path is luma-only today", but that turned out to be wrong on inspection: core/src/vulkan/picture_vulkan.c is a generic VMA byte-buffer allocator and the per-feature kernels (psnr_vulkan.c, vif_vulkan.c, etc.) already memcpy plane data into their own buffers. The host loop in psnr_vulkan.c::extract was already documented as plane-agnostic; only state, descriptor-set count, and the dispatch loop needed extension.

CPU integer_psnr.c emits psnr_y / psnr_cb / psnr_cr unconditionally on YUV420/422/444 (clamping to luma-only on YUV400 via enable_chroma = false). Until this ADR, any pipeline asking the Vulkan backend for chroma PSNR fell through to CPU because the extractor's provided_features claimed only psnr_y. Closing that gap is part of the GPU long-tail backlog and is a prerequisite for chroma SSIM / MS-SSIM follow-ups.

Decision¶

Extend psnr_vulkan.c to dispatch the existing psnr.comp shader three times per frame — once each for Y, Cb, Cr — against per-plane input buffers and per-plane SE-partials buffers, with per-plane (width, height, num_workgroups_x) push constants. The shader is unchanged; it was already plane-agnostic and reads its dims from the push-constant block. State carries ref_in[3] / dis_in[3] / se_partials[3] arrays; one descriptor set per plane is allocated per extract call; a single command buffer issues all three back-to-back dispatches with no inter-dispatch barrier (the SSBOs are independent across planes); the host fence-waits once and reduces all three SE buffers serially. provided_features becomes {"psnr_y", "psnr_cb", "psnr_cr"}. YUV400 clamps n_planes = 1 so chroma dispatches and emits are skipped at runtime.

psnr_max[p] follows the CPU integer_psnr.c default branch (min_sse == 0): psnr_max[p] = (6 * bpc) + 12, identical for all three planes. The min_sse-driven per-plane formula is left unimplemented (no shipped extractor sets min_sse); the array layout makes it a one-line change if needed.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Three-element arrays in a single state struct (chosen)	Minimal diff against the v1 luma-only state; descriptor-set / buffer pattern matches sister kernels; no allocator churn beyond what the v1 already paid.	Per-plane arrays read slightly more verbosely than a singleton.	Smallest, most reviewable shape.
Three independent `PsnrVulkanState` instances dispatched as a meta-extractor	Clean per-plane isolation.	Triples descriptor pools, command-buffer alloc churn, pipeline objects, and `feature_name_dict` instances; CPU integer_psnr.c uses the array shape too — staying close to the canonical layout helps the rebase story.	Wrong cost / benefit.
One pipeline per plane via a dedicated `psnr_chroma.comp` (or a `PLANE_INDEX` spec constant)	Lets the shader make plane-specific compile-time decisions.	The shader already takes `(width, height)` from push constants — there is no plane-specific decision the GLSL needs to make. Three pipelines would burn 3× SPIR-V cache and pipeline-creation latency for zero benefit.	Spec-constant variation buys nothing here.
Subsample-aware host loop plus shader-side subsampling	Would let one dispatch handle all three planes with branched indexing inside the shader.	Defeats the per-WG int64 reduction pattern (every WG would have to know which plane it owned); kills bit-exactness against CPU; needs three independent shared-memory accumulators per WG.	Way more complex than three small dispatches.

Consequences¶

Positive:
Vulkan PSNR now matches the CPU provided_features set — pipelines that name psnr_cb / psnr_cr are routed to Vulkan instead of silently falling through to CPU.
Cross-backend gate (scripts/ci/cross_backend_vif_diff.py --feature psnr) covers all three plane scores at places=4; measured max_abs_diff = 0.0 across 48 frames at 576×324 (lavapipe, 8-bit 4:2:0). Both sides use deterministic int64 SSE accumulators on integer YUV inputs.
Unblocks the chroma SSIM / chroma MS-SSIM follow-ups (T-rows queued separately), which need the same per-plane buffer pattern.
Negative:
Three dispatches per frame instead of one (.chars .n_dispatches_per_frame bumped 1 → 3). Per-plane WG counts are smaller, so wall-time impact is sub-linear — chroma dispatches at 4:2:0 cover 25 % of luma area each.
Descriptor pool sized for 12 sets × 36 buffer descriptors (was 4 × 12) to absorb the per-plane fanout with frames-in-flight headroom.
Neutral / follow-ups:
Chroma SSIM (ssim_vulkan chroma extension) — separate row.
Chroma MS-SSIM (ms_ssim_vulkan chroma extension) — separate row, gated on chroma SSIM landing.
The min_sse-driven psnr_max[p] branch from CPU integer_psnr.c is intentionally not replicated; reactivate when a shipped extractor configuration sets min_sse.

References¶

ADR-0182 — GPU long-tail batch 1, the original luma-only psnr_vulkan row.
ADR-0125 / ADR-0175 — Vulkan backend framework already covers the buffer / descriptor / dispatch patterns this PR reuses; no fresh research digest needed.
core/src/feature/integer_psnr.c — CPU scalar reference for psnr_y / psnr_cb / psnr_cr and the psnr_max[p] default.
Source: req (T3-15(b) prompt — "Extend psnr_vulkan.c to compute psnr_cb and psnr_cr alongside the existing psnr_y").