ADR-0216: Vulkan PSNR — chroma extension (psnr_cb / psnr_cr)¶
- Status: Accepted
- Date: 2026-04-29
- Deciders: Lusoris, Claude (Anthropic)
- Tags:
vulkan,gpu,feature-extractor,psnr
Context¶
psnr_vulkan (T7-23 / ADR-0182) landed luma-only — provided_features = {"psnr_y", NULL} and a single dispatch per frame against a luma-sized SSBO. The original header comment justified the omission with "the picture_vulkan upload path is luma-only today", but that turned out to be wrong on inspection: core/src/vulkan/picture_vulkan.c is a generic VMA byte-buffer allocator and the per-feature kernels (psnr_vulkan.c, vif_vulkan.c, etc.) already memcpy plane data into their own buffers. The host loop in psnr_vulkan.c::extract was already documented as plane-agnostic; only state, descriptor-set count, and the dispatch loop needed extension.
CPU integer_psnr.c emits psnr_y / psnr_cb / psnr_cr unconditionally on YUV420/422/444 (clamping to luma-only on YUV400 via enable_chroma = false). Until this ADR, any pipeline asking the Vulkan backend for chroma PSNR fell through to CPU because the extractor's provided_features claimed only psnr_y. Closing that gap is part of the GPU long-tail backlog and is a prerequisite for chroma SSIM / MS-SSIM follow-ups.
Decision¶
Extend psnr_vulkan.c to dispatch the existing psnr.comp shader three times per frame — once each for Y, Cb, Cr — against per-plane input buffers and per-plane SE-partials buffers, with per-plane (width, height, num_workgroups_x) push constants. The shader is unchanged; it was already plane-agnostic and reads its dims from the push-constant block. State carries ref_in[3] / dis_in[3] / se_partials[3] arrays; one descriptor set per plane is allocated per extract call; a single command buffer issues all three back-to-back dispatches with no inter-dispatch barrier (the SSBOs are independent across planes); the host fence-waits once and reduces all three SE buffers serially. provided_features becomes {"psnr_y", "psnr_cb", "psnr_cr"}. YUV400 clamps n_planes = 1 so chroma dispatches and emits are skipped at runtime.
psnr_max[p] follows the CPU integer_psnr.c default branch (min_sse == 0): psnr_max[p] = (6 * bpc) + 12, identical for all three planes. The min_sse-driven per-plane formula is left unimplemented (no shipped extractor sets min_sse); the array layout makes it a one-line change if needed.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Three-element arrays in a single state struct (chosen) | Minimal diff against the v1 luma-only state; descriptor-set / buffer pattern matches sister kernels; no allocator churn beyond what the v1 already paid. | Per-plane arrays read slightly more verbosely than a singleton. | Smallest, most reviewable shape. |
Three independent PsnrVulkanState instances dispatched as a meta-extractor | Clean per-plane isolation. | Triples descriptor pools, command-buffer alloc churn, pipeline objects, and feature_name_dict instances; CPU integer_psnr.c uses the array shape too — staying close to the canonical layout helps the rebase story. | Wrong cost / benefit. |
One pipeline per plane via a dedicated psnr_chroma.comp (or a PLANE_INDEX spec constant) | Lets the shader make plane-specific compile-time decisions. | The shader already takes (width, height) from push constants — there is no plane-specific decision the GLSL needs to make. Three pipelines would burn 3× SPIR-V cache and pipeline-creation latency for zero benefit. | Spec-constant variation buys nothing here. |
| Subsample-aware host loop plus shader-side subsampling | Would let one dispatch handle all three planes with branched indexing inside the shader. | Defeats the per-WG int64 reduction pattern (every WG would have to know which plane it owned); kills bit-exactness against CPU; needs three independent shared-memory accumulators per WG. | Way more complex than three small dispatches. |
Consequences¶
- Positive:
- Vulkan PSNR now matches the CPU
provided_featuresset — pipelines that namepsnr_cb/psnr_crare routed to Vulkan instead of silently falling through to CPU. - Cross-backend gate (
scripts/ci/cross_backend_vif_diff.py --feature psnr) covers all three plane scores atplaces=4; measuredmax_abs_diff = 0.0across 48 frames at 576×324 (lavapipe, 8-bit 4:2:0). Both sides use deterministic int64 SSE accumulators on integer YUV inputs. - Unblocks the chroma SSIM / chroma MS-SSIM follow-ups (T-rows queued separately), which need the same per-plane buffer pattern.
- Negative:
- Three dispatches per frame instead of one (
.chars .n_dispatches_per_framebumped 1 → 3). Per-plane WG counts are smaller, so wall-time impact is sub-linear — chroma dispatches at 4:2:0 cover 25 % of luma area each. - Descriptor pool sized for 12 sets × 36 buffer descriptors (was 4 × 12) to absorb the per-plane fanout with frames-in-flight headroom.
- Neutral / follow-ups:
- Chroma SSIM (
ssim_vulkanchroma extension) — separate row. - Chroma MS-SSIM (
ms_ssim_vulkanchroma extension) — separate row, gated on chroma SSIM landing. - The min_sse-driven
psnr_max[p]branch from CPU integer_psnr.c is intentionally not replicated; reactivate when a shipped extractor configuration setsmin_sse.
References¶
- ADR-0182 — GPU long-tail batch 1, the original luma-only
psnr_vulkanrow. - ADR-0125 / ADR-0175 — Vulkan backend framework already covers the buffer / descriptor / dispatch patterns this PR reuses; no fresh research digest needed.
core/src/feature/integer_psnr.c— CPU scalar reference forpsnr_y / psnr_cb / psnr_crand thepsnr_max[p]default.- Source:
req(T3-15(b) prompt — "Extend psnr_vulkan.c to compute psnr_cb and psnr_cr alongside the existing psnr_y").