ADR-0195: float_psnr GPU kernels — single-dispatch diff² with float partials, bit-exact vs CPU¶

Status: Accepted
Date: 2026-04-27
Deciders: Lusoris, Claude (Anthropic)
Tags: vulkan, cuda, sycl, gpu, feature-extractor, fork-local, bit-exact

Context¶

ADR-0192 lists float_psnr as the first Group B metric — float twin of an int kernel that already ships on GPU (psnr_y on CUDA / SYCL / Vulkan emits the int-domain luma PSNR). The CPU reference is float_psnr.c (170 LOC):

picture_copy(offset=0) normalises raw samples to float in [0, peak] (peak = 255.0 for bpc=8, etc.).
Per-pixel (ref - dis)² accumulated in double per row.
noise = total / (w · h).
score = MIN(10·log10(peak² / max(noise, 1e-10)), psnr_max).

Single emitted metric: float_psnr. The features.md table prior to this PR claimed float_psnr_y / _cb / _cr plane outputs — that was wrong; the CPU extractor only emits float_psnr. This PR fixes the docs alongside the new GPU kernels.

Decision¶

Ship float_psnr_{vulkan,cuda,sycl} as single-dispatch kernels that compute per-pixel (ref - dis)² in float and emit per-WG float partials. Host accumulates in double and applies the CPU formula.

Single dispatch, no halo, no shared tile¶

The kernel doesn't need a halo or shared tile because every pixel is independent:

16×16 WG.
Per thread: read its ref + dis pixel, compute (ref - dis)².
Sub-group reduce_over_group → SLM cross-subgroup → single float per WG into the output partial buffer.

Smallest GPU kernel in the long-tail by a wide margin: ~120 LOC of GLSL, ~110 LOC of CUDA PTX, ~150 LOC of SYCL.

Float math, no bit-exactness engineering, but empirical 0.0 drift¶

Float per-pixel + float per-WG reduction + log10 final transform are normally a recipe for ULP drift vs the CPU's left-to-right accumulator. Empirically the drift is zero on the cross-backend gate fixture:

Backend	Bit depth	Frames	`float_psnr` max_abs_diff
Vulkan (Intel Arc A380 + Mesa anv)	8-bit	48	0.00e+00
Vulkan	10-bit	3	0.00e+00
CUDA (NVIDIA RTX 4090)	8-bit	48	0.00e+00
CUDA	10-bit	3	0.00e+00
SYCL (Intel Arc A380 + oneAPI 2025.3)	8-bit	48	0.00e+00
SYCL	10-bit	3	0.00e+00

The exactness is not engineered — it falls out of (a) the kernel's trivial structure (one square per pixel, one summation), and (b) the host-side double accumulation absorbing any per-WG ULP noise into a value whose log10 rounds to the same double as the CPU's. The contract per ADR-0192 was nominally places=3; the empirical floor is places=∞ (exact JSON-level agreement).

This isn't the place to lock a "bit-exact forever" contract — future driver / kernel changes could reintroduce ULP drift, and the gate's places=4 threshold is the safety margin for that. ADR-0192's places=3 precision contract still applies to the gate; reality just exceeds it.

Mirror padding: N/A¶

float_psnr has no convolution — every pixel is independent — so the mirror-divergence-from-motion footgun called out in ADR-0193 / 0194 doesn't apply here. (Worth noting because the corresponding motion-family kernels share the directory tree but the mirror contract is per-kernel.)

Float twins kept native, not aliased¶

ADR-0192 considered aliasing float_psnr to the existing psnr_y kernel. Rejected: psnr_y reads quantized uint pixels and emits the dB score; float_psnr reads pixels normalised by the bpc- specific scaler (1 / 4 / 16 / 256). The kernels operate on fundamentally different input domains. Aliasing would either silently quantize the float input (wrong) or run two kernels back-to-back (no win). Native kernel is cleaner.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
*Alias `float_psnr_` to the existing `psnr_y` kernel + post-quantize on the host**	Saves 3 small kernel files	Fundamentally different input domains (raw uint vs `[0, peak]` float). Aliasing changes semantics.	Per ADR-0192 alternatives table
Atomic float adds to a single global `noise` accumulator	One global write, no per-WG slots	Cross-WG associativity drift (the very thing per-WG + double host accum sidesteps); float atomicAdd extension dependency on Vulkan	Per-WG floats + host double is the canonical pattern for this fork (ciede, ansnr, ...)
Two passes (compute squared diffs in pass 1 → write to buffer; reduce in pass 2)	Pass 2 can use parallel-tree reduction with deeper double accumulation	The kernel is trivially short already; pass 2 buys nothing	YAGNI; single-pass with sub-group + SLM reduce is sufficient
Defer to "after `motion_v2` lands" instead of bundling into batch 3 part 3	Smaller PR queue at any moment	Doesn't change the merge surface; the four float twins all share scaffolding	Match ADR-0192's per-metric ordering: smallest first (`float_psnr` < `float_motion` < `float_vif` < `float_adm`)

Consequences¶

Positive: closes the first of four matrix gaps for Group B float twins. The next three (float_motion, float_vif, float_adm) follow the same shape — float per-pixel + per-WG partials + host double reduction.
Positive: ~120 LOC GLSL + ~110 LOC PTX + ~150 LOC SYCL + ~250 LOC of host glue per backend. Smallest GPU twin batch by a wide margin.
Positive: empirically bit-exact on all three backends — strongest possible numerical contract for a float-domain kernel.
Neutral / follow-ups:
CHANGELOG + features.md updates ship in the same PR per ADR-0100.
features.md row 39 corrected: float_psnr emits one metric (float_psnr), not three plane outputs.
Lavapipe lane gains a float_psnr step.
Next batch 3 metric: float_motion (ADR-0196 to come).

References¶

Parent: ADR-0192 — batch 3 scope.
Sibling: ADR-0193, ADR-0194.
CPU reference: float_psnr.c (170 LOC).
Verification: cross-backend gate scripts/ci/cross_backend_vif_diff.py with --feature float_psnr --places 4. New step in the lavapipe lane of tests-and-quality-gates.yml.