ADR-0201: ssimulacra2 Vulkan kernel¶
- Status: Accepted
- Date: 2026-04-27
- Deciders: lusoris@pm.me
- Tags: vulkan, gpu, ssimulacra2, precision
Context¶
ADR-0192 scopes GPU long-tail batch 3, which targets a Vulkan twin for every CPU extractor that still lacks one. ssimulacra2 — the SSIMULACRA 2 perceptual quality metric ported from libjxl (ADR-0130) — is the second-most complex remaining metric after cambi (deferred for a feasibility spike). Its CPU pipeline (full-resolution YUV → linear RGB → 6-scale pyramid; per scale: linear-RGB → XYB, separable 3-pole IIR Gaussian blur of 5 statistics, per-pixel SSIM + edge-diff stats, host accumulation of ~108 weighted norms + cubic polynomial + power transform) makes a single fused GPU kernel impractical.
Closing the Vulkan slot for ssimulacra2 unblocks ADR-0192's Group A coverage matrix (no GPU twin yet) and provides the reference layout that the CUDA + SYCL twins (a follow-up PR) will mirror.
Decision¶
We land ssimulacra2_vulkan as a hybrid host/GPU pipeline with the IIR blur + per-pixel multiplies on the GPU and the precision-sensitive XYB pre-pass + per-pixel SSIM combine on the host. The GPU shaders are:
ssimulacra2_mul.comp— elementwise 3-plane multiply (mirrorsmultiply_3planeinssimulacra2.c) for the ref², dis², and ref·dis pre-blur products.ssimulacra2_blur.comp— separable Charalampidis 2016 3-pole recursive IIR blur with sigma=1.5. The IIR is sequential along the scan axis, so we use one workgroup per row for the horizontal pass (local_size = 1, dispatch(1, H, 1)) and one workgroup per column for the vertical pass (dispatch(1, W, 1)). Per-channel offsets in the 3-plane buffer come from push constants (in_offset,out_offset) so the descriptor set is bound once per (in_buf, out_buf) pair — updating descriptors between recorded vkCmdDispatch calls only leaves the LAST-written binding visible at submit time, a pitfall we hit during development.
Host responsibilities (in ssimulacra2_vulkan.c):
- YUV → linear RGB at full resolution, using the same scalar libjxl port as
ssimulacra2.c::picture_to_linear_rgb(deterministic sRGB EOTF LUT from ADR-0164). - 2×2 box downsample between scales (cheap vs the GPU work; keeps the GPU dispatch chain focused on the per-scale blur pipeline). The downsample uses the full-resolution plane stride consistently — every pyramid level keeps its 3 planes in their full-resolution slots with the active data at the head of the slot — so the GPU shaders' channel offsets (
c * full_w * full_h) line up across scales. - linear-RGB → XYB at every scale via
ss2v_host_linear_rgb_to_xyb(verbatim port ofssimulacra2.c::linear_rgb_to_xyb). Bit-exact with the CPU extractor — see §Precision investigation for why this is required rather than running XYB on the GPU. - Per-pixel SSIMMap + EdgeDiffMap combine in double precision over the GPU-blurred mu/sigma buffers (which are HOST_VISIBLE + MAPPED via VMA). Mirrors
ssim_map+edge_diff_mapexactly, including the(double)num_m * (double)num_s / (double)denom_spromotion at the divide site. - 108-weighted-norm pool + cubic polynomial + power 0.6276 transform (mirrors
pool_score).
The shader source files ssimulacra2_xyb.comp and ssimulacra2_ssim.comp are kept in-tree as references for future follow-up work (§Consequences). The pipelines they configure are still allocated at init for forward-compatibility but never dispatched in v1.
Min-dim guard: the host loop early-exits when the current scale falls below 8×8 (matches the CPU if (cw < 8u || ch < 8u) break). Init rejects inputs below 8×8.
Strict-mode SPIR-V compilation (-O0): all 4 ssimulacra2 shaders build with -O0 to disable the SPIR-V optimizer's FMA contraction. The IIR blur in particular carries state across iterations — even one compiler-introduced FMA per pixel would compound across the blur radius and worsen the per-scale SSIM stats drift. The IIR shader additionally carries precise qualifiers on every state variable + per-pixel intermediate to block driver-side FMA fusion that the SPIR-V NoContraction decoration alone did not catch on lavapipe / Mesa anv / RADV (see §Precision investigation).
Empirical precision¶
ADR-0192 sets places=2 as the nominal precision target for ssimulacra2 with the "measure first; may surprise upward" qualifier inherited from ADR-0188. Final achieved precision on the Netflix normal pair (src01_hrc00_576x324.yuv ↔ src01_hrc01_576x324.yuv, 576×324, 48 frames):
- Pooled
ssimulacra2score (full--precision maxoutput): max abs diff = 1.81e-7, mean = 3.65e-8, P95 = 1.56e-7. Cross-backend gate runs atplaces=4(5e-5 threshold; 0/48 mismatches), matching the rest of the Vulkan VIF/MS-SSIM family. We exceed the parent'splaces=2target by ~5 decimal places. - Per-stage CPU↔GPU equivalence (verified by per-pixel buffer dumps on frame 0):
- XYB plane: bit-exact (host-side XYB).
- IIR-blurred mu/sigma planes: bit-exact (proven by feeding the GPU's XYB output through CPU's
fast_gaussian_1dand comparing to the GPU's blur output — 0 ULP diff across all 576×324 pixels in plane Y). - Per-pixel SSIM + EdgeDiff
dvalue: bit-exact (host-side combine in double precision).
Precision investigation¶
The first iteration of ssimulacra2_vulkan shipped XYB and the per-pixel SSIM combine on the GPU and produced a pooled-score drift of 1.59e-2 (places=1 only). Driving the contract to the planned places=2 (≤5e-3) — and beyond, to places=4 — took five staged measurements:
| Tactic | Pooled max_abs | Notes |
|---|---|---|
Baseline (in-shader XYB + in-shader SSIM, all float) | 1.59e-2 | --places 1 only. |
Add precise qualifier + explicit FMA-blocking temp staging on the XYB matmul + cube-root + SSIM d compute | 1.54e-2 | ~3% improvement; precise + NoContraction decorations confirmed in spirv-dis output, but lavapipe / Mesa anv / RADV still produced ~1.7e-6 max per-pixel drift on the X plane. |
| Move per-pixel SSIM combine to host (double precision over GPU-blurred buffers) | 1.54e-2 | No improvement: per-pixel d is dominated by upstream mu/sigma drift, not by the divide's float-vs-double precision. |
Per-pixel buffer dump + decompose CPU/GPU difference: confirm IIR is bit-exact (CPU fast_gaussian_1d on GPU XYB == GPU IIR output to 0 ULP) | — | Diagnostic only: isolated XYB as the sole drift source. |
Move XYB to host (bit-exact port of linear_rgb_to_xyb) | 1.81e-7 | places=6 effective. Final fix. |
The driver-side compile chain (lavapipe, Mesa anv, RADV — all tested) does not in practice preserve the exact float operation order required for ULP-equivalent XYB even with precise on every intermediate and NoContraction on every OpFMul/OpFAdd. The worst-case 42-ULP X-plane drift comes from cancellation in 0.5 * (cbrt(l) - cbrt(m)) when l ≈ m; even sub-ULP per-input deviation in the matmul kM00*r + m01*g + kM02*b + kOpsinBias amplifies through the cancellation site. Host-side XYB sidesteps the entire compile-chain by running the canonical CPU port.
The IIR blur and SSIM combine both turned out to be bit-exact when fed bit-exact inputs — so once XYB matches CPU exactly, the rest of the pipeline matches by construction.
Wall-time impact (576×324, lavapipe): host XYB adds ~3 ms per scale 0 frame (vs <1 ms for the GPU dispatch); host SSIM combine adds ~1.5 ms per scale per channel. Total per-frame extract time is dominated by the IIR (still GPU). Net wall-time impact under 2% on the Netflix normal pair.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Single fused kernel (XYB + blur + SSIM in one shader) | Fewer dispatches, lower CPU overhead | The IIR blur has carry-state along the scan axis incompatible with the SSIM stats' 2D parallel layout; would require expensive shared-memory thread-pinning per row/column | Correctness-first: separating the IIR into its own shader makes the per-stage data flow auditable against the CPU reference |
Keep XYB on GPU, accept places=1 contract | Pure GPU per-scale pipeline; minimal host work | Diverges from ADR-0192's nominal places=2 and from the rest of the Vulkan VIF/MS-SSIM family (all at places=4) | Rejected by the user (verbatim, paraphrased to neutral English: "places=1 is not good"); the §Precision investigation table shows host-side XYB clears the entire family's places=4 bar with no measurable wall-time cost |
| Host-side IIR blur (GPU does only XYB + SSIM) | Simpler host code, no IIR shader | Defeats the purpose — IIR is the dominant per-scale cost (~50% of CPU time) | Not chosen — would leave the hottest stage on CPU |
| GPU-side YUV → linear RGB (sRGB EOTF on GPU) | Pure GPU pipeline, no host pre-pass | Requires uploading the 1024-entry sRGB EOTF LUT and an ifelse-heavy YUV-matrix dispatcher into the shader | Not chosen for v1 — host YUV→RGB is fast (already SIMD'd via ADR-0163); follow-up if profiling shows it's a bottleneck |
Pack 3 channels into a single vec3 per dispatch | One dispatch processes all 3 channels at once | Doubles per-pixel register pressure; the IIR's 6 prev-state floats + 6 outputs × 3 channels = 36 live floats per lane, exceeding most GPUs' register budget | Not chosen — per-channel iteration is simpler and matches the CPU reference one-for-one |
GPU-side XYB with Float64 capability + precise everywhere | Could in principle keep XYB on GPU and still match CPU bit-for-bit | Requires shaderFloat64 (not core on Vulkan 1.0; not supported on every consumer GPU); doubles the cube-root cost; the divide-amplification site (0.5*(cbrt(l)-cbrt(m))) still cancels in float at the consumer if the output buffer is float | Not chosen — host XYB is bit-exact by construction without any device-feature gating |
Consequences¶
- Positive:
- GPU long-tail batch 3 part 7 closes the Vulkan slot for
ssimulacra2(Group A coverage per ADR-0192). - Cross-backend precision lands at
places=4(max abs 1.81e-7 on the Netflix normal pair), matching the rest of the Vulkan VIF/MS-SSIM family rather than holdingplaces=1as the v1 iteration first attempted. - The hybrid host/GPU layout becomes the reference template for the CUDA and SYCL twins (follow-up PR per ADR-0192 §scope).
- Negative:
- The
ssimulacra2_xyb.compandssimulacra2_ssim.compshaders and their pipelines are kept in-tree but not dispatched in v1. A follow-up PR can either delete them entirely (simplest) or keep them behind aFloat64-gated optional GPU-only mode. - Host-side XYB and SSIM combine adds ~5% CPU time per scale but moves it off-GPU; net wall-time impact is under 2% on lavapipe/ANV/RADV given the IIR remains on GPU.
- One workgroup per row / column for the IIR blur is conservative — performance follow-ups can re-bin multiple rows/columns per WG once the empirical contract is in place.
- Neutral / follow-ups:
- CUDA + SYCL twins land in a separate PR (this PR is Vulkan-only per the user's scope direction). Both should mirror the hybrid host/GPU split unless profiling shows the host XYB is a bottleneck on those backends.
- GPU-side YUV → linear-RGB pre-pass and GPU-side downsample are deferrable optimisations — measure-first if profiling flags them.
- The
psnr_hvs_strict_shaderslist incore/src/vulkan/meson.buildgrows by 4 entries; the list name is now misnamed but renaming it is out of scope for this PR (rename in a follow-up).
References¶
- Parent: ADR-0192 — GPU long-tail batch 3 scope.
- CPU reference: ADR-0130 (extractor) + ADR-0161 (SIMD bit-exactness) + ADR-0162 (blur SIMD) + ADR-0163 (YUV→RGB SIMD) + ADR-0164 (deterministic EOTF + cbrt LUT/Newton).
- Vulkan precedent: ADR-0190 — ms_ssim_vulkan, the closest precedent (5-level pyramid + per-scale SSIM stats with per-WG partials).
- Min-dim guard precedent: ADR-0153.
- Source:
req(user prompt for batch-3 part 7,feat/ssimulacra2-vulkanPR).