Research-0051 — ssimulacra2 GPU XYB shader precision re-investigation (T-GPU-OPT-VK-3)¶
| Field | Value |
|---|---|
| Date | 2026-05-02 |
| Status | Decision: NO-GO. GPU XYB stays disabled; host XYB remains the canonical path. |
| Companion ADR | ADR-0201 (Accepted; this digest extends §Precision investigation with the NVIDIA driver measurement) |
| Tags | gpu, vulkan, ssimulacra2, precision, xyb, fma |
Why now¶
ADR-0201 §Precision investigation landed ssimulacra2_vulkan with a hybrid host/GPU layout: the linear-RGB → XYB pre-pass runs host-side (bit-exact with ssimulacra2.c::linear_rgb_to_xyb) while the IIR blur and per-pixel multiplies run on the GPU. The shader source ssimulacra2_xyb.comp ships in-tree as a reference but is never dispatched in v1 — its pipeline is allocated, its descriptor sets are wired, only the vkCmdDispatch call is omitted (replaced by a (void) cast).
The original investigation matrix tested lavapipe / Mesa anv / Mesa RADV. The empirically-measured per-pixel drift on the X plane (~1.7e-6) compounded through the 6-scale pyramid + IIR + 108-weight pool to a 1.59e-2 pooled-score drift — places=1 only. The host fallback closes the gap to 1.81e-7 (places=6 effective).
Two questions stayed open after ADR-0201:
- Does the same drift appear on NVIDIA proprietary Vulkan (which was not part of the investigation matrix)? NVIDIA's compiler chain is a very different code-generator from Mesa's, and the cancellation-amplification mechanism described in §Precision investigation is at the limit of what
precise/NoContraction/-O0can mitigate — there is a non-zero chance that a different driver gets it right. - If NVIDIA is the only platform that holds, is that a strong enough position to ship a runtime-gated GPU XYB path?
T-GPU-OPT-VK-3 spike answers both.
Method¶
Worktree: feat/ssimulacra2-gpu-xyb-shader-precision at master tip (e266bf8e). Toggle introduced as a compile-time SS2V_USE_GPU_XYB macro in core/src/feature/vulkan/ssimulacra2_vulkan.c (default 0 = host XYB; 1 = GPU XYB). The shader source itself was not modified — the existing ssimulacra2_xyb.comp already ships every known precision mitigation:
precisequalifier on every per-pixel intermediate (matmul partials, cube-root state, MakePositiveXYB rescale).- Explicit per-multiply temp staging (
l_r = kM00 * r; l_g = m01 * g; ... l = l_rg + l_b + kOpsinBias;) to block driver-sideOpExtInst Fmafusion that lavapipe / Mesa anv / RADV emit even whenpreciseis on the LHS. NoContractiondecoration on everyOpFMul/OpFAdd(verified viaspirv-dis).- Bit-trick + 2 Newton iteration cube-root that mirrors
vmaf_ss2_cbrtfop-for-op, including the explicit decomposition of(2y + x/y²) / 3into discrete mul / div / add to defeat driver-side FMA fusion. -O0SPIR-V compilation (per ADR-0201 Decision §Strict-mode SPIR-V compilation) to disable the SPIR-V optimizer's contraction passes.
Build: meson setup build -Denable_vulkan=enabled -Denable_cuda=false -Denable_sycl=false -Dc_args="-DSS2V_USE_GPU_XYB=1"; ninja -C build in libvmaf/. Cross-backend gate:
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json \
python3 scripts/ci/cross_backend_vif_diff.py \
--feature ssimulacra2 --backend vulkan --places 4 \
--vmaf-binary core/build/tools/vmaf \
--reference python/test/resource/yuv/src01_hrc00_576x324.yuv \
--distorted python/test/resource/yuv/src01_hrc01_576x324.yuv \
--width 576 --height 324 --pixel-format 420 --bitdepth 8
Driver: NVIDIA proprietary Vulkan ICD, /usr/share/vulkan/icd.d/nvidia_icd.json, NVIDIA driver 595.71.05 (RTX class consumer GPU).
Result¶
| SS2V_USE_GPU_XYB | Driver | max_abs_diff | places=4 | Verdict |
|---|---|---|---|---|
| 0 (host XYB, shipped default) | NVIDIA | 1.000000e-06 | 0/48 | OK |
| 1 (GPU XYB, this spike) | NVIDIA | 1.541600e-02 | 47/48 | FAIL |
| (ADR-0201 baseline) | lavapipe / anv / RADV | 1.59e-2 | 47/48 | FAIL |
| (ADR-0201 mitigated) | lavapipe / anv / RADV | 1.54e-2 | 47/48 | FAIL |
NVIDIA's drift sits at 1.5416e-2, a virtually-identical magnitude to the Mesa stack's 1.54e-2 result. Three readings:
- The cancellation-amplification site (
X = 0.5 * (cbrt(l) - cbrt(m))) is not driver-specific. It is an algebraic property of the conversion: whenl ≈ m(which is the common case for natural imagery in the L/M opsin channels), any sub-ULP perturbation in the matmul inputs is amplified to single-ULP magnitude inX, and the X plane is a load-bearing input to the SSIM stats that feed the 108-weight pool. - The pre-amplification perturbation comes from the float matmul order
kM00*r + m01*g + kM02*b + kOpsinBias. The shader stages every multiply into aprecisetemp and spelling out the addition tree — but at ULP granularity the float+operator is non-associative, and the GPU's SIMD lane ordering interacts with the SPIR-V optimizer's constant-folding passes in ways the spec'sprecise/NoContractiondecorations do not bound to bit-equivalent IEEE-754 semantics. - NVIDIA's compiler matches the Mesa stack's bit error, not bit-better. This rules out the "maybe NVIDIA's compiler stack is more conservative" hypothesis.
Sanity check: rebuilding without the toggle (SS2V_USE_GPU_XYB=0, the shipped default) and re-running the same gate produces max_abs_diff = 1.0e-6, 0/48 OK — confirming the regression is isolated to the GPU XYB path and the host fallback continues to hold places=4.
Decision¶
NO-GO. The GPU XYB shader cannot reach places=4 on any tested driver (lavapipe, Mesa anv, Mesa RADV, NVIDIA proprietary). The limiting factor is algorithmic — cancellation in 0.5 * (cbrt(l) - cbrt(m)) amplifies any sub-ULP matmul perturbation by ~30× — not driver-specific or precision-decoration-tunable.
Three options were weighed:
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| NO-GO (chosen) — keep host XYB as canonical, leave shader as in-tree reference | Preserves places=4 cross-backend contract; no runtime knob to misconfigure; no per-driver gating logic | The shader stays compiled but never dispatched — small idle code surface | Consistent with ADR-0201's investigation; new evidence (NVIDIA result) reinforces rather than overturns the original conclusion |
| Ship runtime toggle, default off, NVIDIA-only opt-in | Lets advanced users trade precision for ~5% wall-time saving on NVIDIA | NVIDIA result is not better than Mesa — the toggle would not even help on NVIDIA. The hypothesis the toggle was supposed to validate was falsified by the measurement | Falsified by the data — there is no driver where the toggle would deliver a precision-acceptable speedup |
Convert XYB to Float64 shader path (extension-gated) | Could in principle break the cancellation amplification | Requires shaderFloat64 (not core on Vulkan 1.0; absent on a meaningful slice of consumer GPUs); doubles cube-root cost; the X output is still float so the cancellation reappears at the store site | ADR-0201 already weighed and rejected this; the new measurement does not change the trade-off |
Code change¶
None. The toggle macro and the dispatch hook used during the spike were reverted before commit. The C diff against master is empty.
The shader source ssimulacra2_xyb.comp, its pipeline allocation in ssimulacra2_vulkan.c::ss2v_create, and the descriptor set allocation in ss2v_run_scale all stay in their current form — that arrangement is what ADR-0201 Decision §Consequences specifies for forward-compatibility. A future follow-up that ships an opt-in Float64 GPU XYB path would reuse the same scaffolding.
Reproducer¶
git checkout feat/ssimulacra2-gpu-xyb-shader-precision
# manually apply the 3-hunk feasibility patch (SS2V_USE_GPU_XYB
# macro + #if guards in ss2v_run_scale and the per-scale loop) —
# the patch is documented inline in this digest above.
cd libvmaf
meson setup build -Denable_vulkan=enabled -Denable_cuda=false \
-Denable_sycl=false -Dc_args="-DSS2V_USE_GPU_XYB=1"
ninja -C build
cd ..
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json \
python3 scripts/ci/cross_backend_vif_diff.py \
--feature ssimulacra2 --backend vulkan --places 4 \
--vmaf-binary core/build/tools/vmaf \
--reference python/test/resource/yuv/src01_hrc00_576x324.yuv \
--distorted python/test/resource/yuv/src01_hrc01_576x324.yuv \
--width 576 --height 324 --pixel-format 420 --bitdepth 8
# Expect: max_abs_diff = 1.5416e-2, 47/48 mismatches at places=4.
Open questions / follow-ups¶
- Float64 GPU XYB, gated on
shaderFloat64device feature, remains a theoretical option. It is not on the near-term backlog — the cancellation site cancels at the float store step regardless of intermediate precision unless the entire XYB output buffer is widened todouble, which doubles downstream blur / mul / SSIM memory traffic. Any pursuit of this would need a fresh ADR. - Runtime profiling delta of host XYB vs the (hypothetical) GPU XYB: ADR-0201 reports <2% wall-time impact on lavapipe. Re-measuring on NVIDIA could be useful for the host XYB SIMD-isation work (concurrent VK-2 subagent on
feat/ssimulacra2-host-xyb-simd) — if the host XYB is already ~5% of frame time, an AVX-512 path saves ~3% wall; this digest does not block or substitute that work. - CUDA / SYCL twins of
ssimulacra2mirror the same hybrid layout per ADR-0206. This digest's evidence reinforces that the same host-XYB decision applies to both.
References¶
- ADR-0201 — original precision investigation and host-XYB decision.
- ADR-0192 — batch 3 GPU long-tail scope; ssimulacra2 is part 7.
- ADR-0164 — deterministic sRGB EOTF + cbrt scalar reference.
core/src/feature/vulkan/shaders/ssimulacra2_xyb.comp— the shader under investigation.core/src/feature/ssimulacra2_math.h::vmaf_ss2_cbrtf— scalar cube-root the shader mirrors.scripts/ci/cross_backend_vif_diff.py— the gate the measurement was taken with.- Source: T-GPU-OPT-VK-3 (Vulkan-side optimization tracker).