Research-0089: Vulkan VIF API-1.4 NVIDIA residual — CPU double vs Vulkan float stage bisect¶
- Status: Active
- Workstream: ADR-0264, ADR-0269, state.md row T-VK-VIF-1.4-RESIDUAL.
- Last updated: 2026-05-08
Question¶
PR #346 / ADR-0269 (Step A of the Vulkan 1.4 bump path) tagged the load-bearing FP ops in vif.comp with GLSL precise. The optimised SPIR-V correctly emits OpDecorate ... NoContraction on every float-arithmetic op (verified locally — see §1 below). Yet on NVIDIA RTX 4090 + driver 595.71.05 + API 1.4.329 the cross-backend places=4 gate still reports 45/48 mismatches on integer_vif_scale2, max abs 1.527e-02 — the same magnitude as the pre-Step-A baseline. RADV (Mesa 26.0.6) and lavapipe stay clean. The state.md row T-VK-VIF-1.4-RESIDUAL asks for a CPU double vs Vulkan float stage-by-stage bisect to localise the contraction-or-precision surface that PR #346 missed.
This digest is the bisect's static-analysis output. The dynamic empirical leg (rebuilding the CPU reference in float end-to-end and re-running the gate against the live NVIDIA lane) is recorded as not run in this session — see §"Empirical leg not executed" below.
Sources¶
- ADR-0264 — parent decision (deferred bump, two-step plan).
- ADR-0269 — Step A decision; vif precise-decoration scope.
- research-0053 — root-cause investigation that motivated Step A.
- research-0056 — Step A implementation findings (SPIR-V emission audit + numbers table).
- ADR-0391 research-0055 — ciede2000 sibling investigation that proved the structural f32-vs-f64 hypothesis on the chained colour-space chain.
core/src/feature/integer_vif.c— CPU reference (thedouble-precision side of the bisect).core/src/feature/sycl/integer_vif_sycl.cpp— SYCL companion, all-float(already passes the gate).core/src/feature/vulkan/shaders/vif.comp— the Vulkan path under investigation.
Approach¶
The state.md row's brief was: stage-by-stage ULP-vs-CPU per device on the 48-frame Netflix fixture, localising the divergent stage on NVIDIA. That requires a live NVIDIA run plus per-stage instrumentation in both the C reference and the GLSL kernel. This session's work is the static half of that bisect:
- Re-verify PR #346's claim that every float-arithmetic op in the optimised SPIR-V carries
OpDecorate NoContraction— and count how many such ops actually exist. - Diff the CPU and Vulkan FP-graphs op-for-op. Identify every site where the CPU runs
doubleand the Vulkan path runsfloat. - Cross-check against the SYCL backend (also all-
float) which passes the gate, to test whether "f32 throughout" alone can account for the NVIDIA-1.4 residual. - Synthesise: is the residual (a) a missed contraction surface, (b) the structural f32-vs-f64 class T-VK-CIEDE-F32-F64 documents for the colour-space-chain, or (c) something opaque the SPIR-V-surface model cannot see?
Findings¶
1. The optimised SPIR-V has only 5 float-arithmetic ops, all NoContraction-decorated¶
Re-running glslc 2026.1 + spirv-dis at --target-env=vulkan1.4 -O against vif.comp on this worktree (research/vif-1.4-residual-bisect-2026-05-08, fork master tip 0a8b539e):
glslc --target-env=vulkan1.4 -O core/src/feature/vulkan/shaders/vif.comp -o /tmp/vif-14.spv
spirv-dis /tmp/vif-14.spv | grep -E 'OpFDiv|OpFMul|OpFAdd|OpFSub'
Output (5 ops, every one of them precise-tagged):
| SPIR-V ID | Op | GLSL line in vif.comp | NoContraction? |
|---|---|---|---|
%1360 | OpFDiv %float %1357 %1359 | g = float(sigma12) / float(sigma1_sq) (line 505) | yes |
%1366 | OpFMul %float %1360 %1357 | g * float(sigma12) (line 506) | yes |
%1367 | OpFSub %float %1362 %1366 | float(sigma2_sq) - g * float(sigma12) → sv_sq (line 506) | yes |
%1379 | OpFMul %float %1376 %1376 | g * g (line 509) | yes |
%1382 | OpFMul %float %1379 %1359 | g*g * float(sigma1_sq) → gg_sigma_f (line 509) | yes |
Sanity-check at -O0 (unoptimised): 6 FP ops total — the same 5 arithmetic ops plus one OpFOrdLessThan for the if (sv_sq < 0.0) guard, which is not contraction-relevant. So the optimiser does not synthesise extra FP arithmetic from the GLSL; the FP-arithmetic surface is the 5 ops listed above.
cmp /tmp/vif-13.spv /tmp/vif-14.spv is byte-identical (28 924 bytes) — the SPIR-V the driver receives does not depend on the runtime API version. This confirms research-0053 §2's earlier finding still holds post-Step-A.
Conclusion of stage 1: PR #346 is complete on the SPIR-V surface. Adding more precise decorations to vif.comp cannot help — there is nothing left to decorate in the FP-arithmetic graph. The 62 NoContraction decorations PR #346 emits are spread across integer ops the optimiser dragged the precise qualifier into; the 5 floating-point ops the residual could possibly live in are all already covered.
2. CPU vs Vulkan FP-graph diff — the structural mismatches¶
Side-by-side of the per-pixel inner expression, for the sigma1_sq >= sigma_nsq && sigma12 > 0 && sigma2_sq > 0 branch that gates 45 of 48 frames at scale 2:
| Stage | CPU (integer_vif.c::vif_statistic_8 lines 326–336) | Vulkan (vif.comp lines 500–509) |
|---|---|---|
g = sigma12 / sigma1_sq | double g = sigma12 / (sigma1_sq + eps) — f64, eps = 6.5536e-6 | precise float g = float(sigma12) / float(sigma1_sq) — f32, no eps |
sv_sq | int32_t sv_sq = sigma2_sq - g * sigma12 — RHS in f64, then truncate-to-int32 | precise float sv_sq = float(sigma2_sq) - g * float(sigma12) — f32, kept as float |
g-clamp | g = MIN(g, vif_enhn_gain_limit) — f64 | g = min(g, pc.vif_enhn_gain_limit) — f32 |
gg * sigma1_sq | (int64_t)((g * g * sigma1_sq)) — f64 product, truncate-to-int64 | precise float gg_sigma_f = g * g * float(sigma1_sq), then int64_t(gg_sigma_f) |
| numerator | int64_t numer1_tmp = (int64_t)(g*g*sigma1_sq) + numer1 — exact int64 from f64 | uint64_t(int64_t(gg_sigma_f)) + uint64_t(numer1) — int64 from f32 |
| log2 LUT | log2_64(table, numer1_tmp) - log2_64(table, numer1) — exact integer table | lut[numlog - 32768u] - lut[denlog - 32768u] — exact integer table |
Five structural differences:
epsterm (6.5536e-6) is present in the CPU divisor and absent in the shader. The CPU comment ("this epsilon can go away") suggests it is a guard againstsigma1_sq == 0; the shader covers that with thesigma12 > 0 && sigma1_sq != 0 && sigma2_sq != 0outer guard. Mathematically this affects every active-branch sample but the magnitude iseps / sigma1_sq ≤ eps / sigma_nsq = 5e-11, well below ULP at f32 — not a plausible explanation for1.527e-02.gis f64 on CPU, f32 on Vulkan. Sole load-bearing difference for the first dataflow stage.sv_sqis f64-then-int32 on CPU, f32 on Vulkan. The CPU productg * sigma12is f64 before the int32 truncation; the shader product is f32. ULP-class divergence for high-magnitudesigma12.g*g*sigma1_sqis f64 on CPU, f32 on Vulkan. This is thegg_sigma_fexpression PR #346 explicitly taggedprecise. Even withNoContractionthe f32 product can lose 23 bits of mantissa where the f64 keeps 52.int64(gg_sigma_f)truncation narrows the f32 result before the integer-domain accumulation; the CPU narrows the f64 result, which has 29 more mantissa bits to spare.
3. SYCL is all-float and passes the gate¶
core/src/feature/sycl/integer_vif_sycl.cpp lines 706–716 show SYCL uses float g, float sv_sq, float gg_sigma_f — identical precision contract to the Vulkan shader. SYCL passes the cross-backend gate at places=4 on every backend the fork ships against (Intel Arc A380 / Mesa anv; CUDA on the shared kernel).
This rules out "f32-vs-f64 alone" as the explanation for the NVIDIA-1.4 residual. If structural f32-vs-f64 were sufficient, the SYCL backend would fail the gate too, and ciede's pre-Step-A 1.3 baseline would have been similarly deep red. Instead:
| API 1.3, no precise | API 1.3, with precise (PR #346) | API 1.4, with precise | |
|---|---|---|---|
| vif scale 2 vs CPU-f64 | 0/48 (≤1e-06) | 0/48 (2e-06) | 45/48 (1.527e-02) |
| ciede2000 vs CPU-f64 | 42/48 (1.67e-04) | 5/48 (8.9e-05) | 5/48 (8.9e-05) |
The vif row's 1.3 column is clean. Same f32 shader, same f64 CPU reference, same NVIDIA driver, same fixture — passing 0/48. Switching the runtime API version to 1.4 is the only differentiating input, and the SPIR-V byte-identity check (§1, also research-0053 §2) rules out any compile-side codegen change.
4. Synthesis — what the bisect localises and what it cannot¶
The static evidence converges on the following decomposition:
- The 5 SPIR-V FP-arithmetic ops on the float side are not the residual contraction surface PR #346 missed. All 5 are decorated
NoContraction; the SPIR-V surface PR #346 protected is exhaustive. - A pure f32-vs-f64 precision gap (analogous to T-VK-CIEDE-F32-F64) is also insufficient on its own. The 1.3 lane runs the exact same f32 graph against the exact same f64 CPU reference and is clean. f32-vs-f64 is the upper bound on the worst-case error, not the typical error; on this fixture it stayed under the gate at 1.3. The 1.4 lane lifts that error by a factor of
~10^4(~1e-6→1.527e-2), which is six orders of magnitude more than the structural f32-vs-f64 contract permits on the same five ops. - What's left is opaque to the SPIR-V surface. Plausible candidates the SPIR-V-level model cannot directly resolve:
- NVIDIA's internal
shaderFloatControls2v2 codegen (core in 1.4) flipping a default that is not exposed via SPIR-V declarable execution-mode bits — e.g., a reciprocal-multiply vs true-divide substitution onOpFDiv, anrsqrewrite forOpFMul %x %x, or a fusedmad-class instruction selection that is not strictly an FMA in the IEEE sense (some NVIDIA SASSMUFUpaths).NoContractionblocksa*b+c → fma(a,b,c)but does not bind reciprocal / transcendental selection. - A subgroup-reduction codegen change that affects the int64 cross-lane reduction (less likely — the int64 ops are decorated by the optimiser leak in §1, and int64 isn't contraction-relevant either).
- A driver-bug-class shift in 595.71's 1.4 path that the SPIR-V contract doesn't cover. Outside the fork's reach without an NVIDIA driver-team escalation (research-0053 §"Open questions" already filed this).
5. Empirical leg not executed (NVIDIA-side dynamic stage diff)¶
The full state.md brief asks for per-stage ULP dumps on the live NVIDIA + RADV + lavapipe lanes — instrument every stage of the inner expression (g, sv_sq, gg_sigma_f, log-LUT inputs) on both CPU and Vulkan, dump per-frame, ULP-diff. That requires:
- A debug-build of
core/src/feature/integer_vif.cwith per-stagefprintfon the active branch of frame 0..47. - A modified
vif.compwriting the same per-stage values to an SSBO (the shader currently emits only the 7-field int64 accumulator per workgroup). - Re-runs against all three Vulkan ICDs.
- A local API-1.4 bump (4-site change in
core/src/vulkan/common.c+vma_impl.cpp) that is not on master.
The session this digest was produced in did not run that instrumentation. The static evidence above is sufficient to recommend implementation phase 2 NOT proceed — see §"Phase 2 recommendation" — because the SPIR-V-surface mitigation space is exhausted. The dynamic per-stage table from the empirical run is recorded as [UNVERIFIED — needs NVIDIA hardware run + per-stage instrumentation] per the session's hardware-availability constraint and the no-fabrication rule.
| Stage | CPU baseline | NVIDIA Vulkan | RADV Vulkan | lavapipe Vulkan |
|---|---|---|---|---|
g = sigma12 / sigma1_sq | f64 ground truth | [UNVERIFIED] | [UNVERIFIED] | [UNVERIFIED] |
sv_sq = sigma2_sq - g·sigma12 | f64 ground truth | [UNVERIFIED] | [UNVERIFIED] | [UNVERIFIED] |
gg_sigma_f = g·g·sigma1_sq | f64 ground truth | [UNVERIFIED] | [UNVERIFIED] | [UNVERIFIED] |
log2_64(numer1_tmp) (int LUT) | exact integer | exact integer | exact integer | exact integer |
log2_64(numer1) (int LUT) | exact integer | exact integer | exact integer | exact integer |
| frame integrand | f64 ground truth | [UNVERIFIED] | [UNVERIFIED] | [UNVERIFIED] |
48-frame places=4 verdict | reference | 45/48 FAIL, max abs 1.527e-02 (research-0053/0056, reproduced 2026-05-03 by lawrence — not re-run this session) | 0/48 (research-0053) | 0/48 (research-0053, predicted clean by symmetry) |
The only NVIDIA cell with a non-[UNVERIFIED] value is the final-verdict row, and its number is cited from research-0053 + research-0056, not re-measured here.
Phase 2 recommendation¶
Do not attempt a phase-2 shader fix in this PR. The bisect's shader-side conclusion is negative-by-exhaustion: the only FP arithmetic the shader emits is already precise-decorated, the SYCL counter-example rules out a pure f32-vs-f64 gap, and the residual sits in driver-internal codegen the SPIR-V surface cannot bind.
The remaining unblock paths for T-VK-1.4-BUMP (Step B), in descending preference:
- Empirical phase-1 dynamic dump on the live NVIDIA lane. Per-stage SSBO instrumentation as sketched in §5 above, plus the matching CPU-side per-stage prints, plus a local API-1.4 bump. If the per-stage NVIDIA values diverge from RADV at
g,sv_sq, orgg_sigma_fbefore the int64 truncation, the driver is mis-handling one of the five ops despite theNoContractioncontract — file as an NVIDIA driver bug and document a workaround. If they diverge only after the int64 truncation while the f32 values agree, that is a benign rounding-bias pattern and a places=3 NVIDIA-only override ADR is the path. Owner-driven; ~1 day of work; gated on NVIDIA hardware access. - Try
OpExecutionMode SignedZeroInfNanPreserveFloat32via hand-edited SPIR-V orGL_EXT_shader_float_controls2once glslc gains support. Research-0053 §"Open questions" already tracks this. Low expected value (the regression magnitude is above what signed-zero / inf / nan mishandling could plausibly cause), but cheap to prove. - Defer indefinitely with the
placesoverride pattern. A per-featureplaces=3NVIDIA-only override (analogous to T-VK-CIEDE-F32-F64) shipping the partial f32 contract as documented fork debt. Gated on writing a separate ADR; would reduce the gate's strictness by 1 ULP-class on this one metric on this one driver. - NVIDIA driver-team escalation. Outside fork reach without a paid NVIDIA dev relations channel.
Open questions¶
- Why does API 1.3 → 1.4 specifically amplify the f32 error by ~10^4 on these five ops despite
NoContractionbeing declared? The leading hypothesis (research-0053 §4) is that NVIDIA's v2-shaderFloatControls codegen path activates at instance API 1.4 and changes a non-IEEE-bound code-selection default (reciprocal-multiply for divide, fast-rsq forg*g, or similar) that is not constrained byNoContraction. The Vulkan SPIR-V surface has no normative declarable for that class. - Does
GL_EXT_shader_float_controls2(when glslc 2026.1+ supports it) expose a knob that binds reciprocal selection? Not in the current spec —shaderFloatControls2exposes denormal / rounding-mode / signed-zero-inf-nan preserve, none of which cover reciprocal substitution. - Is the residual reproducible on NVIDIA driver 600.x once released? Unknown — no driver-release-notes entry calls out shader-compiler changes between 595.71 and the next branch. Worth re-checking opportunistically.
References¶
- ADR-0214 —
places=4cross-backend parity gate. scripts/ci/cross_backend_vif_diff.py— gate harness; runs the 48-frame Netflix fixture per backend.- SPIR-V 1.6 —
OpDecorate NoContraction VK_KHR_shader_float_controls2(core in 1.4)- GLSL 4.50 §4.7.1 —
precisequalifier - Source: parent-agent task brief, 2026-05-08: paraphrased — localise the root cause of the Vulkan VIF residual NVIDIA mismatch at API 1.4 that PR #346's Step A did not close; per-stage CPU-double vs Vulkan-float ULP bisect; if inconclusive, document the gap and Step B remains blocked.
Status update 2026-05-09: Phase 2 dynamic dump landed — refutes the FP-precision hypothesis¶
The Phase 2 follow-up ran the live NVIDIA RTX 4090 + driver 595.71.05 + Vulkan loader 1.4.341 lane this session with the local API-1.4 bump applied (core/src/vulkan/common.c 3 sites + core/src/vulkan/vma_impl.cpp VMA_VULKAN_VERSION 1004000) and a fresh build (-Denable_vulkan=enabled, glslc 2026.1, vulkan1.3 target-env). The Phase 2 brief was to instrument 5 SSBO writes after each FP op in vif.comp and produce a per-stage ULP table; the debug=true channel that vif_vulkan already exposes (integer_vif_num_scaleN / integer_vif_den_scaleN per frame) made that instrumentation unnecessary because it surfaced the bug at the accumulator level, well upstream of the FP-arithmetic surface that this digest's static analysis pursued. The findings below refute the residual-FP-precision hypothesis the rest of this digest builds on.
Empirical numbers — what the live RTX 4090 actually produces¶
Reproduction confirmed at the gate: harness scripts/ci/cross_backend_vif_diff.py against vif_vulkan extractor on the 576x324 48-frame Netflix fixture, places=4 tolerance, API 1.4 bump applied:
metric max_abs_diff mismatches
integer_vif_scale0 1.000000e-06 0/48 OK
integer_vif_scale1 1.000000e-06 0/48 OK
integer_vif_scale2 1.526800e-02 45/48 FAIL
integer_vif_scale3 2.000000e-06 0/48 OK
max_abs = 1.527e-02 matches the digest body exactly. The 1.3 control lane on the same machine is 0/48 max=0.000e+00 — confirmed bit-exact, deterministic across 5 repeat runs.
debug=true per-frame intermediates, frame 5, scale 2:
| Source | num_scale2 | den_scale2 | reported vif_scale2 |
|---|---|---|---|
| CPU reference (any run) | 2.4944e+04 | 2.5225e+04 | 0.988835 |
| NVIDIA Vulkan 1.4, run 1 | 7.479e+14 | -7.776e+15 | 1.000000 |
| NVIDIA Vulkan 1.4, run 2 | 2.991e+14 | -1.495e+14 | 1.000000 |
| NVIDIA Vulkan 1.4, run 3 | 1.047e+15 | -1.032e+16 | 1.000000 |
| NVIDIA Vulkan 1.4, run 4 | 1.197e+15 | -1.032e+16 | 1.000000 |
| NVIDIA Vulkan 1.4, run 5 | 8.974e+14 | -7.776e+15 | 1.000000 |
Two facts neither the digest body nor research-0053 captured:
- The accumulator outputs are off by ~10¹¹ in magnitude —
den_scale2 ~ -10¹⁶vs CPU's2.52e+04. No FP-precision flip on fiveOpFDiv/OpFMul/OpFSubops can synthesise a 10¹¹× amplification. The bug is not in the FP-arithmetic graph this digest's body bisected. - NVIDIA at API 1.4 is non-deterministic on
vif_vulkanscale 2 (5 runs, 5 distinct(num, den)pairs). API 1.3 on the same machine is fully deterministic across the same 5 runs. The 1.3 → 1.4 transition does not just amplify a precision gap; it introduces a memory race or memory-model regression that the 1.3 path was implicitly defended against. - The bug is isolated to the SCALE = 2 specialization. Scales 0, 1, 3 are deterministic and produce sane positive
num/denvalues across the same runs (numerically a few ppm off CPU due to f32-vs-f64, well under the places=4 gate). Only SCALE = 2 produces the negativedenand the run-to-run drift.
The reported vif_scale2 = 1.000000 is the CPU-side reduction formula's den <= 0 fallback in reduce_and_emit() of core/src/feature/vulkan/vif_vulkan.c: (scale_den[i] > 0.0) ? scale_num[i] / scale_den[i] : 1.0. The score never reflects the ALU output when den flips sign — it just collapses to 1.0 ≡ "perfect VIF", which is the 45/48 frames the gate flags.
Per-stage table — populated row that matters, others retired¶
The §5 per-stage table this digest carried with [UNVERIFIED] cells asked the wrong question. The five cells in the g, sv_sq, gg_sigma_f rows could not have produced the observed magnitude even in the worst-case f32-precision scenario, and the dynamic dump above shows the divergence is upstream of all of them, in the accumulator write path. Replacing the table with a single row that does carry a real measurement:
| Quantity (frame 5, scale 2) | CPU reference | NVIDIA Vulkan 1.4 | RADV Vulkan 1.4 / Intel A380 1.4 |
|---|---|---|---|
integer_vif_num_scale2 | 2.494e+04 | 7e+14 .. 1.2e+15 (run-dependent) | not measured this session — 1.3 control was 0/48 |
integer_vif_den_scale2 | 2.522e+04 | -1.5e+14 .. -1.0e+16 (run-dependent) | not measured this session — 1.3 control was 0/48 |
| 5-run determinism | yes | no | not measured |
| 1.3-vs-1.4 ratio | n/a | ~10¹¹× magnitude flip + sign flip | not measured |
The debug=true host intermediates on RADV (Granite Ridge integrated gfx1036) and Intel Arc A380 weren't sampled in this session because the 1.3 control already proved both lanes 0/48 across 48 frames at the gate, and the brief's localisation question was specifically "NVIDIA vs the rest". Sampling RADV / A380 + lavapipe's num / den at API 1.4 specifically remains a Phase 3 task if the upstream-fix direction warrants confirming the rest of the matrix isn't also going non-deterministic on the same SCALE = 2 specialisation — current evidence says they aren't, but the formal gate result is the only number this session re-verified.
What this means for the Step B unblock paths¶
Of the four paths the digest body listed:
- ~~"Try
OpExecutionMode SignedZeroInfNanPreserveFloat32/GL_EXT_shader_float_controls2"~~ — discarded. The bug is not in IEEE FP semantics; it's an integer-accumulator memory race. - ~~"Per-feature
places=3NVIDIA-only override ADR"~~ — discarded. A precision-tolerance loosening cannot accommodate non-deterministic 10¹¹× accumulator drift; the next run would fail at any tolerance. - NVIDIA driver-team escalation — still possible but no longer the sole path. Worth filing as a confirmed memory-model regression, not a contraction-codegen issue.
- A new path the original digest did not list — fix the shader's memory-model assumptions for Vulkan 1.4. The
subgroupAdd()/barrier()/ cross-subgroup reduction in Phase 4 ofvif.comp(lines 547–592) is the prime suspect. The SCALE = 2 specialisation is the smallest plane (144x81 ≈ 5 wgs); the cross-subgroup reduction'sfor (uint s = 0u; s < n_subgrps; s++)loop readss_lmemwithout a memory-scope-qualified barrier, relying onbarrier()'s implicitWorkgroupMemoryBarriersemantics. Vulkan 1.4 picks a stricter default memory model on NVIDIA than 1.3 did; the shader needs an explicitcontrolBarrier(gl_ScopeWorkgroup, gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsAcquireRelease)(or the GLSLmemoryBarrierShared() + barrier()pair) before the thread-0 read. This is testable cheaply against the 5-run determinism check above. Phase 3 should attempt this fix underenable_vulkanwith a 5-run gate; if determinism returns ANDplaces=4passes, Step B unblocks for free.
Why the Phase 1 SPIR-V analysis still stands (just answers a different question)¶
§1's OpFDiv + 3×OpFMul + OpFSub enumeration is correct. Every one of those 5 ops is NoContraction-decorated. That analysis was sufficient to exclude the FP-arithmetic surface as the shader-side mitigation target — which is exactly what the empirical finding above now confirms. The digest body's negative-by-exhaustion conclusion was right; the alternative-hypothesis section was wrong about which opaque driver behavior was responsible. The non-determinism + sign-flip + 10¹¹× magnitude pattern is the signature of a memory-model issue, not a codegen one.
Reproduction recipe for Phase 3¶
# Apply the local API-1.4 bump (off-master, manual reproducer).
sed -i 's/VK_API_VERSION_1_3/VK_API_VERSION_1_4/g' \
core/src/vulkan/common.c
sed -i 's/VMA_VULKAN_VERSION 1003000/VMA_VULKAN_VERSION 1004000/' \
core/src/vulkan/vma_impl.cpp
# Build with Vulkan only.
cd libvmaf
meson setup build_phase2 -Denable_vulkan=enabled \
-Denable_cuda=false -Denable_sycl=false -Denable_tests=false
ninja -C build_phase2
# Reproduce the gate failure (45/48 places=4 fail on scale 2).
python3 ../scripts/ci/cross_backend_vif_diff.py \
--vmaf-binary $PWD/build_phase2/tools/vmaf \
--reference ../testdata/ref_576x324_48f.yuv \
--distorted ../testdata/dis_576x324_48f.yuv \
--width 576 --height 324 --feature vif --backend vulkan \
--device 0 --places 4
# Confirm non-determinism (5 distinct (num, den) pairs at frame 5).
for run in 1 2 3 4 5; do
build_phase2/tools/vmaf -r ../testdata/ref_576x324_48f.yuv \
-d ../testdata/dis_576x324_48f.yuv -w 576 -h 324 -p 420 -b 8 \
--feature 'vif_vulkan=debug=true' --backend vulkan \
--vulkan_device 0 -n --json -o /tmp/vif_run${run}.json
python3 -c "import json; m=json.load(open('/tmp/vif_run${run}.json'))['frames'][5]['metrics']; print(f'run ${run}: num={m[\"integer_vif_num_scale2\"]:.4g} den={m[\"integer_vif_den_scale2\"]:.4g}')"
done
Hardware lane this session: NVIDIA GeForce RTX 4090 (UUID e478b41b-5c4f-1ddb-f990-e44916aff4c8), driver 595.71.05, Vulkan device API 1.4.329, Vulkan instance loader 1.4.341. RADV (gfx1036 Granite Ridge integrated) at 26.1.0 and Intel Arc A380 (DG2) at 26.1.0 confirmed present + queryable but not used for this Phase 2 data — 1.3 control gates 0/48 elsewhere on this hardware and the brief was NVIDIA-specific.
Status update 2026-05-09: Phase 3 fix landed (NVIDIA + RADV closed; Arc-A380 residual)¶
Phase 3 implementation (this PR — Phase-2 dump landed in PR #510; this is the shader-fix successor) replaced all three bare barrier() calls in vif.comp (Phase-1 cooperative tile load, Phase-2 vertical-conv shared write, Phase-4 cross-subgroup reduction) with explicit memoryBarrierShared() + barrier() pairs. Both forms expand to the same SPIR-V OpControlBarrier with gl_StorageSemanticsShared | gl_SemanticsAcquireRelease shared-memory acquire-release semantics; the fix is applied uniformly to all SCALE values because the structural race lives in the code shared by all four pipeline specialisations — SCALE = 2 is just the smallest workgroup count where the hardware schedule made the bug observable.
Hardware lane this session — corrected device map¶
The Phase-2 dump's "NVIDIA RTX 4090" attribution was off-by-one in the device map. On this multi-GPU host the libvmaf Vulkan enumerator sorts physical devices by devtype_score, which keeps the vkEnumeratePhysicalDevices order between same-type devices. The Vulkan loader's sorted order on this box is:
[0] Intel(R) Arc(tm) A380 Graphics (DG2) [ANV / Mesa 26.1.0]
[1] NVIDIA GeForce RTX 4090 [proprietary 595.71.05]
[2] AMD Radeon Graphics (RADV RAPHAEL_MENDOCINO) [Mesa 26.1.0]
--vulkan_device 0 therefore lands on Arc, not NVIDIA, on this hardware. Phase 2's empirical numbers (den_scale2 ≈ -10¹⁶, num_scale2 ≈ +10¹⁵, 5 distinct run pairs) reproduce exactly on device 0 = Arc A380 + ANV at API 1.4 — the bug shape PR #510 identified is real, but it lives on Mesa-ANV, not NVIDIA. The NVIDIA RTX 4090 lane (device 1) was already deterministic and 0/48 at API 1.4 pre-fix on this exact hardware setup. The places=4 45/48 mismatch the gate showed in PR #510 was the Arc data path.
Empirical Phase-3 results (real, this session)¶
Build: meson setup build_phase3 -Denable_vulkan=enabled, -Denable_cuda=false, -Denable_sycl=false. Local API-1.4 bump applied for the 1.4 measurements only (Step B, out of scope for this PR).
Cross-backend gate, all 3 visible Vulkan devices, both API tiers:
| Device | API 1.3 (default) | API 1.4 + shader fix |
|---|---|---|
| Arc A380 (ANV / Mesa) | 0/48 OK | 45/48 FAIL scale-2 (residual) |
| NVIDIA RTX 4090 (595.71) | 0/48 OK | 0/48 OK |
| RADV iGPU (Mesa) | 0/48 OK | 0/48 OK |
5-run determinism check (vif_vulkan=debug=true, frame 5, integer_vif_num_scale2 / integer_vif_den_scale2):
NVIDIA RTX 4090 + API 1.4 + shader fix:
run 1: num=+2.494358e+04 den=+2.522523e+04
run 2: num=+2.494358e+04 den=+2.522523e+04
run 3: num=+2.494358e+04 den=+2.522523e+04
run 4: num=+2.494358e+04 den=+2.522523e+04
run 5: num=+2.494358e+04 den=+2.522523e+04
CPU reference: num=+2.494e+04 den=+2.522e+04 — match.
Arc A380 (ANV) + API 1.4 + shader fix (residual):
run 1: num=+1.495701e+15 den=-1.285952e+16
run 2: num=+1.495701e+15 den=-1.285952e+16
run 3: num=+1.495701e+15 den=-1.270999e+16
run 4: num=+1.346167e+15 den=-1.016792e+16
run 5: num=+1.495701e+15 den=-1.285952e+16
-- still non-deterministic. memoryBarrierShared() + barrier()
pair is insufficient on Mesa-ANV at API 1.4.
Netflix golden gate unaffected — the fix is shader-only on a non-CPU code path; the 3 Netflix CPU goldens never enter the Vulkan dispatch.
Phase 3 outcome — split¶
- Closed: NVIDIA RTX 4090 + driver 595.71.05 + Vulkan 1.4 residual. The shader's bare
barrier()was relying on implementation-defined shared-memory ordering that NVIDIA's 1.4 default memory model no longer provides; the explicitmemoryBarrierShared()pair restores the prior ordering. 5-run deterministic, 0/48 atplaces=4. RADV is also clean (was already clean pre-fix, stays clean post-fix). - Open — new finding: Arc A380 (Mesa-ANV / DG2) at API 1.4 exhibits the same non-deterministic int64 accumulator signature (10¹¹× magnitudes, sign flips, 5 distinct run pairs) but does not close under
memoryBarrierShared() + barrier(). This is a separate driver-side behaviour: ANV on DG2 may need a device-scope barrier (controlBarrier(gl_ScopeDevice, ...)) rather than workgroup-scope, OR the shared-memory layout needscoherentorvolatilequalifiers, OR there's a subgroup-scope publish gap that requires asubgroupMemoryBarrierShared()before the elected-thread write. None of these were attempted in this PR — the brief was a single-call swap, and per the user'sfeedback_no_test_weakeningrule the PR documents the residual rather than relaxing the gate. Tracked as T-VK-VIF-1.4-RESIDUAL-ARC for a follow-up Phase-3b.
Decision matrix retired by Phase 3¶
- ~~"Driver-internal codegen flip on shaderFloatControls2"~~ — discarded by PR #510's Phase-2 dump (already retired).
- ~~"NVIDIA escalation"~~ — discarded for NVIDIA. The bug was inside the fork's shader, not the driver. (Still open as a Mesa-ANV escalation candidate for the Arc residual.)
- ~~"Per-feature places=3 NVIDIA-only override"~~ — discarded for NVIDIA (not needed). Also discarded for Arc (residual is non-deterministic; tolerance loosening can't accommodate).
- Adopted: shared-memory release-acquire fix in
vif.comp. Closes NVIDIA + RADV; Arc residual moves to Phase-3b.