Research-0985: SYCL parity divergence — float_ssim + ssimulacra2 + float_ansnr (2026-06-03)¶
Date: 2026-06-03 Researcher: Claude (Anthropic) / Lusoris Status: Complete — divergence sources identified; kernel patches blocked pending hardware validation Companion ADR: ADR-0985 (stub reserved; full ADR deferred until hardware gate exists) Triggered by: .workingdir2/cross-backend-arc-20260527/sycl_matrix.md rows: float_ssim FAIL (2.68e-4), ssimulacra2 FAIL (8.72e-2), float_ansnr FAIL (1.59e-4)
1. Executive summary¶
Three SYCL parity matrix rows are open. Investigation found:
| Row | Root cause | Fixable without hardware? | Disposition |
|---|---|---|---|
float_ansnr | Feature removed (PR #38, commit 70ed8b3c) | N/A | CLOSED — stale row |
float_ssim | (a) CPU vs GPU use different SSIM formulas; (b) Arc A380 fp32 accumulation at full-frame scale | Algorithm difference is intentional; precision gap is Arc-architecture-specific | CANNOT FIX with high confidence — needs device-calibration ADR |
ssimulacra2 | Arc A380 lacks native fp64; IIR blur fp32 accumulation error amplified ~17× beyond places=2 contract | Cannot reach acceptable precision on fp64-less hardware without fundamental kernel redesign | CANNOT FIX with high confidence — needs device-calibration ADR |
2. float_ansnr — closed, stale row¶
The float_ansnr extractor was removed in PR #38 (commit 70ed8b3ce3, "feat(core): drop legacy ansnr feature (pre-VMAF, never Netflix-adopted)"). The SYCL parity matrix row at .workingdir2/cross-backend-arc-20260527/sycl_matrix.md predates the removal. Both the CPU extractor (vmaf_fex_float_ansnr) and the SYCL twin were dropped. The row is stale and should be treated as RESOLVED / CLOSED: the feature no longer exists in tree.
Verification:
grep -r "float_ansnr" core/src/ → no results (extractor gone)
grep -r "float_ansnr" core/src/feature/sycl/ → no results
3. float_ssim — two-cause divergence on Arc A380¶
3.1 Divergence data¶
- SYCL: max_abs_diff = 2.68e-04, 40/48 frames fail
- Vulkan: max_abs_diff = 3.02e-04, 1/48 frames fail
- Gate tolerance: 5.0e-05 (places=4)
- Divergence is ~5.4× over tolerance
3.2 Formula difference (primary cause)¶
The CPU scalar extractor (float_ssim.c via iqa_ssim → ssim_accumulate_default_scalar) implements the L×C×S decomposition from Wang et al. (2004) Eq.(1):
sigma_comb = sqrt(var_ref * var_cmp)
l = (2*mu1*mu2 + C1) / (mu1^2 + mu2^2 + C1)
c = (2*sigma_comb + C2) / (var_ref + var_cmp + C2)
s = (covar + C3) / (sigma_comb + C3)
SSIM = l * c * s
The SYCL extractor (integer_ssim_sycl.cpp, vmaf_fex_float_ssim_sycl) implements the combined Wang et al. (2004) Eq.(13):
num = (2*mu1*mu2 + C1) * (2*covar + C2)
den = (mu1^2 + mu2^2 + C1) * (var_ref + var_cmp + C2)
SSIM = num / den
These are not algebraically identical. The L×C×S form uses sigma_comb = sqrt(var_ref * var_cmp) (geometric mean of standard deviations) for the contrast term, while the combined form uses covar (cross-correlation) directly. The CUDA twin (ssim_score.cu, calculate_ssim_vert_combine) uses the same combined formula as SYCL. The Vulkan twin uses a GLSL shader with the same combined form.
This formula divergence was present from the initial GPU kernel landing and was accepted as a known precision trade-off — the combined form is simpler to implement in GPU kernels and avoids a sqrt per pixel.
3.3 Accumulation precision (secondary cause on Arc A380)¶
The Arc A380 (DG2-G10) has shaderFloat64 = false and does not support native fp64 in shader/kernel code. The SYCL kernel performs 11-tap separable Gaussian convolution accumulations in float precision with sycl::reduce_over_group. On RTX 4090 (Ampere, native fp64), the same SYCL kernel passes places=4 (confirmed in Research-0730 §5). On lavapipe (Mesa software renderer), the Vulkan twin also passes places=4.
Arc A380 accumulates float rounding error over (576-10) × (324-10) = 566 × 314 = 177,724 pixel SSIM values, each computed from a 11-tap vertical convolution sum. The per-pixel rounding is bounded but compounds to ~2.7e-4 at the mean level.
3.4 Why a kernel fix is blocked¶
To fix the divergence on Arc A380 one would need to:
(a) Change the GPU formula to match the L×C×S CPU path (requires adding a sqrt per pixel, changes the existing GPU output for all hardware, and the CUDA/Vulkan twins would need matching changes), or
(b) Use integer or int64 intermediate accumulators for the convolution sums (as the integer_ssim_sycl.cpp::IssimStateSycl path does for the ssim feature, with places=4-5 documented in ADR-0564), or
(c) Add a device-specific calibration table entry that sets the float_ssim tolerance to places=3 (5e-04) for Arc A380 class devices.
Option (a) would be high-risk (changes GPU output for all hardware). Option (b) is a significant rewrite of the float_ssim_sycl kernel; cannot be validated without Arc hardware in the test loop. Option (c) is the correct ADR-driven path per Research-0730 §6.1.
Recommendation: Open an ADR for device-calibration entry (float_ssim places=3 on Arc/DG2-class devices). The kernel code is correct for its documented algorithm; the formula difference from CPU is an accepted trade-off in the GPU backends.
4. ssimulacra2 — systematic fp32 amplification on Arc A380¶
4.1 Divergence data¶
- SYCL: max_abs_diff = 8.72e-02, 38/48 frames fail
- Vulkan: max_abs_diff = 5.48e-02, 43/48 frames fail
- Gate tolerance: 5.0e-03 (places=2, per ADR-0192)
- Divergence is ~17× over tolerance on SYCL, ~11× on Vulkan
4.2 Pipeline analysis¶
The ssimulacra2_sycl.cpp kernel uses the following split:
-
Host-side (fp64-safe): YUV → linear-RGB → XYB (cube-root moved to host per ADR-0201 §Precision investigation; GPU cbrt diverges by 42 ULP from libm, cascading to ~1.5e-2 drift on Vulkan without the fix). 2×2 box downsample in linear-RGB. Per-pixel SSIM + EdgeDiff combine (double precision). 108-weight polynomial pooling (double precision).
-
GPU-side: 3-plane elementwise multiply + separable 3-pole IIR Charalampidis 2016 blur. Both compiled with
-fp-model=precise(icpx) to block FMA contraction.
The XYB and pooling paths are already on host and use the same vmaf_ss2_cbrtf LUT as the CPU. The divergence therefore comes from the GPU IIR blur path.
4.3 IIR blur fp32 accumulation analysis¶
The 3-pole recursive IIR per launch_blur in ssimulacra2_sycl.cpp (lines 502–593) implements the Charalampidis 2016 recurrence:
This matches the scalar fast_gaussian_1d byte-for-byte (with -fp-model=precise blocking contraction on icpx). The recurrence is stable for sigma=1.5 (radius=5), so the IIR is not numerically unstable.
The divergence on Arc A380 but not RTX 4090 is consistent with:
-
Different floating-point rounding unit: even with
-fp-model=precise, Arc A380 (Xe-HPC micro-architecture) may use a slightly different rounding mode or denormal handling than NVIDIA Ampere. The IIR recurrence accumulatesprev1_kandprev2_kstate over up toh + radiusrows (up to 329 rows at scale 0 for the 576×324 fixture). Each step's rounding error compounds through the recurrence. After 6 scales (pyramid) the pooled score diverges by ~9e-2. -
fp64-less intermediate spill: on some Xe micro-architectures, the compiler may spill intermediate float registers to memory with different flush-to-zero semantics than the CPU.
-
Cannot be verified without hardware: any proposed fix (e.g., compensated Kahan summation in the IIR accumulation, or using int64 partial sums for the blur pass) cannot be validated without running on Arc A380. A code change that passes on the CPU and RTX 4090 but still diverges on Arc would not be an improvement.
4.4 Considered code changes¶
Option A — Kahan compensated summation in IIR: Replace o0 + o1 + o2 with Kahan-compensated reduction. Could halve the per-output float error but adds 6 FP ops per output pixel and requires state variables — significant kernel complexity increase with unverifiable benefit on Arc.
Option B — Move more computation to host: Run the entire IIR blur on host (same as scalar). This would trivially achieve places=4 but eliminates all GPU benefit for ssimulacra2_sycl (the entire kernel becomes a CPU pass with GPU overhead for the mul3 step only). Defeats the purpose of the GPU extractor.
Option C — Device calibration entry (recommended): Per Research-0730 §6.2 recommendation, add an Arc A380 calibration entry that sets ssimulacra2 tolerance to places=1 (5e-02) on dg2-g10 and related DG2-class devices. This is correct, honest, and does not risk changing scores on any other hardware.
5. SYCL parity test coverage audit¶
Existing tests checked:
test_sycl_ssim_parity.c— testsinteger_ssim(notfloat_ssim), tolerance 1e-4test_sycl_ssimulacra2_parity.c— tests ssimulacra2, tolerance 5e-3 (ADR-0214 contract)
Gap identified: there is no unit test that exercises vmaf_fex_float_ssim_sycl (the float_ssim_sycl extractor in integer_ssim_sycl.cpp). The test_sycl_ssim_parity.c tests integer_ssim_sycl only. A smoke test for float_ssim_sycl registration and parity (at places=3 tolerance to accommodate Arc A380) would complete the coverage.
6. Proposed follow-up actions¶
6.1 Immediate (no hardware required)¶
-
Close the
float_ansnrrow in.workingdir2/anddocs/state.md— the feature is gone. Confirm no ANSNR SYCL symbols remain in tree. -
Add a
float_ssim_syclsmoke test tocore/test/that: - Verifies
vmaf_fex_float_ssim_syclis registered - Runs a CPU vs SYCL parity check with tolerance 5e-04 (places=3)
-
Skips gracefully if no SYCL device is present This test would pass on RTX 4090 at the current places=4 contract and pass on Arc A380 at the relaxed places=3.
-
Document the formula difference between the CPU and GPU
float_ssimimplementations in a code comment ininteger_ssim_sycl.cpp.
6.2 Requires ADR and hardware validation¶
- ADR for device-calibration entries (Arc A380 class):
float_ssim: places=3 (5e-04) on DG2-G10 and related devices-
ssimulacra2: places=1 (5e-02) minimum on DG2-G10 class -
Investigate ssimulacra2 IIR with Arc hardware: Run a bisect between "all on host" and "GPU IIR blur" to isolate exactly which blur pass (horizontal vs vertical, which scale level) contributes the most divergence. Use
SYCL_DEVICE_FILTER=level_zerowith a print-partial-score-per-scale harness.
7. Files examined¶
core/src/feature/sycl/integer_ssim_sycl.cpp—float_ssim_sycl+integer_ssim_syclcore/src/feature/sycl/ssimulacra2_sycl.cpp— ssimulacra2_sycl IIR + combinecore/src/feature/float_ssim.c— CPU float_ssim referencecore/src/feature/iqa/ssim_tools.c— CPU SSIM formula (L×C×S)core/src/feature/cuda/integer_ssim/ssim_score.cu— CUDA SSIM formula (combined)core/src/meson.build— lines 1560–1625: SYCL build flags; confirmed-fp-model=precisedocs/research/0730-cross-backend-arc-parity-20260527.md— prior parity measurementdocs/adr/0214-gpu-parity-ci-gate.md— tolerance table and promotion criteria.workingdir2/cross-backend-arc-20260527/sycl_matrix.md— failure data sourcecore/test/test_sycl_ssim_parity.c— existing SSIM parity test (integer_ssim only)core/test/test_sycl_ssimulacra2_parity.c— existing ssimulacra2 parity test
8. References¶
- Research-0730: Cross-backend numerical parity — Intel Arc A380 (2026-05-27)
- ADR-0214: GPU-parity CI gate (T6-8)
- ADR-0192: ssimulacra2 Vulkan kernel and precision contract (places=2)
- ADR-0188/ADR-0189: float_ssim GPU kernel (combined formula) and places=4 contract
- ADR-0201: ssimulacra2 Vulkan precision investigation — cbrt moved to host
- ADR-0206: ssimulacra2 CUDA + SYCL precision contracts
- ADR-0564: integer_ssim_sycl fp64-free precision, places=4-5
- PR #38: removal of float_ansnr feature