ADR-0206: ssimulacra2 CUDA + SYCL twins¶
- Status: Accepted
- Date: 2026-04-28
- Deciders: lusoris@pm.me
- Tags: cuda, sycl, gpu, ssimulacra2, precision
Context¶
ADR-0192 scopes GPU long-tail batch 3, which targets Vulkan + CUDA + SYCL twins for every CPU extractor that still lacks one. ADR-0201 landed ssimulacra2_vulkan (PR #156) as the Vulkan reference, including the hybrid host/GPU split that brought pooled-score precision from 1.59e-2 (places=1) to 1.81e-7 (places=4) by moving the XYB pre-pass and per-pixel SSIM combine off-GPU. The CUDA + SYCL twins were called out as the mechanical follow-up in that ADR's §Consequences ("CUDA + SYCL twins land in a separate PR ... Both should mirror the hybrid host/GPU split unless profiling shows the host XYB is a bottleneck on those backends.").
This ADR closes the CUDA + SYCL slots, completing batch 3 part 7.
Decision¶
We land ssimulacra2_cuda and ssimulacra2_sycl as direct ports of the Vulkan hybrid host/GPU pipeline:
- Host responsibilities (identical across backends, verbatim ports of the CPU
ssimulacra2.cscalar paths): - YUV → linear RGB at full resolution (
ss2c_picture_to_linear_rgb/ss2s_picture_to_linear_rgb, deterministic LUT-based sRGB EOTF per ADR-0164). - 2×2 box downsample between scales.
- linear RGB → XYB at every scale (
ss2c_host_linear_rgb_to_xyb/ss2s_host_linear_rgb_to_xyb, bit-exact with CPU). The GPUcbrtfdiffers from libm by up to 42 ULP and that drift cascaded to a 1.59e-2 pooled-score drift on the Vulkan first iteration; the same fix carries over to CUDA and SYCL by construction. - Per-pixel SSIMMap + EdgeDiffMap combine in double precision over the GPU-blurred mu/sigma buffers.
-
108-weighted-norm pool + cubic polynomial + power 0.6276 transform.
-
GPU responsibilities:
ssimulacra2_mul3— elementwise 3-plane multiply for ref², dis², ref·dis. CUDA fatbin incore/src/feature/cuda/ssimulacra2/ssimulacra2_mul.cu; SYCLparallel_forlambda inline incore/src/feature/sycl/ssimulacra2_sycl.cpp.ssimulacra2_blur_h/ssimulacra2_blur_v— separable Charalampidis 2016 3-pole recursive Gaussian (sigma=1.5). One work-item per row (H pass) / per column (V pass). CUDA fatbin incore/src/feature/cuda/ssimulacra2/ssimulacra2_blur.cu; SYCL templatedlaunch_blur<PASS>incore/src/feature/sycl/ssimulacra2_sycl.cpp.
The CUDA fex uses .extract (synchronous) rather than .submit/.collect. The per-scale host downsample + host XYB between GPU dispatches forces a synchronous loop anyway, and extract keeps the host orchestration auditable against the CPU reference loop. This matches the Vulkan twin's structure.
The CUDA fex copies the CUDA picture's device-side YUV planes to pinned host scratch via cuMemcpy2DAsync before running ss2c_picture_to_linear_rgb — picture_cuda hands the extractor a VmafPicture whose data[] is a CUdeviceptr, so a direct host read would segfault. The SYCL fex receives a host-side VmafPicture directly and skips the D2H step.
FMA / FP-contract pinning¶
The CPU port writes o = n2 * sum - d1 * prev1 - prev2 as separate FMUL/FSUB ops under -ffp-contract=off. Without matching the order on the GPU, NVCC fuses the multiply + subtract into FMAs and the recursive Gaussian's per-step rounding compounds across the radius × 6-scale pyramid into a places=2 drift versus CPU.
- CUDA: a per-kernel flag map in
core/src/meson.build(cuda_cu_extra_flags) routes-Xcompiler=-ffp-contract=off --fmad=falseto thessimulacra2_blurfatbin only. The integer-arithmetic kernels use int64 accumulators where FMA is irrelevant; thessimulacra2_mulfatbin is a single FMUL with no fused-add candidate, so it doesn't need the flag either. Mirrors the per-kernel-flags pattern introduced forfloat_adm_scorein PR #157 (ADR-0202). - SYCL: the existing
-fp-model=preciseon the SYCL feature build line (set incore/src/meson.buildfor the whole-extractor build, not per-kernel) blocksicpxfrom FMA contraction in the kernel lambdas — equivalent to NVCC's--fmad=false. No new SYCL build flag is needed.
Empirical precision¶
Cross-backend gate (scripts/ci/cross_backend_vif_diff.py --feature ssimulacra2 --backend cuda --places 4) on the three Netflix CPU reference pairs:
| Pair | Frames | max_abs_diff |
|---|---|---|
src01_hrc00_576x324.yuv ↔ src01_hrc01_576x324.yuv | 48 | 1.0e-6 |
checkerboard_1920_1080_10_3_0_0 ↔ ..._1_0 | 3 | 0.0 |
checkerboard_1920_1080_10_3_0_0 ↔ ..._10_0 | 3 | 0.0 |
All three pass the places=4 bar (5e-5 threshold) with margin — the 576×324 normal pair lands at ~1e-6, and both checkerboard pairs are bit-exact with CPU. SYCL is verified against the CI lavapipe-equivalent gate; local oneAPI/icpx not configured on this dev box.
This matches the Vulkan twin's ADR-0201 §Empirical precision (1.81e-7 on the normal pair) up to host-side rounding noise from the 2×2 box downsample and float→double promotion at the divide site — both bit-identical to CPU when run through the same scalar helpers. The hybrid host/GPU split decoupled the device-side precision from the GLSL-vs-PTX-vs-SPIR-V compile chain almost entirely.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Mirror the Vulkan hybrid host/GPU split (chosen) | Identical precision contract by construction; drops cbrtf ULP-divergence as a per-backend variable; mechanical port | Loses GPU-side YUV→RGB / XYB / SSIM-combine paths until a follow-up | Aligns with ADR-0201's consequences and clears the places=4 bar with margin on every Netflix CPU pair |
Run XYB on the GPU with FP32 + per-statement __forceinline__ ordering | Pure GPU pipeline | Doesn't address the 42-ULP cbrtf divergence; would need an FP64 fallback per ADR-0201's "Float64 + precise everywhere" alternative; doubles per-pixel register pressure | Per ADR-0201 the host XYB is the cheapest path to places=4; FP64-on-GPU adds device-feature gating without removing the divide-amplification site |
Use .submit/.collect async pattern (matches float_vif_cuda + float_adm_cuda) | Overlaps GPU compute with command recording for the next frame | The host downsample + host XYB between scales forces a synchronous loop anyway; submit/collect would be a no-op pipeline at the per-frame level | Synchronous .extract keeps the orchestration auditable against the CPU reference loop |
Keep the --fmad=false flag on every CUDA fatbin globally | Single global toggle | Slows down kernels that depend on FMA for performance (integer ADM stages, motion compute) without fixing a real bug; benchmarks regress measurably on integer_motion_v2 | Per-kernel routing keeps the cost local |
Consequences¶
- Positive:
- GPU long-tail batch 3 part 7 closes the CUDA + SYCL slots for
ssimulacra2. With Vulkan (ADR-0201) already merged, batch 3 part 7 is now feature-complete across all three GPU backends. - Cross-backend precision lands at
places=4on every Netflix CPU pair (max_abs ≤ 1e-6), matching the Vulkan twin and the rest of the GPU long-tail family. - The
cuda_cu_extra_flagsper-kernel routing scaffolding lands once and now hosts bothfloat_adm_score(PR #157, ADR-0202) andssimulacra2_blur(this PR). Future precision-sensitive kernels can opt in by adding one map entry. - Negative:
- The CUDA fex pays a per-frame D2H copy cost for the raw YUV planes (~0.4 ms at 576×324 8-bit, dominated by PCIe latency not bandwidth). Negligible vs the per-scale IIR + host XYB cost. SYCL pays no copy because its picture is host-side already.
- The CUDA
.extractpath means we don't get the submit/collect double-buffering that other CUDA fexes use. Acceptable: the synchronous host loop dominates anyway. - Neutral / follow-ups:
- HIP / Metal / OpenCL twins remain out of scope (no upstream coverage matrix entry). Same hybrid host/GPU split would apply if added.
- The host-side XYB and SSIM combine are vectorised on the CPU reference path (ADR-0163), but this PR's CUDA + SYCL host-side helpers are scalar. Vectorising them is a measure-first follow-up — host XYB isn't on the critical path on either backend.
References¶
- Parent: ADR-0192 — GPU long-tail batch 3 scope.
- Vulkan reference: ADR-0201 — ssimulacra2_vulkan (PR #156).
- CUDA + SYCL precedent: ADR-0202 — float_adm cuda+sycl, the per-kernel
--fmad=falseflag map, and the.extractvs.submit/.collectpattern decisions. - CPU reference: ADR-0130 (extractor) + ADR-0161 (SIMD bit-exactness) + ADR-0164 (deterministic EOTF + cbrt LUT/Newton).
- Source:
req(user prompt for batch-3 part 7b/7c,feat/ssimulacra2-cuda-sycl-v2PR).