ADR-0990: Restore double-precision L/C/S accumulation in CUDA ms_ssim_vert_lcs¶
- Status: Accepted
- Date: 2026-06-03
- Deciders: Lusoris, Claude (Anthropic)
- Tags: cuda, precision, ms-ssim, bit-exactness
Context¶
The CUDA kernel ms_ssim_vert_lcs in core/src/feature/cuda/integer_ms_ssim/ms_ssim_score.cu computed the per-pixel luminance (L), contrast (C), and structure (S) components using float literals (2.0f) and stored per-block reduction sums as float. The host struct in integer_ms_ssim_cuda.c allocated the partials buffers and pinned host arrays as float*.
The CPU scalar reference, ssim_accumulate_default_scalar in core/src/feature/iqa/ssim_tools.c (lines 183-196), computes the same quantities using 2.0 (a double literal), which under C standard type-promotion rules promotes ref_mu * cmp_mu to double before the multiply. That means the CPU numerators for L and C are computed in double, and the final lv * cv * sv triple-product is also in double.
The float accumulation in ms_ssim_vert_lcs caused rounding drift of approximately 0.004 for a 256x144 fixture at scale 0 (approximately 33k pixels), which is roughly 40 times the places=4 tolerance of 1e-4 required by test_cuda_float_ms_ssim_parity. The blamed commit is 8db2715ac2.
ADR-0139 already documented and fixed the identical pattern for AVX2/AVX-512 SIMD paths. This ADR applies the same fix to the CUDA backend.
CUDA has supported double-precision __shfl_down_sync since sm_30; all VMAF-supported GPU targets are sm_52 or newer.
Decision¶
- Change the
my_l,my_c,my_slocal variables inms_ssim_vert_lcsfromfloattodouble. - Replace
2.0f *literals with2.0 *(double) in the L and C numerator expressions, matching the CPU scalar reference exactly. - Promote input floats (
ref_mu,cmp_mu, etc.) to double at the point of use in the L/C/S expressions, preserving the same float-to-double widening that C scalar performs via2.0 *. - Change the shared-memory warp-partial arrays (
s_l_warp,s_c_warp,s_s_warp) fromfloat[...]todouble[...]so the warp-level and block-level reductions are also in double. - Change
__shfl_down_syncoperands todouble(the 64-bit overload of__shfl_down_syncis available on all sm_30+ devices). - Write final block-sums as
doubleto thel_partials,c_partials,s_partialsdevice buffers. - Change the kernel signature
c1,c2,c3parameters fromfloattodouble. - In
integer_ms_ssim_cuda.c: - Change
float c1; float c2; float c3;inMsSsimStateCudatodouble. - Change the constants computation to use double literals.
- Change
float *h_l_partials[...](andh_c_partials,h_s_partials) todouble *. - Change device and pinned-host buffer sizes from
sizeof(float)tosizeof(double). - Change the
cuMemcpyDtoHAsyncsize fromsizeof(float)tosizeof(double). - Remove the
(double)casts incollect_fex_cudasince the arrays are alreadydouble.
The CPU scalar and the intermediate float computations (ref_mu, ref_sq, sigma_xy_geom, etc.) are unchanged; those remain float and match the scalar path.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Double reduction (chosen) | Matches scalar C type promotions exactly; closes places=4 gate | Doubles shared-memory footprint for partial arrays; minor extra memory bandwidth on DtoH | Decision — only pattern that closes the parity gate without redesigning the caller |
| Kahan compensated float summation | Keeps float throughput; can recover precision | Does not reproduce C's specific 2.0 * float_val promotion path; complex to verify correctness | Rejected — precision gain is asymmetric and hard to audit |
| Full double pyramid (horiz pass too) | Maximum precision | 2x memory for all intermediate buffers; large performance regression | Rejected — the horiz reduction is not the source of drift; float is sufficient there |
| Accept the places=4 failure, widen tolerance | Zero code change | Contradicts ADR-0139 invariant; silently hides future regressions | Rejected — places=4 is the required gate per ADR-0214 |
Consequences¶
- Positive:
test_cuda_float_ms_ssim_paritypasses the places=4 (1e-4) tolerance gate. CUDAfloat_ms_ssimoutput matches CPU scalar output consistent with ADR-0139's double-precision accumulation invariant, extended to the CUDA backend. - Negative: Per-scale partials buffers grow from
block_count * 4bytes toblock_count * 8bytes (5 scales, 3 channels, 2 allocation tiers: device + pinned host). At 1080p with 16x8 blocks this is approximately 1.1 MB device-side, which is negligible relative to pyramid frame buffers. - Neutral / follow-ups:
- The HIP
integer_ms_ssim_hip.cand HIP kernel have the samefloatpartial pattern. A follow-up should apply the same fix. docs/rebase-notes.mdentry documents the double-partials invariant so future upstream ports preserve it.
References¶
- Phase-2 diagnosis by agent
w487fr96l(2026-06-03): float accumulation inms_ssim_vert_lcsusing2.0fliterals causes approximately 0.004 drift over 33k pixels, 40x the places=4 gate. - Blamed commit:
8db2715ac2. - Related ADR: ADR-0139 — same
2.0 *double-promotion fix for AVX2/AVX-512 paths. - Related ADR: ADR-0214 — places=4 GPU parity gate.
- Scalar reference:
core/src/feature/iqa/ssim_tools.cssim_accumulate_default_scalar.