ADR-1071: Promote HIP ms_ssim_vert_lcs to double precision (ADR-0990 parity)¶

Status: Accepted
Date: 2026-06-06
Deciders: Lusoris
Tags: hip, precision, ms_ssim, cross-backend-parity

Context¶

ADR-0990 (2026-06-03) promoted the CUDA ms_ssim_vert_lcs kernel's per-pixel L/C/S computation, warp/block reductions, and per-block partial buffers from float to double, closing a ~0.004 drift at scale 0 relative to the CPU reference (ssim_tools.c uses 2.0 * double literals). The host struct fields c1, c2, c3 and the pinned partial arrays were also promoted to double.

The HIP twin (core/src/feature/hip/integer_ms_ssim/ms_ssim_score.hip and integer_ms_ssim_hip.c) did not receive the same fix. A cross-backend audit on 2026-06-06 (moment, ciede2000, ms_ssim extractors) confirmed:

ms_ssim_vert_lcs kernel takes float c1, c2, c3 and writes float *l_partials, float *c_partials, float *s_partials — same pre-ADR-0990 precision posture.
MsSsimStateHip has float c1, c2, c3, float *h_{l,c,s}_partials[5], and device partial allocations sized sizeof(float) — mismatched with CUDA's double.
Options enable_db and clip_db, exposed by both the CPU and CUDA extractors, were absent from the HIP extractor's options[] array, causing silent score drift when callers request dB output.

Decision¶

Apply the ADR-0990 precision fix to the HIP backend:

Kernel (ms_ssim_score.hip): promote c1/c2/c3 parameters from float to double; promote per-pixel my_l/my_c/my_s from float to double; use double literals (2.0 *) for the L/C/S numerators; promote shared reduction arrays to double; use __shfl_down on double operands; write double * output partials.
Host (integer_ms_ssim_hip.c): promote MsSsimStateHip.c1/c2/c3 to double; compute them using const double L = 255.0; resize device and pinned-host partial allocations to sizeof(double); resize hipMemcpyAsync DtoH copies accordingly; promote h_{l,c,s}_partials arrays to double *.
Add enable_db and clip_db to the HIP extractor's options[] and apply the dB conversion in collect_fex_hip(), matching the CPU and CUDA paths.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Keep `float` partials in HIP	No kernel change required	Persistent ~0.004 per-scale drift; fails ADR-0214 places=4 gate on AMD hardware	Correctness non-negotiable
Apply only to host, not kernel	Simpler diff	Kernel still writes `float`; host double allocation would be written by `float` kernel causing UB (type-punning)	Wrong: kernel and host must agree on element size
Annotate as known divergence	Zero code change	Requires weakening ADR-0214 gate for HIP; unacceptable	Violates correctness-first principle

Consequences¶

Positive: HIP ms_ssim L/C/S precision matches CUDA and CPU within ADR-0214 places=4; enable_db/clip_db options now functional on HIP.
Negative: double partial buffers are 2× larger in device VRAM (typically <100 KB total for 5 scales at 1080p; negligible). GCN/RDNA has native fp64 support, so no emulation penalty.
Neutral / follow-ups: SYCL ms_ssim uses float partials intentionally due to Intel Arc A380 lacking native fp64; no change needed there (documented as places=3 tolerance in ADR-0214 SYCL exception).

References¶

ADR-0990 (CUDA ADR-0139/ADR-0990 double-precision fix — the root fix being ported here)
ADR-0139 (AVX2/AVX-512 double-precision fix)
ADR-0214 (cross-backend places=4 CPU-parity gate)
ADR-0285 (HIP ms_ssim extractor scaffolding)
Cross-backend audit 2026-06-06: moment, ciede2000, ms_ssim