ADR-1071: Promote HIP ms_ssim_vert_lcs to double precision (ADR-0990 parity)¶
- Status: Accepted
- Date: 2026-06-06
- Deciders: Lusoris
- Tags:
hip,precision,ms_ssim,cross-backend-parity
Context¶
ADR-0990 (2026-06-03) promoted the CUDA ms_ssim_vert_lcs kernel's per-pixel L/C/S computation, warp/block reductions, and per-block partial buffers from float to double, closing a ~0.004 drift at scale 0 relative to the CPU reference (ssim_tools.c uses 2.0 * double literals). The host struct fields c1, c2, c3 and the pinned partial arrays were also promoted to double.
The HIP twin (core/src/feature/hip/integer_ms_ssim/ms_ssim_score.hip and integer_ms_ssim_hip.c) did not receive the same fix. A cross-backend audit on 2026-06-06 (moment, ciede2000, ms_ssim extractors) confirmed:
ms_ssim_vert_lcskernel takesfloat c1, c2, c3and writesfloat *l_partials,float *c_partials,float *s_partials— same pre-ADR-0990 precision posture.MsSsimStateHiphasfloat c1, c2, c3,float *h_{l,c,s}_partials[5], and device partial allocations sizedsizeof(float)— mismatched with CUDA'sdouble.- Options
enable_dbandclip_db, exposed by both the CPU and CUDA extractors, were absent from the HIP extractor'soptions[]array, causing silent score drift when callers request dB output.
Decision¶
Apply the ADR-0990 precision fix to the HIP backend:
- Kernel (
ms_ssim_score.hip): promotec1/c2/c3parameters fromfloattodouble; promote per-pixelmy_l/my_c/my_sfromfloattodouble; use double literals (2.0 *) for the L/C/S numerators; promote shared reduction arrays todouble; use__shfl_downondoubleoperands; writedouble *output partials. - Host (
integer_ms_ssim_hip.c): promoteMsSsimStateHip.c1/c2/c3todouble; compute them usingconst double L = 255.0; resize device and pinned-host partial allocations tosizeof(double); resizehipMemcpyAsyncDtoH copies accordingly; promoteh_{l,c,s}_partialsarrays todouble *. - Add
enable_dbandclip_dbto the HIP extractor'soptions[]and apply the dB conversion incollect_fex_hip(), matching the CPU and CUDA paths.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Keep float partials in HIP | No kernel change required | Persistent ~0.004 per-scale drift; fails ADR-0214 places=4 gate on AMD hardware | Correctness non-negotiable |
| Apply only to host, not kernel | Simpler diff | Kernel still writes float; host double allocation would be written by float kernel causing UB (type-punning) | Wrong: kernel and host must agree on element size |
| Annotate as known divergence | Zero code change | Requires weakening ADR-0214 gate for HIP; unacceptable | Violates correctness-first principle |
Consequences¶
- Positive: HIP ms_ssim L/C/S precision matches CUDA and CPU within ADR-0214 places=4;
enable_db/clip_dboptions now functional on HIP. - Negative:
doublepartial buffers are 2× larger in device VRAM (typically <100 KB total for 5 scales at 1080p; negligible). GCN/RDNA has native fp64 support, so no emulation penalty. - Neutral / follow-ups: SYCL ms_ssim uses
floatpartials intentionally due to Intel Arc A380 lacking native fp64; no change needed there (documented as places=3 tolerance in ADR-0214 SYCL exception).
References¶
- ADR-0990 (CUDA ADR-0139/ADR-0990 double-precision fix — the root fix being ported here)
- ADR-0139 (AVX2/AVX-512 double-precision fix)
- ADR-0214 (cross-backend places=4 CPU-parity gate)
- ADR-0285 (HIP ms_ssim extractor scaffolding)
- Cross-backend audit 2026-06-06: moment, ciede2000, ms_ssim