ADR-0539: integer ADM HIP kernels — real implementation replacing weak HSACO stubs¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, Claude
- Tags: hip, gpu, feature, integer-adm, port
Context¶
ADR-0533 (PR #1292) wired the integer ADM HIP extractor into the dispatch table, but the four .hip kernel sources it depends on (adm_dwt2.hip, adm_csf.hip, adm_csf_den.hip, adm_cm.hip) did not build standalone via hipcc --genco:
adm_dwt2.hipandadm_csf.hipwere near-line-for-line CUDA ports that already compiled, but had never been registered inhip_kernel_sources.adm_csf_den.hiphad an incomplete port — the first kernel was missing thetemplate <int val_per_thread, int cta_size>declaration and the__device__ __forceinline__ voidqualifiers, so the file failed to parse.adm_cm.hipwas wrapped in an unterminated#if defined(COMPARE_FUSED_SPLIT)block and referenced CUDA-only helpers (uint64_cu,warp_reduce,VMAF_CUDA_THREADS_PER_WARP,atomicAdd_int64,__float2uint_ru,__log2f) thatcuda_helper.cuhprovides buthipccdoes not.
ADR-0536 (PR #1296) papered over the missing strong symbols with weak fallbacks in hip_hsaco_stubs.c so the host link succeeded; at runtime hipModuleLoadData on an empty blob returned non-zero and the extractor silently fell back to CPU. The user directive — "no stubs anywhere" — requires real, working kernels.
Decision¶
Port the four kernels to compile standalone under hipcc --genco, register them in hip_kernel_sources, and delete the four ADM weak slots from hip_hsaco_stubs.c. The xxd-embedded _hsaco strong symbols from the new targets supply the previously-stubbed blobs.
The CUDA twin's per-warp __shfl_down_sync reduction (cuda_helper.cuh::warp_reduce) is replaced by per-thread atomicAdd on the 64-bit unsigned accumulator. This mirrors the pattern adopted in vif_statistics.hip (ADR-0537): AMD wavefronts are 64 wide (not the 32 the CUDA shuffle mask hard-codes), and __shfl_down_sync is not portable across hipcc / NV. Per-thread atomicAdd on a 64-bit unsigned accumulator is bit-exact with respect to the CUDA twin — only the reduction order differs, and unsigned integer addition is associative and commutative.
The two-kernel pattern for adm_cm (i4_adm_cm_line_kernel writing per-thread scratch + adm_cm_reduce_line_kernel_4 consuming it) is preserved — the HIP host TU (integer_adm_hip.c) still launches both, mirroring the pre-fusion CUDA shape. The newer fused CUDA kernel (i4_adm_cm_line_kernel_fused) is not adopted here; migrating the HIP host TU is a follow-up.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Per-thread atomicAdd on uint64 accum (chosen) | Bit-exact w.r.t. CUDA twin (uint64 add is associative). Trivial to read. Same approach as vif_statistics.hip (ADR-0537). Works on every AMD wavefront width without #ifdef. | Higher atomic traffic than per-warp reduce. Negligible at 576x324; revisit if profiling shows contention on 4K / HDR ROI. | Selected. |
HIP __shfl_down with warpSize (runtime) | Closer to CUDA twin shape. | __shfl_down semantics differ between hipcc 5.x and 6.x; warpSize is a compile-time constant on AMD but 32 vs. 64 differ between gfx generations; would need per-arch dispatch. | Complexity / portability cost not justified for a four-kernel port. |
__hip_atomic_compare_exchange CAS-loop reduction | Avoids 64-bit atomicAdd on older GCN. | atomicAdd-on-uint64 is native on gfx90a / gfx10 / gfx11 (every dev-MCP target); CAS-loop is the implicit fallback for older silicon already. | Manual fallback redundant. |
Compile kernels via host-side hipcc with cuda_helper.cuh shim | Reuses CUDA twin verbatim. | cuda_helper.cuh pulls ffnvcodec/dynlink_loader.h and assumes warpSize=32 in the shuffle mask; HIP support would need a separate shim header maintained in parallel — same surface area as just porting. | Net negative — more files to maintain, no clarity benefit. |
Consequences¶
- Positive: HIP backend now produces bit-exact ADM scores vs. CPU on the Netflix golden src01 pair (verified diff = 0.000000 across all six emitted features:
integer_adm,integer_adm2,integer_adm3,integer_adm_scale[0-3]). Closes the silent CPU-fallback bug ADR-0533 introduced. - Positive:
hip_hsaco_stubs.cdrops from four stub slots to zero; the weak-fallback macro is retained as a pattern for future in-progress ports but is currently used by zero extractors. - Negative: Per-thread atomicAdd is theoretically slower than the CUDA per-warp reduce. Not measured here; defer profiling to a perf pass once ADM HIP is exercised on 4K HDR fixtures.
- Neutral / follow-ups: Host TU (
integer_adm_hip.c) untouched — same kernel names, same launch shapes, sameWarpShiftstruct layout. The HIP host TU still uses the two-kernel reduce pattern; CUDA's newer fused variant could be ported as a follow-up if perf becomes a concern.
References¶
- ADR-0533 (PR #1292): integer ADM HIP extractor wiring.
- ADR-0536 (PR #1296): weak HSACO stubs that this ADR removes for the four ADM blobs.
- ADR-0537:
vif_statistics.hipadopted the same per-thread atomicAdd reduction pattern that this ADR carries into ADM. - ADR-0214: cross-backend numerical-parity gate (places=4).
- Upstream CUDA twins:
core/src/feature/cuda/integer_adm/adm_*.cu. - Source:
req(user direction): "no stubs anywhere".