ADR-0375: HIP batch-3 — float_moment_hip and float_ssim_hip real kernels¶
- Status: Accepted
- Date: 2026-05-10
- Deciders: lusoris, Claude (Anthropic)
- Tags:
hip,gpu,build,feature-extractor,fork-local
Context¶
After ADR-0373 landed batch-2 (float_motion_hip), six of eleven HIP feature extractors remained at the -ENOSYS scaffold posture. The batch-3 goal was to promote two straightforward extractors — float_moment_hip and float_ssim_hip — to real HIP module-API consumers, following the pattern established by PR #612 (float_psnr_hip, ADR-0254), batch-1 (ADR-0372), and batch-2 (ADR-0373).
float_moment_hip was the simplest remaining target: a single kernel dispatch per frame with four uint64 atomic accumulators, directly mirroring cuda/integer_moment/moment_score.cu. float_ssim_hip was the most architecturally complex of the tractable targets: a two-pass separable 11-tap Gaussian (horizontal pass into five intermediate float buffers, vertical pass + SSIM combine + per-block float partial sum), directly mirroring cuda/integer_ssim/ssim_score.cu. Both are scale=1-only in v1 (same as the CUDA and Vulkan twins).
The remaining five extractors (ciede_hip, integer_psnr_hip — already promoted in batch-1 — integer_motion_v2_hip, adm_hip, vif_hip) and the old-style motion_hip.c stub (non-VmafFeatureExtractor API) are deferred to future batches.
Decision¶
We promote float_moment_hip and float_ssim_hip from -ENOSYS stubs to real #ifdef HAVE_HIPCC dual-path consumers.
float_moment_hip (hip/float_moment/moment_score.hip + updated float_moment_hip.c / float_moment_hip.h):
- Two kernels:
calculate_moment_hip_kernel_8bpc/_16bpc(same 7-arg signature for both — nobpcarg; raw uint8 bytes addressed as uint16 for 16bpc). - Warp reduction: two uint32
__shfl_downshuffles for the 64-lane GCN/RDNA warp (mirrors CUDA twin'swarp_reduce_u64technique). - Accumulator pattern:
hipMemsetAsynczeros the four device uint64 counters before each dispatch;hipMemcpyAsyncreads them back after. - Host: four sums divided by
w×h→ fourfloat_moment_*features.
float_ssim_hip (hip/float_ssim/ssim_score.hip + updated float_ssim_hip.c / float_ssim_hip.h):
- Three kernels:
calculate_ssim_hip_horiz_8bpc,calculate_ssim_hip_horiz_16bpc,calculate_ssim_hip_vert_combine. - Warp size 64 (GCN/RDNA):
SSIM_WARPS_PER_BLOCK = 128 / 64 = 2(vs CUDA's128 / 32 = 4). The shared-memory warp-sum array is sized accordingly. - Five intermediate float device buffers (
d_ref_mu,d_cmp_mu,d_ref_sq,d_cmp_sq,d_refcmp) allocated viahipMalloc(no wrapper struct, unlike the CUDAVmafCudaBufferlayer). - Two luma staging buffers (
ref_in,cmp_in) for HtoD copies (hipMemcpy2DAsync,hipMemcpyHostToDevice), consistent with T7-10b's "pictures arrive as CPU VmafPictures" posture. - Both passes run on the same private stream; implicit stream ordering provides the happens-before between Pass 1 writes and Pass 2 reads.
- v1 constraint:
scale=1only, enforced atinit()with-EINVAL. - Host: per-block float partials accumulated in
double, divided by(W-10)×(H-10)→ singlefloat_ssimfeature.
Both host TUs follow the float_psnr_hip.c #ifdef HAVE_HIPCC dual-path pattern: helper functions that call HIP APIs are inside #ifdef HAVE_HIPCC; free-helpers called from error labels are defined outside the guard with internal #ifdef HAVE_HIPCC guards (mirrors float_psnr_hip_module_free). Without enable_hipcc, init() returns -ENOSYS (scaffold contract preserved).
moment_score and ssim_score are added to the hip_kernel_sources dict in core/src/meson.build, extending the hipcc --genco → HSACO → xxd -i pipeline established by ADR-0372.
HIP real-kernel count: 6 of 11 extractors.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Also promote integer_motion_v2_hip | One more extractor done | Algorithm (convolution linearity trick) requires a different kernel shape — needs a separate .hip file and host wiring distinct from motion/motion_score.hip | Deferred to batch-4 for focused review |
Promote ciede_hip instead of float_ssim_hip | Simpler reduction (per-block float partials, single pass) | Less user-visible than SSIM; SSIM closes the more prominent quality metric gap | float_ssim_hip chosen for higher impact |
Use vmaf_hip_kernel_submit_pre_launch to zero partials for SSIM | Uniform pre-launch contract | SSIM uses per-block writes, not atomics — no zero needed; the helper is for atomic accumulator consumers | Not used for SSIM, used correctly for moment |
Consequences¶
- Positive:
float_moment_hipandfloat_ssim_hipproduce real scores on AMD GCN/RDNA hardware when built withenable_hip=true enable_hipcc=true. CPU-only and HIP-without-hipcc builds preserve the-ENOSYSscaffold posture unchanged. - Positive:
moment_score.hipandssim_score.hipjoin the HSACO build pipeline; future GPU-specific work on these metrics starts from working device code rather than from scratch. - Negative: Two new HSACO fat binaries in the build output; adds approximately 30-60 s to
hipcc --gencobuild time per target architecture. - Neutral:
float_ssim_hipv1 rejectsscale != 1with-EINVAL, matching the CUDA and Vulkan twins. Lifting this constraint is a future batch item. - Neutral / follow-up:
motion_hip.c(old-style API, not aVmafFeatureExtractor) and the remaining five stubs (ciede_hip,integer_motion_v2_hip,adm_hip,vif_hip,integer_motion_hip— not yet registered) are deferred to batch-4.
References¶
- ADR-0254:
float_psnr_hip— first real HIP kernel, establishes the#ifdef HAVE_HIPCCdual-path pattern. - ADR-0372: HIP batch-1 (
integer_psnr_hip,float_ansnr_hip). - ADR-0373: HIP batch-2 (
float_motion_hip). - ADR-0260:
float_moment_hipscaffold (fourth consumer). - ADR-0274:
float_ssim_hipscaffold (eighth consumer). - CUDA twins:
cuda/integer_moment/moment_score.cu,cuda/integer_ssim/ssim_score.cu,cuda/integer_ssim_cuda.c. - req: "open a NEW PR replacing 2-3 more HIP -ENOSYS stubs with real implementations … Pick the simplest 2-3 for batch-3: motion_hip, float_ssim_hip, and float_moment_hip."