Skip to content

ADR-0375: HIP batch-3 — float_moment_hip and float_ssim_hip real kernels

  • Status: Accepted
  • Date: 2026-05-10
  • Deciders: lusoris, Claude (Anthropic)
  • Tags: hip, gpu, build, feature-extractor, fork-local

Context

After ADR-0373 landed batch-2 (float_motion_hip), six of eleven HIP feature extractors remained at the -ENOSYS scaffold posture. The batch-3 goal was to promote two straightforward extractors — float_moment_hip and float_ssim_hip — to real HIP module-API consumers, following the pattern established by PR #612 (float_psnr_hip, ADR-0254), batch-1 (ADR-0372), and batch-2 (ADR-0373).

float_moment_hip was the simplest remaining target: a single kernel dispatch per frame with four uint64 atomic accumulators, directly mirroring cuda/integer_moment/moment_score.cu. float_ssim_hip was the most architecturally complex of the tractable targets: a two-pass separable 11-tap Gaussian (horizontal pass into five intermediate float buffers, vertical pass + SSIM combine + per-block float partial sum), directly mirroring cuda/integer_ssim/ssim_score.cu. Both are scale=1-only in v1 (same as the CUDA and Vulkan twins).

The remaining five extractors (ciede_hip, integer_psnr_hip — already promoted in batch-1 — integer_motion_v2_hip, adm_hip, vif_hip) and the old-style motion_hip.c stub (non-VmafFeatureExtractor API) are deferred to future batches.

Decision

We promote float_moment_hip and float_ssim_hip from -ENOSYS stubs to real #ifdef HAVE_HIPCC dual-path consumers.

float_moment_hip (hip/float_moment/moment_score.hip + updated float_moment_hip.c / float_moment_hip.h):

  • Two kernels: calculate_moment_hip_kernel_8bpc / _16bpc (same 7-arg signature for both — no bpc arg; raw uint8 bytes addressed as uint16 for 16bpc).
  • Warp reduction: two uint32 __shfl_down shuffles for the 64-lane GCN/RDNA warp (mirrors CUDA twin's warp_reduce_u64 technique).
  • Accumulator pattern: hipMemsetAsync zeros the four device uint64 counters before each dispatch; hipMemcpyAsync reads them back after.
  • Host: four sums divided by w×h → four float_moment_* features.

float_ssim_hip (hip/float_ssim/ssim_score.hip + updated float_ssim_hip.c / float_ssim_hip.h):

  • Three kernels: calculate_ssim_hip_horiz_8bpc, calculate_ssim_hip_horiz_16bpc, calculate_ssim_hip_vert_combine.
  • Warp size 64 (GCN/RDNA): SSIM_WARPS_PER_BLOCK = 128 / 64 = 2 (vs CUDA's 128 / 32 = 4). The shared-memory warp-sum array is sized accordingly.
  • Five intermediate float device buffers (d_ref_mu, d_cmp_mu, d_ref_sq, d_cmp_sq, d_refcmp) allocated via hipMalloc (no wrapper struct, unlike the CUDA VmafCudaBuffer layer).
  • Two luma staging buffers (ref_in, cmp_in) for HtoD copies (hipMemcpy2DAsync, hipMemcpyHostToDevice), consistent with T7-10b's "pictures arrive as CPU VmafPictures" posture.
  • Both passes run on the same private stream; implicit stream ordering provides the happens-before between Pass 1 writes and Pass 2 reads.
  • v1 constraint: scale=1 only, enforced at init() with -EINVAL.
  • Host: per-block float partials accumulated in double, divided by (W-10)×(H-10) → single float_ssim feature.

Both host TUs follow the float_psnr_hip.c #ifdef HAVE_HIPCC dual-path pattern: helper functions that call HIP APIs are inside #ifdef HAVE_HIPCC; free-helpers called from error labels are defined outside the guard with internal #ifdef HAVE_HIPCC guards (mirrors float_psnr_hip_module_free). Without enable_hipcc, init() returns -ENOSYS (scaffold contract preserved).

moment_score and ssim_score are added to the hip_kernel_sources dict in core/src/meson.build, extending the hipcc --genco → HSACO → xxd -i pipeline established by ADR-0372.

HIP real-kernel count: 6 of 11 extractors.

Alternatives considered

Option Pros Cons Why not chosen
Also promote integer_motion_v2_hip One more extractor done Algorithm (convolution linearity trick) requires a different kernel shape — needs a separate .hip file and host wiring distinct from motion/motion_score.hip Deferred to batch-4 for focused review
Promote ciede_hip instead of float_ssim_hip Simpler reduction (per-block float partials, single pass) Less user-visible than SSIM; SSIM closes the more prominent quality metric gap float_ssim_hip chosen for higher impact
Use vmaf_hip_kernel_submit_pre_launch to zero partials for SSIM Uniform pre-launch contract SSIM uses per-block writes, not atomics — no zero needed; the helper is for atomic accumulator consumers Not used for SSIM, used correctly for moment

Consequences

  • Positive: float_moment_hip and float_ssim_hip produce real scores on AMD GCN/RDNA hardware when built with enable_hip=true enable_hipcc=true. CPU-only and HIP-without-hipcc builds preserve the -ENOSYS scaffold posture unchanged.
  • Positive: moment_score.hip and ssim_score.hip join the HSACO build pipeline; future GPU-specific work on these metrics starts from working device code rather than from scratch.
  • Negative: Two new HSACO fat binaries in the build output; adds approximately 30-60 s to hipcc --genco build time per target architecture.
  • Neutral: float_ssim_hip v1 rejects scale != 1 with -EINVAL, matching the CUDA and Vulkan twins. Lifting this constraint is a future batch item.
  • Neutral / follow-up: motion_hip.c (old-style API, not a VmafFeatureExtractor) and the remaining five stubs (ciede_hip, integer_motion_v2_hip, adm_hip, vif_hip, integer_motion_hip — not yet registered) are deferred to batch-4.

References

  • ADR-0254: float_psnr_hip — first real HIP kernel, establishes the #ifdef HAVE_HIPCC dual-path pattern.
  • ADR-0372: HIP batch-1 (integer_psnr_hip, float_ansnr_hip).
  • ADR-0373: HIP batch-2 (float_motion_hip).
  • ADR-0260: float_moment_hip scaffold (fourth consumer).
  • ADR-0274: float_ssim_hip scaffold (eighth consumer).
  • CUDA twins: cuda/integer_moment/moment_score.cu, cuda/integer_ssim/ssim_score.cu, cuda/integer_ssim_cuda.c.
  • req: "open a NEW PR replacing 2-3 more HIP -ENOSYS stubs with real implementations … Pick the simplest 2-3 for batch-3: motion_hip, float_ssim_hip, and float_moment_hip."