ADR-0373: HIP Batch-2 — float_motion_hip Real Kernel¶
- Status: Accepted
- Date: 2026-05-10
- Deciders: lusoris, Claude (Anthropic)
- Tags:
hip,gpu,build
Context¶
ADR-0372 promoted integer_psnr_hip and float_ansnr_hip from -ENOSYS stubs to real HIP module-API consumers (batch-1). It explicitly deferred float_motion_hip because the existing scaffold used a different API shape (opaque uintptr_t slots instead of real hipMalloc device buffers) and the temporal design (blur ping-pong + flush() tail emission) required more adaptation than the stateless scalar-reduction targets.
After reviewing the scaffold, the adaptation is mechanical: the HIP device kernel already existed at feature/hip/float_motion/float_motion_score.hip (a warp-64 port of the CUDA twin), and the host-side changes follow the same hipModuleLoadData + hipModuleGetFunction + hipModuleLaunchKernel pattern established by ADR-0254 and ADR-0372. The only non-trivial aspect is the temporal state: blur[2] ping-pong (two hipMalloc float arrays) + ref_in staging buffer, with compute_sad=0 on the first frame and the prev_motion_score carry for motion2.
Decision¶
Promote float_motion_hip (emits VMAF_feature_motion_score and VMAF_feature_motion2_score) from -ENOSYS scaffold to a real HIP module-API consumer following ADR-0254 / ADR-0372.
Key implementation choices:
-
Struct layout: Replace the three
uintptr_topaque slots (ref_in,blur[0],blur[1]) with realvoid *device pointers allocated viahipMalloc. Module and per-bpc function handles (hipModule_t module,hipFunction_t funcbpc8/funcbpc16) added inside#ifdef HAVE_HIPCC. -
Helper extraction:
fm_hip_bufs_allocandfm_hip_bufs_freeare extracted to keepinit_fex_hipunder the 60-line function-size limit.fm_hip_bufs_freealso unloads the module (single point of teardown). -
Temporal logic:
compute_sad=0on the first frame (no previous blur available; kernel writescur_blurbut partials are all 0.0 by contract).collect()emitsmotion2_score = min(prev, cur)atindex - 1andflush()emits the tailmotion2_scoreats->index. Mirrors the CUDA twin'sflush_fex_cudaexactly. -
HtoD copy:
hipMemcpy2DAsyncwithhipMemcpyHostToDevicebecause pictures arrive as CPUVmafPictures (VMAF_FEATURE_EXTRACTOR_HIPflag not yet set — same posture as all prior HIP consumers). -
Meson:
float_motion_scoreadded tohip_kernel_sourcesinsrc/meson.buildsohipcc --genco+xxd -iembed the HSACO blob at build time whenenable_hipcc=true. -
Non-HAVE_HIPCC path: scaffold posture preserved;
init()returns-ENOSYSvia the template helpers,submit()callsvmaf_hip_kernel_submit_pre_launchand returns-ENOSYS,collect()andflush()return-ENOSYS/1respectively. CPU-only CI is unaffected.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Port float_moment_hip as batch-2 instead | Stateless, simpler | Kernel already implemented; float_moment_hip.c needs its own HAVE_HIPCC promotion separately | Both can land; float_motion is higher priority (used by VMAF models directly) |
Port float_ssim_hip | Completes more extractors | Two-pass design, five intermediate device buffers, three kernel functions; non-trivial ABI | Per per-task guidance: STOP if bit-exactness is not trivially clear |
Keep motion_hip (the old raw-API stub) | No change | It uses a different API shape and is not a VmafFeatureExtractor consumer | It is a separate implementation; float_motion_hip is the correct consumer |
Consequences¶
- Positive:
float_motion_hipis now a real kernel consumer. HIP extractor count rises from 3/11 real (ADR-0372) to 4/11 real.VMAF_feature_motion_scoreandVMAF_feature_motion2_scoreare now GPU-accelerated on ROCm. - Neutral:
float_ssim_hip,float_moment_hip,integer_motion_v2_hip,adm_hip,vif_hip,ciede_hipremain at-ENOSYSscaffold posture. - Follow-up:
float_moment_hipandinteger_motion_v2_hipare the next simplest batch-3 candidates (the device kernels already exist).
References¶
- ADR-0273 (
float_motion_hipseventh consumer scaffold). - ADR-0372 (HIP batch-1:
integer_psnr_hip+float_ansnr_hip). - ADR-0254 (
float_psnr_hipcanonical HIP module-API pattern). - paraphrased: the user requested recovery of the previous batch-2 agent's work on
float_motion_hip— promoting the host TU from-ENOSYSscaffold to a real HIP module-API consumer using the kernel already present infloat_motion/float_motion_score.hip.