ADR-0274: HIP eighth kernel-template consumer — `float_ssim_hip`¶

Status: Accepted
Date: 2026-05-03
Deciders: Lusoris, Claude (T7-10b follow-up)
Tags: hip, gpu, feature-extractor, kernel-template, multi-dispatch, fork-local

Context¶

The first seven HIP kernel-template consumers (ADR-0241, ADR-0254, ADR-0259, ADR-0260, ADR-0266, ADR-0267, and ADR-0273) are all single-dispatch: each frame is processed in one kernel launch. The runtime PR (T7-10b) needs at least one multi-dispatch consumer in place before it lands so the helper-body flip can validate the case where two kernels run on the picture stream with implicit happens-before ordering between them.

This ADR adds the eighth consumer — float_ssim_hip, the HIP twin of core/src/feature/cuda/integer_ssim_cuda.c (384 LOC, which despite its filename registers vmaf_fex_float_ssim_cuda and emits the float_ssim feature). It pins two new shapes: two-dispatch separable Gaussian (horizontal pass writes five intermediate float buffers, vertical pass reads them and writes per-block float partials) and chars.n_dispatches_per_frame == 2 in the extractor's characteristics — every prior HIP consumer has n_dispatches_per_frame == 1. The smoke test pins this value explicitly so the runtime PR's dispatch-counter accounting can't silently drift.

float_ssim carries a v1 scale=1 constraint matching ssim_cuda and ssim_vulkan: auto-decimation rejects scale>1 with -EINVAL at init time. Pinning this in the scaffold so a caller asking for float_ssim_hip:scale=2 sees a clean validation surface instead of the kernel-template's -ENOSYS.

Decision¶

We add core/src/feature/hip/float_ssim_hip.{c,h} as the eighth kernel-template consumer. The TU mirrors the CUDA twin call-graph-for-call-graph: identical state struct (modulo the CUDA-driver CUfunction slots and the VmafCudaBuffer *-vs-uintptr_t buffer slot type difference), identical validate_dims / init_dims helpers (extracted to keep init() under the readability-function-size budget), identical init/submit/collect/close lifecycle, and identical c1 / c2 SSIM constants (L = 255.0, K1 = 0.01, K2 = 0.03). init() returns -ENOSYS once dimension validation passes, until T7-10b lands the runtime helpers; the smoke test pins both the registration shape and chars.n_dispatches_per_frame == 2; CHANGELOG and rebase-notes carry the addition.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
`float_ssim_hip` (chosen)	Pins the multi-dispatch shape (`n_dispatches_per_frame == 2`); pins the five-intermediate-float-buffer pyramid that no prior consumer exercises; pins the v1 scale=1 `-EINVAL` validation surface so callers don't get a confusing `-ENOSYS` from a runtime-not-ready kernel for an input the kernel wouldn't have supported anyway.	384 LOC is the largest unported CUDA twin in the smallest-twin tier (still under integer_ssim's 384 → integer_psnr_hvs's 492).	—
`float_motion_hip` (361 LOC)	Single-dispatch motion mirror.	Not multi-dispatch — leaves the runtime PR without a multi-dispatch consumer to validate against.	Picked alongside this one as the seventh consumer (ADR-0273) for the temporal+three-buffer-ping-pong shape.
`integer_psnr_hvs_cuda.c` (492 LOC)	Pins the per-DCT-block SSIM-like shape.	28% larger than `integer_ssim_cuda.c`; complex DCT scratch buffers.	Defer to a later batch.
`integer_ms_ssim_cuda.c` (502 LOC)	Pins the multi-scale pyramid shape.	30% larger; multi-scale pyramid is a more invasive shape than two-dispatch single-scale.	Defer to a later batch.
Skip the multi-dispatch consumer	Keeps the diff smaller.	Leaves the runtime PR without any consumer that exercises `n_dispatches_per_frame == 2` — every helper-body flip lands without that validation.	Adds a follow-up PR for no real saving.

Consequences¶

Positive: The multi-dispatch shape is now pinned ahead of T7-10b. The runtime PR's helper-body flip can validate that two kernels on the picture stream with implicit happens-before ordering produce the expected per-block partials. The chars.n_dispatches_per_frame == 2 invariant is asserted in the smoke test, so a refactor that accidentally drops that field to 1 fails the lookup contract.
Negative: Adds two helper functions (validate_dims_hip, init_dims_hip) extracted from init() to keep the readability-function-size budget. The CUDA twin keeps everything inline; the HIP twin needs the extraction because it adds extra context-allocation steps the CUDA path doesn't. The AGENTS.md note pins this layout difference so a future refactor doesn't try to re-inline the helpers and bust the budget.
Neutral / follow-ups: When the runtime PR lands a HIP device-buffer allocator, the five intermediate float buffer slots (uintptr_t h_{ref_mu,cmp_mu,ref_sq,cmp_sq,refcmp}) become its first five-buffer client. The CUDA twin's VmafCudaBuffer *h_{ref_mu,cmp_mu,ref_sq,cmp_sq,refcmp} field shape is the target the runtime PR will mirror.

References¶

ADR-0212 — HIP audit-first scaffold (T7-10).
ADR-0241 — first consumer + kernel template.
ADR-0254 — second consumer.
ADR-0259 — third consumer.
ADR-0260 — fourth consumer.
ADR-0266 — fifth consumer (PR #340).
ADR-0267 — sixth consumer (PR #340).
ADR-0273 — seventh consumer (this PR).
ADR-0246 — GPU kernel-template pattern.
ADR-0214 — places=4 cross-backend gate.
CUDA twin: core/src/feature/cuda/integer_ssim_cuda.c (registers vmaf_fex_float_ssim_cuda).
Source: req (user dispatch) — "Add the seventh + eighth HIP runtime kernel-template consumers. ... Pick two more cleanest CUDA twins."

ADR-0274: HIP eighth kernel-template consumer — float_ssim_hip¶