ADR-0254: HIP second-consumer kernel — float_psnr_hip via mirrored kernel-template¶
- Status: Accepted
- Date: 2026-05-03
- Deciders: Lusoris, Claude (Anthropic)
- Tags: gpu, hip, rocm, amd, kernel-template, fork-local
Context¶
ADR-0212 shipped the HIP backend as a build-only scaffold (T7-10 audit half). ADR-0241 followed up with the first kernel-template consumer (integer_psnr_hip), establishing the per-frame async lifecycle the runtime PR (T7-10b) will eventually fill in.
This ADR ships the second consumer — float_psnr_hip — to exercise the same scaffold contract through a second feature with a slightly different precision posture (float per-WG partials instead of uint64 SSE). The goal is to prove the kernel-template's shape generalises across feature precisions before the runtime PR lands, so that PR can swap helper bodies for live HIP calls without churning either consumer's call-graph.
float_psnr_cuda.c is the natural twin to mirror: same algorithmic shape as integer_psnr_cuda.c, slightly larger surface (bit-depth-aware peak / clamp formula, float partials) at 272 LOC, single dispatch per channel, single readback. The features are functionally distinguishable so smoke-test coverage actually disambiguates lookup behaviour rather than collapsing to the first-consumer assertion.
Decision¶
Land a second kernel-template consumer (float_psnr_hip), kernel body still deferred to T7-10b¶
The PR ships:
core/src/feature/hip/float_psnr_hip.{c,h}— mirrorscore/src/feature/cuda/float_psnr_cuda.c's call graph verbatim:init → context_new + lifecycle_init + readback_alloc + feature_name_dict,submit → submit_pre_launch + (kernel-launch + event-record + DtoH copy land in T7-10b),collect → collect_wait score-emit (also T7-10b),close → lifecycle_close + readback_free + dictionary_free + context_destroy. The bit-depth-awarepeak/psnr_maxtable is established atinit()time so the eventual cross-backend numeric gate has nothing fork-specific to track.init()returns-ENOSYSbecause the template helpers do; the consumer's call-site shape is the load-bearing artefact this PR pins.- Registration:
vmaf_fex_float_psnr_hipis added to thefeature_extractor_listincore/src/feature/feature_extractor.cunder#if HAVE_HIP, immediately after the first consumer. Same posture as ADR-0241: registration succeeds, theVMAF_FEATURE_EXTRACTOR_HIPflag stays cleared until T7-10b, no device buffer-type plumbing is added yet. - Smoke test extension:
core/test/test_hip_smoke.cgrows one sub-test (test_float_psnr_hip_extractor_registered) that mirrors the existing first-consumer registration assertion but forfloat_psnr_hip. The smoke test file's table-driven structure keepsrun_tests()flat (15 sub-tests after this PR; thereadability-function-size15-branch budget is comfortably under the threshold). - Meson wiring:
core/src/hip/meson.buildadds../feature/hip/float_psnr_hip.ctohip_sources. No new dependency — the consumer compiles on a stock Ubuntu runner without any AMD packages installed (same posture as ADR-0212 + ADR-0241).
What stays on T7-10b¶
- Real
hipStreamCreate/hipMemcpyAsync/hipMallocAsyncbodies inkernel_template.c. - Per-bpc kernel launch (
float_psnr_kernel_8bpc/float_psnr_kernel_16bpcHIP twins). VMAF_FEATURE_EXTRACTOR_HIPflag flip on both consumers.- Picture buffer-type plumbing for HIP-resident frames.
- Score emission (
vmaf_feature_collector_append_with_dictwith the log10 / clamp formula).
The runtime PR will keep this consumer's call-graph verbatim and flip every -ENOSYS to a live error code, mirroring how the CUDA twin handles failures today.
Alternatives considered¶
A. Pick integer_motion_hip as the second consumer¶
Rejected. integer_motion_cuda.c is 518 LOC with a stateful single-frame-delay reference plane plus its own ring-buffer; mirroring it before the runtime exists would force scaffold gymnastics around state that has no observable effect (every frame's submit returns -ENOSYS). float_psnr_cuda.c is half the LOC, single-dispatch, no inter-frame state — the cleanest second consumer in the CUDA tree.
B. Pick integer_motion_v2_hip (320 LOC)¶
Rejected on the same state-plumbing argument. v2 still carries the single-frame-delay reference and the v1/v2 dual-feature emission that doubles the score-collector surface area. Pre-runtime, that extra surface is dead weight; post-runtime it's a one-PR follow-up once the runtime path is proven on the simpler twin.
C. Promote one of the existing scaffold stubs (adm_hip / vif_hip / motion_hip) to a kernel-template consumer¶
Rejected. The scaffold stubs in adm_hip.c / vif_hip.c / motion_hip.c use a different ABI shape (vmaf_hip_<feature>_init / _run / _destroy C functions, not VmafFeatureExtractor struct registration). They were scaffolded by ADR-0212 deliberately as "backend-internal stubs", not consumers. Repurposing one would collapse two distinct contracts (the kernel-internal C ABI versus the feature-extractor registration ABI) into a single confused shape. The fork keeps the stubs as scaffold-only placeholders the runtime PR can wire up later, and grows kernel-template consumers side-by-side as separate TUs.
D. Defer the second consumer entirely until T7-10b lands¶
Rejected. The kernel-template's load-bearing claim is "every fork-added HIP feature consumer goes through this contract". A single consumer doesn't prove the contract generalises; a second consumer at a different precision (float partials versus int64 SSE) does. Catching a contract-shape mismatch before the runtime PR is dramatically cheaper than catching it after — once T7-10b flips bodies to real HIP calls, every consumer-side regression becomes runtime-visible and must be debugged against a real device.
Consequences¶
- The HIP
feature_extractor_list[]grows from 1 entry to 2. - A caller asking for
vmaf --feature float_psnr_hipnow gets "extractor found, runtime not ready (-ENOSYS at init)" instead of "no such extractor". - The smoke-test sub-test count grows from 14 to 15. Still under the
readability-function-size15-branch budget thanks to the table-drivenrun_tests()shape ADR-0241 introduced. - T7-10b's surface is unchanged: same six kernel-template helper bodies to flip, same two consumers to validate against. The second consumer simply doubles the validation surface for free.
- No new build-time dependency. The scaffold-only build still succeeds on a stock Ubuntu runner without ROCm.
- Bit-exactness is not claimed; the kernels still don't exist.
/cross-backend-diffintegration lands with T7-10b.
References¶
- ADR-0212 — HIP backend scaffold (T7-10 audit half).
- ADR-0241 — first kernel-template consumer (
integer_psnr_hip); this ADR's predecessor. - ADR-0221 — original CUDA kernel template that ADR-0241 mirrored onto HIP.
- ADR-0246 — kernel-template generalisation across GPU backends.
core/src/feature/cuda/float_psnr_cuda.c— the CUDA reference whose call graph this consumer mirrors.req— user direction in T7-10b implementation prompt (paraphrased: "Land the next HIP runtime kernel body — pick the lightest CUDA twin").