ADR-0259: HIP third-consumer kernel — `ciede_hip` via mirrored kernel-template¶

Status: Accepted
Date: 2026-05-03
Deciders: Lusoris, Claude (Anthropic)
Tags: gpu, hip, rocm, amd, kernel-template, fork-local

Context¶

ADR-0212 shipped the HIP backend as a build-only scaffold. ADR-0241 added the first kernel-template consumer (integer_psnr_hip). PR #324 / ADR-0254 is in flight as the second consumer (float_psnr_hip). The runtime PR (T7-10b) is still pending; until it lands, every kernel-template helper body returns -ENOSYS so consumer init() calls surface that error verbatim.

This ADR ships the third consumer — ciede_hip — to widen the kernel-template's pre-runtime validation surface with a feature whose submit path intentionally bypasses the template's submit_pre_launch helper (the ciede CUDA twin inlines the wait because its kernel writes one float per block — no atomic, no memset required). Pinning that bypass shape pre-runtime keeps the runtime-PR's diff small: T7-10b flips helper bodies, but does not have to invent a new "no-memset" template variant.

integer_ciede_cuda.c (243 LOC) is the cleanest CUDA twin to mirror after the two already-claimed PSNR twins: single dispatch per frame, single readback (per-block float partials), no inter-frame state, no fork-specific tweaks. Its precision posture (per-block float partials plus host double accumulation, per ADR-0187) sits between integer_psnr_hip's int64 SSE and float_psnr_hip's float partials.

Decision¶

Land `ciede_hip` as the third kernel-template consumer; runtime body still deferred¶

The PR ships:

core/src/feature/hip/ciede_hip.{c,h} — mirrors core/src/feature/cuda/integer_ciede_cuda.c's call graph verbatim: init → context_new + lifecycle_init + readback_alloc + feature_name_dict, submit → -ENOSYS (the runtime PR will fill in the live hipStreamWaitEvent + dispatch + event-record + DtoH copy chain — note the intentional bypass of submit_pre_launch, mirroring the CUDA twin's no-memset path), collect → collect_wait + score-emit (T7-10b), close → lifecycle_close + readback_free + dictionary_free + context_destroy. The 16×16 workgroup tile constants (CIEDE_HIP_BX / CIEDE_HIP_BY) are kept verbatim from the CUDA twin so the runtime PR's partials_count math agrees.
Registration: vmaf_fex_ciede_hip is added to feature_extractor_list in core/src/feature/feature_extractor.c under #if HAVE_HIP, immediately after the second consumer. Same posture as ADR-0241 / ADR-0254: registration succeeds, VMAF_FEATURE_EXTRACTOR_HIP flag stays cleared.
Smoke test extension: core/test/test_hip_smoke.c grows one sub-test (test_ciede_hip_extractor_registered).
Meson wiring: core/src/hip/meson.build adds ../feature/hip/ciede_hip.c to hip_sources. No new dependency — the consumer compiles on a stock Ubuntu runner without any AMD packages installed.

What stays on T7-10b¶

Real hipStreamCreate / hipMemcpyAsync / hipMallocAsync bodies in kernel_template.c.
Per-bpc kernel launch (ciede_kernel_8bpc / ciede_kernel_16bpc HIP twins).
VMAF_FEATURE_EXTRACTOR_HIP flag flip on every consumer.
Picture buffer-type plumbing for HIP-resident frames.
Score emission with the 45 - 20*log10(mean_dE) formula.

The runtime PR will keep this consumer's call-graph verbatim and flip every -ENOSYS to a live error code, mirroring how the CUDA twin handles failures today.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
`ciede_hip` (chosen)	243 LOC, single dispatch, intentional `submit_pre_launch` bypass widens validation surface, CUDA twin stable	Pre-runtime, the bypass shape is a paper-only contract	The bypass-shape coverage is exactly the value-add of a third consumer at this stage.
`integer_motion_hip`	Cleaner CUDA twin in raw LOC	503 LOC + stateful single-frame-delay reference plane + ring-buffer; mirroring before runtime forces scaffold gymnastics around state with no observable effect (every submit returns `-ENOSYS`)	Higher complexity / LOC for less validation surface.
`float_ansnr_hip`	298 LOC, similar precision posture to `float_psnr`	Substantively duplicates the second consumer's precision posture (float partials) — adds little new validation	Smaller delta vs. ADR-0254.
Defer until T7-10b	Avoids one round of paper-only scaffold	A single (or two) consumer doesn't prove the contract generalises; catching contract drift post-runtime is dramatically more expensive — every regression then needs a real device to debug	The fork's pattern is "validate scaffolds before runtimes" (Vulkan T5-1 → T5-1b cadence).

Consequences¶

Positive: kernel-template's "no-memset bypass" path is now pinned by a smoke-tested consumer. The runtime PR can flip helper bodies without inventing a new template variant for ciede.
Positive: HIP feature_extractor_list[] grows by one row (third row); a caller asking for vmaf --feature ciede_hip now gets "extractor found, runtime not ready (-ENOSYS at init)" instead of "no such extractor".
Neutral: smoke-test sub-test count grows by one (was 15 after ADR-0254, becomes 16 with this PR plus ADR-0260). Still fits the table-driven run_tests() shape ADR-0241 introduced (the readability-function-size 15-branch budget applies to the function's branches, not the table length).
Neutral: T7-10b's surface is unchanged — same six kernel-template helper bodies to flip, plus per-feature kernel launches.
Neutral: bit-exactness is not claimed; the kernels still don't exist. /cross-backend-diff integration lands with T7-10b.
Negative: the PR introduces another file pair pinned at -ENOSYS until T7-10b. If T7-10b slips, the -ENOSYS rows accrue small maintenance cost (re-rebase, re-format on touched files).

References¶

ADR-0212 — HIP scaffold-only PR.
ADR-0241 — first kernel-template consumer (integer_psnr_hip).
ADR-0254 — second kernel-template consumer (float_psnr_hip); in flight as PR #324.
ADR-0221 — original CUDA kernel template that ADR-0241 mirrored onto HIP.
ADR-0187 — ciede precision / places=4 empirical floor argument; carries to the HIP twin via the per-block float partials + host double accumulation pattern.
core/src/feature/cuda/integer_ciede_cuda.c — the CUDA reference whose call graph this consumer mirrors.
req — user direction in T7-10b implementation prompt (paraphrased: "Land the third and fourth HIP runtime kernel-template consumers; pick the cleanest CUDA twins").

ADR-0259: HIP third-consumer kernel — ciede_hip via mirrored kernel-template¶