ADR-0259: HIP third-consumer kernel — ciede_hip via mirrored kernel-template¶
- Status: Accepted
- Date: 2026-05-03
- Deciders: Lusoris, Claude (Anthropic)
- Tags: gpu, hip, rocm, amd, kernel-template, fork-local
Context¶
ADR-0212 shipped the HIP backend as a build-only scaffold. ADR-0241 added the first kernel-template consumer (integer_psnr_hip). PR #324 / ADR-0254 is in flight as the second consumer (float_psnr_hip). The runtime PR (T7-10b) is still pending; until it lands, every kernel-template helper body returns -ENOSYS so consumer init() calls surface that error verbatim.
This ADR ships the third consumer — ciede_hip — to widen the kernel-template's pre-runtime validation surface with a feature whose submit path intentionally bypasses the template's submit_pre_launch helper (the ciede CUDA twin inlines the wait because its kernel writes one float per block — no atomic, no memset required). Pinning that bypass shape pre-runtime keeps the runtime-PR's diff small: T7-10b flips helper bodies, but does not have to invent a new "no-memset" template variant.
integer_ciede_cuda.c (243 LOC) is the cleanest CUDA twin to mirror after the two already-claimed PSNR twins: single dispatch per frame, single readback (per-block float partials), no inter-frame state, no fork-specific tweaks. Its precision posture (per-block float partials plus host double accumulation, per ADR-0187) sits between integer_psnr_hip's int64 SSE and float_psnr_hip's float partials.
Decision¶
Land ciede_hip as the third kernel-template consumer; runtime body still deferred¶
The PR ships:
core/src/feature/hip/ciede_hip.{c,h}— mirrorscore/src/feature/cuda/integer_ciede_cuda.c's call graph verbatim:init → context_new + lifecycle_init + readback_alloc + feature_name_dict,submit → -ENOSYS(the runtime PR will fill in the livehipStreamWaitEvent+ dispatch + event-record + DtoH copy chain — note the intentional bypass ofsubmit_pre_launch, mirroring the CUDA twin's no-memset path),collect → collect_wait + score-emit (T7-10b),close → lifecycle_close + readback_free + dictionary_free + context_destroy. The 16×16 workgroup tile constants (CIEDE_HIP_BX/CIEDE_HIP_BY) are kept verbatim from the CUDA twin so the runtime PR'spartials_countmath agrees.- Registration:
vmaf_fex_ciede_hipis added tofeature_extractor_listincore/src/feature/feature_extractor.cunder#if HAVE_HIP, immediately after the second consumer. Same posture as ADR-0241 / ADR-0254: registration succeeds,VMAF_FEATURE_EXTRACTOR_HIPflag stays cleared. - Smoke test extension:
core/test/test_hip_smoke.cgrows one sub-test (test_ciede_hip_extractor_registered). - Meson wiring:
core/src/hip/meson.buildadds../feature/hip/ciede_hip.ctohip_sources. No new dependency — the consumer compiles on a stock Ubuntu runner without any AMD packages installed.
What stays on T7-10b¶
- Real
hipStreamCreate/hipMemcpyAsync/hipMallocAsyncbodies inkernel_template.c. - Per-bpc kernel launch (
ciede_kernel_8bpc/ciede_kernel_16bpcHIP twins). VMAF_FEATURE_EXTRACTOR_HIPflag flip on every consumer.- Picture buffer-type plumbing for HIP-resident frames.
- Score emission with the
45 - 20*log10(mean_dE)formula.
The runtime PR will keep this consumer's call-graph verbatim and flip every -ENOSYS to a live error code, mirroring how the CUDA twin handles failures today.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
ciede_hip (chosen) | 243 LOC, single dispatch, intentional submit_pre_launch bypass widens validation surface, CUDA twin stable | Pre-runtime, the bypass shape is a paper-only contract | The bypass-shape coverage is exactly the value-add of a third consumer at this stage. |
integer_motion_hip | Cleaner CUDA twin in raw LOC | 503 LOC + stateful single-frame-delay reference plane + ring-buffer; mirroring before runtime forces scaffold gymnastics around state with no observable effect (every submit returns -ENOSYS) | Higher complexity / LOC for less validation surface. |
float_ansnr_hip | 298 LOC, similar precision posture to float_psnr | Substantively duplicates the second consumer's precision posture (float partials) — adds little new validation | Smaller delta vs. ADR-0254. |
| Defer until T7-10b | Avoids one round of paper-only scaffold | A single (or two) consumer doesn't prove the contract generalises; catching contract drift post-runtime is dramatically more expensive — every regression then needs a real device to debug | The fork's pattern is "validate scaffolds before runtimes" (Vulkan T5-1 → T5-1b cadence). |
Consequences¶
- Positive: kernel-template's "no-memset bypass" path is now pinned by a smoke-tested consumer. The runtime PR can flip helper bodies without inventing a new template variant for ciede.
- Positive: HIP
feature_extractor_list[]grows by one row (third row); a caller asking forvmaf --feature ciede_hipnow gets "extractor found, runtime not ready (-ENOSYS at init)" instead of "no such extractor". - Neutral: smoke-test sub-test count grows by one (was 15 after ADR-0254, becomes 16 with this PR plus ADR-0260). Still fits the table-driven
run_tests()shape ADR-0241 introduced (thereadability-function-size15-branch budget applies to the function's branches, not the table length). - Neutral: T7-10b's surface is unchanged — same six kernel-template helper bodies to flip, plus per-feature kernel launches.
- Neutral: bit-exactness is not claimed; the kernels still don't exist.
/cross-backend-diffintegration lands with T7-10b. - Negative: the PR introduces another file pair pinned at
-ENOSYSuntil T7-10b. If T7-10b slips, the-ENOSYSrows accrue small maintenance cost (re-rebase, re-format on touched files).
References¶
- ADR-0212 — HIP scaffold-only PR.
- ADR-0241 — first kernel-template consumer (
integer_psnr_hip). - ADR-0254 — second kernel-template consumer (
float_psnr_hip); in flight as PR #324. - ADR-0221 — original CUDA kernel template that ADR-0241 mirrored onto HIP.
- ADR-0187 — ciede precision /
places=4empirical floor argument; carries to the HIP twin via the per-block float partials + host double accumulation pattern. core/src/feature/cuda/integer_ciede_cuda.c— the CUDA reference whose call graph this consumer mirrors.req— user direction in T7-10b implementation prompt (paraphrased: "Land the third and fourth HIP runtime kernel-template consumers; pick the cleanest CUDA twins").