ADR-0240: GPU backend public-header pattern doc (PR3 of GPU dedup, doc-only)¶

Status: Accepted
Date: 2026-05-02
Deciders: Lusoris
Tags: docs, gpu, agents, fork-local

Context¶

GPU dedup PR3 (after PR2 = ADR-0239 picture-pool extract) was originally scoped as "GPU public header codegen — extract libvmaf_{cuda,sycl,vulkan,hip}.h from one template". A 2026-05-02 audit measured the four headers and found:

~20 of ~200 lines per header truly shared (state init / import_state / state_free lifecycle).
The remaining 90 % is backend-specific feature surface — CUDA's picture preallocation, SYCL's DMABuf/VA/D3D11 import, Vulkan's VkImage zero-copy import + ring depth, HIP's scaffold-only surface.

Codegenning 10 % of each header would add a build-system Python dependency (Jinja2 or hand-rolled emitter) for too little return — the divergence between backends is the API surface, not the lifecycle boilerplate. The dedup-via-pattern approach the tiny-AI extractor template (ADR-0250) uses is a better fit: a recipe lives in docs/, new backends consult it, no build-system additions.

Decision¶

PR3 of the GPU dedup sequence ships two documentation artefacts instead of a codegen pipeline:

docs/development/gpu-backend-template.md — the recipe new GPU backends follow (state lifecycle, optional sections for _available / _list_devices / picture preallocation / hwaccel zero-copy import, Doxygen + ABI conventions).
core/include/libvmaf/AGENTS.md — a public-headers-tree invariant note that points back at the template + pins the rebase-sensitive ordering of the four existing backend headers.

The picture-preallocation section enforces the SYCL/Vulkan NONE / HOST / DEVICE three-method shape (CUDA's CUDA-specific HOST_PINNED is documented as a historical quirk new backends should not replicate). The "implementation MUST delegate to VmafGpuPicturePool" rule (ADR-0239) is restated in the template so future backends can't reimplement the round-robin lifecycle.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Doc-pattern (chosen)	Mirrors the tiny-AI template precedent (ADR-0250); zero build-system dependency; agent-readable AGENTS guidance	Future backend additions rely on developer discipline (no enforcement)	Picked: 90 % of each header is genuinely backend-specific; codegen is the wrong tool for ~10 % shared shape
Jinja2 codegen at build time	Forces shared lifecycle into one template; new backends are a one-line spec addition	Adds a Python build dependency; obscures the headers (you edit the template, not the file); high friction for a 20-LOC dedup	Rejected: ROI doesn't justify the build-system addition
Pure-Python codegen with checked-in outputs	No build-time Python; same template + a `make regen-headers` target	Two sources of truth — easy to drift; requires a CI gate that re-renders + diffs	Rejected: same return as Jinja2 with extra drift surface
Thin shared header with macro-emit blocks	Pure C; no codegen; `#define X_BACKEND_LIFECYCLE(prefix)` style	C macro stamping for type names is fragile; readability collapses; Doxygen on macros is brittle	Rejected: cure worse than the disease
Skip PR3 entirely; jump to PR4 (kernel host-glue)	Less PR churn	Leaves the user's "headers second" sequence step unhonoured; no agent-readable pattern doc for the next GPU backend addition	Rejected: doc-pattern is small, real, and lands the sequence step

Consequences¶

Positive:
New GPU backends (Metal, DirectML, future ROCm-replacement) have a written recipe to follow — same dedup pattern the tiny-AI extractors use (ADR-0250 / PR #251 → migrated in PR #265).
The four-header rebase invariant note pins which headers upstream syncs are expected to touch (only libvmaf_cuda.h) versus mis-merge signals (any of the other three).
The NONE / HOST / DEVICE three-method picture-preallocation convention is locked as the new-backend default; HOST_PINNED won't propagate beyond CUDA.
Negative:
No machine-enforced rule that future backends consult the template. Discipline-only; relies on PR review + the AGENTS.md pointer firing in agents' contexts.
The dedup return is small in absolute LOC (template doc + AGENTS.md, ~250 lines added to ship the recipe).
Neutral / follow-ups:
PR4 of the dedup sequence (T-GPU-DEDUP-3): GPU feature kernel host-glue extract. 10+ files per backend share state/init/close boilerplate (~250/500 LOC per file). The leverage there is dramatic; this is the highest-LOC dedup of the sequence.
Vulkan picture_vulkan_pool.c migration (T-GPU-DEDUP-2-vulkan): after #264 (Vulkan picture preallocation) lands, rewrite the Vulkan pool as a thin wrapper around the generic pool — parallel to what #266 did for SYCL. Tracked separately because it depends on #264 merging first.

References¶

Source: req 2026-05-02 — popup answer "all three sequenced — pool first, headers second, kernels third"; clarification reply "we only created the boilerplate but we didn't actually migrate" (which led to the audit-then-doc-pattern reframing of PR3 from codegen to recipe).
2026-05-02 GPU dedup audit (in-conversation, agent-produced).
ADR-0239 — PR2 of the dedup sequence (picture-pool extract).
ADR-0250 — the precedent for "recipe doc + shared helpers, not codegen".
docs/development/gpu-backend-template.md — the recipe.
core/include/libvmaf/AGENTS.md — the public-headers-tree invariant note.