Skip to content

ADR-0246: Per-backend GPU kernel scaffolding templates (CUDA + Vulkan)

  • Status: Accepted
  • Date: 2026-04-29
  • Deciders: Lusoris, Claude
  • Tags: gpu, cuda, vulkan, refactor, fork-local

Context

Every fork-added GPU feature kernel re-implements the same lifecycle boilerplate by hand. On CUDA that's a private non-blocking stream + two events + a device accumulator + a pinned host readback slot, with a cuCtxPushCurrent / cuCtxPopCurrent ladder for init and a synchronise + destroy ladder for close. On Vulkan it's a descriptor-set layout + pipeline layout + shader module + compute pipeline + descriptor pool, plus a per-frame command-buffer + fence pair. Both lifecycles are mechanical, but they're hand-rolled in every kernel — a 14-kernel CUDA matrix and a 15-kernel Vulkan matrix means roughly 90 copies of the same six-line stream

  • event scaffold and 30 copies of the same descriptor-pool sizing heuristic.

A sister-agent scope analysis (referenced under "Reproducer / smoke-test command" below) measured the savings: ~6 LOC per CUDA kernel migration, ~30 LOC per Vulkan kernel migration. The wins are real but small per kernel; the larger value is in centralising the partial-init unwind paths (forgetting cuStreamSynchronize before cuStreamDestroy was the root cause of one regression; leaking a VkDescriptorPool on a vkCreateComputePipelines failure was a second).

The same analysis ruled out a cross-backend template: CUDA's async-stream + event model and Vulkan's command-buffer + fence + descriptor-pool model share no concrete shape. A unified abstraction would force a lowest-common-denominator API that captures neither.

Decision

Land per-backend kernel scaffolding templates as header-only inline helpers under core/src/cuda/kernel_template.h and core/src/vulkan/kernel_template.h. The templates are template-only: no existing kernel includes them in this PR. Each future kernel migration ships in its own PR, gated by the existing places=4 cross-backend-diff lane (per ADR-0214) plus the Netflix CPU golden gate.

Alternatives considered

Option Pros Cons Why not chosen
Per-backend templates (this ADR) Honest about the underlying platform; helpers can use the platform's idioms (events vs fences) directly; small, focused diffs Two headers instead of one; a new GPU backend (HIP, Metal) needs its own template Chosen — matches the actual shape of CUDA vs Vulkan code
Single cross-backend template One mental model for "GPU kernel lifecycle"; new backends bolt onto the same shape Lowest-common-denominator API drops async-stream nuance on CUDA and descriptor-pool nuance on Vulkan; helper bodies become switch-on-backend ladders Sister-agent analysis showed the two backends share no concrete shape — the unified API would be a fiction
Macros (CUDA helper-style BEGIN/END pairs) One-line use-sites; matches the existing CHECK_CUDA_GOTO style cuda-gdb / Nsight / RenderDoc step poorly through macro-expanded blocks; type errors land at expansion sites, not call sites Step-debugging GPU code is already hard; the macro form raises the floor unnecessarily
Helper functions (this ADR) Debuggable; type-checked at the call site; the existing CHECK_CUDA_* macros stay where they pay off Inline functions still expand into the caller — no real cost Chosen
Templates + migrate-with-template (one big PR) One PR closes the boilerplate-debt bullet Each migration would gate on its own places=4 cross-backend-diff cell; failures would be hard to localise; PR review surface explodes Each migration is its own PR with a focused gate — see ADR-0214
Templates only, deferred migrations (this ADR) Templates land unused, breaking nothing; each migration is small + gated The boilerplate-debt bullet stays open until follow-up PRs land Chosen — splits risk and review cost

Consequences

  • Positive: Future kernel migrations get a small, focused diff with a central place to encode the lifecycle invariants (stream sync before destroy, descriptor pool sized to frames-in-flight × n_planes, fence freed on partial-init failure). Each migration is independently reviewable and gated.
  • Negative: 706 LOC of helper headers exist with zero callers in this PR — adopters arrive over the follow-up T-rows. New GPU backends (HIP, Metal) will eventually want their own template; the per-backend split is honest but not free.
  • Neutral / follow-ups:
  • T7-XX-followup-a: SYCL kernel-template refactor (deferred — needs icpx host).
  • T7-XX-followup-b: migrate fork-added CUDA kernels to template (integer_psnr_cuda first, then ssimulacra2_cuda).
  • T7-XX-followup-c: migrate fork-added Vulkan kernels to template (psnr_vulkan first, then motion_vulkan / ssim_vulkan / cambi_vulkan).
  • User-facing doc lives at docs/backends/kernel-scaffolding.md.
  • Rebase invariant rows added to core/src/cuda/AGENTS.md and a new core/src/vulkan/AGENTS.md; the templates are fork-local additions that an upstream sync must not silently drop.

References