Skip to content

ADR-0238: Vulkan VmafPicture preallocation surface (API parity with CUDA / SYCL)

  • Status: Accepted
  • Date: 2026-05-02
  • Deciders: Lusoris
  • Tags: vulkan, api, preallocation, fork-local, parity

Context

The CUDA backend ships vmaf_cuda_preallocate_pictures / vmaf_cuda_fetch_preallocated_picture (with HOST / HOST_PINNED / DEVICE methods); the SYCL backend ships vmaf_sycl_preallocate_pictures / vmaf_sycl_picture_fetch (HOST / DEVICE). Vulkan ships nothing — docs/api/gpu.md flagged this as a known limitation:

Per-feature picture preallocation API is not yet exposed — Vulkan kernels allocate device-side VkBuffers internally per feature. The CUDA/SYCL-style VmafVulkanPicturePreallocationMethod surface is a follow-up.

The gap blocks two real use cases:

  1. Driver-style integration: a caller that owns its decode loop wants to write source frames directly into the buffers libvmaf's compute kernels will read, removing the host → device staging copy. Without preallocation, every frame round-trips through the per-extractor internal buffers.
  2. Memory budgeting: in a long-running pipeline (FFmpeg per-shot tooling, encoder-side ROI scoring), pinning the picture pool depth up front lets operators reason about the resident set instead of tracking per-feature internal allocators.

Closing the parity gap is purely additive — the existing import-image zero-copy path (ADR-0186 / ADR-0251) remains the right answer for external-VkImage callers; preallocation serves the host-driven path.

Decision

We will ship VmafVulkanPicturePreallocationMethod (NONE / HOST / DEVICE) and the matching vmaf_vulkan_preallocate_pictures + vmaf_vulkan_picture_fetch entry points, modelled on the SYCL surface (the SYCL pool is the simpler reference; CUDA's HOST_PINNED option maps to a CUDA-only allocator and has no Vulkan analogue).

HOST allocates pictures via the regular vmaf_picture_alloc — useful when the score consumer is CPU-side. DEVICE backs each picture's luma plane with a host-visible Vulkan buffer (VMA AUTO_PREFER_HOST); the persistent mapped pointer is exposed as pic->data[0], so the caller writes once and the kernel descriptor sets read the same memory. A new VMAF_PICTURE_BUFFER_TYPE_VULKAN_DEVICE tags these pictures so cross-backend extractors can refuse mixed backings.

Pool depth is fixed at the canonical frames-in-flight = 2 (matches SYCL's compile-time constant); the depth is not part of the public surface in this PR. ADR-0251's max_outstanding_frames knob is unrelated — that controls the import-image fence ring, not the preallocation pool. A separate follow-up can grow the configuration struct if a real workload needs depth > 2.

Alternatives considered

Option Pros Cons Why not chosen
Mirror SYCL surface (chosen) Shortest review path; SYCL pool is already battle-tested in the FFmpeg integration; same enum names + same pool-depth contract Inherits SYCL's "luma-only" pool (chroma allocation deferred) Picked: API parity is the goal; the chroma deferral is consistent across SYCL + Vulkan and tracked separately
Mirror CUDA surface (HOST / HOST_PINNED / DEVICE) Three methods feels more complete; CUDA users see familiar names HOST_PINNED has no Vulkan analogue (VMA AUTO_PREFER_HOST is the closest, but it's not "pinned" in the CUDA sense) Rejected: would ship a method we'd then have to caveat as "treated like HOST"
Skip the public surface; expose a fork-internal helper Smaller blast radius; less ABI commitment Doesn't close the parity gap noted in docs/api/gpu.md; FFmpeg integrators still bounce off the missing surface Rejected: the gap is explicit and visible
Bake preallocation into vmaf_vulkan_state_init (auto-pool-on-import) One fewer call site for callers Forces every Vulkan caller to pay the pool cost; breaks the import-image zero-copy path which deliberately doesn't pre-allocate Rejected: opt-in is the right ergonomics
Add a pic_cnt parameter to the configuration struct in this PR Configurable from day one Premature; SYCL's hard-coded 2 has shipped without complaint; growing the struct later is additive Rejected: ship the minimal surface, grow it if a workload demands depth > 2

Consequences

  • Positive:
  • Closes the CUDA / SYCL / Vulkan API parity gap for picture preallocation, removing the explicit "follow-up" caveat from docs/api/gpu.md.
  • Lets FFmpeg-side host-driven integrations (the non-zero-copy path) write source frames directly into the buffers libvmaf compute kernels read, removing one host → device staging copy.
  • The new VMAF_PICTURE_BUFFER_TYPE_VULKAN_DEVICE tag enables cross-backend extractors to refuse mixed-backing inputs in future work.
  • Pool tear-down lives in a single hook in vmaf_close(), matching the SYCL pattern; no per-extractor cleanup needed.
  • Negative:
  • One additional translation unit (core/src/vulkan/picture_vulkan_pool.c, ~180 LOC).
  • Public ABI grows two entry points + one enum + one struct. Backwards compatibility is preserved (additive only).
  • The pool currently allocates only the Y plane — same as SYCL. Chroma-aware extractors that want pre-allocated U/V planes (none of the current Vulkan kernels do) need a follow-up.
  • Neutral / follow-ups:
  • Pool depth tunable. The current 2 matches SYCL's compile-time constant; if a workload needs depth > 2, grow the VmafVulkanPictureConfiguration struct additively (mirroring ADR-0251 follow-up #3's pattern).
  • Chroma plane preallocation. Out of scope here; tracked as a separate item once a Vulkan kernel actually consumes pre-allocated chroma.
  • External-handles parity. The pool currently uses the imported state's VkInstance / VkDevice via the fork-internal vmaf_vulkan_state_context() accessor. External callers (vmaf_vulkan_state_init_external) work transparently; no additional plumbing required.
  • Cross-backend parity gate stays unchanged at places=4 — preallocation only changes where the buffer lives, not which bytes the kernel reads.

FFmpeg-side adoption (deferred, by design)

CLAUDE.md §12 r14 requires ffmpeg-patches/ updates when libvmaf adds an entry point that the patches consume. None of the in-tree patches (0002 cuda / 0003 sycl / 0004 vulkan-selector / 0006 libvmaf_vulkan) call any *_preallocate_pictures / *_picture_fetch entry point today — neither for CUDA nor for SYCL. Those filters go through the regular per-frame vmaf_picture_alloc host path. Switching them to consume the preallocation pool is a separate optimization with two distinct angles:

  • vf_libvmaf_vulkan (zero-copy import path, patch 0006): this filter receives external VkImages via vmaf_vulkan_import_image and never allocates VmafPictures itself. Preallocation is orthogonal — no change needed, ever.
  • vf_libvmaf host path with vulkan_device=N: currently allocates host-backed pictures per frame. Switching to vmaf_vulkan_preallocate_pictures(DEVICE) would let FFmpeg's frame data land directly in the kernel-read buffer. Real perf win, but it's a separate PR with its own benchmark gate (mirrors the pattern of ADR-0251's measurement-gate flip). For symmetry with SYCL — which also ships preallocation but whose FFmpeg filter doesn't consume it yet — we deliberately keep this PR's scope at "API parity surface" rather than "FFmpeg integration optimization". When the FFmpeg adoption PR lands, it's additionally gated on demonstrating no host → device staging copy in perf record traces against an unchanged Netflix golden score.

ffmpeg-patches/series.txt is unchanged by this PR; the apply --check gate against the pinned n8.1 baseline passes unmodified.

References

Status update 2026-05-08: Accepted

Audited as part of the 2026-05-08 ADR Proposed sweep (Research-0086).

Acceptance criteria verified in tree at HEAD 0a8b539e:

  • core/include/libvmaf/libvmaf_vulkan.h declares VmafVulkanPicturePreallocationMethod (line 153), VmafVulkanPictureConfiguration::pic_prealloc_method (line 165), vmaf_vulkan_preallocate_pictures (line 180), vmaf_vulkan_picture_fetch (line 191).
  • Verification command: grep -nE "VmafVulkanPicturePreallocationMethod|vmaf_vulkan_preallocate_pictures|vmaf_vulkan_picture_fetch" core/include/libvmaf/libvmaf_vulkan.h.