Research-0045: Vulkan picture preallocation — option-space digest¶

Date: 2026-05-02
Companion ADR: ADR-0238

Question¶

CUDA + SYCL ship a public picture-preallocation surface (*_preallocate_pictures

*_picture_fetch); Vulkan does not. What's the smallest surface that closes the parity gap without locking the design to a single host-allocator pattern?

Reference surfaces compared¶

Trait	CUDA (`libvmaf_cuda.h`)	SYCL (`libvmaf_sycl.h`)	Vulkan (this digest)
Methods	`NONE`, `HOST`, `HOST_PINNED`, `DEVICE`	`NONE`, `HOST`, `DEVICE`	`NONE`, `HOST`, `DEVICE`
Pool depth	Caller-controlled (via `vmaf_cuda_ring_buffer`)	Compile-time `pic_cnt = 2`	Compile-time `pic_cnt = 2` (mirrors SYCL)
Backing	`cudaMalloc` / `cudaMallocHost` / pinned	`sycl::malloc_device` / `sycl::malloc_host`	VMA `AUTO_PREFER_HOST` VkBuffer / regular host
Plane coverage	All planes	Y plane only	Y plane only (mirrors SYCL)
Buffer-type tag	`VMAF_PICTURE_BUFFER_TYPE_CUDA_*`	`VMAF_PICTURE_BUFFER_TYPE_SYCL_DEVICE`	`VMAF_PICTURE_BUFFER_TYPE_VULKAN_DEVICE` (new)
Lock model	None (single-threaded fetch)	`std::mutex`	`pthread_mutex_t` (no C++ in vulkan/)

The SYCL surface is the cleaner reference. CUDA's HOST_PINNED is a CUDA-allocator-specific concept; VMA's AUTO_PREFER_HOST is the closest analogue but isn't "pinned" in the CUDA sense — exposing it under the same name would invite the wrong mental model.

Decisions implied by the option survey¶

Three methods, not four — NONE / HOST / DEVICE. No HOST_PINNED analogue.
Compile-time pool depth = 2 — matches SYCL exactly. Growing to a configurable depth is additive (mirror ADR-0251 follow-up #3's pattern of growing VmafVulkanConfiguration with an optional unsigned).
Y-plane only DEVICE backing — matches SYCL. Chroma kernel work in this fork still allocates its own buffers per-feature; no Vulkan extractor currently consumes preallocated chroma.
pthread_mutex_t instead of C++ — core/src/vulkan/ is pure C; the SYCL pool's std::mutex doesn't translate. The round-robin counter is the only contended state.
Pool lives on VmafContext, not on VmafVulkanState — matches both CUDA and SYCL. The state is the GPU resource handle; the pool is a per-context resource that's created via vmaf_vulkan_preallocate_pictures and torn down in vmaf_close().

Lifetime / fail-paths surveyed¶

The SYCL pool's allocation unwind on partial failure is the right model: pic_cnt allocations attempted in sequence; on the i-th failure, free pictures [0, i) and the pool struct itself, then return the original error. The new C implementation (picture_vulkan_pool.c::pool_unwind) preserves this contract.

The DEVICE method introduces a bookkeeping wrinkle: VmafVulkanBuffer already maintains a host_ptr → buffer map for the legacy vmaf_vulkan_picture_alloc shim. The pool path bypasses that map by attaching the buffer handle to the picture's priv cookie + release_picture callback — when the pool unrefs a pic, the standard VmafPicture refcount path frees the buffer through VMA directly. No double-free risk since the pool is the sole owner of the cookie.

Test contract¶

Six smoke tests pin the API contract without dispatching GPU work:

NONE is a no-op (returns 0, allocates nothing).
HOST allocates, fetches round-robin, unrefs cleanly.
DEVICE allocates, fetches, the host-mapped pointer is writable, ASan / UBSan don't flag a use-after-free on unref.
Fetch without preallocate falls back to a host-backed picture (mirrors the SYCL fallback contract for callers that ignore the preallocation surface).
Unknown method → -EINVAL.
NULL args → -EINVAL (no crash).

End-to-end scoring against pool-allocated pictures lives in the cross-backend parity gate (unchanged — preallocation only changes where bytes live, not which bytes the kernel reads).

What this digest deliberately defers¶

Pool depth tunable. SYCL has shipped at depth 2 for the full FFmpeg integration without complaint. If a real workload needs more, grow VmafVulkanPictureConfiguration additively.
Chroma plane preallocation. Both SYCL and Vulkan kernels are luma-only today. A future U/V-aware extractor pulls this in.
External-handles parity. The pool uses the imported state's VkInstance/VkDevice via the fork-internal vmaf_vulkan_state_context() accessor; external-handles callers (state_init_external) work transparently with no extra plumbing.
HOST_PINNED-like option. VMA's allocator API has no "pinned in the CUDA sense" mode. If a workload turns up that needs it, the right path is a new method (VMAF_VULKAN_PICTURE_PREALLOCATION_METHOD_HOST_VISIBLE_DEVICE_LOCAL, ReBAR / SAM territory).

References¶

VMA AUTO_PREFER_HOST semantics: https://gpuopen.com/learn/vulkan-renderers-memory-allocation/.
SYCL pool reference: core/src/sycl/picture_sycl.cpp (vmaf_sycl_picture_pool_init).
CUDA preallocation reference: core/include/libvmaf/libvmaf_cuda.h, core/src/libvmaf.c::vmaf_cuda_preallocate_pictures.
Parallel surfaces: ADR-0186 / ADR-0251 (zero-copy import) — not replaced by this PR; preallocation serves the host-driven path.