ADR-0251: Vulkan VkImage import — v2 async pending-fence model (T7-29 part 4)¶
- Status: Accepted
- Date: 2026-05-01
- Deciders: Lusoris, Claude (Anthropic)
- Tags: vulkan, ffmpeg, fork-local, zero-copy, performance, implementation
Context¶
ADR-0186 shipped the Vulkan VkImage zero-copy import surface with a deliberately synchronous v1 design — vmaf_vulkan_import_image() records, submits, and waits the fence in-call. The ADR's Alternatives considered row called out async pending-fence as the v2 follow-up "once profiling shows the wait is a bottleneck." That signal arrived: lawrence's 2026-04-30 profile of the FFmpeg libvmaf_vulkan filter (Issue #239) confirms the synchronous fence wait inside vmaf_vulkan_import_image() serialises CPU and GPU work — exactly the bottleneck the parent ADR predicted. The decoder thread idles every other frame because libvmaf will not return until the GPU finishes the luma copy.
This ADR records the v2 design that swaps the in-call fence wait for a per-frame fence ring and a deferred drain in vmaf_vulkan_wait_compute(). The public ABI is preserved — the four entry points keep their signatures, and the fence pool is fully internal to VmafVulkanState.
Decision¶
We will replace the single fence + single command buffer in VmafVulkanImportSlots with a per-frame ring keyed by frame_index % ring_size. The ring depth is fixed at state init via a configurable max_outstanding_frames parameter (default 4), and pre-allocates 2 × ring_size staging VkBuffers (ref + dis × ring), ring_size VkCommandBuffers, and ring_size VkFences — no runtime allocation on the import hot path. vmaf_vulkan_import_image() records, submits to the slot for frame_index % ring_size, and returns immediately; if the slot was already in flight from a prior frame, the call waits that prior fence first (back-pressure). vmaf_vulkan_wait_compute() blocks on every outstanding fence in submission order and is the natural drain point before vmaf_vulkan_state_build_pictures() reads back the host mappings. vmaf_vulkan_state_free() drains the ring before destroying any handle.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Per-frame fence pool, FIFO ring (chosen) | Bounded memory, no runtime alloc, ABI-stable, matches the canonical Vulkan game-engine pattern | Ring size has to be picked up-front; if max_outstanding_frames < FFmpeg's filter graph depth the back-pressure stalls show up exactly where the v1 wait did | Simplest change that breaks the serial bottleneck without a Vulkan 1.2 hard dependency |
Single fence with delayed wait (record + submit in import_image, wait in wait_compute) | Minimal diff vs v1 | Only one frame can be in flight at any time — the decoder still blocks once it loops back to record the next frame against the same command buffer; gain over v1 is marginal | Doesn't actually remove the serialisation — only relocates the wait |
Timeline semaphore with monotonic counter (drop fences, signal a VkSemaphore of type VK_SEMAPHORE_TYPE_TIMELINE, wait on a value) | One synchronisation primitive instead of N fences; matches FFmpeg's hwframes context (AVVkFrame::sem); cleaner host API | Requires VK_KHR_timeline_semaphore (core in 1.2) — fork's pinned api_version is 1.3 so present everywhere we run, but the swap touches every kernel TU's submit path and complicates the FFmpeg filter's existing per-frame timeline-semaphore wait (would need a second timeline). Bigger blast radius than the ring | Deferred to v3; revisit when a feature kernel needs a queue family transfer (where timeline semaphores are the only correct primitive) |
| Stay on v1 | Zero new code, matrix unchanged | Profile signal (Issue #239) is direct evidence the wait dominates the FFmpeg filter wall-clock; staying on v1 means accepting that bottleneck indefinitely | The whole reason v1 existed was "we'll fix it when we have data." The data is in. |
Consequences¶
- Positive: Decoder/copy/compute can overlap up to
max_outstanding_framesdeep — for FFmpeg's typical 2–3 frame buffering the defaultring_size = 4keeps the libvmaf filter off the critical path until the back-pressure budget is exceeded. ABI is preserved (the ring is fully internal toVmafVulkanState); the FFmpeg patch inffmpeg-patches/0006-libvmaf-add-libvmaf-vulkan-filter.patchneeds no signature change. - Negative: Staging-buffer memory grows
1 × → max_outstanding_frames ×per direction (ref+dis), so default doubles the allocation footprint vs v1 from2 × stride × hto8 × stride × h. For a 1080p 8-bit Y plane that is ~16 MB host-visible per state — well below any practical memory budget but worth noting. The unit test matrix doubles: every existing v1 contract test (NULL state, wrong geometry, unimported index) is replicated forindex > ring_sizeto verify ring wrap, plus new tests for fence-pool init/teardown ordering and thewait_computedrain. - Neutral / follow-ups:
- Cross-backend gate (
scripts/ci/cross_backend_parity_gate.py) keepsplaces=4as the v2 contract — async submission does not change which bytes the staging buffer receives, only when the host can read them. - Measurement gate to flip Status → Accepted: v2 wall-clock ≤ 0.7 × v1 on the Netflix normal pair under the FFmpeg
libvmaf_vulkanfilter (PR-235 lavapipe lane). If the lavapipe ICD's single-threaded software submit model masks the gain (likely — lavapipe has no real queue concurrency), document that and re-gate against a hardware Arc / RTX / RX run before flipping Accepted. - Ring-size tuning (landed):
VmafVulkanConfiguration.max_outstanding_framesis now a public field — 0 selects the canonical default (4); values clamp to[1, VMAF_VULKAN_RING_MAX]internally. The observable readback isvmaf_vulkan_state_max_outstanding_frames(). External-handles callers (vmaf_vulkan_state_init_external) still receive the default; extendingVmafVulkanExternalHandlesis deferred to a separate ABI bump. Smoke-test contract pinned incore/test/test_vulkan_async_pending_fence.c(test_ring_size_*group). - Timeline semaphore v3: tracked under T7-29 part 5 once a feature kernel actually needs the cross-queue-family transfer property timeline semaphores buy us. Fence ring is sufficient for the single-queue-family compute path v2 ships against.
References¶
- Parent: ADR-0186 — declares the v2 async-pending-fence follow-up as the deferred path-3 row of its
Alternatives considered. - Grandparent: ADR-0184 — pinned the public ABI surface that v2 preserves.
- Profile signal: Issue #239 — FFmpeg filter wall-clock serialisation report (lawrence, 2026-04-30).
- Pattern source: Vulkan ring-fence is the canonical "frames in flight" pattern from Khronos synchronization examples.
- FFmpeg filter coupling: CLAUDE.md §12 r14 — every libvmaf surface change ships the matching patch in
ffmpeg-patches/0006-libvmaf-add-libvmaf-vulkan-filter.patch. v2 keeps the public signatures byte-identical so this PR does not modify the patch. - Source: T7-29 part 4 in
.workingdir2/BACKLOG.md. - Per-PR rule: ADR-0108 deep-dive deliverables checklist.
Status update 2026-05-08: Accepted¶
Audited as part of the 2026-05-08 ADR Proposed sweep (Research-0086).
Acceptance criteria verified in tree at HEAD 0a8b539e:
core/include/libvmaf/libvmaf_vulkan.h:64declaresVmafVulkanConfiguration::max_outstanding_frames.core/src/vulkan/common.c:444-486implementsvmaf_vulkan_clamp_ring_size,vmaf_vulkan_state_max_outstanding_frames, and the per-frame fence ring sized ats->requested_ring_size.core/src/vulkan/vulkan_internal.h:47-117documents the captured request depth and the ring sizer contract.- Verification command:
grep -n "max_outstanding_frames" core/src/vulkan/*.{c,h} core/include/libvmaf/libvmaf_vulkan.h.