ADR-0186: Vulkan VkImage import + filter (T7-29 parts 2 + 3)¶
- Status: Accepted
- Date: 2026-04-27
- Deciders: Lusoris, Claude (Anthropic)
- Tags: vulkan, ffmpeg, fork-local, zero-copy, implementation
Context¶
ADR-0184 shipped the public surface for the Vulkan VkImage import path as -ENOSYS-returning stubs:
vmaf_vulkan_import_image()vmaf_vulkan_wait_compute()vmaf_vulkan_read_imported_pictures()
The signatures landed (PR #128) so downstream consumers — the future libvmaf_vulkan FFmpeg filter (T7-29 part 3) and any direct C-API callers — could compile against the contract. This ADR records the design decisions for the actual implementation now that we are dropping the stubs.
The implementation is needed before T7-29 part 3 (the FFmpeg filter) can land; the filter code is otherwise untestable.
Decision¶
Implement the three import entry points with a synchronous v1 design plus a documented v2 follow-up. Add a fourth entry point — vmaf_vulkan_state_init_external — so the FFmpeg filter can run libvmaf compute on the decoder's VkDevice (source VkImage handles are device-bound; cross-device import would require dmabuf export/import plumbing that is out of scope for v1). Bundle the FFmpeg filter (ffmpeg-patches/0006-libvmaf-add-libvmaf-vulkan-filter.patch) in the same PR — see "FFmpeg patch coupling" below.
Per-state staging buffers¶
VmafVulkanState gains a struct VmafVulkanImportSlots field holding one ref + one dis staging VkBuffer (HOST_VISIBLE | HOST_COHERENT, allocated via VMA), reused across frames. The buffers are sized to match the DATA_ALIGN-rounded stride that vmaf_picture_alloc would produce — so the pixel data can be handed straight to vmaf_read_pictures without an additional host memcpy on the libvmaf side.
Geometry (w, h, bpc) is pinned by the first vmaf_vulkan_import_image() call. Subsequent calls must match or return -EINVAL — same contract as the SYCL vmaf_sycl_init_frame_buffers() model. Lazy allocation avoids needing a separate init entry point in the public surface.
Synchronous copy path (v1)¶
Inside vmaf_vulkan_import_image() we:
- Lazy-allocate the staging buffers + a reusable command buffer + a fence on first call.
- Record the command buffer:
vkCmdPipelineBarrier: caller'svk_layout→TRANSFER_SRC_OPTIMAL(we do not transition back — AVVkFrame discardable semantics).vkCmdCopyImageToBufferfor the Y plane only.- Submit with the caller's timeline semaphore as
pWaitSemaphores(or skip the wait whenvk_semaphore == 0, e.g. for the smoke test). - Wait the fence in-call before returning.
vmaf_vulkan_wait_compute() is therefore a no-op on this path — the work has already drained. The function is kept in the surface so the v2 async-pending-fence model can drop in without an ABI change.
vmaf_vulkan_read_imported_pictures(ctx, index) (in libvmaf.c under HAVE_VULKAN) wraps the staging buffers' host pointers into proper VmafPicture handles via a builder in import.c, attaches a no-op release callback (the buffers are owned by the state, not the picture pool), and routes through the standard vmaf_read_pictures() pipeline.
Why YUV400P (luma-only)¶
The first iteration emits luma-only VmafPicture (pix_fmt = VMAF_PIX_FMT_YUV400P). Every fork-added Vulkan extractor shipped to date — psnr, vif, motion, adm, moment — is luma-only, so chroma planes are never read. Adding chroma support is a mechanical extension when the first chroma-aware extractor arrives.
File split¶
core/src/vulkan/import.c(new, ~310 LOC): the buffer lifecycle, command-buffer recording, fence wait, and thevmaf_vulkan_state_build_pictures()builder.core/src/vulkan/import_picture.h(new): exposes the builder solibvmaf.ccan include it without inheriting<volk.h>.core/src/vulkan/vulkan_internal.h: gainsVmafVulkanImportSlots,owns_handles, and promotesVmafVulkanStatefromcommon.cso both files can see the slot layout.core/src/vulkan/common.c: addsvmaf_vulkan_state_init_external+ the matching internalvmaf_vulkan_context_new_externalthat adopts caller- supplied handles, skippingvkCreate{Instance,Device}.core/src/libvmaf.c: implementsvmaf_vulkan_read_imported_pictures()next to the existingvmaf_vulkan_import_state().ffmpeg-patches/0006-libvmaf-add-libvmaf-vulkan-filter.patch(new, ~280 LOC of additions to FFmpeg n8.1): thelibvmaf_vulkanfilter consumingAV_PIX_FMT_VULKAN, pullingAVVkFrame *fromdata[0], callingvmaf_vulkan_state_init_externalwith the device's compute queue, thenimport_image+read_imported_picturesper frame. Mirrors0005-libvmaf-add-libvmaf-sycl-filter.patch.
FFmpeg patch coupling (new fork rule)¶
Bundling parts 2 + 3 surfaces a recurring failure mode: the fork ships its FFmpeg integration as a stack of patches against n8.1, and any libvmaf-side surface change probed by those patches breaks the next rebase silently. This PR adds rule §12 r14 to CLAUDE.md (and the AGENTS.md mirror): every PR that touches a libvmaf public surface used by ffmpeg-patches/ updates the relevant patch in the same PR — pure libvmaf-internal refactors, doc-only, and test-only PRs are exempt. Reviewers verify with for p in ffmpeg-patches/000*-*.patch; do git -C ffmpeg-8 apply --check "$p"; done against the pinned n8.1 baseline.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Async pending-fence v2 — record + submit in import_image, return immediately, track the fence; wait_compute blocks on the fence | True overlap of decode/copy; lower latency on the fast path | Requires a per-frame fence pool, dual-buffering for outstanding submits, careful interaction with state lifecycle; doubles the test matrix | v1 is enough to unlock T7-29 part 3 (FFmpeg filter) which already serializes per-frame; v2 is a follow-up once profiling shows the wait is a bottleneck |
Kernels read VkImage directly via VkSampler / storage-image bindings | True zero-copy on the GPU side | Requires refactoring every Vulkan extractor (psnr/vif/motion/adm/moment) to support both VmafVulkanBuffer and VkImage inputs; ~3-5x the LOC of v1 | Out of scope for T7-29 part 2; revisit after part 3 ships and the FFmpeg-side workflow is real |
Build a fake VmafPicture without a VmafRef | Avoids the ref-init / no-op-release-callback dance | vmaf_read_pictures() always calls vmaf_picture_unref on cleanup, which returns -EINVAL on pic->ref == NULL and propagates that as a non-zero return from vmaf_read_pictures | Following the existing release-callback contract is cleaner; the per-frame overhead is a single vmaf_ref_init + decrement |
Allocate via vmaf_picture_alloc + memcpy from staging into the alloc'd picture | No release-callback wiring | Adds one host memcpy per plane per frame on top of the existing extractor upload_plane memcpy | The release-callback approach is only ~25 LOC and avoids the extra memcpy |
| Defer to T7-29 part 3 — implement everything inside the FFmpeg filter without a libvmaf-side surface | Smaller libvmaf footprint | The filter ends up reaching into Vulkan handle internals to do vkCmdCopyImageToBuffer against an internal VmafVulkanBuffer — leaks abstraction and duplicates logic for any direct C-API caller | The C-API surface is the contract; the FFmpeg filter is one consumer |
Consequences¶
- Positive: All three import entry points return
0on the success path. Geometry validation matches SYCL'sinit_frame_bufferscontract. The staging-buffer reuse (one allocation per state, not per frame) keeps the per-frame cost to onevkCmdCopyImageToBuffer+ one fence wait. Header purity from ADR-0184 is preserved (no<volk.h>leaks intolibvmaf.c). - Negative: v1 is synchronous — every
vmaf_vulkan_import_image()call blocks the caller until the GPU finishes the copy. For a 1080p 8-bit Y plane this is sub-millisecond (~2 MB at >5 GB/s PCIe), but it precludes decode/copy overlap. Documented; v2 follow-up addresses it. - Neutral / follow-ups:
- T7-29 part 3 (S) — package the FFmpeg-side
libvmaf_vulkanfilter asffmpeg-patches/0006-libvmaf-add-libvmaf-vulkan-filter.patch. Now possible because the API works. - v2 async pending-fence model (deferred) — once part 3 ships and is exercised, profile to confirm the synchronous wait is the bottleneck before refactoring.
- Chroma support (deferred) — extend the staging-buffer pair to ref/dis × Y/U/V (or a single plane-stride array) when the first chroma-aware Vulkan extractor lands.
- Validation layer integration (deferred) —
VmafVulkanConfiguration.enable_validationis still a no-op; the field reservation lands in T5-1c.
Verification¶
End-to-end GPU-plumbing validation lives downstream in T7-29 part 3 (the FFmpeg filter): the natural test is ffmpeg -hwaccel vulkan ... -vf libvmaf_vulkan and verifying the score matches the CPU-path baseline at places=4. For this PR, validation is contract-level:
- 10/10 unit tests in
core/test/test_vulkan_smoke.ccover: NULL-state rejection,vk_image == 0rejection,wait_computeon an idle state returns 0,read_imported_pictureson a NULL ctx → -EINVAL. - The float_moment Vulkan cross-backend gate (
scripts/ci/cross_backend_vif_diff.py --feature float_moment --backend vulkan) re-runs clean: 0/48 mismatches × 4 metrics on Intel Arc A380 — confirms the import-slot promotion and state struct change did not regress the existing kernel paths.
References¶
- Parent: ADR-0184 — declares the API shape this ADR implements.
- Pattern source: SYCL trio (
vmaf_sycl_import_va_surface/vmaf_sycl_wait_compute/vmaf_read_pictures_sycl) inlibvmaf_sycl.h. - Source: T7-29 in
.workingdir2/BACKLOG.md. - Per-PR rule: ADR-0108 deep-dive deliverables checklist.