ADR-0353: Vulkan submit-pool migration PR-B — six secondary kernels¶
- Status: Accepted
- Date: 2026-05-09
- Deciders: lusoris
- Tags: vulkan, gpu, performance, fork-local
Context¶
ADR-0256 defined the VmafVulkanKernelSubmitPool abstraction, which pre-allocates command buffers and fences so the hot-path frame loop does not call vkAllocateCommandBuffers / vkCreateFence / vkFreeCommandBuffers / vkDestroyFence on every frame. PR #563 (branch perf/vulkan-submit-pool-pr-a-adm-motion-psnr) landed the migration for the three highest-priority kernels: adm_vulkan.c, motion_vulkan.c, and psnr_vulkan.c.
Six secondary kernels remained on the legacy per-frame allocation path:
| Kernel | Note |
|---|---|
ssim_vulkan.c | Single pipeline, 3 SSBOs — all stable |
ciede_vulkan.c | Single pipeline, 7 SSBOs — all stable |
ms_ssim_vulkan.c | Two pipeline bundles (decimate + SSIM), 5 pyramid scales |
motion_v2_vulkan.c | Ping-pong ref_buf[cur/prev] — one descriptor update per frame |
float_psnr_vulkan.c | Single pipeline, 3 SSBOs — all stable |
float_motion_vulkan.c | Ping-pong blur[cur/prev] — one descriptor update per frame |
Per-frame allocation incurs vkAllocateCommandBuffers + vkCreateFence + vkAllocateDescriptorSets on every extracted frame. On a 4K / 120-fps encode quality assessment loop this is measurable CPU overhead and creates allocator pressure on non-persistent allocator paths.
The migration also unlocks T-GPU-OPT-VK-4 (descriptor set pre-allocation): for kernels with fully-stable SSBO handles (ssim, ciede, float_psnr) the descriptor sets can be written once at init() and reused every frame, eliminating vkUpdateDescriptorSets from the hot path entirely.
Decision¶
Migrate all six secondary Vulkan kernels to the submit-pool pattern established by PR-A. The migration is mechanical and follows the PR-A reference exactly:
- Add
VmafVulkanKernelSubmitPool sub_pool(one pool per pipeline bundle) to the kernel state struct. - Add pre-allocated
VkDescriptorSet pre_set[N]fields. - In
init(): callvmaf_vulkan_kernel_submit_pool_createthenvmaf_vulkan_kernel_descriptor_sets_alloc. For stable kernels, callwrite_descriptor_setonce here. For ping-pong kernels, defer the write toextract(). - In
extract(): replace the legacy alloc/begin/end/submit/wait/free sequence withvmaf_vulkan_kernel_submit_acquire+vmaf_vulkan_kernel_submit_end_and_wait+vmaf_vulkan_kernel_submit_free. - In
close_fex(): callvmaf_vulkan_kernel_submit_pool_destroybeforevmaf_vulkan_kernel_pipeline_destroy(the pipeline destructor callsvkDeviceWaitIdleand destroys the descriptor pool; all in-flight work must be drained first).
ms_ssim_vulkan.c receives two pools (sub_pool_decimate with 1 slot, sub_pool_ssim with MS_SSIM_SCALES = 5 slots) and three pre-allocated descriptor set arrays (dec_sets_ref[4], dec_sets_cmp[4], ssim_sets[5]). All SSBO handles are stable from init() onward, so all writes happen once.
Ping-pong kernels (motion_v2, float_motion) keep one vkUpdateDescriptorSets call per frame because the cur/prev assignment changes every frame — this is the same pattern as motion_vulkan.c (PR-A).
No SPIR-V changes. No numerical output changes. Bit-exactness is preserved by construction: only the command-buffer and descriptor-set lifecycle changes; the dispatch geometry and push constants are untouched.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| One mega-PR for all 13 kernels (PR-A + PR-B + PR-C in one) | Single review surface; minimal PR overhead | 13 files × ~200 LOC delta = ~2 600 LOC diff; hard to review without context; risk of a single mechanical error blocking all 13; CI runtime identical | Violates user feedback on PR size consolidation (100–800 LOC preferred); PR-A/B/C split costs the same CI minutes total and is strictly easier to bisect |
| Per-kernel PR (13 PRs) | Maximum isolation; trivial bisection | 13× PR overhead (description, checklist, CI queue); review context must be rebuilt 13 times; branch hygiene; not justifiable for mechanical migrations | 13 PRs for identical mechanical pattern is pure overhead |
| 3-PR split (A: hot-path top-3, B: secondary 6, C: long-tail 4) — chosen | Reasonably-sized diffs; clear priority ordering; A/B/C can merge sequentially while C is still being authored; bisection by PR is sufficient | Three separate CI runs, three PR descriptions | Best balance: PR-A landed immediately on the highest-priority kernels; PR-B migrates the next tier; PR-C will clean up the remaining four |
| Defer PR-B until after PR-C | No extra PR | PR-A leaves 10 kernels on the legacy path; every frame loop for the 6 secondary kernels continues to allocate; delay compounds | No benefit to deferring a ready migration |
Consequences¶
- Positive: Six more kernel frame loops free from per-frame
vkAllocateCommandBuffers/vkCreateFence. Forssim,ciede, andfloat_psnr,vkUpdateDescriptorSetsis removed from the hot path entirely (T-GPU-OPT-VK-4). Thems_ssimmulti-scale loop eliminates per-scale per-frame command buffer allocation. Total per-frame Vulkan call reduction across all migrated kernels (PR-A + PR-B): 9 kernel × 3 alloc/destroy calls → 0. - Negative: Slightly more complex
init()andclose_fex()per kernel; the pool ordering invariant (pool_destroy before pipeline_destroy) must be maintained on future refactors. - Neutral / follow-ups: PR-C (
ansnr_vulkan.c,vif_vulkan.c,ssimulacra2_vulkan.c,cambi_vulkan.c) remains on the legacy path until that branch lands. Thems_ssimtwo-pool approach should be kept in sync with any future multi-scale dispatch strategy changes.
References¶
- ADR-0256 — submit pool design.
- PR-A: branch
perf/vulkan-submit-pool-pr-a-adm-motion-psnr, PR #563. - ADR-0189 — SSIM Vulkan kernel.
- ADR-0190 — MS-SSIM Vulkan kernel.
- ADR-0193 — motion-v2 Vulkan kernel.
- Vulkan 1.3 spec §5.3 (Command Buffer Lifecycle), §14.2 (Descriptor Sets).
- Source:
req— user task specification for PR-B migration (2026-05-09).