ADR-0101: SYCL USM-backed picture pre-allocation pool¶
- Status: Accepted
- Date: 2026-04-18
- Deciders: Lusoris, Claude (Opus 4.7)
- Tags:
sycl,gpu,picture-api,memory
Context¶
vmaf_sycl_preallocate_pictures() and vmaf_sycl_picture_fetch() shipped in an earlier SYCL backend scaffolding PR (#25) as no-op stubs:
int vmaf_sycl_preallocate_pictures(VmafContext *vmaf, VmafSyclPictureConfiguration cfg)
{
// SYCL extractors handle picture upload internally,
// so preallocation is not strictly needed.
(void)vmaf; (void)cfg; return 0;
}
int vmaf_sycl_picture_fetch(VmafContext *vmaf, VmafPicture *pic)
{
return vmaf_picture_alloc(pic, vmaf->pic_params.pix_fmt, ...);
}
The stubs were documented as "not strictly needed" because the primary SYCL upload path (vmaf_sycl_shared_frame_upload in src/sycl/common.cpp) copies from host VmafPicture.data[0] into double-buffered shared device USM buffers once per frame. The fetch stub simply delegated to vmaf_picture_alloc, which returns a host-backed picture.
Issue #26 tracks this as a correctness gap: the public API implies "DEVICE" and "HOST" pre-allocation modes return USM-backed pictures, but in fact every caller receives an ordinary host picture regardless of pic_prealloc_method. Callers that expect a device pointer (e.g., for zero-copy interop with a decoder that writes straight to SYCL USM) have to work around the silent fallback. The CUDA backend already ships a full implementation via vmaf_cuda_preallocate_pictures + VmafRingBuffer (src/cuda/ring_buffer.c); SYCL had parity only on paper.
Decision¶
Implement a USM-backed picture pool for SYCL — VmafSyclPicturePool in core/src/sycl/picture_sycl.{h,cpp} — and wire vmaf_sycl_preallocate_pictures() / vmaf_sycl_picture_fetch() / vmaf_close() to use it.
Key choices:
- Scope: Y-plane only. VMAF operates on luma; U/V planes are not read by any SYCL extractor. Keeping scope identical to the CUDA pool avoids diverging the two backends' contracts.
- Pool depth: 2 pictures. Matches the double-buffered shared-frame upload (
cur_upload/cur_computeinVmafSyclState), so frame N+1 can start filling pool slot 1 while frame N is still being consumed from slot 0. - Allocator selection:
VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_DEVICE→sycl::malloc_device;VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_HOST→sycl::malloc_host.NONEremains a true no-op (no pool created;vmaf_sycl_picture_fetchfalls back tovmaf_picture_alloc). - Refcount semantics: Each pool entry is initialised with a VMAF ref (=1) owned by the pool.
vmaf_sycl_picture_pool_fetchcallsvmaf_picture_refwhich increments to 2 — the caller owns the outer ref and releases it viavmaf_picture_unref(→ back to 1, pool still holds).vmaf_closecallsvmaf_sycl_picture_pool_closewhich walks the pool and releases each entry's buffers + priv + ref. - buf_type tag: Every pool picture gets
VmafPicturePrivate.buf_type = VMAF_PICTURE_BUFFER_TYPE_SYCL_DEVICE(regardless of DEVICE vs HOST USM method) sovalidate_pic_paramscan enforce consistent backing across ref/dist pairs.
The primary upload path (vmaf_sycl_shared_frame_upload) is unchanged — SYCL copy-queue memcpy is allocator-agnostic, so a USM source works identically to a host source (transparent D2D or H2D depending on allocator kind).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep the no-op stubs | Zero code change; no blast radius | Silent API lie: DEVICE/HOST modes return a host picture. Breaks zero-copy decoder interop. Issue #26 stays open indefinitely. | Rejected: the public API contract requires real USM buffers on these modes. |
Unify with CUDA VmafRingBuffer | Single implementation, single refcount model | Would require shimming cuStream semantics onto a sycl::queue; ring_buffer.c pulls in CUDA headers unconditionally. High churn for no functional gain. | Rejected: blast radius touches CUDA code paths that are already stable. |
Extend generic picture_pool.c (the reusable byte-pool from ADR-0029) | One pool for all backends | picture_pool.c is host-byte oriented; adding USM support means injecting a backend-dispatch layer into code that currently has zero GPU awareness. Leaks SYCL types into a generic path. | Rejected: keeps the generic pool focused on host byte buffers. |
Consequences¶
- Positive:
VmafSyclPictureConfiguration.pic_prealloc_methodnow behaves as its public doc-comment promises for all three enum values.- Zero-copy decoder interop (VPL → Level Zero → SYCL USM) has a landing spot: the decoder can write directly into
pic.data[0]on a pool picture without a host round-trip. - Pool lifetime is bounded by
vmaf_close(), matching CUDA behaviour. - Negative:
- Adds ~180 lines of C++ (
picture_sycl.cpp) plus a small API surface (VmafSyclPoolMethod, threevmaf_sycl_picture_pool_*functions) in the internal header. - Two new USM allocations per
vmaf_sycl_preallocate_picturescall — minor GPU memory baseline increase (e.g., 2 × 1920×1080 × 1 B ≈ 4 MB on default 8-bit 1080p runs). - Neutral / follow-ups:
- New test:
core/test/test_sycl_pic_preallocation.ccovers NONE / DEVICE / HOST / no-state / double-preallocate / refcount-cycle paths. docs/api/gpu.mdupdated — removes the "SYCL preallocation is a no-op" callout introduced in PR #25.- PR #25 tracked Issue #26 as "scaffolded but not implemented"; this ADR closes that gap.
References¶
- Issue #26 — "SYCL:
vmaf_sycl_preallocate_picturesis a no-op stub" - PR #25 — original SYCL preallocation scaffolding (public API surface)
core/src/cuda/ring_buffer.c— CUDA reference implementationcore/src/sycl/common.cpp(vmaf_sycl_shared_frame_upload) — confirms SYCL copy-queue memcpy is allocator-agnostic- Source:
req— user selected "Option 1 — implement USM properly" and "Yes — full ADR under docs/adr/" in session 90ad1e1d AskUserQuestion popup.