Skip to content

ADR-0101: SYCL USM-backed picture pre-allocation pool

  • Status: Accepted
  • Date: 2026-04-18
  • Deciders: Lusoris, Claude (Opus 4.7)
  • Tags: sycl, gpu, picture-api, memory

Context

vmaf_sycl_preallocate_pictures() and vmaf_sycl_picture_fetch() shipped in an earlier SYCL backend scaffolding PR (#25) as no-op stubs:

int vmaf_sycl_preallocate_pictures(VmafContext *vmaf, VmafSyclPictureConfiguration cfg)
{
    // SYCL extractors handle picture upload internally,
    // so preallocation is not strictly needed.
    (void)vmaf; (void)cfg; return 0;
}

int vmaf_sycl_picture_fetch(VmafContext *vmaf, VmafPicture *pic)
{
    return vmaf_picture_alloc(pic, vmaf->pic_params.pix_fmt, ...);
}

The stubs were documented as "not strictly needed" because the primary SYCL upload path (vmaf_sycl_shared_frame_upload in src/sycl/common.cpp) copies from host VmafPicture.data[0] into double-buffered shared device USM buffers once per frame. The fetch stub simply delegated to vmaf_picture_alloc, which returns a host-backed picture.

Issue #26 tracks this as a correctness gap: the public API implies "DEVICE" and "HOST" pre-allocation modes return USM-backed pictures, but in fact every caller receives an ordinary host picture regardless of pic_prealloc_method. Callers that expect a device pointer (e.g., for zero-copy interop with a decoder that writes straight to SYCL USM) have to work around the silent fallback. The CUDA backend already ships a full implementation via vmaf_cuda_preallocate_pictures + VmafRingBuffer (src/cuda/ring_buffer.c); SYCL had parity only on paper.

Decision

Implement a USM-backed picture pool for SYCL — VmafSyclPicturePool in core/src/sycl/picture_sycl.{h,cpp} — and wire vmaf_sycl_preallocate_pictures() / vmaf_sycl_picture_fetch() / vmaf_close() to use it.

Key choices:

  • Scope: Y-plane only. VMAF operates on luma; U/V planes are not read by any SYCL extractor. Keeping scope identical to the CUDA pool avoids diverging the two backends' contracts.
  • Pool depth: 2 pictures. Matches the double-buffered shared-frame upload (cur_upload / cur_compute in VmafSyclState), so frame N+1 can start filling pool slot 1 while frame N is still being consumed from slot 0.
  • Allocator selection: VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_DEVICEsycl::malloc_device; VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_HOSTsycl::malloc_host. NONE remains a true no-op (no pool created; vmaf_sycl_picture_fetch falls back to vmaf_picture_alloc).
  • Refcount semantics: Each pool entry is initialised with a VMAF ref (=1) owned by the pool. vmaf_sycl_picture_pool_fetch calls vmaf_picture_ref which increments to 2 — the caller owns the outer ref and releases it via vmaf_picture_unref (→ back to 1, pool still holds). vmaf_close calls vmaf_sycl_picture_pool_close which walks the pool and releases each entry's buffers + priv + ref.
  • buf_type tag: Every pool picture gets VmafPicturePrivate.buf_type = VMAF_PICTURE_BUFFER_TYPE_SYCL_DEVICE (regardless of DEVICE vs HOST USM method) so validate_pic_params can enforce consistent backing across ref/dist pairs.

The primary upload path (vmaf_sycl_shared_frame_upload) is unchanged — SYCL copy-queue memcpy is allocator-agnostic, so a USM source works identically to a host source (transparent D2D or H2D depending on allocator kind).

Alternatives considered

Option Pros Cons Why not chosen
Keep the no-op stubs Zero code change; no blast radius Silent API lie: DEVICE/HOST modes return a host picture. Breaks zero-copy decoder interop. Issue #26 stays open indefinitely. Rejected: the public API contract requires real USM buffers on these modes.
Unify with CUDA VmafRingBuffer Single implementation, single refcount model Would require shimming cuStream semantics onto a sycl::queue; ring_buffer.c pulls in CUDA headers unconditionally. High churn for no functional gain. Rejected: blast radius touches CUDA code paths that are already stable.
Extend generic picture_pool.c (the reusable byte-pool from ADR-0029) One pool for all backends picture_pool.c is host-byte oriented; adding USM support means injecting a backend-dispatch layer into code that currently has zero GPU awareness. Leaks SYCL types into a generic path. Rejected: keeps the generic pool focused on host byte buffers.

Consequences

  • Positive:
  • VmafSyclPictureConfiguration.pic_prealloc_method now behaves as its public doc-comment promises for all three enum values.
  • Zero-copy decoder interop (VPL → Level Zero → SYCL USM) has a landing spot: the decoder can write directly into pic.data[0] on a pool picture without a host round-trip.
  • Pool lifetime is bounded by vmaf_close(), matching CUDA behaviour.
  • Negative:
  • Adds ~180 lines of C++ (picture_sycl.cpp) plus a small API surface (VmafSyclPoolMethod, three vmaf_sycl_picture_pool_* functions) in the internal header.
  • Two new USM allocations per vmaf_sycl_preallocate_pictures call — minor GPU memory baseline increase (e.g., 2 × 1920×1080 × 1 B ≈ 4 MB on default 8-bit 1080p runs).
  • Neutral / follow-ups:
  • New test: core/test/test_sycl_pic_preallocation.c covers NONE / DEVICE / HOST / no-state / double-preallocate / refcount-cycle paths.
  • docs/api/gpu.md updated — removes the "SYCL preallocation is a no-op" callout introduced in PR #25.
  • PR #25 tracked Issue #26 as "scaffolded but not implemented"; this ADR closes that gap.

References

  • Issue #26 — "SYCL: vmaf_sycl_preallocate_pictures is a no-op stub"
  • PR #25 — original SYCL preallocation scaffolding (public API surface)
  • core/src/cuda/ring_buffer.c — CUDA reference implementation
  • core/src/sycl/common.cpp (vmaf_sycl_shared_frame_upload) — confirms SYCL copy-queue memcpy is allocator-agnostic
  • Source: req — user selected "Option 1 — implement USM properly" and "Yes — full ADR under docs/adr/" in session 90ad1e1d AskUserQuestion popup.