`libvmaf_metal.h`¶

Each GPU backend adds its own small API on top of the core libvmaf.h surface — a state object, picture preallocation helpers, and (SYCL / Metal) zero-copy import paths. This page is the reference for the four active backends (CUDA, SYCL, HIP, Metal); the Vulkan backend was removed in ADR-0726 — historical notes are preserved below in the Vulkan section. HIP still has three unported feature kernels, while Metal has a live Apple-Silicon runtime and first kernel batch.

Core API primer: index.md. CLI equivalents: ../usage/cli.md#backend-selection. Backend dispatch rules + runtime precedence: ../backends/index.md.

When these headers apply¶

The CUDA header is only useful in a build with -Denable_cuda=true (linking CUDA runtime + nvcc-compiled kernels). When disabled, the symbols are absent and calls won't link.
The SYCL header is only useful in a build with -Denable_sycl=true (linking oneAPI / Level Zero). Same rule.
~~The Vulkan header requires -Denable_vulkan=enabled (linking volk + the compute-shader feature kernels).~~ Vulkan was removed in ADR-0726. The header, source files, and enable_vulkan Meson option no longer exist.
The HIP header requires -Denable_hip=true -Denable_hipcc=true (linking ROCm). All feature kernels are real; 3 legacy stubs (adm_hip, vif_hip, motion_hip) return -ENOSYS (older _init/_run/_destroy API, not registered extractors).
The Metal header requires -Denable_metal=auto/enabled on macOS. Runtime entry points are live on Apple Silicon; unsupported devices return -ENODEV. Eight feature kernels are currently wired.
To write portable code that compiles against any libvmaf build, wrap GPU-specific sections in #ifdef HAVE_CUDA / #ifdef HAVE_SYCL / #ifdef HAVE_HIP / #ifdef HAVE_METAL, which pkg-config --cflags libvmaf surfaces automatically. (HAVE_VULKAN is no longer defined — Vulkan was removed in ADR-0726.)

CUDA¶

core/include/libvmaf/libvmaf_cuda.h

Lifecycle addition¶

  vmaf_init()
  vmaf_cuda_state_init()         ← new
  vmaf_cuda_import_state()       ← new; hands state to ctx
  vmaf_cuda_preallocate_pictures()
  loop:
    vmaf_cuda_fetch_preallocated_picture()
    ...                                     write into .data[i]
    vmaf_read_pictures()
  vmaf_score_pooled()
  vmaf_close()              /* destroys the by-value copy of CUDA state */
  vmaf_cuda_state_free()    /* always required — frees the original allocation */

State¶

typedef struct VmafCudaState VmafCudaState;

typedef struct VmafCudaConfiguration {
    void *cu_ctx;   /* CUcontext; NULL → libvmaf creates one on device 0 */
} VmafCudaConfiguration;

int vmaf_cuda_state_init(VmafCudaState **cu_state, VmafCudaConfiguration cfg);
int vmaf_cuda_import_state(VmafContext *vmaf, VmafCudaState *cu_state);
int vmaf_cuda_state_free(VmafCudaState *cu_state);

cu_ctx = NULL — libvmaf creates a fresh CUDA context on CUDA device 0. This is the common case for standalone tooling.
cu_ctx != NULL — must be a CUcontext from the driver API; libvmaf adopts it for all allocations. Use this to interop with an application that already owns a context (e.g. NVENC / NVDEC).

Ownership and explicit free¶

vmaf_cuda_import_state(vmaf, cu_state) copies the VmafCudaState by value into the VmafContext — it does not transfer ownership of the original heap allocation. vmaf_close(vmaf) tears down the embedded copy (destroying the CUDA stream and, if libvmaf created the context, releasing the primary context). After vmaf_close() returns, the caller must call vmaf_cuda_state_free(cu_state) to release the original heap allocation. Skipping this call leaks the allocation. Do not import the same state into two contexts.

vmaf_cuda_state_free(VmafCudaState *cu_state) (added in ADR-0157) is a NULL-safe free() wrapper for the original pointer returned by vmaf_cuda_state_init(). At the time of the call, vmaf_close() has already run vmaf_cuda_release() on the embedded copy (destroying the stream and context), so vmaf_cuda_state_free() only needs to free() the struct. It also serves as the escape hatch when the state was built via vmaf_cuda_state_init() but never imported (e.g. an early vmaf_init() failure), in which case it additionally tears down the stream and context before freeing.

VmafCudaState *cuda = NULL;
int err = vmaf_cuda_state_init(&cuda, (VmafCudaConfiguration){ .cu_ctx = NULL });
if (err) { return err; }

if (some_unrelated_setup_failed()) {
    vmaf_cuda_state_free(cuda);   /* not yet imported — also destroys stream/ctx */
    return -1;
}

err = vmaf_cuda_import_state(ctx, cuda);
/* ctx now holds a by-value copy of the state.
 * vmaf_close(ctx) destroys that copy (stream + context).
 * vmaf_cuda_state_free(cuda) must still be called afterwards to
 * release the original heap allocation from vmaf_cuda_state_init(). */
vmaf_close(ctx);
vmaf_cuda_state_free(cuda);  /* always required after import + vmaf_close */

The CUDA and SYCL lifetime models differ deliberately: CUDA state is copied by value into the context; the caller still owns the heap pointer. SYCL state is also always caller-freed after vmaf_close() (the queue is queue-scoped and survives a scoring session boundary). Both require an explicit free after vmaf_close() — CUDA via vmaf_cuda_state_free, SYCL via vmaf_sycl_state_free. Match the API to the lifetime model of the underlying runtime.

Picture preallocation¶

enum VmafCudaPicturePreallocationMethod {
    VMAF_CUDA_PICTURE_PREALLOCATION_METHOD_NONE,
    VMAF_CUDA_PICTURE_PREALLOCATION_METHOD_DEVICE,
    VMAF_CUDA_PICTURE_PREALLOCATION_METHOD_HOST,
    VMAF_CUDA_PICTURE_PREALLOCATION_METHOD_HOST_PINNED,
};

typedef struct VmafCudaPictureConfiguration {
    struct { unsigned w, h; unsigned bpc; enum VmafPixelFormat pix_fmt; } pic_params;
    enum VmafCudaPicturePreallocationMethod pic_prealloc_method;
} VmafCudaPictureConfiguration;

int vmaf_cuda_preallocate_pictures(VmafContext *ctx, VmafCudaPictureConfiguration cfg);
int vmaf_cuda_fetch_preallocated_picture(VmafContext *ctx, VmafPicture *pic);

Preallocation methods:

Method	`.data[i]` memory	Use when
`NONE`	caller-provided	You already own device buffers and set `.data[i]` yourself before `vmaf_read_pictures()`.
`DEVICE`	`cudaMalloc`	Your source data already lives on GPU (decoder output, encoder input). No H2D copy on `vmaf_read_pictures`.
`HOST`	`malloc`	Your source is on host; libvmaf inserts the H2D copy.
`HOST_PINNED`	`cudaMallocHost`	Host source but you want overlap with async compute — pinned memory allows concurrent DMA.

HOST_PINNED is almost always the right choice for CPU-decoded feeds; the peak throughput difference vs HOST on a PCIe-4 x16 link is 15–25% for 1080p. See ../backends/cuda/overview.md.

Complete CUDA example¶

#include <cuda.h>
#include <libvmaf/libvmaf.h>
#include <libvmaf/libvmaf_cuda.h>
#include <libvmaf/model.h>
#include <libvmaf/picture.h>

int main(void) {
    VmafConfiguration cfg = { .log_level = VMAF_LOG_LEVEL_WARNING, .n_threads = 4 };
    VmafContext *vmaf = NULL;
    int err = vmaf_init(&vmaf, cfg);

    VmafCudaState *cuda = NULL;
    VmafCudaConfiguration cu_cfg = { .cu_ctx = NULL };  /* libvmaf picks device 0 */
    err = vmaf_cuda_state_init(&cuda, cu_cfg);
    err = vmaf_cuda_import_state(vmaf, cuda);

    VmafModel *model = NULL;
    VmafModelConfig mcfg = { .name = "vmaf" };
    err = vmaf_model_load(&model, &mcfg, "vmaf_v0.6.1");
    err = vmaf_use_features_from_model(vmaf, model);

    VmafCudaPictureConfiguration pcfg = {
        .pic_params = { .w = 1920, .h = 1080, .bpc = 8, .pix_fmt = VMAF_PIX_FMT_YUV420P },
        .pic_prealloc_method = VMAF_CUDA_PICTURE_PREALLOCATION_METHOD_HOST_PINNED,
    };
    err = vmaf_cuda_preallocate_pictures(vmaf, pcfg);

    for (unsigned i = 0; i < nframes; i++) {
        VmafPicture ref = {0}, dist = {0};
        err = vmaf_cuda_fetch_preallocated_picture(vmaf, &ref);
        err = vmaf_cuda_fetch_preallocated_picture(vmaf, &dist);
        /* fill ref.data[i] / dist.data[i] — host-pinned, write normally */
        err = vmaf_read_pictures(vmaf, &ref, &dist, i);
    }

    err = vmaf_read_pictures(vmaf, NULL, NULL, 0);

    double score;
    err = vmaf_score_pooled(vmaf, model, VMAF_POOL_METHOD_MEAN, &score, 0, UINT_MAX);
    printf("VMAF: %.17g\n", score);

    vmaf_model_destroy(model);
    vmaf_close(vmaf);         /* tears down the by-value copy of CUDA state */
    vmaf_cuda_state_free(cuda); /* releases the original heap allocation */
    return 0;
}

Limitations¶

Single device. VmafCudaConfiguration does not expose a device index; launch libvmaf on device N by setting the current context to N before vmaf_cuda_state_init() (via cuCtxSetCurrent or cudaSetDevice).
No stream parameter. libvmaf runs its own streams internally; interop with an external stream is not exposed in v1.

SYCL¶

Header¶

core/include/libvmaf/libvmaf_sycl.h

State¶

typedef struct VmafSyclState VmafSyclState;

typedef struct VmafSyclConfiguration {
    int device_index;        /* -1 = SYCL default device; 0+ = specific ordinal */
    int enable_profiling;    /* non-zero: queue w/ enable_profiling property */
} VmafSyclConfiguration;

int  vmaf_sycl_state_init(VmafSyclState **out, VmafSyclConfiguration cfg);
int  vmaf_sycl_import_state(VmafContext *ctx, VmafSyclState *state);
void vmaf_sycl_state_free(VmafSyclState **state);
int  vmaf_sycl_list_devices(void);

vmaf_sycl_state_free is unusual — SYCL state is not owned by the context after import. You must call it explicitly after vmaf_close(ctx). This asymmetry exists because SYCL USM allocations are queue-scoped and the queue outlives one scoring session.

vmaf_sycl_list_devices enumerates device_type::gpu only (CPU / FPGA / accelerator devices are skipped) and prints one line per device with its ordinal, platform, vendor, driver version, and fp64 support flag. Returns the count, or -EIO on a SYCL exception. Used by vmaf_bench --list-devices (../usage/bench.md).

Picture preallocation (simple path)¶

enum VmafSyclPicturePreallocationMethod {
    VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_NONE,
    VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_DEVICE,
    VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_HOST,
};

typedef struct VmafSyclPictureConfiguration {
    struct { unsigned w, h; unsigned bpc; enum VmafPixelFormat pix_fmt; } pic_params;
    enum VmafSyclPicturePreallocationMethod pic_prealloc_method;
} VmafSyclPictureConfiguration;

int vmaf_sycl_preallocate_pictures(VmafContext *ctx, VmafSyclPictureConfiguration cfg);
int vmaf_sycl_picture_fetch(VmafContext *ctx, VmafPicture *pic);

vmaf_sycl_preallocate_pictures now honors the enum and creates a 2-deep SYCL picture pool when DEVICE or HOST is selected:

Method	Backing	Use case
`VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_NONE`	no pool; `vmaf_sycl_picture_fetch` falls back to `vmaf_picture_alloc()`	CPU-fed callers and test harnesses
`VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_DEVICE`	`sycl::malloc_device` USM	decoder / uploader writes directly into GPU-resident planes
`VMAF_SYCL_PICTURE_PREALLOCATION_METHOD_HOST`	`sycl::malloc_host` USM	CPU-visible pooled planes with SYCL-friendly lifetime semantics

The caller owns each VmafPicture reference returned by vmaf_sycl_picture_fetch() and must release it with vmaf_picture_unref() after submitting it through vmaf_read_pictures(). The pool keeps its own references until vmaf_close() tears down the context.

Zero-copy frame-buffer path¶

For callers that own a GPU-resident decode pipeline (Intel VPL, VA-API, D3D11), the frame-buffer API exposes two shared Y-plane buffers (ref + dis) and alternative ingest entry points. Use it when the pipeline wants the SYCL backend's built-in double-buffered Y-plane upload path instead of managing whole VmafPicture instances from the preallocation pool.

int vmaf_sycl_init_frame_buffers (VmafContext *ctx, unsigned w, unsigned h, unsigned bpc);
int vmaf_sycl_get_frame_buffers  (VmafContext *ctx, void **ref, void **dis);
int vmaf_read_pictures_sycl      (VmafContext *ctx, unsigned index);  /* replaces vmaf_read_pictures */
int vmaf_sycl_wait_compute       (VmafContext *ctx);
int vmaf_flush_sycl              (VmafContext *ctx);                  /* replaces (NULL,NULL,0) flush */

Typical zero-copy loop:

vmaf_sycl_init_frame_buffers(vmaf, W, H, 8);
void *ref_buf = NULL, *dis_buf = NULL;
vmaf_sycl_get_frame_buffers(vmaf, &ref_buf, &dis_buf);

for (unsigned i = 0; i < nframes; i++) {
    /* write Y-plane luma directly into ref_buf / dis_buf
     * (kernels, SYCL events, dmabuf imports — whatever the decoder exposes) */
    vmaf_read_pictures_sycl(vmaf, i);
    /* vmaf_sycl_wait_compute() is only needed if you must reuse ref_buf/dis_buf
     * for a later frame while compute is still in flight */
}
vmaf_flush_sycl(vmaf);

GPU-resident import paths¶

/* Linux / Level Zero */
int  vmaf_sycl_dmabuf_import (VmafSyclState *state, int fd, size_t size, void **ptr);
void vmaf_sycl_dmabuf_free   (VmafSyclState *state, void *ptr);

int  vmaf_sycl_import_va_surface (VmafSyclState *state, void *va_display,
                                  unsigned int va_surface, int is_ref,
                                  unsigned w, unsigned h, unsigned bpc);

int  vmaf_sycl_upload_plane (VmafSyclState *state, const void *src, unsigned pitch,
                             int is_ref, unsigned w, unsigned h, unsigned bpc);

/* Windows (conditional) */
#ifdef _WIN32
int  vmaf_sycl_import_d3d11_surface (VmafSyclState *state, void *d3d11_device,
                                     void *d3d11_texture, unsigned subresource,
                                     int is_ref, unsigned w, unsigned h, unsigned bpc);
#endif

vmaf_sycl_dmabuf_import is the primitive — turns a DMA-BUF fd into a SYCL device pointer via Level Zero external memory import. Stable.
vmaf_sycl_import_va_surface is the convenience wrapper on top of dmabuf — preferred path for a VA-API decode feed. Falls back to vaGetImage + memcpy when the DRM-PRIME export fails (older Mesa / proprietary drivers).
vmaf_sycl_upload_plane is the platform-agnostic escape hatch — memcpy from a host pointer. Use when nothing better works or when you need a baseline for benchmarking.
vmaf_sycl_import_d3d11_surface (Windows only) is declared in the public header but not implemented in-tree (rg finds zero definitions). The Doxygen block describes an intended host-roundtrip design — staging texture → CPU map → H2D memcpy — but no translation unit provides the symbol today. Tracked as issue #27. On Windows, use vmaf_sycl_upload_plane for a host → USM fallback.

See ADR-0016 for how these APIs landed and ../backends/sycl/overview.md for the ingestion-path decision tree.

Profiling helpers¶

int  vmaf_sycl_profiling_enable     (VmafSyclState *state);
void vmaf_sycl_profiling_disable    (VmafSyclState *state);
void vmaf_sycl_profiling_print      (VmafSyclState *state);
int  vmaf_sycl_profiling_get_string (VmafSyclState *state, char **out);

Profiling must be enabled at init time — the SYCL queue is created with the enable_profiling property inside vmaf_sycl_state_init() only when VmafSyclConfiguration.enable_profiling = 1 is passed. vmaf_sycl_profiling_enable does not re-create the queue; it only flips a bool on the state (core/src/sycl/common.cpp:1053). If the queue was not built with enable_profiling, calling vmaf_sycl_profiling_enable succeeds but subsequent get_profiling_info calls on kernel events will throw a sycl::exception. In practice: set enable_profiling=1 at init, then use the enable/disable pair to gate which frame ranges get timed.

vmaf_sycl_profiling_get_string yields a caller-owned buffer — free with free(). Equivalent to vmaf_bench --gpu-profile (../usage/bench.md).

Limitations¶

Zero-copy ingest paths (dmabuf_import, import_va_surface) require the SYCL queue to use the Level Zero backend — they call sycl::get_native<ext_oneapi_level_zero> directly. On an OpenCL-backend SYCL build these throw sycl::exception, which the wrapper catches and converts to -EIO. The error log is generic ("SYCL DMA-BUF import exception: <what()>") rather than a specific "not on Level Zero" diagnostic, so callers that want a graceful degradation should detect their own backend via sycl::queue::get_backend() up front and fall back to vmaf_sycl_upload_plane without relying on the log text.
vmaf_sycl_import_d3d11_surface is declared but unimplemented (ghost symbol — see issue #27). Windows callers must use vmaf_sycl_upload_plane today.
vmaf_sycl_init_frame_buffers is single-resolution. Changing w/h/bpc mid-stream requires vmaf_close + re-init.

Vulkan¶

Status: REMOVED — ADR-0726 (2026-05-28). The Vulkan backend, all source files (core/src/vulkan/, core/src/feature/vulkan/), the public header (libvmaf_vulkan.h), and the enable_vulkan Meson option were deleted. The CLI flags --vulkan_device, --no_vulkan, and --backend vulkan are no longer accepted. The sections below are preserved as a historical reference for what was implemented; none of the entry points listed exist in current builds. For active GPU backends see CUDA, SYCL, HIP, and Metal.

Historical note: at the time of removal, the backend had reached T5-1c (full default-model coverage). The state-level API, all feature extractors, image-import zero-copy paths, and FFmpeg AVVulkanDeviceContext interop described below are no longer present. ADR references: ADR-0127, ADR-0175, ADR-0186, ADR-0726.

Header¶

core/include/libvmaf/libvmaf_vulkan.h

State¶

typedef struct VmafVulkanState VmafVulkanState;

typedef struct VmafVulkanConfiguration {
    int device_index;                /* -1 = first device with compute queue */
    int enable_validation;           /* non-zero: load VK_LAYER_KHRONOS_validation */
    unsigned max_outstanding_frames; /* 0 = default (4); clamped to [1, 8] */
} VmafVulkanConfiguration;

int      vmaf_vulkan_available(void);
int      vmaf_vulkan_state_init(VmafVulkanState **out, VmafVulkanConfiguration cfg);
unsigned vmaf_vulkan_state_max_outstanding_frames(const VmafVulkanState *state);
int      vmaf_vulkan_import_state(VmafContext *ctx, VmafVulkanState *state);
void     vmaf_vulkan_state_free(VmafVulkanState **state);
int      vmaf_vulkan_list_devices(void);

The lifetime model mirrors CUDA's: after vmaf_vulkan_import_state(ctx, state) the context owns the state and vmaf_close(ctx) frees it. The _state_free() helper exists for the pre-import escape hatch (caller built a state but never imported it — e.g. early vmaf_init() failure or a benchmark harness that constructs and tears down a state without scoring).

vmaf_vulkan_available() returns 1 when libvmaf was built with -Denable_vulkan=enabled and 0 otherwise.

Lifecycle¶

  vmaf_init()
  vmaf_vulkan_state_init()         ← creates VkInstance + VkDevice + compute queue
  vmaf_vulkan_import_state()       ← state ownership transfers to ctx
  ...
  vmaf_score_pooled()
  vmaf_close()                     ← frees the imported state

For zero-copy interop with caller-owned VkInstance / VkDevice handles (typically from FFmpeg's AVVulkanDeviceContext), use vmaf_vulkan_state_init_external together with vmaf_vulkan_import_image / vmaf_vulkan_wait_compute / vmaf_vulkan_read_imported_pictures. See ADR-0186 and backends/vulkan/overview.md.

Async pending-fence pipelining (v2 — ADR-0251)¶

vmaf_vulkan_import_image is non-blocking as of T7-29 part 4. It records the GPU copy, submits to the compute queue, and returns immediately — the caller's decoder thread can run ahead while libvmaf's transfer queue drains in the background. Up to max_outstanding_frames (default 4) frames may be in flight before the next import_image call back-pressures on the oldest fence.

The drain happens automatically inside vmaf_vulkan_state_build_pictures (called by vmaf_vulkan_read_imported_pictures); callers who need an explicit drain — e.g. before reusing the imported VkImage on the decoder side — call vmaf_vulkan_wait_compute(), which now blocks on every outstanding fence in the ring.

Memory cost: the staging arena scales with max_outstanding_frames. At the default depth and 1080p 8-bit Y, the arena is roughly 16 MiB of host-visible buffers per VmafVulkanState. Higher resolutions or multi-state setups should size accordingly.

The ring depth is configurable via VmafVulkanConfiguration.max_outstanding_frames (0 selects the canonical default of 4; values are clamped to [1, 8] internally). The clamped value is observable via vmaf_vulkan_state_max_outstanding_frames(). ADR-0235 follow-up #3, T7-29 part 4 (this knob currently affects only vmaf_vulkan_state_init; external-handles callers receive the default until a separate ABI bump extends VmafVulkanExternalHandles).

Picture preallocation (ADR-0238)¶

Mirrors the CUDA / SYCL preallocation surface:

enum VmafVulkanPicturePreallocationMethod {
    VMAF_VULKAN_PICTURE_PREALLOCATION_METHOD_NONE = 0,
    VMAF_VULKAN_PICTURE_PREALLOCATION_METHOD_HOST,
    VMAF_VULKAN_PICTURE_PREALLOCATION_METHOD_DEVICE,
};

typedef struct VmafVulkanPictureConfiguration {
    struct {
        unsigned w, h;
        unsigned bpc;
        enum VmafPixelFormat pix_fmt;
    } pic_params;
    enum VmafVulkanPicturePreallocationMethod pic_prealloc_method;
} VmafVulkanPictureConfiguration;

int vmaf_vulkan_preallocate_pictures(VmafContext *vmaf, VmafVulkanPictureConfiguration cfg);
int vmaf_vulkan_picture_fetch(VmafContext *vmaf, VmafPicture *pic);

HOST allocates pictures via the regular vmaf_picture_alloc; DEVICE backs each picture's luma plane with a host-visible Vulkan buffer (VMA AUTO_PREFER_HOST) — the persistent mapped pointer is exposed as pic->data[0], so the caller writes once and the kernel descriptor sets read the same memory. Pool depth is fixed at the canonical frames-in-flight = 2 (matches SYCL); pictures are dispensed round-robin via vmaf_vulkan_picture_fetch. Fetch falls back to a host-backed picture if the caller skipped preallocate_pictures entirely.

Limitations¶

Pool depth is currently compile-time pic_cnt = 2 (matches SYCL). Growing the depth is an additive VmafVulkanPictureConfiguration field — gated on a real workload needing more.
Pool currently allocates the Y plane only (matches SYCL). Chroma-aware extractors that want preallocated U/V planes need a follow-up.
The ffmpeg libvmaf filter exposes vulkan_device=N (set to >= 0 to enable the Vulkan backend; see docs/usage/ffmpeg.md). Image-import zero-copy through AVVulkanDeviceContext is wired by ffmpeg-patches/0004-libvmaf-wire-vulkan-backend-selector.patch on top of T7-29's _state_init_external API.
HIP / AMD-ROCm support: libvmaf_hip.h is shipping (T7-10 scaffold, ADR-0212; runtime + all registered feature kernels real). 3 legacy stubs (adm_hip/vif_hip/motion_hip) use an older _init/_run/_destroy API and are not registered extractors. FFmpeg integration is wired by ffmpeg-patches/0011-libvmaf-wire-hip-backend-selector.patch (--enable-libvmaf-hip + hip_device=N, ADR-0380).

HIP¶

Header¶

core/include/libvmaf/libvmaf_hip.h

libvmaf_hip.h exposes the AMD ROCm/HIP lifecycle surface. It is available only in builds with -Denable_hip=true -Denable_hipcc=true; without those flags the symbols are absent and calls will not link. HIP runtime types (hipDevice_t, hipStream_t) cross the public ABI as uintptr_t to keep the header free of <hip/hip_runtime.h> — cast on the caller side.

Core lifecycle API¶

Symbol	Description
`vmaf_hip_available`	Returns 1 if libvmaf was built with `-Denable_hip=true`, 0 otherwise. Cheap to call; no HIP runtime is touched until `vmaf_hip_state_init()`.
`vmaf_hip_state_init`	Allocates a `VmafHipState` pinned to a HIP device. `device_index = -1` selects the first compute-capable HIP device; 0+ selects a specific ordinal. Returns `-ENODEV` when no compatible device is found.
`vmaf_hip_import_state`	Hands an allocated `VmafHipState` to a `VmafContext`. The caller retains ownership and must call `vmaf_hip_state_free` after `vmaf_close`. Returns `0` on success, `-EINVAL` when `ctx` or `state` is `NULL`, `-ENOSYS` when built without HIP.
`vmaf_hip_state_free`	Releases a state allocated via `vmaf_hip_state_init`. Safe to pass `NULL` or a state that was never imported. Sets the pointer to `NULL` on return.
`vmaf_hip_list_devices`	Enumerates compute-capable HIP devices visible to the runtime. Prints one line per device with its ordinal, name, and compute capability. Returns device count or `-ENOSYS` when built without HIP.

State¶

typedef struct VmafHipState VmafHipState;

typedef struct VmafHipConfiguration {
    int device_index; /**< -1 = first HIP device with compute capability */
    int flags;        /**< reserved for future use; pass 0 */
} VmafHipConfiguration;

int  vmaf_hip_available(void);
int  vmaf_hip_state_init(VmafHipState **out, VmafHipConfiguration cfg);
int  vmaf_hip_import_state(VmafContext *ctx, VmafHipState *state);
void vmaf_hip_state_free(VmafHipState **state);
int  vmaf_hip_list_devices(void);

Ownership¶

The HIP backend follows the same caller-owned-state model as SYCL: after vmaf_hip_import_state(ctx, state) the caller still owns the state and must call vmaf_hip_state_free(&state) after vmaf_close(ctx). This differs from the CUDA model (where the context takes ownership post-import). The rationale mirrors SYCL: HIP state may outlive a single scoring session when the caller manages a multi-pass workflow against the same device.

Typical call sequence¶

vmaf_init()
vmaf_hip_state_init(&state, cfg)     ← allocate state for device N
vmaf_hip_import_state(vmaf, state)   ← hands state to ctx; caller still owns it
loop:
  vmaf_read_pictures(vmaf, &ref, &dist, i)
vmaf_score_pooled(vmaf, ...)
vmaf_close(vmaf)
vmaf_hip_state_free(&state)          ← caller frees after vmaf_close

Limitations and current feature coverage¶

The HIP backend is compile-time gated behind -Denable_hip=true and requires ROCm 7.0+ at runtime. As of ADR-0533 / ADR-0539, 21 feature extractors are registered and end-to-end verified on AMD gfx hardware:

PSNR, float-PSNR, CIEDE, float-moment, integer-moment, float-SSIM, MS-SSIM, PSNR-HVS, CAMBI, SSIMULACRA2, integer-motion, integer-motion-v2, float-motion, float-VIF, integer-VIF, integer-ADM, float-ADM, integer-SSIM, speed-chroma, speed-temporal, integer-CIEDE.

Three legacy-API stubs (adm_hip, vif_hip, motion_hip) exist in tree but use an older _init/_run/_destroy API shape that is not compatible with the VmafFeatureExtractor registration system; they return -ENOSYS at init() and are not selectable via --feature. The float_ansnr extractor was removed in commit 70ed8b3ce3 (PR #38); it is no longer registered on any backend.

See ../backends/hip/overview.md for the complete extractor table, build flags, HSACO fat-binary target selection, FFmpeg integration (hip_device=N — ADR-0380), and per-kernel notes.

Metal¶

libvmaf_metal.h exposes the Apple Metal lifecycle and IOSurface import surface. It is available only on macOS builds with -Denable_metal=auto or -Denable_metal=enabled; unsupported hosts return -ENODEV instead of silently falling back to CPU. The header is installed into the system prefix by meson install whenever Metal is enabled, so that downstream FFmpeg --enable-libvmaf-metal configure probes can locate it (ADR-0437).

Core lifecycle API¶

Symbol	Description
`vmaf_metal_available`	Returns 1 if the library was built with Metal support; 0 otherwise.
`vmaf_metal_state_init`	Allocates a `VmafMetalState`, selecting a device by index (-1 = system default). Returns `-ENODEV` on non-Apple-Family-7 hosts.
`vmaf_metal_import_state`	Hands an allocated `VmafMetalState` to a `VmafContext` for use during feature extraction. The caller retains ownership and must call `vmaf_metal_state_free` after `vmaf_close`.
`vmaf_metal_state_free`	Releases a state allocated via `vmaf_metal_state_init` or `vmaf_metal_state_init_external`. Safe to pass `NULL`.
`vmaf_metal_list_devices`	Enumerates Apple-Family-7+ Metal devices. Returns device count or `-ENOSYS` when built without Metal.

Typical call sequence:

vmaf_init()
vmaf_metal_state_init(&state, cfg)     ← new
vmaf_metal_import_state(vmaf, state)   ← new; hands state to ctx
loop:
  vmaf_metal_picture_import(state, iosurface, plane, w, h, bpc, is_ref, index)
  vmaf_metal_wait_compute(state)
  vmaf_metal_read_imported_pictures(vmaf, index)
vmaf_score_pooled(vmaf, ...)
vmaf_close(vmaf)
vmaf_metal_state_free(&state)

IOSurface zero-copy import (ADR-0423)¶

For FFmpeg/VideoToolbox callers that hold CVPixelBufferRef-backed frames, the fork ships a zero-copy IOSurface import path. The caller pulls the IOSurface via CVPixelBufferGetIOSurface and hands it to libvmaf; the Metal feature kernels read the frame without a host round-trip.

Symbol	Description
`VmafMetalExternalHandles`	Struct carrying an `id<MTLDevice>` and `id<MTLCommandQueue>` as `uintptr_t` to keep the header Metal-framework-free.
`vmaf_metal_state_init_external`	Allocates a `VmafMetalState` that adopts caller-supplied Metal handles instead of creating its own device/queue. Required when the IOSurface source and libvmaf compute must share the same `MTLDevice`.
`vmaf_metal_picture_import`	Imports a single plane of an `IOSurfaceRef` (as `uintptr_t`) into the libvmaf Metal pipeline. The caller retains ownership; libvmaf locks the surface read-only and copies the plane into a shared-storage `VmafPicture`.
`vmaf_metal_wait_compute`	Blocks until all Metal compute work on `state` has finished. Currently a synchronous no-op (v1 import path is a host-side memcpy); future async paths replace this with an `MTLSharedEvent` drain.
`vmaf_metal_read_imported_pictures`	Triggers a libvmaf score read for the ref+dis IOSurfaces at `index`. Mirrors `vmaf_vulkan_read_imported_pictures`.

Metal currently targets Apple Silicon (Apple-Family-7, M1 and later) and ships runtime dispatch for 8 feature kernels. VIF, ADM, CIEDE, CAMBI, SSIMULACRA2, MS-SSIM, PSNR-HVS, and motion3 are tracked as follow-up kernels. Intel-Mac paths remain MoltenVK/Vulkan-oriented unless a future ADR adds a dedicated Metal runtime contract for those devices.

index.md — core API (everything on this page sits on top of it)
dnn.md — tiny-AI session API (separate from classic GPU dispatch)
../usage/cli.md#backend-selection — --no_cuda / --no_sycl / --sycl_device flags
../backends/cuda/overview.md, ../backends/sycl/overview.md, and ../backends/hip/overview.md — user-facing backend pages
../usage/bench.md — vmaf_bench, which consumes these APIs to produce the perf + validation tables
ADR-0016, ADR-0022, ADR-0027 — governing decisions

GPU backends C API — libvmaf_cuda.h / libvmaf_sycl.h / libvmaf_hip.h / libvmaf_metal.h¶

When these headers apply¶

CUDA¶

Header¶

Lifecycle addition¶

State¶

Ownership and explicit free¶

Picture preallocation¶

Complete CUDA example¶

Limitations¶

SYCL¶

Header¶

State¶

Picture preallocation (simple path)¶

Zero-copy frame-buffer path¶

GPU-resident import paths¶

Profiling helpers¶

Limitations¶

Vulkan¶

Header¶

State¶

Lifecycle¶

Async pending-fence pipelining (v2 — ADR-0251)¶

Picture preallocation (ADR-0238)¶

Limitations¶

HIP¶

Header¶

Core lifecycle API¶

State¶

Ownership¶

Typical call sequence¶

Limitations and current feature coverage¶

Metal¶

Core lifecycle API¶

IOSurface zero-copy import (ADR-0423)¶

Related¶

GPU backends C API — `libvmaf_cuda.h` / `libvmaf_sycl.h` / `libvmaf_hip.h` / `libvmaf_metal.h`¶