Metal (Apple Silicon) compute backend¶
Status: 17 kernels wired, registered, and parity-tested. The Metal backend has a real Apple-Silicon runtime, shared-memory
MTLBufferpicture storage, metallib embedding, and 17 wired feature extractors — the full cross-backend metric set:float_moment_metal,float_motion_metal,float_ms_ssim_metal,float_psnr_metal,float_ssim_metal,integer_motion_metal,integer_psnr_metal,motion_v2_metal,integer_ssim_metal,float_vif_metal,integer_vif_metal,float_adm_metal,integer_adm_metal,integer_ciede_metal,integer_psnr_hvs_metal,integer_cambi_metal, andssimulacra2_metal. Each carries a per-kernel CPU-vs-Metal score parity test (seecore/test/test_metal_kernel_coverage_audit.c). (float_ansnr_metalwas removed in commit 70ed8b3ce3 / PR #38 together with the CPU and HIP twins.)The dispatch support predicate recognises both those extractor names and every key in their provided-features arrays (
psnr_y,psnr_cb,psnr_cr,float_ms_ssim,vif,adm_num,ciede2000,Cambi_feature_cambi_score,ssimulacra2, etc.). The one remaining Metal-twin gap is the SpEED family (speed_chroma/speed_temporal), which has CUDA/SYCL/HIP twins but no Metal kernel yet.
Why Metal¶
Apple Silicon (M1+) is the perf story for Apple-platform users. The fork's existing Apple-Silicon coverage is the NEON SIMD CPU path (per ADR-0145 and the wider NEON twin matrix); this backend adds the GPU compute path that NEON cannot reach.
Three properties make a native Metal backend worth shipping:
- Unified memory.
MTLBufferallocations created withMTLResourceStorageModeSharedare zero-copy across CPU↔GPU; the submit-side H2D / D2H staging the CUDA / HIP / Vulkan backends spend the bulk of their complexity on collapses to host stores and direct[buffer contents]reads. - First-party Apple compute API. OpenCL is deprecated since macOS 10.14 (2018) and receives no driver updates; Vulkan reaches the GPU only through MoltenVK's translation layer (Vulkan command buffer → Metal command buffer rewrite) which adds per-dispatch overhead. Metal is the supported user-space surface.
- No PCIe boundary. GPU and CPU share the same DRAM with cache coherence; the runtime PR can keep the previous-frame ref Y plane in one shared buffer rather than ping-ponging two device allocations the way the HIP twin does.
See ADR-0361 §Context for the full reasoning and rejected alternatives (MoltenVK, oneAPI, OpenCL, Swift-based runtime).
Apple Silicon only¶
The runtime PR (T8-1b) gates device selection on MTLGPUFamily.Apple7 (M1 and later) via -[id<MTLDevice> supportsFamily:]. Intel Macs and non-macOS hosts surface as -ENODEV from vmaf_metal_state_init. Reasoning: Apple discontinued Intel-Mac GPU parity, and the unified-memory zero-copy story does not apply on Intel-Mac discrete GPUs (Radeon Pro / Vega) which sit behind PCIe. See ADR-0361 §Apple Silicon-only.
Build¶
On macOS:
-Denable_metal=auto (the default) auto-resolves to enabled on host_machine.system() == 'darwin' and disabled elsewhere. -Denable_metal=disabled suppresses the auto-probe even on macOS. -Denable_metal=enabled forces the Metal frameworks to be linked; on non-macOS hosts the meson dependency('Metal') probe fails the setup step with a clear missing-framework error.
The backend has zero hard runtime dependencies on non-macOS hosts because the Metal subdirectory is not entered there unless -Denable_metal=enabled is forced. On macOS the dependency('Metal') / dependency('IOSurface') probes resolve to the system frameworks; MetalKit is optional.
Runtime layer¶
The runtime layer uses Objective-C++ .mm TUs under ARC and keeps Metal object handles opaque at C boundaries as void * / uintptr_t. vmaf_metal_context_new creates an Apple-Family-7+ id<MTLDevice> and id<MTLCommandQueue>, picture_metal.mm allocates shared MTLBuffer storage, and kernel_template.mm wraps per-feature command-buffer lifecycle and readback waits.
Kernel sources are Metal Shading Language (.metal) compiled to .air and linked into a default.metallib with xcrun metal / xcrun metallib. The metallib is embedded into the libvmaf binary's __TEXT,__metallib section and loaded by the Obj-C++ host dispatch files.
Rollout sequence¶
- T8-1 (scaffold PR + batch-1) — public header,
src/metaltree, first consumer registrations,enable_metalMeson option, smoke test, and macOS CI lane. - T8-1b (runtime PR) —
MTLCreateSystemDefaultDevice/id<MTLCommandQueue>/id<MTLBuffer>lifecycle. Runtime entry points return0on a real Apple Silicon device and-ENODEVon Intel Mac or non-Apple-Family-7 GPUs. - T8-1c…T8-1j (first kernel batch) —
motion_v2, float/integer PSNR, float moment, float ANSNR, float/integer motion, and float SSIM host dispatch + MSL kernels. - T8-1k —
integer_moment_metal(uint32 hi/lo reduction). - T8-2b —
float_ms_ssim_metal(ADR-0490): float-precision 5-scale MS-SSIM on Metal. Three MSL kernels (ms_ssim_decimate,ms_ssim_horiz,ms_ssim_vert_lcs); Wang (2003) weights applied host-side in double precision. - T8-2c+ — remaining kernels (VIF, ADM, CIEDE, CAMBI, SSIMULACRA2, etc.) follow as their own PRs gated by the
places=4cross-backend-diff lane (per ADR-0214). enable_metaldefault flip fromautotoenabled: only after the kernel matrix proves bit-exactness via theplaces=4cross-backend gate (mirrors theenable_vulkanandenable_hiproadmaps).
Feature extractor options¶
float_ssim_metal¶
float_ssim_metal now reaches full option parity with the CPU float_ssim extractor (ADR-0484):
enable_lcs(bool, defaultfalse) — emit per-frame luminance (float_ssim_l), contrast (float_ssim_c), and structure (float_ssim_s) sub-scores alongside the composite SSIM score. When enabled, thefloat_ssim_vert_combinekernel accumulates three additional per-WG partial sums (L, C, S) in a single threadgroup reduction pass — no extra dispatch.enable_db(bool, defaultfalse) — convert the SSIM score to decibels:-10·log10(1 − SSIM). Applied host-side after the partial-sum reduction.clip_db(bool, defaultfalse) — clamp the dB output to a finite maximum derived from frame dimensions and bit depth. Mirrors the CPU helper exactly.scale(int, default0= auto-detect) — decimation scale factor. v1 supports scale=1 only;scale=0on frames where auto-detect would choosescale>1returns-EINVALat init time with a log message.
Usage example:
vmaf --feature float_ssim_metal:enable_lcs=true:enable_db=true \
--reference ref.yuv --distorted dist.yuv ...
Coordination with NEON¶
The Metal backend targets the GPU on Apple Silicon. The NEON SIMD twin matrix (per ADR-0145) stays the CPU-side path on the same hardware. The two are complementary:
- Small / latency-sensitive runs land on NEON via the existing CPU dispatch (no GPU command-buffer setup overhead).
- Large / throughput-bound runs land on Metal when one of the shipped Metal feature kernels is requested; the GPU's parallelism + unified memory eliminate both the CPU-bound bottleneck and the H2D / D2H staging cost.
Backend selection follows the standard libvmaf precedence (see ../index.md §Runtime selection): GPU paths win when available, CPU SIMD wins otherwise.
Verification¶
The macOS CI lane Build — macOS Metal is the ground-truth gate; it runs on every PR with -Denable_metal=enabled and exercises the smoke test plus the currently wired kernel batch. Linux-host dev sessions cannot reproduce the lane locally because Metal.framework only exists on macOS hosts.
Reviewers verifying locally on a Mac:
References¶
- ADR-0361 — original audit-first Metal backend ADR.
- ADR-0212 — HIP scaffold precedent (T7-10).
- ADR-0175 — Vulkan scaffold precedent (T5-1) — the original audit-first GPU-backend pattern.
- ADR-0145 — motion_v2 NEON twin on Apple Silicon CPU.
- ADR-0214 —
places=4cross-backend gate; the runtime PR's incoming numerics gate. - ADR-0490 —
float_ms_ssim_metalport (T8-2b): design rationale for the float 5-scale MS-SSIM Metal twin. - Apple Developer documentation — Metal-cpp, https://developer.apple.com/metal/cpp/ (accessed 2026-05-09).