ADR-0299: GPU scoring backend for `vmaf-tune` (`--score-backend`)¶

Status: Accepted
Date: 2026-05-03
Deciders: Lusoris
Tags: tooling, cuda, vulkan, sycl, ai, automation, fork-local

Context¶

vmaf-tune (Phase A; ADR-0237) drives an (encoder, preset, crf) grid sweep, encoding each cell with FFmpeg and scoring the encode against the reference with the libvmaf CLI. On a 60-second 1080p source, CPU-only VMAF scoring runs at 1–2 fps — the score axis dominates the corpus wall-clock once encodes parallelise.

The fork already ships GPU-accelerated scoring backends:

CUDA (ADR-0127 sibling, mature): 10–30 fps at 1080p depending on GPU class.
Vulkan (ADR-0175, ADR-0186): comparable to CUDA on recent NVIDIA / AMD silicon.
SYCL/oneAPI: Intel-first, comparable on Arc / Iris Xe.

The libvmaf CLI exposes them via the unified --backend NAME selector (values: auto|cpu|cuda|sycl|vulkan). Until this ADR, vmaf-tune invoked the CLI without --backend, leaving the binary in its built-in auto-mode — which is not the same as actively detecting the host's fastest backend, and was effectively CPU on workstations where the GPU backends' auto-engagement heuristics don't fire (e.g. when neither --gpumask nor --sycl_device is set).

The user-facing speedup is 10–30× on score wall-clock, which translates directly to corpus throughput once encodes are no longer the long pole.

Decision¶

We add a --score-backend {auto|cpu|cuda|sycl|vulkan} flag to vmaf-tune corpus (default auto) that resolves to a libvmaf --backend NAME argument before any encodes run.

Selection logic lives in a new module tools/vmaf-tune/src/vmaftune/score_backend.py:

parse_supported_backends(help_text) extracts the alternation from the vmaf binary's --help output. CPU is always considered supported.
detect_available_backends() intersects binary support with cheap hardware probes (nvidia-smi -L, vulkaninfo --summary, sycl-ls).
select_backend(prefer) honours the user's choice. auto walks the fallback chain (cuda → vulkan → sycl → cpu) and returns the first available; any other value is treated strictly — if the requested backend is not available, raise BackendUnavailableError with a diagnostic message. We never silently downgrade an explicit GPU request to CPU; that would mask hardware/build mismatches and lie to the operator about wall-clock expectations.

run_score and build_vmaf_command accept an optional backend kwarg that, when set, appends --backend NAME to the spawned argv. None preserves legacy behaviour for callers that haven't migrated.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Always inject `--backend auto`	One-liner change in `score.py`	The libvmaf `auto` heuristic is conservative and routes to CPU on hosts where the user would want CUDA but no `--gpumask` is set. No telemetry for which backend ran.	Doesn't actually deliver the 10–30× speedup the user wants by default.
Detect once globally, ignore user preference	Simplest CLI surface (no new flag)	Operators on multi-GPU CI runners need to pin a backend for reproducibility. No way to force CPU for a known-bad GPU driver day.	Loses the strict-mode guarantee (no silent downgrade).
Probe by invoking vmaf with each backend and keeping the survivor	Most accurate (catches runtime init failures, not just driver presence)	Spawns up to 4 subprocesses per `vmaf-tune` invocation — adds 5–10 s of cold start; doesn't compose with the Phase A "JSONL row per cell" mental model.	Too slow for an interactive selection step.
Parse `vmaf --capabilities` JSON	Most robust (no help-text parsing fragility)	The CLI doesn't expose a machine-readable capability dump yet; would require a libvmaf change.	Future work; ship the help-parser first, swap when available.

Consequences¶

Positive:
10–30× faster vmaf-tune corpus runs on GPU-equipped hosts; the score axis stops dominating wall-clock.
Strict-mode (--score-backend cuda on a CPU-only host) fails fast with a clear error — surfaces build/runtime mismatches the moment the operator hits them.
Auto-detection means default-on benefit for anyone with a working GPU; no flag-flip required.
Negative:
Adds a (mockable) dependency on nvidia-smi / vulkaninfo / sycl-ls for the detection step. Missing tools degrade gracefully to "backend not available", never to a hard error.
Help-text parsing is fragile — if libvmaf renames --backend or reformats the auto|cpu|cuda|sycl|vulkan line, detection silently reports CPU-only. Mitigated by the unit tests in tests/test_score_backend.py, which pin the parser against a known-good help fragment.
Neutral / follow-ups:
When libvmaf adds a machine-readable --capabilities dump, swap the help parser for that.
Phase B/C (bisect, predict per ADR-0237) inherit the same flag for free since they share score.run_score.

References¶

ADR-0237 — vmaf-tune umbrella.
ADR-0127 — Vulkan compute backend.
ADR-0175 — Vulkan scaffold.
ADR-0186 — Vulkan image-import.
ADR-0214 — cross-backend numerical parity gate.
Source: req — user requested wiring vmaf-tune's scoring step to the existing CUDA/Vulkan/SYCL libvmaf backends, with the explicit hard rules "force-cuda on a host without CUDA must fail with a clear error" and "do not silently fall back to CPU if user explicitly requested GPU".

ADR-0299: GPU scoring backend for vmaf-tune (--score-backend)¶