ADR-0237: Quality-aware encode automation surface (`vmaf-tune`)¶

Status: Accepted (Phase A only; Phases B–F remain Proposed)
Date: 2026-05-02 (Proposed); 2026-05-03 (Phase A acceptance)
Deciders: Lusoris
Tags: tooling, ai, ffmpeg, codec, automation, fork-local

Context¶

The fork has built a deep quality-measurement stack — VMAF + tiny-AI fusion regressors (vmaf_tiny_v2, fr_regressor_v1), no-reference metrics (nr_metric_v1), perceptual extractors (LPIPS, SSIMULACRA 2, CIEDE2000, CAMBI, psnr_hvs), pre/post filters (vmaf_pre, learned_filter_v1), saliency (MobileSal placeholder), shot detection (TransNet V2 placeholder), and a per-shot CRF predictor (T6-3b in flight). It also ships first-class FFmpeg integration (six in-tree patches against n8.1, with CPU / CUDA / Vulkan filters) and a codec-aware FR regressor surface (ADR-0235 codec collision: ai/src/vmaf_train/codec.py, six-bucket codec one-hot, training BLOCKED on corpus).

What the fork does not ship is the action loop: nothing drives the encoder. Every metric we compute is on someone else's encode. The natural next layer — for an opinionated fork that already ships every quality input a per-title / per-shot / per-codec optimiser would need — is a quality-aware encode automation tool that closes the loop: given a source and a quality target (or a bitrate budget, or a Pareto request), drive FFmpeg to find the encoding parameters that hit it.

The user-facing framing is: the fork becomes a "quality + codec parameterisation automation tool" on top of the canonical Netflix VMAF reference numbers, with vmaf-tune as the integration point. This ADR pins the scope and phase ordering before any code lands, since the implementation surface is large enough to grow unbounded if not constrained up front.

Decision¶

We will ship tools/vmaf-tune/ as a new fork-local automation surface. It is a hybrid C + Python tool (same shape as the existing vmaf-perShot binary), built via Meson alongside the rest of the libvmaf tree. The tool exposes one harness layer (drive FFmpeg with parameter grids, capture bitrate + decode

score-via-libvmaf), one search layer (target-quality bisect / Bayesian / Pareto), and one selector layer (pre-trained per-title and per-shot CRF predictors with codec-aware conditioning).

The codec scope is multi-codec from day one — libx264, libx265, libsvtav1, libvpx-vp9, libvvenc, plus neural-codec adapters (DCVC family / CompressAI research models / NVC) and emerging codecs (LCEVC, EVC, AV2 when ffmpeg gains support) — but each codec is gated behind a thin codec adapter interface so we never special-case the search loop on codec identity. Phase A ships the harness against libx264 only; adapters are added one-per-PR as the underlying corpora exist.

We will land this as an ADR-Proposed-only PR for now (no code), with Research-0044 as the option-space digest. Phase A implementation lands as a separate PR gated on the user greenlighting the design + corpus plan in this ADR.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Hybrid C + Python under `tools/vmaf-tune/` (chosen)	Matches existing fork tools (`vmaf-perShot`, `vmaf_roi`); harness can call libvmaf in-process via the C API for speed; Python wraps FFmpeg + search + AI inference; meson installs alongside `vmaf`	Two-language seam adds complexity; build-system surface grows	Picked per `req` (popup Q3 chose `tools/vmaf-tune/`) and matches the existing pattern; the C-side avoids a process-boundary penalty when the harness scores thousands of encodes
Pure Python under `ai/automation/`	Fastest iteration; reuses `ai/`'s `pyproject.toml`; ships as console script	Process-boundary cost on every score (spawn `vmaf` binary per encode); doesn't compose with libvmaf's preallocation API; mixing automation + training under `ai/` muddies the separation	Rejected: the harness is an installable binary, not a research script — wrong tree
Top-level `automation/` subtree	Signals "new product surface"; clean for growth into multiple binaries	New top-level dir for one tool initially; no integration with existing meson build	Rejected: premature; revisit if `vmaf-tune` grows multiple sibling binaries
MCP-server-only (`mcp-server/vmaf-mcp/` + new tools)	Agent-driven from day one; reuses the JSON-RPC surface	Forces non-agent users through MCP; heavier dep on MCP runtime; not a CLI tool	Rejected: orthogonal — MCP wiring is Phase F, layered on top of the CLI tool, not in place of it
Defer entirely; encode automation lives downstream	Smaller fork surface; lets others build on libvmaf	Wastes the per-shot / saliency / codec-aware infrastructure that's already 70% of the work; nobody else has the integrated quality stack	Rejected: user explicitly asked for the fork to take this scope (`req`)
Single codec at a time, multi-codec deferred	Faster Phase A ship; less corpus burden	Codec adapter interface is harder to retrofit than to design up front; codec-aware FR regressor (ADR-0235) already commits to multi-codec	Rejected: multi-codec is the whole point; ship the adapter interface in Phase A even if only x264 is wired
AV1-only (mirror av1an's scope)	Minimum viable; av1an proves the bisect strategy works	x264 has 100× the deployment leverage; encoding x264 is 10–100× faster than AV1 so the corpus generation is cheaper	Rejected: x264 first, AV1 follows

Consequences¶

Positive:
The fork's quality stack (every metric + tiny-AI model + per-shot infra) finally has a consumer that exercises it end-to-end — the strongest dogfood we'll get for the metric pipeline.
Closes the loop on the codec-aware FR regressor (ADR-0235 collision): Phase A's harness produces the corpus that unblocks its training.
Per-shot CRF predictor (T6-3b) gains a downstream consumer that turns its predictions into actual encodes.
MCP integration in Phase F makes the whole automation surface agent-callable — an agent can "encode this clip at VMAF=93, using x265, with a 2-second-GOP constraint" via tool calls.
Differentiates the fork against upstream Netflix/vmaf without forking the metric definitions (Netflix golden gate stays intact; this is a purely additive tool tree).
Negative:
New 6-month-class workstream. Phase A alone is ~1 week; the full A→F path is ~2–3 months at the user's typical cadence. Risk of half-finished phases is real.
Encoder version coupling: x264 / x265 / svt-av1 / libvpx / libvvenc default values shift between versions. The harness has to capture encoder build + commit + version into every parquet row, and CI has to pin a known encoder set.
Training corpus we own: per-title / per-shot CRF predictors need (source, encode, score) tuples. We have Netflix Public NViD + BVI-DVC sources; the encodes have to come from Phase A's harness because we can't redistribute third-party encodes. This makes Phase A a hard prerequisite for every AI phase.
Neural-codec adapters (DCVC, NVC, CompressAI) are research-grade, Python-only, depend on heavy ML stacks (PyTorch, CUDA). They will live behind an opt-in vmaf-tune extra and never block the traditional-codec path.
Neutral / follow-ups:
Research-0044 ships in the same PR as this ADR — the option-space digest covering encode-search strategies (grid / coordinate descent / Bayesian / bisect), prior art (av1an, ab-av1, Netflix per-title, Bitmovin Per-Title), training-corpus plan, and the codec-adapter interface sketch.
Phase A (encode harness MVP, ~1 week, x264-only) ships standalone; the vmaf-tune binary plus a Parquet schema for the (params, bitrate, metrics) corpus. Useful by itself.
Phase B (target-VMAF bisect, ~3 days) ports av1an-style binary search across our metric set.
Phase C (per-title CRF predictor, ~1 week, gated on Phase A producing corpus) trains a small regressor: source canonical-6 + codec one-hot + resolution + framerate → CRF for target VMAF.
Phase D (per-shot dynamic CRF, gated on T6-3b landing) consumes per-shot CRF predictions, emits --qpfile / x265 zone files, drives 2-pass encode.
Phase E (Pareto ABR ladder, ~1 week) per-title across resolutions; emits a manifest in DASH/HLS-friendly shape.
Phase F (MCP tools, ~3 days) encode_search, recommend_crf, generate_ladder exposed via vmaf-mcp.
New docs surfaces: docs/usage/vmaf-tune.md, docs/ai/models/per_title_crf.md, ffmpeg recipe additions in docs/usage/ffmpeg.md.
Codec adapter interface lives at tools/vmaf-tune/codec_adapters/<codec>.py — one file per codec, must declare its parameter space, its quality knob (CRF / CQ / qmin-qmax / λ-control), its 2-pass shape, and its log-parsing for emitted bitrate.
Test-data licensing audit: any sources we encode for the training corpus stay under their original licence; the encodes are fork-generated and gitignored — only the parquet of features + scores ships.
This ADR will be split into per-phase ADRs as each phase lands — ADR-0237 stays the umbrella; ADR-0237a (Phase A harness), 0237b (bisect), etc. will be linked from here.

References¶

Source: req 2026-05-02 — "what if we change the scope to the fork that of course netflix is always the vmaf number truth but our fork will be a full quality metric and codec parametring automation tool/ai? like in combination with ffmpeg of course" (paraphrased per CLAUDE.md user-quote rule).
Popup Q1 2026-05-02: scope = Just write the spec / RFC for now.
Popup Q2 2026-05-02: codecs = x264 + x265 + AV1 + VP9 + that ai codec and more modern codecs of course (translates to the multi-codec phasing in this ADR).
Popup Q3 2026-05-02: location = tools/vmaf-tune/ (C + Python mix, like vmaf-perShot).
Research-0044 — option-space + prior art + corpus plan + codec-adapter interface.
ADR-0235 — codec-aware FR regressor v2 (training BLOCKED, corpus produced by Phase A).
ADR-0223 — shot detection for Phase D.
T6-3b backlog — per-shot CRF predictor.
ADR-0186 — Vulkan zero-copy import path consumed by vf_libvmaf_vulkan; the harness will use this when scoring on the encode side to keep CPU↔GPU traffic off the hot path.
Prior art surveyed in Research-0044: av1an, ab-av1, Netflix per-title (2015 paper + dynamic optimiser), Bitmovin Per-Title, qpfile / x265 zone files / svt-av1 segment table, av1an --target-quality, ffmpeg-bitrate-stats.
Bristol VI-Lab 2026 NVC review (docs/research/0033-bristol-nvc-review-2026.md) — neural codec landscape, informs the neural-codec adapter scoping.

ADR-0237: Quality-aware encode automation surface (vmaf-tune)¶