Research-0044: Quality-aware encode automation (`vmaf-tune`) — option-space digest¶

Date: 2026-05-02
Companion ADR: ADR-0237
Status: Snapshot at proposal time. Phase A implementation PR(s) supersede the operational details; this digest stays as the "why we picked these axes" reference.

Question¶

The fork has every quality input a per-title, per-shot, codec-aware encode optimiser would need (VMAF + 8 supporting metrics + tiny-AI fusion regressor + saliency + shot detection + per-shot CRF predictor

FFmpeg patches + codec-aware vocabulary). What's the smallest tool that closes the loop — drives FFmpeg, captures bitrate + quality, recommends parameters — without locking the design to one codec, one search strategy, or one quality target?

Prior art surveyed¶

Tool	What it does	What we'd borrow	What we'd improve
av1an (Rust)	AV1-only chunked encoder with optional `--target-quality` (VMAF bisect over CRF per chunk). Uses scenedetect for chunking.	The bisect strategy is solid: "encode at midpoint CRF, score, halve interval, repeat 3–5 times". Concretely: target VMAF ± 0.5, ≤ 5 encodes per chunk.	Multi-codec; drop the AV1-only assumption; use our VMAF + ssimulacra2 + lpips ensemble; consume our shot detector instead of scenedetect-py
ab-av1 (Rust)	Single-clip CRF bisect against a target VMAF. AV1-only. No per-shot.	Same bisect shape; simpler than av1an's chunking.	Multi-codec; per-title (predict starting CRF instead of always starting at midpoint)
Netflix Per-Title (paper, 2015)	Complexity-bucket sources, pick CRF per bucket. Shaping the bitrate-quality curve from offline data.	The "predict CRF from source features" model — that's our Phase C.	Use canonical-6 + codec one-hot + resolution + framerate as the source descriptor (we already extract these); skip the manual bucketing
Netflix Dynamic Optimiser (paper, 2018)	Per-shot Lagrangian λ-optimisation across the title; convex-hull pruning of (R, D) candidates.	The convex-hull formulation is the right shape for Phase E (Pareto ABR ladder).	Out of scope for Phase A–C; revisit at Phase E
Bitmovin Per-Title (closed)	Source-classification-based CRF + ladder generation. SaaS.	Confirms the per-title-CRF approach is production-real.	We're open-source, AI-driven, multi-metric. Different deployment shape.
x265 Zone files (`--zones`)	Per-frame-range qp/bitrate/aq override map. Native to x265.	Phase D output format for x265.	None — consume directly
x264 `--qpfile`	Per-frame qp override list.	Phase D output format for x264.	None — consume directly
svt-av1 segment table	AV1 zone-style overrides.	Phase D output format for svt-av1.	None — consume directly
ffmpeg-bitrate-stats (Python)	Parses ffmpeg encode logs for per-frame bitrate.	Reusable parser for the harness's bitrate-extraction step (could vendor or pip-depend).	Add metric extraction (we score post-encode via libvmaf, not from log lines)

Search-strategy axis¶

Strategy	Encodes per target	When it wins	Notes
Grid (full sweep)	O(\|grid\|)	Producing the training corpus for Phase C; one-time per source	Only used in Phase A. Never used at inference.
Coordinate descent	5–15	Tuning a single quality knob (CRF) when other params are fixed	Simple, no ML deps. Phase A baseline strategy.
Bisect (binary search)	4–6	Single-knob target-quality (target VMAF ± 0.5)	Av1an-proven; this is Phase B's primary algorithm.
Bayesian optimisation	8–20	Multi-knob optimisation (CRF × preset × GOP × ref)	Adds `scikit-optimize` dep. Phase B+ optional.
Per-title prediction + bisect refinement	1 (predict) + 2–3 (refine)	Phase C — fastest "given source, hit VMAF=X"	The whole point of training a predictor: skip 2–3 bisect rounds.
Convex-hull / λ-sweep	O(N²) candidates × M shots	Phase E ABR-ladder generation	Netflix Dynamic-Optimiser-style; deferred.

Decision: Phase A ships grid (corpus generation) + bisect (target-quality). Phase C swaps in the predictor as the "first guess" for the bisect. Bayesian opt is opt-in, not the default.

Codec-adapter interface¶

Every codec exposes a different parameter shape (CRF for x264/x265, CQ for libvpx, --crf for svt-av1, --crf for libvvenc, neural codecs have a λ-rate parameter). The harness must not branch on codec identity in the search loop. Sketch:

class CodecAdapter:
    name: str                       # "libx264", "libx265", "libsvtav1", ...
    quality_knob: str               # "crf", "cq", "lambda", ...
    quality_range: tuple[int, int]  # e.g. (0, 51) for x264
    quality_default: int            # mid-range default
    invert_quality: bool            # higher knob = lower quality? True for CRF/CQ

    def build_command(self, source, output, params: dict) -> list[str]: ...
    def parse_log(self, stderr: str) -> EncodeMetrics: ...   # bitrate, time, frames
    def two_pass_supported(self) -> bool: ...
    def emit_per_shot_overrides(self, shots: list[Shot]) -> str: ...  # qpfile / zone / segments

Phase A wires libx264. Each subsequent codec is a one-file adapter that doesn't touch the harness or search loop.

Codec scope (per popup `Q2` 2026-05-02)¶

Codec	FFmpeg encoder	Phase A?	Quality knob	Phase D format	Notes
H.264	`libx264`	yes	`-crf`	`--qpfile`	Highest leverage; 100× deployment of any other codec; corpus gen is fastest
H.265 / HEVC	`libx265`	Phase A+1	`-crf`	`--zones`	Mature; second-most-deployed
AV1	`libsvtav1`	Phase A+2	`-crf`	segment table	Fastest AV1 encoder; corpus gen costs 5–10× x264
VP9	`libvpx-vp9`	Phase A+3	`-crf` (CQ mode)	(no per-shot native)	YouTube-scale; fewer per-shot levers
H.266 / VVC	`libvvenc`	Phase A+4	`-crf`	(TBD — VVenC has perceptual-QP overrides)	Newest standard codec; encoders still rapidly evolving
LCEVC (MPEG-5 Part 2)	`liblcevc-eilp` (3rd-party)	Phase A+5	base-codec CRF + enhancement params	(enhancement-layer specific)	Two-codec system; harness needs to compose
EVC (MPEG-5 Part 1)	`libxeve` (3rd-party)	Phase A+5	`-crf`	(TBD)	Royalty-friendly H.265 alternative
AVS3	`libuavs3e` / `libxavs2`	Phase A+5	`-crf`	(TBD)	Chinese standard; deployment is regional
Neural codecs (DCVC family, NVC, CompressAI research models)	(Python, PyTorch)	Phase A+6 (extras)	rate-λ	(no per-shot model)	Not via FFmpeg — research-grade Python encoders. Lives behind `pip install vmaf-tune[neural]` extra. The Bristol VI-Lab 2026 NVC review (`docs/research/0033-bristol-nvc-review-2026.md`) is the landscape map.
JPEG-AI / image neural codecs	(Python)	Out of scope	—	—	Image-only; not a video tool

The phasing reflects "ship the codec adapter, gate on whether we have an encoder corpus we can ourselves produce". libx264 first; neural codecs last because their corpus production is 100–1000× more expensive per encode and the inference is too slow for live use cases.

Training-corpus plan¶

Per-title CRF predictor (Phase C) and codec-aware FR regressor (ADR-0235, currently BLOCKED) need (source, encode, score) tuples. We can never redistribute third-party encodes; therefore we own the encoder.

Source	Sources we have	Encodes we own	Status
Netflix Public Dataset	9 ref + 70 dis YUVs (37 GB, .workingdir2/netflix/)	We re-encode at Phase A grid	Sources: present locally per memory note 2026-04-27
KoNViD-1k	sources + per-clip MOS	We re-encode at Phase A grid	Sources: CC BY 4.0, available
BVI-DVC (parts A+B+C+D)	sources + per-clip ratings	We re-encode at Phase A grid	Already used for vmaf_tiny_v2
BVI-VC	sources	Optional Phase A+ corpus	TBD

Process: Phase A's harness runs the grid sweep over every source × codec × CRF setting in our codec scope. Output is a parquet schema:

source_path | source_canonical_6 | resolution | framerate | duration_s
codec | preset | crf | extra_params_json
encode_path (gitignored) | encode_size_bytes | encode_time_s
vmaf | ssimulacra2 | lpips_sq | psnr_y | psnr_hvs | cambi
encoder_version | encoder_commit | ffmpeg_version

The encodes themselves stay gitignored under .workingdir2/encodes/ (or a configurable cache dir). Only the parquet ships — and only after a licensing audit confirms the source provenance permits publishing the features + scores (it does for all four corpora).

What we deliberately don't solve in Phase A¶

Live / latency-sensitive encoding. This whole tool is batch-VOD-shaped. Live transcoding has totally different constraints (encode in real time, no second-pass). Out of scope.
Audio. vmaf-tune is video-only. Audio-quality automation is a separate problem with separate metrics (PESQ, POLQA, ViSQOL).
Quality / bitrate constraints from CDN economics. We produce the (bitrate, quality) curve; deciding the right point on it for a CDN budget is the operator's call, not ours.
Subjective MOS prediction beyond what our metrics give us. Our metrics are the truth surface; we're not building a new MOS model in this tool.
Encoder selection. The user picks the codec; we don't recommend "use AV1 instead of H.264". (A future Phase G could, given a corpus that compares codecs at iso-quality across bitrates, but that's a separate tool.)

Risks the digest flags¶

Corpus-generation cost: a 60-CRF-value × 4-codec × 100-source grid is 24 000 encodes. Some neural-codec encodes take minutes per clip. Phase A has to ship with a sampling story: stratify by resolution / motion class / codec; don't bulk-encode the full cross product.
Encoder-version drift: pinning encoders is non-trivial because ffmpeg ships with whatever versions Ubuntu has. The harness records exact build IDs; CI uses a vendored / pinned set.
Per-shot CRF conflicts with rate control: x264's --qpfile overrides VBV/CBR rate control. Phase D has to document the interaction and fall back to constant-CRF mode when the user wants per-shot.
Neural-codec adapters add a heavy dependency tree (PyTorch, CUDA, model weights). They live behind an opt-in extra and can never be a hard dependency of vmaf-tune — Phase A+ users who want only x264 should not need PyTorch.
Scope creep: every encode-tuning feature on the planet could be added here. The phasing exists to anchor "Phase A ships standalone, every later phase is gated on the prior corpus existing". Reviewers should reject Phase A→F bundling.

Decision implications for the ADR¶

Multi-codec from day one (codec-adapter interface — designed in Phase A, only libx264 wired).
Tools tree (tools/vmaf-tune/), C + Python hybrid.
Phase A standalone (~1 week); A→F roadmap with hard gates.
Per-title predictor (Phase C) gated on Phase A corpus.
Per-shot (Phase D) gated on T6-3b landing.
Neural-codec adapters gated on extras + their own corpus track.

References¶

av1an: https://github.com/master-of-zen/Av1an.
ab-av1: https://github.com/alexheretic/ab-av1.
Netflix Per-Title (2015): https://netflixtechblog.com/per-title-encode-optimization-7e99442b62a2.
Netflix Dynamic Optimiser (2018): Katsavounidis 2018, "Dynamic optimizer — a perceptual video encoding optimization framework".
Bitmovin Per-Title: https://bitmovin.com/blog/per-title-encoding/.
ffmpeg --qpfile: https://trac.ffmpeg.org/wiki/Encode/H.264#FAQ.
x265 zones: https://x265.readthedocs.io/en/master/cli.html#cmdoption-zones.
svt-av1 segments: https://gitlab.com/AOMediaCodec/SVT-AV1/-/blob/master/Docs/Parameters.md.
CompressAI research catalog: https://github.com/InterDigitalInc/CompressAI.
Bristol VI-Lab 2026 NVC review: docs/research/0033-bristol-nvc-review-2026.md.
ADR-0235 (codec-aware FR regressor): codec one-hot vocabulary the per-title predictor will inherit.
ADR-0223 (TransNet V2 shot detector): Phase D shot input.
T6-3b backlog: per-shot CRF predictor, Phase D model.

Research-0044: Quality-aware encode automation (vmaf-tune) — option-space digest¶