ADR-0397: vmaf-tune Phase F — auto adaptive recipe-aware tuning¶
- Status: Accepted
- Date: 2026-05-08
- Deciders: Lusoris
- Tags: tooling, automation, vmaf-tune, ffmpeg, codec, fork-local
Context¶
Phases A through E of vmaf-tune ship as eight standalone CLI subcommands (corpus, recommend, fast, predict, tune-per-shot, recommend-saliency, ladder, compare) plus three orthogonal modes (HDR auto-detect via ADR-0300, sample-clip via ADR-0301, and resolution-aware model selection via ADR-0289). The operator-facing question — "give me the cheapest encode that meets target VMAF for this content" — currently requires the operator to compose roughly eight phases manually, ≈ 5–6 hours of wall-clock for a 2-hour 1080p title (see Research-0067 cost table). The user's 2026-05-08 vision text (paraphrased: "the real long-term potential is building an adaptive encoding ecosystem around community-generated training data, perceptual analysis and continual model improvement") frames Phase F as the first composition layer that exposes this ecosystem behind a single CLI verb.
Research-0067 walks the cost model, the seven short-circuit cases, and the four failure modes; it concludes that a deterministic decision tree hits the explainability + reproducibility floor the fork requires without sacrificing the wall-time savings a learned policy would deliver.
Decision¶
We will ship vmaf-tune auto as the Phase F entry point, implemented as a deterministic decision tree in tools/vmaf-tune/src/vmaftune/auto.py that composes the existing phase subcommands sequentially. The tree is hand-coded (no learned policy at runtime), every branch maps to an existing per-phase ADR contract, and the full tree fits within a 30-line pseudocode specification (see Research-0067 §"Phase F decision tree"). The phased rollout below splits the work into four follow-up PRs so each slice ships with its own validation:
- F.0 — this ADR (design only). No code. Establishes the decision tree, short-circuit list, escalation policy, and the rule that Phase F never invents new sub-phases.
- F.1 — scaffolded
vmaf-tune auto(tools/vmaf-tune/src/vmaftune/auto.py). Sequential composition of the existing subcommands; no short-circuits, no escalation. The CLI flags--src,--target-vmaf,--max-budget-bitrate,--allow-codecs,--codec,--smokeare stable from this PR forward.--smokeexercises the composition end-to-end with mocked sub-phases (no ffmpeg, no ONNX); production wiring lands in F.2. - F.2 — short-circuit logic. The seven cases from Research-0067 §"When Phase F should short-circuit" become conditional branches: single-rung ladder when source < 2160p; codec known; predictor verdict GOSPEL; short / low-variance source skips Phase D; non-animation / non-screen-content skips saliency; SDR source skips HDR pipeline; sample-clip propagates to internal sweeps.
- F.3 — confidence-aware fallbacks. When Phase C's predictor returns FALL_BACK on a (rung, codec) cell, escalate only that cell to
recommend.coarse_to_fine. The escalation is per-cell, bounded, and logged. GOSPEL and LIKELY verdicts skip the escalation. Encoder-ROI / saliency-binary missing degrades to a warning, never aborts. - F.4 — per-content-type recipe overrides. Auto-detect animation / live-action / screen-content via a fork-local classifier (TransNet V2 is already present and exposes a shot-cut histogram that correlates well with the three classes; fallback heuristics in
auto.pycover the no-classifier case). The class drives saliency / preset / per-shot defaults; users override via explicit flags.
The tree is the v1 surface; learned policy stays a research follow-up after F.1–F.4 have produced enough labelled compositions to seed a future supervised baseline. No closed-source ML services, no Internet calls during encode, no runtime learned-policy inference — Phase F is integration, not invention.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Deterministic decision tree (chosen) | Explainable; reproducible across runs; every branch maps to an existing ADR contract; testable with mocked sub-phases; no runtime ML dependency on the auto path. | Hard-codes priorities; new sub-phases require tree edits. | Picked: matches the fork's "no learned policy at runtime" constraint, the user's explainability requirement, and the per-phase contract carve-outs in ADR-0237 / ADR-0276 / ADR-0295. |
| Pure-grid composition (today's manual workflow) | Zero new code; fully reproducible. | 8-step manual composition; ≈ 5–6 h wall-clock for a typical 2-hour movie; high operator-error rate; the colleague's "day per movie" pain point. | Rejected: the cost is exactly why Phase F is a backlog item. |
| Optuna over the full composition space | Strong optimum; reuses Phase A.5 search infrastructure. | Per-source TPE warm-up cost; no closed-form way to express "skip Phase D when source is short"; opaque to operators ("why did it pick x265?"); too few independent samples per source for Bayesian search to beat a hand-tuned tree. | Rejected: search-over-recipes is the wrong model for a discrete composition problem with operator-explainability requirements. |
| Learned policy (RL or supervised over a Phase A corpus) | Adapts to corpus drift; aligns with the long-term "continual model improvement" arm of the user's vision text. | Requires a labelled "this composition was right" dataset that doesn't yet exist; runtime inference adds an ONNX dependency to the auto path; reproducibility suffers (model drift between runs); violates the fork's "no learned-policy at runtime" constraint. | Rejected for v1; revisit as a research experiment once the deterministic tree has emitted a labelled corpus of recipe choices. |
| One mega-subcommand replacing all phases | Single operator surface. | Breaks every existing per-phase contract; downstream consumers (CI, MCP server, the FFmpeg patch series) lose stable per-phase hooks; ADR-0237 / ADR-0276 / ADR-0295 explicitly carve those contracts. | Rejected: the carve-outs are load-bearing. |
Ship auto as a thin shell-script wrapper | No new Python module. | No way to express the short-circuit conditions or the FALL_BACK escalation cleanly in bash; smoke testing harder; operator-error surface (quoting, env vars) larger. | Rejected: the harness is Python; auto.py keeps the contract testable. |
Consequences¶
- Positive:
- Operators get a single CLI verb (
vmaf-tune auto) for the common "encode at target VMAF" workflow; the eight-step manual composition collapses to one invocation. - Wall-clock floor on short-circuit-eligible content (1080p SDR photographic, single-codec, GOSPEL predictor) is the final encode itself — no redundant sweeps.
- Every branch is testable in isolation; F.1 ships with mocked sub-phases for end-to-end smoke coverage.
-
Phase F's deterministic-tree footprint provides a labelled audit trail (which short-circuits fired, which fallbacks escalated) that a future learned-policy study can train on.
-
Negative:
- The decision tree is hand-tuned; recipe drift (new codec, new preset) requires tree edits, not just a re-trained model.
- Adding a new sub-phase (F.5 onwards) means tree-and-test edits, not configuration-only changes.
-
Operator must still understand what Phase F chose — the explainability surface (per-cell verdicts, escalation log) is a new doc burden.
-
Neutral / follow-ups:
- F.1 PR ships
tools/vmaf-tune/src/vmaftune/auto.py,tests/test_auto.py, and the--smokemode. Adds theautosubcommand tocli.py. - F.1 PR adds a docs page under
docs/usage/vmaf-tune.mddocumenting theautoflags and the decision tree (the doc must reproduce the pseudocode block from Research-0067 so operators can predict whatautowill do without reading the code). - F.2 / F.3 / F.4 are sibling PRs gated on F.1 landing; each extends
auto.pywith smoke and integration tests. - Future: when a labelled-composition corpus exists, evaluate a supervised classifier as a recipe selector; the deterministic tree stays as the canonical fallback per the no-runtime-ML constraint.
- Adds rebase-notes entry: Phase F ties together the per-phase contracts; future upstream syncs that touch any one phase must re-validate the tree.
References¶
- ADR-0237 —
vmaf-tuneumbrella decision (Phase A scaffold). - ADR-0276 fast — Phase A.5 proxy + Bayesian.
- ADR-0276 phase-d — Phase D per-shot scaffold.
- ADR-0289 — resolution-aware model selection.
- ADR-0293 — saliency-aware ROI tuning.
- ADR-0295 — Phase E per-title ABR ladder.
- ADR-0300 — HDR-aware encoding + scoring.
- ADR-0301 — sample-clip mode.
- ADR-0306 — coarse-to-fine CRF search.
- Research-0060 — Phase A.5 cost model (parent).
- Research-0067 — Phase F feasibility (companion digest, this ADR).
- Source:
req— paraphrased user vision text from the 2026-05-08 ChatGPT exchange ("the real long-term potential is building an adaptive encoding ecosystem around community-generated training data, perceptual analysis and continual model improvement"); the English-translated paraphrase lives in this ADR's Context to satisfy the user-quote-handling rule for non-References sections.
Status update 2026-05-08: F.2 short-circuits landed¶
F.1 sequential scaffold + F.2 short-circuits ship together in one PR (the bigger-content path the user prefers over per-LOC PRs). tools/vmaf-tune/src/vmaftune/auto.py exposes the seven _should_short_circuit_<N> predicates as standalone helpers; each fires its corresponding stage-skip and records the firing in plan.metadata.short_circuits. The Phase D 5-min / 0.15-shot-variance thresholds ship as constants (PHASE_D_DURATION_GATE_S, PHASE_D_SHOT_VARIANCE_GATE) pending F.3 empirical fit. F.3 (per-cell recommend.coarse_to_fine escalation on FALL_BACK) and F.4 (per-content-type recipe overrides) remain deferred per the original phased rollout.
Status update 2026-05-08: F.3 confidence-aware fallbacks landed¶
F.3 ships _confidence_aware_escalation(verdict, interval, thresholds) in tools/vmaf-tune/src/vmaftune/auto.py, returning one of SKIP_ESCALATION / RECOMMEND_ESCALATION / FORCE_ESCALATION. The helper consumes the conformal interval width emitted by Predictor.predict_vmaf_with_uncertainty (ADR-0279) at every (rung, codec) cell. Two width gates carve the half-width axis into three regions; tight_interval_max_width overrides FALL_BACK to skip, wide_interval_min_width overrides GOSPEL to escalate, the middle band defers to the F.2 verdict. Thresholds load from a calibration JSON sidecar via load_confidence_thresholds; missing-sidecar paths fall back to the documented Research-0067 defaults (2.0 / 5.0 VMAF) and emit a one-line warning. Per-cell decisions are recorded in plan.metadata.confidence_aware_escalations[]. F.4 (per-content-type recipe overrides) remains deferred.
Status update 2026-05-09: F.4 recipes landed¶
F.4 ships _apply_recipe_override(meta, plan_state, thresholds) in tools/vmaf-tune/src/vmaftune/auto.py, returning a (recipe_class, recipe, effective_thresholds) triple. The recipe table at module scope (_CONTENT_RECIPE_TABLE) holds factory callables for the four named classes (animation, screen_content, live_action_hdr, ugc) plus the empty default; each call returns a fresh override dict so mutations cannot leak between runs. Recipes fire before the F.2 short-circuits evaluate so a recipe can flip force_single_rung and have the ladder stage honour it. The four override keys consumed by the driver are tight_interval_max_width, force_single_rung, saliency_intensity, and target_vmaf_offset. Per memory feedback_no_test_weakening, target_vmaf_offset shifts only the predictor's effective target; the input --target-vmaf (production-flip gate) is preserved verbatim in plan.metadata.target_vmaf while the offset target lands in plan.metadata.effective_predictor_target_vmaf. Every threshold value shipped here is [provisional, calibrate against real corpus in F.5]. F.5 closes the calibration loop once F.4 has emitted enough labelled recipe applications to fit the placeholders empirically. Phase F phased rollout is now complete (F.0 design + F.1+F.2 + F.3 + F.4); F.5 calibration is a follow-up backlog item, not a Phase F gate.
Status update 2026-05-09: F.5 calibrated¶
F.5 ships ai/scripts/calibrate_phase_f_recipes.py and the calibrated override JSON at ai/data/phase_f_recipes_calibrated.json. The calibration was run against the K150K corpus (.workingdir2/konvid-150k/konvid_150k.jsonl, 148 543 rows out of 153 841 expected — the ingestion was ~96.6 % complete; partial-corpus statistics are statistically valid for high-level class statistics and a re-run on the full corpus is a follow-up PR). The calibrated values replace the F.4 [provisional, calibrate against real corpus in F.5] placeholders at module-load time via vmaftune.auto._load_calibrated_recipes; if the JSON is missing or malformed the F.4 placeholder constants in _F4_PLACEHOLDER_RECIPES remain in force (graceful fallback covered by tools/vmaf-tune/tests/test_calibrated_recipes.py).
Calibrated values:
| Class | tight_interval_max_width | force_single_rung | saliency_intensity | target_vmaf_offset | Source |
|---|---|---|---|---|---|
animation | 1.75 | true | aggressive | +2.0 | proxy (UGC-anchored) |
screen_content | (unset) | (unset) | very_aggressive | +1.0 | proxy (UGC-anchored) |
live_action_hdr | 1.4 | (unset) | (default) | 0.0 | proxy (UGC-anchored) |
ugc | 3.5 | false | default | +1.5 | corpus (K150K) |
Honest-data caveats:
- K150K is a UGC-only corpus and carries no per-source
content_classcolumn; only theugcrow is corpus-derived. The other three rows are calibrated as documented absolute offsets ("proxy") anchored on the F.4 envelope. PR #477's TransNet shot-metadata columns plus a class-labelled subset will let a future re-calibration replace the proxy rows with corpus-derived values. - UGC's empirical
target_vmaf_offsetcame out positive (+1.5) on K150K because the corpus's MOS distribution has a heavier upper tail than lower tail. The calibration script clamps every offset to the F.4 documented envelope of[-2.0, +2.0]so a pathological corpus cannot push the predictor target outside the regime the planner has been exercised against. Per memoryfeedback_no_test_weakening, the offset shifts only the predictor's effective target — never the input--target-vmafgate that ships models. - The
mos_to_vmaf_proxymapping (slope 20, intercept 0) is the Hosu et al. 2017 §3.3 anchor. A future re-calibration that measures end-to-end VMAF against the K150K reference clips (libvmaf full-reference pass) will replace the proxy with measured scores.
The calibrated values reduce the placeholder envelope (UGC width 3.0 → 3.5 is the only widening; 1.5 → 1.75 for animation; 1.2 → 1.4 for live-action HDR) but every value stays inside the ConfidenceThresholds invariant (tight ≤ wide). The test_calibrated_ugc_width_below_wide_gate_ceiling regression test locks this in: any future re-calibration that exceeds the wide-gate ceiling needs a separate ADR. Phase F is now fully calibrated; the one outstanding follow-up is the class-labelled re-calibration once PR #477 lands.
Status update 2026-05-10: F.1/F.2 additional short-circuits landed¶
Three additional short-circuit predicates ship in tools/vmaf-tune/src/vmaftune/auto.py, appended after the original seven in SHORT_CIRCUIT_PREDICATES (canonical positions 7, 8, 9):
- #8
low-complexity(_should_short_circuit_low_complexity) — skipsrecommend.coarse_to_finewhenmeta.complexity_score(the probe-encode bitrate at the adapter'sprobe_quality/probe_preset) is belowLOW_COMPLEXITY_PROBE_BITRATE_THRESHOLD_KBPS(200 kbps placeholder).0.0orNaNdoes not fire (no probe yet). - #9
baseline-meets-target(_should_short_circuit_baseline_meets_target) — skips the full predictor sweep whenmeta.baseline_vmafalready meets or exceedsplan_state.target_vmaf.0.0orNaNdoes not fire. - #10
no-two-pass(_should_short_circuit_no_two_pass) — skips the two-pass calibration stage when the resolved codec adapter'ssupports_two_passflag isFalse(ADR-0333).None(adapter not yet resolved) does not fire.
SourceMeta gains complexity_score and baseline_vmaf fields (both default 0.0). PlanState gains adapter_supports_two_pass (default None). 28 new unit tests in tools/vmaf-tune/tests/test_auto_phase_f1_f2.py; two existing tests updated to reflect 10 total predicates.