Research-0090: vmaf-tune Phase B target-VMAF bisect — feasibility¶
- Date: 2026-05-08
- Author: lusoris
- Companion ADR: ADR-0326
Question¶
What is the right algorithmic primitive for Phase B (target-VMAF bisect) of vmaf-tune? How many real encodes does it cost per call, and how does it compose with the existing coarse-to-fine grid search (ADR-0306)?
Algorithm choice¶
Three candidates, evaluated against integer-CRF, monotone-VMAF-in-CRF real-encode constraints:
| Algorithm | Encodes per call | Convergence | Notes |
|---|---|---|---|
| Integer binary search (chosen) | ceil(log2(range)) ≈ 6 for range=51 | Tight O(log) | Mirrors predictor.pick_crf; clean monotonicity assertion |
| Golden-section search | ~log_φ(range) ≈ 8.4 for range=51 | Same asymptote, irregular below 4-CRF windows | Continuous-domain assumption fights integer rounding |
| Brute-force linear scan | range = 52 for x264 | O(n) | ~9× more encodes; ignores monotonicity instead of asserting it |
| Bayesian (Optuna TPE) | 10–20 trials typical | Robust to non-monotone curves | Pulls Optuna into the production predicate; the fast subcommand already owns this surface |
Binary search wins on every axis except "robustness to non-monotone curves", and for that case we deliberately abort rather than degrade into a slower fallback (the AGENTS.md invariant). Real-world content + modern codecs are monotone in CRF; pathological exceptions are encoder bugs we want to surface, not paper over.
Expected encode budget per call¶
The codec adapter's quality_range for x264 widens to (0, 51) per ADR-0306, so the worst-case binary search converges in ceil(log2(52)) = 6 halvings. We cap at max_iterations=8 to absorb the off-by-one between window collapse and the "best so far" promotion path. With a 2-to-3-second per-encode wall time at 1080p medium preset on a single CPU socket, one bisect call costs 12–24 wall-clock seconds. Phase E ladder generation runs the bisect five times per resolution per target VMAF, so a 5-resolution × 3-target-VMAF ladder is ~3–6 wall-clock minutes on CPU; per ADR-0299 the score side is GPU-accelerated, halving that further.
Composition with coarse-to-fine¶
ADR-0306 ships corpus.coarse_to_fine_search as a 2-pass (preset × CRF) grid search. Phase B is the single-axis primitive the coarse-to-fine loop wraps when only the CRF axis is in play:
- Coarse-to-fine sweeps a
presets × CRF-coarsegrid first (~15 cells for x264), picks the best preset, then runs aCRF-finepass at that preset (~5 cells). Total: ~20 encodes. - Phase B is what
recommend.pick_target_vmafcould use as a callback if the operator pinned a preset up-front and only needed CRF optimised. Total: ~6 encodes.
The two are complementary, not redundant. Coarse-to-fine spans a 2-axis space; bisect spans a 1-axis space inside that. We do not fold coarse-to-fine into the bisect entry-point — the API surface stays minimal so callers compose them explicitly.
Recommendation¶
Ship the integer binary search per the ADR. Resist the temptation to add fallback strategies for monotonicity-violation cases; surfacing the violation is more useful than recovering from it.
References¶
- ADR-0326 — the decision.
predictor.pick_crf— the analytical-curve precedent for the same algorithmic shape.- ADR-0306 — coarse-to-fine grid search; complementary primitive.
- ADR-0237 — vmaf-tune umbrella spec.