Skip to content

Research-0090: vmaf-tune Phase B target-VMAF bisect — feasibility

  • Date: 2026-05-08
  • Author: lusoris
  • Companion ADR: ADR-0326

Question

What is the right algorithmic primitive for Phase B (target-VMAF bisect) of vmaf-tune? How many real encodes does it cost per call, and how does it compose with the existing coarse-to-fine grid search (ADR-0306)?

Algorithm choice

Three candidates, evaluated against integer-CRF, monotone-VMAF-in-CRF real-encode constraints:

Algorithm Encodes per call Convergence Notes
Integer binary search (chosen) ceil(log2(range)) ≈ 6 for range=51 Tight O(log) Mirrors predictor.pick_crf; clean monotonicity assertion
Golden-section search ~log_φ(range) ≈ 8.4 for range=51 Same asymptote, irregular below 4-CRF windows Continuous-domain assumption fights integer rounding
Brute-force linear scan range = 52 for x264 O(n) ~9× more encodes; ignores monotonicity instead of asserting it
Bayesian (Optuna TPE) 10–20 trials typical Robust to non-monotone curves Pulls Optuna into the production predicate; the fast subcommand already owns this surface

Binary search wins on every axis except "robustness to non-monotone curves", and for that case we deliberately abort rather than degrade into a slower fallback (the AGENTS.md invariant). Real-world content + modern codecs are monotone in CRF; pathological exceptions are encoder bugs we want to surface, not paper over.

Expected encode budget per call

The codec adapter's quality_range for x264 widens to (0, 51) per ADR-0306, so the worst-case binary search converges in ceil(log2(52)) = 6 halvings. We cap at max_iterations=8 to absorb the off-by-one between window collapse and the "best so far" promotion path. With a 2-to-3-second per-encode wall time at 1080p medium preset on a single CPU socket, one bisect call costs 12–24 wall-clock seconds. Phase E ladder generation runs the bisect five times per resolution per target VMAF, so a 5-resolution × 3-target-VMAF ladder is ~3–6 wall-clock minutes on CPU; per ADR-0299 the score side is GPU-accelerated, halving that further.

Composition with coarse-to-fine

ADR-0306 ships corpus.coarse_to_fine_search as a 2-pass (preset × CRF) grid search. Phase B is the single-axis primitive the coarse-to-fine loop wraps when only the CRF axis is in play:

  • Coarse-to-fine sweeps a presets × CRF-coarse grid first (~15 cells for x264), picks the best preset, then runs a CRF-fine pass at that preset (~5 cells). Total: ~20 encodes.
  • Phase B is what recommend.pick_target_vmaf could use as a callback if the operator pinned a preset up-front and only needed CRF optimised. Total: ~6 encodes.

The two are complementary, not redundant. Coarse-to-fine spans a 2-axis space; bisect spans a 1-axis space inside that. We do not fold coarse-to-fine into the bisect entry-point — the API surface stays minimal so callers compose them explicitly.

Recommendation

Ship the integer binary search per the ADR. Resist the temptation to add fallback strategies for monotonicity-violation cases; surfacing the violation is more useful than recovering from it.

References

  • ADR-0326 — the decision.
  • predictor.pick_crf — the analytical-curve precedent for the same algorithmic shape.
  • ADR-0306 — coarse-to-fine grid search; complementary primitive.
  • ADR-0237 — vmaf-tune umbrella spec.