Research-0060: vmaf-tune fast — proxy-based recommend (Phase A.5)¶
- Date: 2026-05-03
- Companion ADR: ADR-0276
- Parent ADR: ADR-0237 (umbrella, Phase A Accepted)
- Status: Snapshot at proposal time. The production fast-path implementation PR will supersede the operational details; this digest stays as the "why proxy + Bayesian + GPU verify" reference.
Question¶
Phase A (PR #329) ships a grid-sweep corpus generator: for every (preset, crf) cell, encode with libx264, score with libvmaf, emit a JSONL row. On 1080p sources at medium preset that loop costs roughly 5–15 seconds per cell on a workstation, scaling linearly with grid size. A typical corpus run for one source over (medium, slow) × CRF 18..40 is ≈140 cells ≈ 30–60 minutes; expanding to four presets and the full CRF range pushes it past two hours per source.
The user's framing — "vmaf-tune planning should be GPU/AI fast — Netflix-quality decisions in short time" — asks whether the fork's existing AI primitives can collapse that wall-time without sacrificing the recommendation quality the slow grid would have produced. The slow grid stays as ground truth (ADR-0237 contract); the question is whether a thin, opt-in fast subcommand can hit within a small VMAF tolerance of the grid's optimum in seconds-to-minutes rather than hours.
Bottleneck analysis (Phase A wall-time)¶
A single Phase A grid cell decomposes into three serial stages. Wall-time profile from a 1080p 10-second clip on x86-64 (12-core workstation, no GPU usage):
| Stage | Tool | Wall-time | Share |
|---|---|---|---|
| 1. Encode | ffmpeg -c:v libx264 -preset medium -crf N | 3–9 s | 60–80 % |
| 2. Decode + score | vmaf (CPU bit-exact, full feature set) | 1.5–3 s | 15–30 % |
| 3. JSONL emit + cleanup | corpus.py | < 0.1 s | < 2 % |
Three observations drive the fast-path design:
- The encode step dominates. Slower presets (
slow,veryslow) shift even more of the budget into stage 1 —veryslowon 1080p is closer to 60+ seconds per cell. The grid scales as|presets| × |CRF range| × #sources; on a 5-source × 4-preset × 23-CRF sweep that's 460 cells × ~10 s ≈ 75 minutes minimum. - Stage 2 is independently a known cost. CPU bit-exact VMAF on 1080p runs at roughly 1–2 fps; the fork's CUDA / Vulkan / SYCL backends already accelerate it 8–20× (ADR-0157 / ADR-0186 / the Phase 5 GPU work) but Phase A's harness invokes the CPU CLI path because the codepath is the most reproducible.
- The grid is unstructured. Every cell is encoded + scored independently; there is no early termination, no surrogate model, no Bayesian prior. The corpus is exhaustive by design (Phase A's purpose is to be the training set Phase B/C consume), but the recommendation use case — "pick the CRF that hits target VMAF with minimum bitrate" — does not need exhaustive coverage.
Acceleration levers¶
The fork already ships every primitive a fast-path needs. We classify them by where in the grid loop they intervene:
Lever A — VMAF proxy via fr_regressor_v2 (codec-aware)¶
Skip stage 2 entirely for grid exploration. fr_regressor_v1 (257 params, canonical-6 → VMAF, ADR-0249, ADR-0235) and the in-flight fr_regressor_v2 (codec-aware, 9-D conditioning over encoder × preset × CRF, ADR-0272 / Research-0058) are MLP models designed for exactly this prediction shape: estimate VMAF without running VMAF. Inference is microseconds per row on CPU; a 460-cell "grid" becomes a sub-second sweep.
The proxy's job is not to replace libvmaf — it is to rank candidates well enough that the search converges on the correct neighbourhood, after which a single full VMAF measurement verifies the chosen point. Calibration cost: Pearson PLCC against real VMAF on the parent's training corpus is the gating metric; v1 reports 0.9+ on Netflix Public, v2 adds the codec one-hot to handle the medium vs slow rate-distortion shift.
Speedup ceiling on its own: ≈50× (entire stage 2 collapses across the grid; stage 1 is unchanged).
Lever B — Bayesian search via Optuna TPE¶
Replace the dense grid with a sparse Bayesian sample. Optuna's TPE sampler typically reaches the same final VMAF in ≈1/10 the trials a dense grid needs, because each trial's posterior narrows the search range. CRF is a one-dimensional ordinal in [10, 51] for x264 — the easiest possible BO surface.
Speedup on its own: 5–10× on trial count. Combines multiplicatively with lever A: Bayesian sampling over a proxy is both fewer trials and cheaper per trial.
Lever C — Hardware-encoded grid via NVENC / QSV / AMF¶
Replace libx264 with NVENC (h264_nvenc) for the encode step. Hardware-encoded H.264 on a desktop GPU runs at hundreds of fps (realtime ≈ 30 fps, NVENC ≈ 300–600 fps on Ada-class) — 1–2 orders of magnitude faster than libx264 medium software. The cost is that NVENC's rate-distortion curve differs from libx264: same CRF lands at different VMAF / bitrate. The fr_regressor_v2's encoder one-hot is the right shape to handle this — if the training corpus covered NVENC, which Phase A's libx264-only corpus does not yet.
Speedup ceiling on its own: 10–30×, gated on encoder availability and a follow-up calibration corpus.
Lever D — Per-shot parallelisation via TransNet V2¶
Split the source into shots (TransNet V2 real weights via PR #334) and run an independent recommendation per shot, then aggregate. This is orthogonal to A/B/C: it makes the optimisation embarrassingly parallel along the time axis. Wins compound on long sources where a single CRF is suboptimal anyway. Out of scope for the Phase A.5 scaffold; queued as a Phase D follow-up under the existing vmaf-perShot story (ADR-0222).
Lever E — GPU VMAF backend for the verification pass¶
When the proxy converges, run one real VMAF measurement at the recommended CRF to verify the prediction. Use CUDA / Vulkan / SYCL (ADR-0157 / ADR-0186) for an 8–20× speedup on the verify step. The verify is one cell, not the grid, so the GPU win matters less than the proxy win — but it closes the loop with a real measurement and costs single-digit seconds.
Speedup contribution: small in absolute wall-time (one measurement), large in trust (the recommended CRF is grounded in a real number, not just a proxy estimate).
Speedup model (rough estimates)¶
Baseline: 460-cell grid × 10 s/cell ≈ 4600 s (≈ 75 min) for one source.
| Combination | Trials | Per-trial cost | Verify cost | Total wall-time | Speedup |
|---|---|---|---|---|---|
| None (Phase A grid) | 460 | 10 s (encode + CPU score) | – | 4600 s | 1× |
| A (proxy only, dense grid) | 460 | 4 s (encode only; proxy ≈ µs) | 1 s GPU verify | ≈1840 s | ~2.5× |
| B (Bayesian, real score) | 50 | 10 s | – | 500 s | ~9× |
| A + B (Bayesian + proxy) | 50 | 4 s | 1 s | 201 s | ~23× |
| A + B + E (Bayesian + proxy + GPU verify) | 50 | 4 s (encode is the floor) | 0.3 s | 200 s | ~23× |
| A + B + E + sample-chunk (5-second sample, full encode-time scales linearly) | 50 | 2 s (5 s clip × ~0.4× realtime) | 0.3 s | ≈100 s | ~46× |
| A + B + C + E (NVENC + Bayesian + proxy + GPU verify) | 50 | 0.3 s NVENC + µs proxy | 0.3 s | ≈30 s | ~150× |
The published 50–500× figures from the user's framing assume lever C (hardware encoder) carries the bulk. The conservative "no NVENC required" combination still hits 20–50× on a CPU-only host — which is the baseline the scaffold targets.
Hard caveat: these are upper bounds. The proxy's recommendation is only as good as its calibration. Without a Phase A corpus trained per-encoder, fr_regressor_v2's predictions on libx264 medium / slow at non-training CRFs are an extrapolation. The production claim has to be validated against Phase A baseline once the corpus exists; the scaffold ships a smoke-mode that fakes trials so the pipeline can be exercised without that corpus.
Decision matrix — which combination is "fast-path v1"?¶
| Combination | Pros | Cons | Recommendation |
|---|---|---|---|
| A + B + E (recommended) | No new external dep beyond Optuna; works on any host; scales gracefully when GPU is absent (lever E is opt-in); the proxy is already-shipped fr_regressor_v2.onnx | Speedup capped at ~50×; encode floor remains software | First scaffold ships A + B + E. |
| A + B + C + E | Headline 100–500× speedup | NVENC requires an FFmpeg compiled with --enable-nvenc and an NVIDIA GPU; QSV/AMF analogues fragment the matrix; the proxy needs a hardware-encoder corpus to be calibrated | Follow-up PR. Requires Phase A.5b corpus regeneration with NVENC. |
| A only (dense grid + proxy) | Zero search-strategy churn; deterministic | Still scans every CRF; barely better than the grid | Rejected; misses the easy Bayesian win. |
| B only (Bayesian + real score) | No proxy calibration risk | 10× speedup but still CPU-bound on encode + score | Backup plan if the proxy's calibration regresses. |
| D (per-shot) bolted onto any | Compounds linearly with shot count | Out of scope for v1; orthogonal | Phase D follow-up. |
The recommended canonical fast-path is A + B + E — proxy + Bayesian + GPU verify — because it ships on any host the rest of the fork already supports, requires only the existing tiny-AI ONNX surface plus one new Python dep (optuna), and degrades gracefully on hosts without a GPU verify backend.
Failure modes the scaffold has to surface¶
- Proxy out-of-distribution. If a source's canonical-6 features sit far from the training distribution, the proxy will rank candidates incorrectly. The scaffold's verify step catches the symptom (predicted VMAF vs measured VMAF diverges); the structured response is to fall back to the slow grid for that source. The CLI must report the proxy / verify gap and exit non-zero past a configurable tolerance.
- No Phase A corpus available. Without a corpus the proxy ships with placeholder weights; the fast-path is structurally sound but numerically meaningless. The CLI carries an explicit
--smokeflag that skips both encode and proxy and synthesises trials, so the pipeline can be exercised in CI before real weights exist. - Hardware encoder unavailable. Lever C is opt-in; defaulting to
libx264keeps the tool runnable on CPU-only hosts. Auto- detect viaffmpeg -encodershappens in a follow-up. - Optuna missing. Optuna is an optional runtime dep guarded behind an
extras = ["fast"]block in the package metadata. The CLI errors with a clear install-instruction message whenvmaf-tune fastis invoked without the extra installed.
What the scaffold ships (this PR)¶
- New
vmaftune.fastmodule with afast_recommend(...)entry point and a--smokemode that synthesises 50 fake trials and runs Optuna over them to validate the pipeline end-to-end. vmaf-tune fastCLI subcommand wired into the existing argparse tree.- Extended
docs/usage/vmaf-tune.mdwith the fast-path section and an explicit "what's needed for production" checklist. - Optional dep on Optuna via
pip install vmaf-tune[fast]. Core install is unchanged. - ADR-0276 + this digest + AGENTS.md invariant + changelog fragment + rebase-notes entry (per ADR-0108).
What is deferred to follow-up PRs:
- Real fr_regressor_v2 weights (gated on PR #347 merging plus T-VMAF-TUNE-CORPUS-A producing training data).
- ONNX Runtime wiring for the proxy inference call (the scaffold uses a stub
predict_vmafthat returns a deterministic mock score so the smoke test stays self-contained). - NVENC / QSV / AMF auto-detection (lever C).
- Per-shot parallelisation (lever D).
- GPU verify wiring (
vmafCLI invocation with--cuda/--vulkan/--syclselected per host capability). - A Phase B "real recommend" mode that runs the actual encoder + proxy loop on a real source.
References¶
req: user prompt — "vmaf-tune planning should be GPU/AI fast — Netflix-quality decisions in short time, would be huge" (paraphrased per user-quote handling rule).- ADR-0237 — parent umbrella spec.
- ADR-0272 — fr_regressor_v2 codec-aware scaffold (Phase B prereq).
- ADR-0249 — fr_regressor_v1 baseline (canonical-6 → VMAF).
- ADR-0235 — codec-aware conditioning shape.
- ADR-0223 — TransNet V2 scaffold (lever D enabler).
- Research-0044 — parent option-space digest.
- Research-0058 — fr_regressor_v2 feasibility.
- Optuna documentation — TPE sampler reference.