vmaf-tune — quality-aware encode automation harness¶
vmaf-tune is a fork-added Python tool (ADR-0237, Research-0044) that drives FFmpeg over an encoder parameter grid, scores each encode with vmaf, and emits a JSONL corpus of (source, encoder, params, bitrate, vmaf) rows.
Subcommands at a glance¶
| Subcommand | Purpose | ADR / research |
|---|---|---|
corpus | Multi-codec encoder grid sweep + scoring | ADR-0237 |
recommend | Target-VMAF / target-bitrate predicate (Phase B) | Research-0061, buckets 4+5 |
tune-per-shot | Per-shot CRF zones | ADR-0392 |
recommend-saliency | Saliency-aware ROI tuning | ADR-0287 ecosystem; consumes vmaf-roi sidecars |
ladder | Per-title bitrate ladder (Pareto ABR) | ADR-0295 |
fast | Predicted-CRF fast path | ADR-0276 |
prefilter | Pelorus deband strengths + CRF joint autotune | ADR-1116 |
hdr | HDR-aware tuning + HDR-VMAF scoring | (PR #434, bucket #9) |
compare | Apples-to-apples codec comparison at matched VMAF | (PR #435) |
benchmark | Offline cross-codec report from an existing JSONL | ADR-0424 |
sidecar | Local predictor bias-correction training / inspection | ADR-0394 |
encode-profile | Encode one recommendation from a report profile | ADR-0643 |
Codec adapters¶
19 adapters under tools/vmaf-tune/src/vmaftune/codec_adapters/:
| Family | Software | NVIDIA NVENC | Intel QSV | AMD AMF | Apple VideoToolbox |
|---|---|---|---|---|---|
| AV1 | libaom-av1, libsvtav1 | av1_nvenc (ADR-0290) | av1_qsv | av1_amf | av1_videotoolbox placeholder |
| H.264 | x264 (Phase A) | h264_nvenc | h264_qsv | h264_amf | h264_videotoolbox |
| HEVC | x265 (ADR-0288) | hevc_nvenc | hevc_qsv | hevc_amf | hevc_videotoolbox |
| VP9 | libvpx-vp9 | — | — | — | — |
| VVC | vvenc | — | — | — | — |
Pipeline¶
ref.yuv ──► vmaf-tune corpus ──► encode (libx264) ──► vmaf score ──► corpus.jsonl
ref.yuv ──► vmaf-tune corpus ──► encode (libx264|libx265) ──► vmaf score ──► corpus.jsonl
│
└─► encodes are written to --encode-dir, deleted post-score
unless --keep-encodes
corpus.jsonl ──► vmaf-tune recommend --target-vmaf T ──► smallest CRF >= T
--target-bitrate B ──► CRF closest to B
corpus.jsonl ──► vmaf-tune benchmark --target-vmaf T ──► encoder ranking
JSON artifact portability¶
Human-facing vmaf-tune CLI JSON outputs, report-style artifacts, Phase-F executor result JSONL (tune_results*.jsonl), and local sidecar state files are strict RFC-8259 JSON. Diagnostic values that are non-finite in memory (NaN, Infinity, -Infinity) are serialized as null rather than as JavaScript-only tokens, so notebooks, dashboards, FFmpeg profile consumers, and MCP clients can parse the files with strict JSON decoders. Corpus JSONL rows remain the training interchange format; their feature-missing semantics are documented separately in the corpus schema below.
Environment variables¶
| Variable | Default | Notes |
|---|---|---|
VMAFTUNE_WORKDIR | (OS default, typically /tmp) | Parent directory under which all vmaf-tune subcommands create their per-run temporary scratch directories (decoded reference YUV, intermediate encodes). Set this to a path on a volume with sufficient free space when the OS /tmp is a small tmpfs — e.g. a 634-second 1080p60 source decodes to approximately 118 GB of raw YUV420p. The compare, tune-per-shot, and ladder subcommands also accept --workdir PATH to override this variable per invocation. In the vmaf-dev-mcp container this is pre-set to /probes/vmaftune-work (the 435 GB /probes bind-mount). Resolution order: --workdir flag > VMAFTUNE_WORKDIR env var > OS default. (ADR-0598) |
Workdir disk-space management (ADR-0577)¶
compare and tune-per-shot both run bisects inside a thread pool. Without a concurrency cap, each codec thread decodes the full reference source in parallel — a 3-codec compare against a 110 GB BBB 1080p source would peak at 330 GB, which overflowed the 420 GB /probes volume in the BBB v13 failure.
Three safeguards are active by default:
-
Decode concurrency cap (
--max-concurrent-decodes 1). A semaphore serialises reference-YUV decode operations across all threads. Peak disk stays at one YUV at a time (110 GB for BBB) regardless of how many codecs run in parallel. Raise the cap on hosts with large volumes and fast disks. -
Aggressive cleanup. After each bisect (codec × target VMAF pair) completes, its decoded reference YUV is deleted immediately. Per-iteration
.mkvand.decoded.yuvfiles are deleted as soon as scoring finishes. Scratch accumulation is therefore bounded to the working set of the currently active bisect, not the full run. -
Mid-run disk-space check. Before each iteration's decode,
shutil.disk_usageis called and compared against 2× the estimated YUV size. If the volume is too full the bisect fails fast with a structured error message that names the codec and target VMAF, rather than an opaque ffmpeg rc=228 (ENOSPC) followed by a corrupt output JSON.
To reproduce the BBB v13 scenario in tests, monkeypatch shutil.disk_usage to return low free space and pass a container source — the preflight and mid-run checks both fire before any real decode is attempted.
Install¶
The tool ships under tools/vmaf-tune/ as a standalone Python package. Phase A has zero runtime dependencies beyond the standard library.
pip install -e tools/vmaf-tune
# or run directly from the checkout:
python tools/vmaf-tune/vmaf-tune --help
External binaries required at runtime:
ffmpegwith--enable-libx264(and--enable-libsvtav1if you pass--encoder libsvtav1) onPATH(or--ffmpeg-bin).vmaf(this fork's CLI, built via meson) onPATH(or--vmaf-bin).
Predictor Training Corpora¶
vmaftune.predictor_train trains the per-codec ONNX predictors consumed by the fast, per-shot, ladder, and auto paths. --corpus accepts either a single Phase-A JSONL file or a directory of JSONL shards; directory inputs are scanned recursively in sorted order so the trainer can consume .workingdir2/corpus_run/ directly:
python -m vmaftune.predictor_train \
--corpus .workingdir2/corpus_run \
--codec libx264 \
--output-dir .workingdir2/predictor-real
Rows are filtered per codec after schema aliases are normalised. The trainer accepts both current corpus.py rows (encoder, crf, vmaf_score, bitrate_kbps) and older hardware-sweep rows (codec, q/cq, vmaf, actual_kbps). If a codec has no usable rows, that codec still falls back to the documented synthetic-stub corpus and the model card records corpus.kind: synthetic-stub-*. Directory inputs are passed through the same loader as single files; a codec with rows in any shard records corpus.kind: real-N=<rows>.
When rows include richer predictor inputs, the trainer now preserves them instead of zero-filling: probe_i_frame_avg_bytes, probe_p_frame_avg_bytes, probe_b_frame_avg_bytes, saliency_mean, saliency_var, frame_diff_mean, y_avg, and y_var. Older rows remain valid; missing probe-byte columns use the deterministic bitrate stand-ins and missing saliency / signalstats columns stay 0.0.
vmaf-tune predict --use-saliency uses the same saliency feature slots at runtime. It decodes each sampled shot to temporary yuv420p, runs the configured saliency ONNX, and feeds only saliency_mean / saliency_var into the predictor:
vmaf-tune predict \
--source source.mp4 \
--codec libx264 \
--target-vmaf 96 \
--use-saliency \
--saliency-model model/tiny/saliency_student_v1.onnx
This flag is separate from recommend-saliency --saliency-aware, which creates ROI/QP sidecars for the encoder.
Local Sidecar Bias Correction¶
vmaf-tune sidecar exposes the local sidecar model from docs/ai/local-sidecar-training.md as an operator CLI. It never uploads captures and never mutates the shipped predictor; it stores only the online-ridge correction under ${XDG_CACHE_HOME:-~/.cache}/vmaf-tune/sidecar/.
Inspect the current sidecar state:
Record one observed encode. features.json may be either a flat ShotFeatures object or { "features": { ... } }; the required fields are probe_bitrate_kbps, probe_i_frame_avg_bytes, probe_p_frame_avg_bytes, and probe_b_frame_avg_bytes.
vmaf-tune sidecar record \
--codec libx264 \
--features-json features.json \
--crf 28 \
--observed-vmaf 94.2
Batch training accepts JSONL with one row per observed encode:
{"features":{"probe_bitrate_kbps":3000,"probe_i_frame_avg_bytes":10000,"probe_p_frame_avg_bytes":2000,"probe_b_frame_avg_bytes":1000,"width":1920,"height":1080,"fps":24},"crf":28,"observed_vmaf":94.2}
Prediction reports the bare predictor score, sidecar correction, and clamped final score:
Quick start¶
Generate a 6-cell corpus row over (medium, slow) × (22, 28, 34) for one 1080p source clip:
vmaf-tune corpus \
--source ref.yuv \
--width 1920 --height 1080 --pix-fmt yuv420p \
--framerate 24 --duration 10 \
--preset medium --preset slow \
--crf 22 --crf 28 --crf 34 \
--output corpus.jsonl
--source is repeatable — pass one flag per source clip. The grid is the Cartesian product of --preset × --crf.
SVT-AV1 example (ADR-0278)¶
The libsvtav1 adapter accepts the same x264-style preset names — they are translated to SVT-AV1's integer presets internally. AV1 CRF values live in 0..63; the Phase A informative window is (20, 50):
vmaf-tune corpus \
--source ref.yuv \
--width 1920 --height 1080 --pix-fmt yuv420p \
--framerate 24 --duration 10 \
--encoder libsvtav1 \
--preset medium --preset slow \
--crf 28 --crf 35 --crf 42 \
--output corpus_av1.jsonl
The corpus row records the human-readable preset name ("medium"), while the FFmpeg argv carries the integer SVT-AV1 expects (-preset 7).
Codec adapter parameter ranges¶
Each adapter declares its own quality knob, range, and preset vocabulary. The harness validates (preset, crf) against the adapter before invoking FFmpeg.
| Encoder | CRF (absolute) | Phase A CRF window | Default CRF | Presets |
|---|---|---|---|---|
libx264 | 0..51 | (15, 40) | 23 | ultrafast, superfast, veryfast, faster, fast, medium, slow, slower, veryslow |
libsvtav1 | 0..63 | (20, 50) | 35 | placebo, slowest, slower, slow, medium, fast, faster, veryfast |
libvpx-vp9 | 0..63 | (0, 63) | 32 | placebo, slowest, slower, slow, medium, fast, faster, veryfast, superfast, ultrafast |
SVT-AV1 preset name -> integer mapping¶
SVT-AV1 uses integer presets 0..13 (0 = slowest / best, 13 = fastest). The harness maps the x264-style names below to AV1 integers so the corpus row schema is codec-independent:
| Name | SVT-AV1 integer | Notes |
|---|---|---|
placebo | 0 | Slowest; research-grade only. |
slowest | 1 | |
slower | 3 | |
slow | 5 | |
medium | 7 | SVT-AV1 default. |
fast | 9 | |
faster | 11 | |
veryfast | 13 | Fastest. |
The mapping is closed and order-stable; see ADR-0294.
CLI flags¶
| Flag | Default | Notes |
|---|---|---|
--source PATH | — | Required. Repeatable for multi-source sweeps. |
--width / --height | — | Required. Source resolution. |
--pix-fmt PFMT | yuv420p | Forwarded to ffmpeg -pix_fmt. |
--framerate F | 24.0 | Source framerate. |
--duration S | 0 | Source duration in seconds (used for bitrate calc). |
--encoder NAME | libx264 | One of libx264, libx265. |
--encoder NAME | libx264 | One of libx264, h264_amf, hevc_amf, av1_amf. |
--preset P | — | Required. Repeatable. x264 preset name. |
--crf N | — | Required. Repeatable. x264 CRF integer. |
--encoder NAME | libx264 | Currently wired: libx264, libsvtav1 (ADR-0294). |
--preset P | — | Required. Repeatable. Preset name (see codec table below). |
--crf N | — | Required. Repeatable. CRF integer (range varies by codec). |
--output PATH | corpus.jsonl | JSONL destination. |
--encode-dir PATH | .workingdir2/encodes | Scratch dir; gitignored by convention. |
--keep-encodes | off | Retain encoded files after scoring. |
--vmaf-model NAME | vmaf_v0.6.1 | Forwarded to vmaf --model. Only used when --no-resolution-aware is set; otherwise auto-picked per encode resolution (see "Resolution-aware mode" below). |
--resolution-aware / --no-resolution-aware | on | Auto-pick the VMAF model per encode resolution. Default on. |
--ffmpeg-bin PATH | ffmpeg | Override the ffmpeg binary. |
--ffprobe-bin PATH | ffprobe | Override the ffprobe binary (used for HDR detection). |
--vmaf-bin PATH | vmaf | Override the vmaf binary. |
--score-backend NAME | auto | libvmaf scoring backend — auto\|cpu\|cuda\|sycl\|hip. See below. (vulkan was removed in ADR-0726.) |
--no-source-hash | off | Skip src_sha256 (faster on large YUVs; loses provenance). |
--auto-hdr | (default) | Probe each source via ffprobe; inject HDR flags + HDR model when PQ / HLG signaling is detected. |
--force-sdr | off | Treat all sources as SDR; skip HDR detection. |
--force-hdr-pq | off | Treat all sources as HDR PQ (SMPTE-2084) without probing. Useful for raw YUV refs that ffprobe cannot read color metadata from. |
--force-hdr-hlg | off | Treat all sources as HDR HLG (ARIB STD-B67) without probing. |
--two-pass | off | Phase F (ADR-0333 + ADR-0546). Run a 2-pass encode for codecs whose adapter sets supports_two_pass = True (today: libx264, libx265, libvpx-vp9, libaom-av1, libvvenc). Codecs without 2-pass support fall back to single-pass with a stderr warning. Doubles encode wall time. |
--vaapi-device PATH | auto | VA-API DRI render-node for Intel QSV hardware-device init. auto selects the first Intel render node from /sys/class/drm and falls back to /dev/dri/renderD128 only when no Intel node is discoverable. Also overridable via VMAFTUNE_VAAPI_DEVICE env var; the flag takes precedence. Ignored for non-QSV encoders. (ADR-0641) |
Resolution-aware mode¶
VMAF is a resolution-aware metric: the fork ships two production-grade pooled-mean models — vmaf_v0.6.1 (trained on a 1080p viewing setup) and vmaf_4k_v0.6.1 (re-fit for a 4K display). Scoring 4K content against the 1080p model under-counts spatial detail; scoring 1080p content against the 4K model over-counts coding artefacts. The bias is several VMAF points either way — large enough to poison a mixed-resolution ABR-ladder corpus.
When --resolution-aware is on (the default), vmaf-tune picks the model per encode according to a height-only rule that mirrors Netflix's published guidance:
| Encode height | Selected model |
|---|---|
≥ 2160 (UHD-1 and up) | vmaf_4k_v0.6.1 |
< 2160 (everything else, including 1440p / 720p / SD) | vmaf_v0.6.1 |
The fork has no 720p / 1440p / SD model — vmaf_v0.6.1 is the canonical fallback for all sub-2160p content (matches Netflix's recommendation).
The emitted JSONL row's vmaf_model field now records the effective model used for that row, not the global --vmaf-model opt. Mixed-ladder corpora legitimately contain multiple distinct vmaf_model values across rows; downstream consumers should group / filter by vmaf_model rather than assume a constant.
To force a single model regardless of resolution (e.g. to reproduce a legacy single-model corpus), pass --no-resolution-aware:
vmaf-tune corpus \
--source ref_4k.yuv --width 3840 --height 2160 \
--preset medium --crf 23 \
--no-resolution-aware --vmaf-model vmaf_v0.6.1 \
--output corpus.jsonl
The Python API exposes the decision rule directly for callers that need to consult it outside the corpus loop:
from vmaftune.resolution import (
select_vmaf_model_version, # str: "vmaf_v0.6.1" or "vmaf_4k_v0.6.1"
select_vmaf_model, # Path: in-tree model JSON file
crf_offset_for_resolution, # int: -2 / 0 / +2 / +4 by resolution band
)
assert select_vmaf_model_version(3840, 2160) == "vmaf_4k_v0.6.1"
assert select_vmaf_model_version(1920, 1080) == "vmaf_v0.6.1"
crf_offset_for_resolution returns a small integer offset that the future search layer (Phase B target-VMAF bisect) can apply when seeding bisect bounds across an ABR ladder. The shipped defaults are codec-agnostic and conservative; Phase B/C/D will learn per-codec offsets from real corpora and override them via the same function signature. See ADR-0289 and Research-0054 for the full rationale.
GPU scoring backend¶
Per ADR-0299, vmaf-tune corpus forwards a --backend NAME argument to the libvmaf CLI so scoring runs on a GPU when one is present. CPU VMAF runs at ~1–2 fps on 1080p; the CUDA / SYCL / HIP backends shipped with this fork (ADR-0667) deliver 10–30× speedup on the score axis. (The Vulkan backend was removed in ADR-0726; use HIP for vendor-neutral AMD / Intel Arc GPU scoring.)
Modes¶
| Value | Behaviour |
|---|---|
auto (default) | Detect what the local vmaf binary supports + what the host hardware advertises, then pick the fastest available backend in vmaf-tune probe order: cuda → sycl → hip → cpu. Falls back to CPU silently only because no GPU was found. |
cuda / sycl / hip | Strict mode. Errors out with BackendUnavailableError if the local vmaf binary does not support the requested backend or the host hardware is missing. No silent downgrade to CPU — that would mask hardware/build mismatches and lie about wall-clock expectations. |
cpu | Force the CPU path. Useful for reproducibility against the Netflix golden-data gate or to bypass a known-bad GPU driver day. |
Detection heuristics¶
vmaf-tune inspects the vmaf --help output to learn which backends the binary advertises (the CLI prints a line of the form --backend $name: ...auto|cpu|cuda|sycl|hip|metal), then runs cheap hardware probes:
- CUDA:
nvidia-smi -Lreturns at least oneGPUline. - SYCL:
sycl-lslists at least one:gpudevice. - HIP/ROCm:
rocminforeports agfx*agent, orrocm-smireports a GPU.
Missing tools degrade to "backend not available" — they never raise hard errors. CPU is always considered available even if the help line is missing.
Probe order vs. libvmaf registry order: vmaf-tune's
autoprobe order (cuda → sycl → hip → cpu) differs from the libvmaf internal registry order (sycl → cuda → hip → cpu). vmaf-tune probes CUDA first becausenvidia-smiprovides the most reliable availability detection. The libvmaf registry prioritises SYCL first as the primary continuous-integration GPU target. When comparing timing results, be aware thatautomay select different backends in vmaf-tune versus a directvmaf --backend autoinvocation.
Wall-clock expectation (60 s 1080p source, indicative)¶
| Score backend | Hardware | Wall-clock | Throughput |
|---|---|---|---|
cpu | AVX2 desktop CPU | ~600–1200 s | ~1.2–2.5 fps |
cuda | RTX 30/40-class GPU | ~50–120 s | ~12–30 fps |
sycl | Intel Arc / Iris Xe | ~80–180 s | ~8–18 fps |
hip | RDNA2 / RDNA3 ROCm host | ~80–180 s | ~8–18 fps |
Numbers are order-of-magnitude only; exact figures depend on the specific feature extractors enabled by the model (vmaf_v0.6.1 versus tiny-AI variants), whether --keep-encodes is on, and host I/O bandwidth. Cross-backend numerical parity is guaranteed to places=4 by the ADR-0214 CI gate.
Examples¶
# Default — auto-pick the fastest backend.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
--preset medium --crf 22 --crf 28
# Force CUDA. Errors out clearly if /opt/cuda is missing or the
# vmaf binary was built without CUDA support.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
--preset medium --crf 22 --score-backend cuda
# Pin CPU for reproducibility against the Netflix golden gate.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
--preset medium --crf 22 --score-backend cpu
Vulkan score backend (--score-backend=vulkan)¶
Status: REMOVED — ADR-0726 (2026-05-28). The Vulkan backend was removed from libvmaf.
--score-backend=vulkanis no longer a valid value — passing it returns a non-zero exit code with an "unsupported backend" error. Use--score-backend=hipfor vendor-neutral GPU scoring on AMD or Intel Arc hosts. The sections below are preserved for historical reference only; none of the workflows described are functional in current builds.
Per ADR-0314 (now superseded by ADR-0726), the vulkan value of --score-backend was the vendor-neutral GPU score path. Use --score-backend=hip for AMD/Intel Arc hosts going forward.
Supported platforms¶
| Host | Driver | Status |
|---|---|---|
| Linux + AMD RDNA2/RDNA3 | Mesa RADV | Production. |
| Linux + Intel Arc / Iris Xe | Mesa anv | Production. |
| Linux + NVIDIA | Mesa NVK or proprietary nvidia driver | Production; coexists with --score-backend=cuda. |
| Linux CI / no GPU | Mesa lavapipe (software rasteriser) | Slow but functional — the cross-backend parity gate (ADR-0214) runs on lavapipe. |
| macOS (Apple Silicon + Intel) | MoltenVK 1.2 layered over Metal | Functional. |
| Windows | Vendor-supplied Vulkan ICD | Functional; not gated in CI yet. |
The libvmaf binary needs to be built with -Denable_vulkan=true (default in fork release artefacts). The vulkan value will fail strict-mode validation otherwise — vmaf --help will not advertise vulkan in its --backend alternation.
Verifying Vulkan availability¶
vmaf-tune runs the same probe libvmaf does — vulkaninfo --summary must succeed and report at least one deviceName:
If that command is missing, install the Vulkan SDK loader (Linux: vulkan-tools package; macOS: brew install vulkan-tools).
Example¶
# Vendor-neutral GPU scoring on AMD / Intel Arc / MoltenVK hosts.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
--preset medium --crf 22 --crf 28 --score-backend vulkan
Failure mode (no Vulkan loader installed):
vmaf-tune: backend 'vulkan' requested but not available on this host
(available: cpu). Check that the local vmaf binary was built with the
matching backend support and the corresponding runtime/driver is
installed.
The exit code is 2 and no encodes are dispatched — the strict-mode guarantee from ADR-0299 (no silent CPU downgrade) is preserved across all four backend values.
| --no-cache | off | Disable the content-addressed encode/score cache (default: ON). | | --cache-dir PATH | $XDG_CACHE_HOME/vmaf-tune | Override cache location (falls back to ~/.cache/vmaf-tune). | | --cache-size-gb N | 10 | LRU eviction cap in GiB. | | --sample-clip-seconds N | 0 | Encode/score only the centre N-second slice of each source. 0 (default) keeps the legacy full-source behaviour. See Sample-clip mode. |
Sample-clip mode¶
Set --sample-clip-seconds N to evaluate each grid cell on the centre N-second slice of the source instead of the full reference. This is a runtime/accuracy trade-off, formalised in ADR-0301.
- Speedup. Encode wall-time scales roughly linearly with slice length, so e.g. a 10-second slice of a 60-second source is a ~6x speedup per grid point. The libvmaf scoring pass shrinks by the same factor (it reads only the matching reference window via
--frame_skip_ref/--frame_cnt). - Accuracy delta. Expect ~1-2 VMAF points of drift versus full-clip on diverse content (mixed-shot trailers, sports, action), tighter (~0.3-0.5 VMAF points) on uniform content (single-shot interviews, animation, static stills). The delta is per-cell consistent — relative ordering between (preset, crf) cells survives, which is what Phase B (target-VMAF bisect) and Phase C (per-title CRF predictor) actually consume. Full-clip rescoring of the predictor's pick is the recommended Phase C epilogue.
- Window placement. Naive centre-anchored: the slice is
(duration_s − N) / 2 .. (duration_s + N) / 2. Smarter shot-aware placement (e.g. viatransnet_v2) is on the follow-up backlog. - Fallback. If
N >= duration_sthe harness silently falls back to full-clip mode and tags the rowclip_mode="full"— the request is treated as "use the whole source", not as an error. - Bitrate semantics.
bitrate_kbpsis computed against the encoded duration, so sample-clip rows aren't biased low by dividing slice-bytes by full-source seconds.duration_skeeps the original source provenance.
# 6x faster grid sweep — 10s of a 60s source per cell.
vmaf-tune corpus \
--source ref.yuv \
--width 1920 --height 1080 --pix-fmt yuv420p \
--framerate 24 --duration 60 \
--preset medium --preset slow \
--crf 22 --crf 28 --crf 34 \
--sample-clip-seconds 10 \
--output corpus.jsonl
Each emitted row carries clip_mode="sample_10s" (or "full"), letting Phase B/C either filter sample rows out, weight them differently, or rescore the chosen cell on the full source.
benchmark subcommand — corpus-level codec ranking¶
vmaf-tune benchmark is Phase G of the tune workflow. It does not run FFmpeg or libvmaf; it reads an existing Phase-A JSONL corpus and ranks encoders by their best matched-quality point.
For each encoder in the corpus, the command filters to successful rows with finite vmaf_score and bitrate_kbps, then chooses the lowest bitrate row whose VMAF clears --target-vmaf. Encoders that never clear the target stay in the report as status=unmet using their closest VMAF miss, so a too-narrow CRF sweep is visible instead of silently dropped.
vmaf-tune benchmark \
--from-corpus corpus.jsonl \
--target-vmaf 92 \
--baseline-encoder libx264 \
--format markdown
Output formats:
| Format | Use |
|---|---|
markdown | PR comments and human review. |
json | Notebooks, dashboards, and follow-up automation. |
csv | Spreadsheets and quick plots. |
--baseline-encoder controls the bitrate-delta column. When omitted, the baseline is the lowest-bitrate encoder that clears the target. The report inherits the corpus coverage: if libx264 was swept over 20 CRFs and libx265 over 3 CRFs, the ranking reflects those sampled points only.
Corpus JSONL schema¶
Each row is one JSON object on its own line. The full key list is exported as vmaftune.CORPUS_ROW_KEYS for programmatic consumers and versioned via vmaftune.SCHEMA_VERSION (currently 3 — v2 added clip_mode for sample-clip mode under ADR-0301; v3 added the HDR provenance triple hdr_transfer / hdr_primaries / hdr_forced when corpus.iter_rows was wired to hdr.detect_hdr + hdr.hdr_codec_args per the ADR-0300 status update of 2026-05-08). Bumping the schema is a coordinated change with Phase B/C; do not edit row shape without bumping the version.
| Key | Type | Description |
|---|---|---|
schema_version | int | Currently 3. v3 adds the enc_internal_* aggregates (ADR-0332). |
run_id | str | Per-row UUID4 hex. |
timestamp | str | UTC ISO-8601 (seconds precision). |
src | str | Path to the reference YUV. |
src_sha256 | str | SHA-256 of the reference (empty if --no-source-hash). |
width / height | int | Source dimensions. |
pix_fmt | str | Source pixel format. |
framerate | float | Source framerate. |
duration_s | float | Source duration in seconds. |
encoder | str | Codec adapter name (e.g. libx264). |
encoder_version | str | Detected encoder version (e.g. libx264-164). |
preset | str | Encoder preset. |
crf | int | Quality knob value. |
extra_params | list[str] | Additional encoder argv (Phase A: []). |
encode_path | str | Path to encoded file (empty if not retained). |
encode_size_bytes | int | Encoded file size. |
bitrate_kbps | float | (encode_size_bytes × 8 / 1000) / duration_s. |
encode_time_ms | float | Wall-clock encode time. |
vmaf_score | float | Pooled-mean VMAF (NaN if scoring skipped/failed). |
vmaf_model | str | Model version string (e.g. vmaf_v0.6.1). |
score_time_ms | float | Wall-clock scoring time. |
ffmpeg_version | str | Detected ffmpeg version. |
vmaf_binary_version | str | Detected vmaf binary version. |
exit_status | int | First non-zero of (encode, score) exit codes. |
hdr_transfer | str | "" (SDR), "pq" (SMPTE-2084) or "hlg" (ARIB STD-B67). Schema v3+. |
hdr_primaries | str | Raw ffprobe color_primaries (e.g. bt2020); empty for SDR. Schema v3+. |
hdr_forced | bool | true iff the user overrode detection via --force-hdr-* / --force-sdr. Schema v3+. |
clip_mode | str | "full" (default) or "sample_<N>s" per --sample-clip-seconds. Schema v2+. |
shot_count | int | Number of TransNet-V2 shots in the source (0 when shot detection unavailable). v3+. |
shot_avg_duration_sec | float | Mean shot length in seconds (0.0 when unavailable). v3+. |
shot_duration_std_sec | float | Population std of shot lengths in seconds — content-class proxy (animation: low; live action: high). v3+. |
adm2_mean | float | Per-frame ADM2 mean (canonical-6). NaN when scoring skipped. v3+ (ADR-0366). |
vif_scale0_mean … motion2_std | float | Remaining canonical-6 mean/std aggregates (12 columns total). v3+ (ADR-0366). |
enc_internal_qp_mean | float | Per-frame QP mean from x264 pass-1 stats. 0.0 for opt-out adapters. v3+ (ADR-0332). |
enc_internal_qp_std | float | Per-frame QP standard deviation. v3+ (ADR-0332). |
enc_internal_bits_mean | float | Per-frame bit-cost mean (tex+mv+misc). v3+ (ADR-0332). |
enc_internal_bits_std | float | Per-frame bit-cost standard deviation. v3+ (ADR-0332). |
enc_internal_mv_mean | float | Per-frame motion-vector bit-cost mean. v3+ (ADR-0332). |
enc_internal_mv_std | float | Per-frame motion-vector bit-cost standard deviation. v3+ (ADR-0332). |
enc_internal_itex_mean | float | Mean intra-texture cost across I/i frames. v3+ (ADR-0332). |
enc_internal_ptex_mean | float | Mean predicted-texture cost across P/B/b frames. v3+ (ADR-0332). |
enc_internal_intra_ratio | float | Fraction of macroblocks coded as intra. v3+ (ADR-0332). |
enc_internal_skip_ratio | float | Fraction of macroblocks coded as skip. v3+ (ADR-0332). |
The ten enc_internal_* columns are populated for adapters that declare supports_encoder_stats = True (currently libx264 and libx265). The parser normalizes x264 macroblock counters and x265 icu / pcu / scu CTU counters into the same intra / predicted / skip ratio columns. Hardware encoders (NVENC / AMF / QSV / VideoToolbox) and AV1 software encoders (libaom-av1 / libsvtav1 / libvvenc) opt out and emit 0.0 for every column so the schema is uniform across the corpus. The trade-off: per-encode wall-clock cost roughly doubles for opt-in adapters because the harness runs a stats-only -pass 1 invocation before the production CRF encode.
Example row¶
{
"schema_version": 3,
"run_id": "0a3b1c8b...",
"timestamp": "2026-05-03T16:00:00+00:00",
"src": "ref.yuv",
"src_sha256": "",
"width": 1920, "height": 1080, "pix_fmt": "yuv420p",
"framerate": 24.0, "duration_s": 10.0,
"encoder": "libx264", "encoder_version": "libx264-164",
"preset": "medium", "crf": 28,
"extra_params": [],
"encode_path": "",
"encode_size_bytes": 845210,
"bitrate_kbps": 676.168,
"encode_time_ms": 4321.0,
"vmaf_score": 92.41,
"vmaf_model": "vmaf_v0.6.1",
"score_time_ms": 1820.5,
"ffmpeg_version": "6.1.1",
"vmaf_binary_version": "3.0.0-lusoris.0",
"exit_status": 0,
"clip_mode": "full",
"shot_count": 12,
"shot_avg_duration_sec": 0.83,
"shot_duration_std_sec": 0.41,
"adm2_mean": 9.73, "adm2_std": 0.12,
"enc_internal_qp_mean": 25.23,
"enc_internal_qp_std": 0.12,
"enc_internal_bits_mean": 4975.0,
"enc_internal_bits_std": 1820.5,
"enc_internal_mv_mean": 60.3,
"enc_internal_mv_std": 32.1,
"enc_internal_itex_mean": 8000.0,
"enc_internal_ptex_mean": 1500.0,
"enc_internal_intra_ratio": 0.07,
"enc_internal_skip_ratio": 0.16
}
recommend subcommand — target-VMAF / target-bitrate¶
vmaf-tune recommend consumes the corpus (either pre-built via --from-corpus PATH.jsonl or generated on the fly from --source + grid flags) and applies one of two predicates:
--target-vmaf T— return the row with the smallest CRF whosevmaf_score >= T. If no row clears the bar, the row with the highest VMAF is returned and the predicate is annotated(UNMET)in the output. Exit code is still0for an honest closest-miss.--target-bitrate KBPS— return the row whosebitrate_kbpsis closest (absolute distance) toKBPS. Ties on distance go to the smaller CRF (higher quality).
The two flags are mutually exclusive — argparse rejects passing both with exit code 2.
When reading a pre-built corpus, recommend ignores rows whose exit_status is non-zero and rows with missing or non-finite vmaf_score. The --encoder and --preset flags act as filters in that mode, so a mixed-codec corpus can be reused without first splitting it into per-codec files.
Use a pre-built corpus¶
# Phase A — build once.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
--framerate 24 --duration 10 \
--preset medium --crf 18 --crf 22 --crf 26 --crf 30 --crf 34 \
--output corpus.jsonl
# Smallest CRF whose VMAF >= 92.
vmaf-tune recommend --from-corpus corpus.jsonl --target-vmaf 92.0
# CRF whose bitrate is closest to 5 Mbps.
vmaf-tune recommend --from-corpus corpus.jsonl --target-bitrate 5000
Build the corpus on the fly¶
vmaf-tune recommend \
--source ref.yuv --width 1920 --height 1080 \
--framerate 24 --duration 10 \
--preset medium --crf 18 --crf 22 --crf 26 --crf 30 \
--target-vmaf 92.0
If --preset and --crf are omitted, recommend sweeps medium × range(18, 36, 2) as a sensible default for ad-hoc runs.
Output¶
Default output is a single human-readable line on stdout, e.g.
encoder=libx264 preset=medium crf=22 vmaf=95.000 bitrate_kbps=5000.00 \
predicate=target_vmaf>=92.0 margin=+3.000
Pass --json to get the full corpus row as a JSON object on stdout instead — convenient for piping into other tooling.
recommend flags¶
| Flag | Default | Notes |
|---|---|---|
--target-vmaf T | — | Smallest CRF whose vmaf_score >= T. |
--target-bitrate KBPS | — | CRF whose bitrate_kbps is closest to KBPS. |
--from-corpus PATH | — | Read rows from a pre-built JSONL. Skips encode + score. |
--source / --width / --height / --framerate / --duration | — | Build a corpus on the fly. Required when --from-corpus is omitted. |
--encoder / --preset / --crf | libx264 / medium / [18,20,...,34] | Sweep grid (when building). Filter (when loading). |
--json | off | Emit the winning row as JSON instead of the prose summary. |
fast subcommand — proxy + Bayesian + GPU-verify (Phase A.5)¶
vmaf-tune fast is the seconds-to-minutes alternative to the Phase A grid for the recommendation use case. It runs an Optuna TPE search over the integer CRF axis, scores each trial with the fr_regressor_v2 proxy (ADR-0291) on the canonical-6 libvmaf features extracted from a short probe encode, then runs one real-encode + libvmaf verify pass at the recommended CRF before reporting. The slow grid stays canonical (ADR-0276) — fast is opt-in, and falls back to the grid when the proxy/verify gap exceeds the configured tolerance.
Install¶
The fast-path needs Optuna in addition to the core install:
The shipped [fast] extra is the only correct install path; the core package stays zero-extra-dep so corpus generation works on hosts that never run the fast path.
Smoke run (no ffmpeg / no ONNX / no GPU)¶
The --smoke flag swaps the proxy + verify pipeline for a deterministic synthetic CRF→VMAF curve so CI on bare hosts still exercises the search loop:
{
"encoder": "libx264",
"n_trials": 12,
"notes": "smoke mode — synthetic predictor; no ffmpeg / ONNX / GPU. See ADR-0276 + ADR-0304 + Research-0076 for the production path.",
"predicted_kbps": 1954.27,
"predicted_vmaf": 82.65,
"proxy_verify_gap": null,
"recommended_crf": 27,
"smoke": true,
"target_vmaf": 92.0,
"verify_vmaf": null
}
Production run¶
vmaf-tune fast \
--src ref.yuv --width 1920 --height 1080 \
--framerate 24 --pix-fmt yuv420p \
--encoder libx264 --preset medium \
--target-vmaf 92.0 \
--crf-min 18 --crf-max 40 \
--n-trials 30 \
--score-backend auto \
--output recommendation.json
The recommendation lands as a single JSON object — same schema recommend and predict already emit, plus the fast-path-specific verify_vmaf and proxy_verify_gap diagnostics:
{
"encoder": "libx264",
"target_vmaf": 92.0,
"recommended_crf": 22,
"predicted_vmaf": 92.41,
"predicted_kbps": 4820.0,
"n_trials": 30,
"smoke": false,
"notes": "production: TPE over 30 trials with v2 proxy; GPU verify gap = 0.612 VMAF (tolerance 1.50).",
"verify_vmaf": 91.80,
"proxy_verify_gap": 0.612,
"score_backend": "cuda"
}
Exit codes¶
| Code | Meaning |
|---|---|
0 | Recommendation produced; proxy/verify gap within tolerance. |
2 | Argument validation error (missing --src, bad CRF range, ...). |
3 | Out-of-distribution: proxy/verify gap exceeded --proxy-tolerance. The recommendation is still emitted; callers should fall back to the slow Phase A grid (vmaf-tune corpus + vmaf-tune recommend). |
Fall-back idiom¶
vmaf-tune fast --src ref.yuv --width 1920 --height 1080 \
--target-vmaf 92.0 --output rec.json \
|| vmaf-tune recommend --source ref.yuv --width 1920 --height 1080 \
--preset medium --target-vmaf 92.0 --output rec.json
The || chain captures both the production-error case (rc=2) and the OOD case (rc=3), so the slow grid is the safety net whenever the fast-path is not confident.
fast flags¶
| Flag | Default | Notes |
|---|---|---|
--src PATH | — | Source video. Required outside --smoke. |
--width / --height | 0 | Raw-YUV geometry. Required outside --smoke. |
--pix-fmt | yuv420p | ffmpeg pix_fmt for the probe + verify encodes. |
--framerate | 24.0 | Reference framerate. |
--target-vmaf T | — | Quality target on the standard [0, 100] scale. Required. |
--encoder | libx264 | Codec adapter; must be in ENCODER_VOCAB_V2 for production mode. |
--preset | medium | Encoder preset for the probe + verify encodes. |
--crf-min / --crf-max | 10 / 51 | TPE search range over the integer CRF axis. |
--n-trials | 30 (prod), 50 (smoke) | TPE trial budget. |
--time-budget-s | 300 | Soft wall-clock cap for Optuna. TPE stops scheduling new trials after the timeout; an in-flight probe finishes. |
--proxy-tolerance | 1.5 | Max abs proxy/verify gap before exit code 3. |
--sample-chunk-seconds | 5.0 | Probe-slice duration per TPE trial. |
--smoke | off | Synthetic curve; no ffmpeg / ONNX / GPU. |
--score-backend | auto | Verify-pass backend (auto/cpu/cuda/sycl/hip). (vulkan removed in ADR-0726.) |
--ffmpeg-bin / --vmaf-bin | ffmpeg / vmaf | Tool paths. |
--vmaf-model | vmaf_v0.6.1 | libvmaf model for the verify pass. |
--encode-dir | .workingdir2/fast | Scratch dir for probe + verify encodes. |
--output | stdout | JSON destination for the recommendation payload. |
prefilter subcommand — Pelorus deband + CRF joint autotune (ADR-1116)¶
vmaf-tune prefilter runs a VMAF-in-the-loop search that jointly tunes the Pelorus deband pre-filter's strengths and the encoder CRF, returning the lowest-bitrate combination that hits a target VMAF. This is the control-plane ("mode 2") seam between vmafx and Pelorus (Pelorus ADR-0106; contract in Pelorus ADR-0110).
vmafx stays Vulkan-free: it only emits the ffmpeg -vf "pelorus_deband_vulkan=range=..:thry=..:.." string and scores the encoded output. The deband filter runs inside ffmpeg — so the live loop requires an ffmpeg build with the Pelorus Vulkan deband filter. When that filter is absent the subcommand refuses the live run with a clear message and points you at --smoke.
The frozen knob contract¶
The search space is exactly the 10 tunable knobs Pelorus ADR-0110 freezes (name / type / range / default). vmafx hard-codes this table — renaming, narrowing, or retyping a knob is a coordinated two-repo break:
| Knob | Type | Range | Default | Meaning |
|---|---|---|---|---|
range | int | 1–31 | 15 | reference-sampling radius (px) |
thry | float | 0.0–0.25 | 0.012 | luma flat-test threshold |
thrc | float | 0.0–0.25 | 0.012 | chroma flat-test threshold |
grainy | float | 0.0–0.4 | 0.006 | luma grain amplitude |
grainc | float | 0.0–0.4 | 0.0 | chroma grain amplitude |
softness | float | 0.0–1.0 | 0.5 | soft-blend transition width |
detail | float | 0.0–0.25 | 0.06 | detail-mask activity threshold |
dither | enum | 0–2 | 2 | 0=none, 1=bayer8, 2=bluenoise |
dynamic | bool | 0–1 | 1 | re-seed grain each frame |
protect | bool | 0–1 | 1 | gate debanding off textured regions |
The out-of-contract options (sample, blur, planes, meta) are never swept — they are pipeline-topology / reporting switches set once per run outside the optimizer.
How the joint search works¶
Each Optuna TPE trial proposes a full (deband-dict, crf) and the loop runs [pelorus_deband_vulkan=...] → HW encode → VMAF score for it. The objective is |achieved_vmaf − target| + λ·kbps, so the search converges on the lowest-bitrate deband+CRF that hits the target. CRF is an ordinal integer dimension joined to the 10 deband dimensions in one study — the two axes are co-optimised, not searched in nested loops (debanding shifts the rate-quality curve, so they are not separable). The result reports the recommended strengths, CRF, achieved VMAF, and the per-probe VMAF log.
The search engine is the same TPESampler study the fast subcommand uses (ADR-0276 / ADR-0304) — prefilter only constructs the joint search space and the objective.
Quick start (smoke)¶
Exercise the joint search end-to-end with a synthetic surface — no ffmpeg, no Vulkan, no GPU:
Live loop (requires a Pelorus-enabled ffmpeg)¶
vmaf-tune prefilter \
--src ref.yuv --width 1920 --height 1080 \
--target-vmaf 93 --encoder libx264 \
--crf-min 18 --crf-max 40 \
--ffmpeg-bin /opt/ffmpeg-pelorus/bin/ffmpeg
The emitted recommendation includes a ready-to-use recommended_vf fragment, e.g.:
Restrict the swept knobs with one or more --sweep-knob flags (the rest stay at the filter default):
prefilter flags¶
| Flag | Default | Notes |
|---|---|---|
--src PATH | — | Source video. Required for the live loop. |
--width / --height | 0 | Raw-YUV geometry. Required for the live loop. |
--pix-fmt | yuv420p | ffmpeg pix_fmt for the probe encodes. |
--framerate | 24.0 | Reference framerate. |
--target-vmaf T | — | Quality target on the [0, 100] scale. Required. |
--encoder | libx264 | Codec adapter that performs the post-deband encode. |
--preset | medium | Encoder preset for the probe encodes. |
--filter | pelorus_deband | Pre-encode filter adapter to autotune. |
--sweep-knob KNOB | all 10 | Repeatable; restricts the swept deband knobs. |
--crf-min / --crf-max | 18 / 40 | Joint TPE search range over CRF. |
--n-trials | 60 (live), 40 (smoke) | TPE trial budget. |
--time-budget-s | 600 | Soft wall-clock cap for Optuna. |
--seed | 0 | TPE sampler seed (reproducible search). |
--smoke | off | Synthetic deband+CRF surface; no ffmpeg / Vulkan / GPU. |
--score-backend | auto | Probe-score backend (auto/cpu/cuda/sycl/hip). |
--ffmpeg-bin / --vmaf-bin | ffmpeg / vmaf | Tool paths. |
--vmaf-model | vmaf_v0.6.1 | libvmaf model for the probe scores. |
--neg | off | Use the VMAF NEG model variant. |
--encode-dir | .workingdir2/prefilter | Scratch dir for probe encodes. |
--output | stdout | JSON destination for the recommendation payload. |
Note: the live encode path is unit-tested with a mocked encode/score loop but has not been run against a real Pelorus-enabled ffmpeg in this environment (the filter is not installed here). See
docs/state.md→T-PREFILTER-LIVE-ENCODE-UNTESTED-2026-06-14.
Codec adapters¶
Phase A wires libx264 end-to-end through the search loop. Additional codec adapters land as one-file additions under tools/vmaf-tune/src/vmaftune/codec_adapters/ and join the registry without touching the search loop. The currently-registered adapters are discoverable via vmaftune.codec_adapters.known_codecs().
libaom-av1¶
Google's reference AV1 encoder. Adapter shipped via ADR-0279.
- Encoder name (FFmpeg):
libaom-av1. - Quality knob:
-crfinteger in[0, 63](default35); higher CRF is lower quality. - Speed knob:
-cpu-usedinteger in[0, 9](0 = slowest/best, 9 = fastest). The adapter exposes a human-readable preset vocabulary that matches x264/x265 so a single sweep axis covers all four encoders.
--preset name | libaom -cpu-used |
|---|---|
placebo | 0 (slowest, highest quality) |
slowest | 1 |
slower | 2 |
slow | 3 |
medium | 4 (default) |
fast | 5 |
faster | 6 |
veryfast | 7 |
superfast | 8 |
ultrafast | 9 (fastest) |
Sample FFmpeg invocation produced by the adapter (Phase B+ wires the codec args into vmaftune.encode):
libaom vs SVT-AV1 trade-offs¶
libaom-av1 and libsvtav1 both target the AV1 bitstream but sit at different points on the speed/quality curve. Use the table below as a rough decision aid; the per-corpus numbers belong in your local sweep output, not here.
| Concern | libaom-av1 | libsvtav1 |
|---|---|---|
| Encode wall time at matched preset | meaningfully slower | meaningfully faster |
| Quality at slow presets (matched bitrate) | slightly higher per AOM benchmarks | slightly lower |
| Quality at fast presets | comparable | comparable, sometimes ahead |
| CRF range | 0..63 | 0..63 |
| Best fit | offline / high-quality archive encodes | live, batch, large catalog |
The fork's vmaf-tune corpus rows record the exact (encoder, preset, crf, vmaf_score, encode_time_ms, bitrate_kbps) tuple, so Phase C/D predictors can pick whichever encoder dominates the relevant region of the rate-distortion plane on a given source.
SVT-AV1-HDR (juliobbv-p/svt-av1-hdr) is a runtime variant of the libsvtav1 adapter, not a separate adapter. Its FFmpeg examples still use -c:v libsvtav1; the difference is which SVT library the FFmpeg binary is linked against. Use ADAPTER@VARIANT tokens plus --encoder-ffmpeg-bin to compare mainline SVT-AV1 and SVT-AV1-HDR in one report:
vmaf-tune compare \
--src clip.mkv --width 3840 --height 2160 --pix-fmt yuv420p10le \
--target-vmafs 94,96,98 \
--encoders libsvtav1,libsvtav1@svt-av1-hdr \
--ffmpeg-bin /opt/ffmpeg-8.1.1-main/bin/ffmpeg \
--encoder-ffmpeg-bin libsvtav1@svt-av1-hdr=/opt/ffmpeg-8.1.1-svtav1-hdr/bin/ffmpeg \
--format json --output svtav1-vs-hdr.json
The row label stays human-readable (codec = libsvtav1@svt-av1-hdr), while machine-readable provenance records adapter = libsvtav1, runtime_variant = svt-av1-hdr, and the exact ffmpeg_bin path. Tokens without a per-token binding use the global --ffmpeg-bin.
Hardware encoders (NVENC)¶
Phase A also wires the NVIDIA NVENC family for hardware-accelerated sweeps:
Adapter --encoder | FFmpeg encoder | Hardware required |
|---|---|---|
h264_nvenc | h264_nvenc | NVIDIA Kepler+ (most modern GPUs) |
hevc_nvenc | hevc_nvenc | NVIDIA Maxwell 2nd-gen+ (GTX 960+) |
av1_nvenc | av1_nvenc | NVIDIA Ada Lovelace+ (RTX 40-series, L40, L4) |
NVENC's quality knob is -cq (constant quantizer), the closest analogue to libx264 CRF. The fork's CQ window is the same [15, 40] perceptually informative range used for libx264; the hardware accepts [0, 51].
NVENC has seven preset levels (p1 fastest → p7 slowest). The CLI takes the same mnemonic preset names as libx264 and maps them:
| Mnemonic | NVENC preset |
|---|---|
ultrafast, superfast, veryfast | p1 |
faster | p2 |
fast | p3 |
medium (default) | p4 |
slow | p5 |
slower | p6 |
slowest, placebo | p7 |
Hardware encoders (AMF)¶
vmaf-tune ships software (libx264) and hardware adapters under one codec-adapter contract — the harness search loop is identical for all of them. AMD AMF (Advanced Media Framework) covers H.264, HEVC, and AV1 on AMD GPUs (see ADR-0282):
| Encoder | Codec | Hardware | FFmpeg flag | Quality knob |
|---|---|---|---|---|
h264_amf | H.264 / AVC | Any AMD GPU with AMF | -c:v h264_amf | -qp_i / -qp_p (cqp) |
hevc_amf | HEVC / H.265 | Any AMD GPU with AMF | -c:v hevc_amf | -qp_i / -qp_p (cqp) |
av1_amf | AV1 | RDNA3+ only (RX 7000 series and newer) | -c:v av1_amf | -qp_i / -qp_p (cqp) |
Requirements:
- AMD GPU with AMF support (AV1 needs RDNA3 silicon or newer).
- FFmpeg built with
--enable-amf. - The AMF runtime / driver installed (Adrenalin on Windows; the open-source Mesa AMF stack or AMD Pro driver on Linux).
Behavioural notes:
- Preset compression — 7 levels collapse to 3. AMF exposes only three
-qualityrungs (quality,balanced,speed) wherelibx264/ NVENC / QSV expose seven preset names. The adapter maps preset names onto AMF rungs as follows:
| Preset names | AMF -quality |
|---|---|
placebo, slowest, slower, slow | quality |
medium (default) | balanced |
fast, faster, veryfast, superfast, ultrafast | speed |
This is opinionated: AMF's hardware pipeline does not expose finer steps. Callers that need finer granularity should pin -qp_i / -qp_p (the qp quality knob) instead of the preset.
- Rate control is constant-QP. AMF
-rc cqpplus matched-qp_i/-qp_pis the closest analogue to x264 CRF. Range is 0..51; the harness exposes the (15, 40) Phase A informative window for cross-codec comparability.
Example:
Coarse-to-fine CRF search (ADR-0306)¶
Sweeping every CRF from 0..51 (52 encodes per source × preset) is wasteful when the only question is "what's the smallest CRF whose VMAF still meets my target?". The fork ships a 2-pass coarse-to-fine search that visits ~15 points instead of 52, a ~3.5× wall-time speedup with no measurable quality regression.
How it works:
- Coarse pass at
--coarse-stepover[10, 50]— by default that's[10, 20, 30, 40, 50](5 encodes). - Fine pass at
--fine-stepwithin±--fine-radiusof the best-coarse CRF. With defaults that's the 10 unique CRFs around the best (e.g.[25..29, 31..35]if best-coarse is30). - 1-pass shortcut: when the highest-CRF coarse point already meets the VMAF target, no refinement is needed (lower bitrate would have to come from CRFs above the coarse grid, which the fine pass wouldn't probe anyway). The search stops after the coarse pass.
corpus --coarse-to-fine¶
vmaf-tune corpus \
--source ref.yuv \
--width 1920 --height 1080 --pix-fmt yuv420p \
--framerate 24 --duration 10 \
--encoder h264_nvenc \
--preset medium --preset slow \
--crf 23 --crf 28 --crf 34 \
--output corpus_nvenc.jsonl
(The --crf flag carries the quality value regardless of whether the encoder names it CRF or CQ; the adapter forwards it as -cq for NVENC.)
Hardware vs software trade-off¶
NVENC is 10–100× faster than the software encoders at the cost of quality. Empirically, h264_nvenc at medium typically loses 3–5 VMAF points versus libx264 medium at the same bitrate, depending on content complexity. The Pareto frontier is genuinely different — that is precisely why the harness treats NVENC as separate codec entries rather than a flag on libx264. Use NVENC when you need a large corpus quickly or when the production pipeline is GPU-encoded; use software when you need the perceptually best encode at a given bitrate.
If FFmpeg reports Encoder h264_nvenc not found (or one of the sibling encoders), the FFmpeg build wasn't compiled with --enable-nvenc or the GPU lacks the relevant generation. The harness records the failure as exit_status != 0 and skips scoring, so a partial corpus over a heterogeneous fleet is still well-formed.
vmaf-tune corpus \
--encoder h264_amf \
--preset slow --preset medium --preset fast \
--crf 23 --crf 28 --crf 34 \
--output corpus_amf.jsonl
The raw FFmpeg invocation the adapter emits looks like:
Hardware encoders (QSV)¶
Beyond libx264, the registry exposes the three Intel QSV (Quick Sync Video) hardware adapters added in ADR-0281. They share the QSV preset vocabulary (veryslow, slower, slow, medium, fast, faster, veryfast — same names as x264's medium-and-down subset) and use -global_quality N (ICQ rate control, range 1..51, semantically similar to CRF).
| Adapter name | FFmpeg encoder | Quality knob | Hardware required |
|---|---|---|---|
h264_qsv | h264_qsv | global_quality (1–51) | Intel iGPU 7th-gen+ (Kaby Lake or newer) or Arc / Battlemage |
hevc_qsv | hevc_qsv | global_quality (1–51) | Intel iGPU 7th-gen+ (10-bit needs 11th-gen+) or Arc / Battlemage |
av1_qsv | av1_qsv | global_quality (1–51) | Intel iGPU 12th-gen+ only or Arc / Battlemage |
Hardware-device initialisation (Linux). FFmpeg's QSV bridge on Linux requires an explicit VA-API device chain before the input and a pixel-format conversion filter after it. vmaf-tune injects these automatically for all QSV encoders:
# Before -i:
-init_hw_device vaapi=va:/dev/dri/renderD129
-init_hw_device qsv=qsv_dev@va
-filter_hw_device va
# After -i, before -c:v:
-vf format=nv12,hwupload=extra_hw_frames=64
Without this chain every QSV encode fails with -22 Invalid argument. The VA-API device defaults to auto: vmaf-tune walks /dev/dri/by-path and /sys/class/drm/renderD*/device/vendor, selects the first Intel vendor node (0x8086), and falls back to /dev/dri/renderD128 only when no Intel render node is discoverable. Override with --vaapi-device /dev/dri/renderD129 or VMAFTUNE_VAAPI_DEVICE=/dev/dri/renderD129 when you need to pin a specific node. (ADR-0641)
The underlying FFmpeg invocations look like:
ffmpeg \
-init_hw_device vaapi=va:/dev/dri/renderD129 \
-init_hw_device qsv=qsv_dev@va \
-filter_hw_device va \
-i src.mkv \
-vf format=nv12,hwupload=extra_hw_frames=64 \
-c:v h264_qsv -preset medium -global_quality 23 -an out.mkv
vmaf-tune validates the (preset, global_quality) pair before spawning ffmpeg and probes ffmpeg -encoders for the requested encoder; if libmfx / VPL is not compiled in, the harness raises RuntimeError with a build-time hint rather than letting ffmpeg emit an Encoder not found line buried in stderr.
Apple VideoToolbox adapters¶
The h264_videotoolbox, hevc_videotoolbox, and prores_videotoolbox registry entries cover Apple Silicon (M-series) and T2-equipped Intel Macs via the VideoToolbox.framework hardware encoder. The H.264 and HEVC adapters were added in ADR-0283; the ProRes adapter follows the same registry pattern (see ADR-0283 Status update 2026-05-09). All three share _videotoolbox_common.py for the preset → -realtime mapping; H.264 / HEVC also share the -q:v quality knob, while ProRes uses the integer profile tier instead.
| Adapter name | FFmpeg encoder | Quality knob | Hardware required |
|---|---|---|---|
h264_videotoolbox | h264_videotoolbox | q:v (0..100, higher = better) | Apple Silicon or Intel Mac with T2 |
hevc_videotoolbox | hevc_videotoolbox | q:v (0..100, higher = better) | Apple Silicon or Intel Mac with T2 |
prores_videotoolbox | prores_videotoolbox | profile:v (0..5, higher tier = better) | Apple Silicon M1 Pro / Max / Ultra or later |
VideoToolbox exposes only a binary -realtime {0,1} flag instead of a multi-valued preset, so the harness's nine-name preset vocabulary collapses onto that boolean per the table in _videotoolbox_common.py: ultrafast/superfast/veryfast/faster/fast → realtime=1 (low-latency fast path); medium/slow/slower/veryslow → realtime=0 (offline / quality-priority). The mapping is intentionally lossy — VT cannot expose a finer dial.
The underlying FFmpeg invocations look like:
ffmpeg -i src.mkv -c:v h264_videotoolbox -realtime 0 -q:v 60 -an out.mkv
ffmpeg -i src.mkv -c:v hevc_videotoolbox -realtime 0 -q:v 60 -an out.mkv
ffmpeg -i src.mkv -c:v prores_videotoolbox -realtime 0 -profile:v hq -an out.mov
AV1 hardware encoding is intentionally not wired — Apple Silicon has no AV1 hardware encoder block as of 2026 and FFmpeg exposes no av1_videotoolbox. Use libaom-av1 or libsvtav1 for AV1 on macOS.
vmaf-tune validates the (preset, quality) pair via the adapter (-q:v for H.264 / HEVC; integer tier id for ProRes) and probes ffmpeg -encoders for the requested encoder; if VideoToolbox is unavailable (e.g. the host is Linux), the harness raises RuntimeError rather than letting ffmpeg emit Encoder not found.
ProRes tier reference¶
ProRes is a fixed-rate intermediate codec — there is no CRF / QP scalar. Quality is selected entirely by the tier, and bitrate is implicit in the tier × resolution × frame-rate combination. The harness's --crf flag carries the integer tier id (the FFmpeg profile:v AVOption value); the adapter emits the canonical FFmpeg alias on the argv for diagnosability.
crf value | FFmpeg alias | Marketing name | Typical use |
|---|---|---|---|
| 0 | proxy | ProRes 422 Proxy | Offline editing, dailies |
| 1 | lt | ProRes 422 LT | Broadcast acquisition |
| 2 | standard | ProRes 422 | Mainline broadcast master |
| 3 | hq | ProRes 422 HQ | High-end broadcast / film master (default) |
| 4 | 4444 | ProRes 4444 | Graphics, alpha, colour grading |
| 5 | xq | ProRes 4444 XQ | High-dynamic-range / wide-gamut master |
Source: FFmpeg libavcodec/videotoolboxenc.c prores_options AVOption table (verified against an FFmpeg n8.1.1 checkout).
ProRes is intra-only — every frame is a keyframe — so --keyint / --force-keyframes flags are accepted but have no rate-distortion effect. The harness still emits them so the muxer's seek-table density is predictable across codecs.
Saliency-aware encoding (recommend-saliency --saliency-aware)¶
Bucket #2 of the PR #354 audit (see ADR-0293) wires the fork-trained saliency_student_v1 ONNX model (ADR-0286) into vmaf-tune so a single command can produce an encode that biases bits toward salient regions (faces, focal subjects, action) and saves bits on background.
Synopsis¶
vmaf-tune recommend-saliency \
--src ref.yuv --width 1920 --height 1080 --framerate 24 \
--duration-frames 240 \
--preset medium --crf 23 \
--saliency-aware \
[--saliency-offset -4] \
[--saliency-model model/tiny/saliency_student_v1.onnx] \
[--saliency-aggregator mean|ema|max|motion-weighted] \
--output out.mp4
How it works¶
compute_saliency_map()samples the requested frame window from the source YUV, runs those frames throughsaliency_student_v1.onnx(ImageNet-normalised RGB derived from YUV, NCHW[1, 3, H, W]), and reduces the per-frame saliency outputs into one mask in[0, 1].saliency_to_qp_map()linearly maps the mask to per-pixel QP deltas —--saliency-offsetis the QP delta at peak saliency (negative means better quality on salient regions). Background gets the symmetric positive delta. The output is clamped to[-12, +12](matching thevmaf-roisidecar convention from ADR-0247).- The pixel-level QP-offset map is dispatched to the encoder-specific ROI channel (see the encoder-targets table below); each channel reduces the map to its native granularity before writing any sidecar file or composing the argv slice.
- The augmented
extra_params(qpfile path, zones string, or ROI-map path) are appended to the FFmpeg command and the normal encode path runs.
Trade-off¶
| Axis | Direction |
|---|---|
| Bitrate (same VMAF) | −10 % to −20 % for content with strong attention focus (faces, action, sport). Background-uniform content sees little change. |
| Encode time | +5 % typical (saliency inference + per-MB reduce; per-frame model time is sub-millisecond on CPU at SD/HD). |
| Decode time | unchanged (the bitstream is standard-compliant for all five supported encoders). |
| Quality (VMAF) | unchanged at the clip-mean level; concentrated where the eye looks. |
Numbers are indicative. Today's recommend-saliency subcommand is a one-shot encode at --crf (or the adapter default); target-VMAF selection remains the job of recommend, compare, or a caller-provided bisect loop.
Temporal aggregation¶
--saliency-aggregator controls how sampled frame masks become the single ROI pattern applied to the encode:
| Aggregator | Behaviour | Use when |
|---|---|---|
mean | Per-pixel arithmetic mean; preserves the original implementation. | Default, stable clips, and baseline comparisons. |
ema | Exponential moving average with --saliency-ema-alpha as the current-frame weight. | Motion or cuts make the latest sampled frames more representative. |
max | Per-pixel maximum over sampled masks. | Missing a brief salient object is worse than over-protecting background. |
motion-weighted | Weighted mean where each sampled frame is weighted by luma delta from the previous sampled frame. | Motion-heavy clips where changing frames should dominate the aggregate. |
All four reducers use the same saliency_student_v1 weights and the same downstream QP mapping, so changing the reducer does not change the model contract or encoder sidecar format.
Graceful fallback¶
If onnxruntime is not installed or model/tiny/saliency_student_v1.onnx cannot be loaded, recommend-saliency --saliency-aware logs a warning and falls back to a plain encode. This matches the vmaf-roi C sidecar's posture.
If the chosen --encoder has no ROI dispatch implementation (e.g. h264_nvenc, libvpx-vp9, hevc_nvenc) the command exits with code 2 and a structured error message listing the supported codecs. To accept a plain encode instead, pass --saliency-fallback-plain (or set the environment variable VMAFTUNE_SALIENCY_FALLBACK_OK=1); in that case an ERROR is logged rather than a WARNING. This hard-fail posture matches ADR-0498 / ADR-0546.
Encoder targets¶
The saliency pipeline supports five encoder ROI mechanisms:
| Encoder | ROI channel | Granularity | argv slot |
|---|---|---|---|
libx264 | ASCII --qpfile (x264 r2390+) | 16×16 luma MB | -x264-params qpfile=… |
libaom-av1 | patched FFmpeg -qpfile ROI bridge | 16×16 luma MB mapped onto libaom MI cells | -qpfile … |
libx265 | --zones QP delta (per-clip mean) | full-clip spatial mean | -x265-params zones=0,N,q=<delta> |
libsvtav1 | --qp-file offset map (SVT-AV1 v1.7+) | 64×64 super-block | -svtav1-params qp-file=… |
libvvenc | ROIFile CSV (VVenC v1.14.0+) | 64×64 CTU | -vvenc-params ROIFile=… |
See ADR-0293 (x264 baseline) and ADR-0414 (x265 / SVT-AV1 / VVenC). The libaom-av1 row uses the shared -qpfile bridge documented in vmaf-tune-ffmpeg.md.
Per-adapter helpers in vmaftune.saliency:
write_x265_zones_arg(block_offsets, duration_frames)→ zones stringwrite_x264_qpfile(block_offsets, out_path, duration_frames)→ x264/libaom qpfilewrite_svtav1_qpoffset_map(block_offsets, out_path, duration_frames)→ Pathwrite_vvenc_roi_csv(block_offsets, out_path, duration_frames)→ Path
Caveats¶
- Aggregate mask, not per-frame ROI. The current implementation reduces saliency across the sampled frames and applies one delta pattern across the whole clip. Per-frame ROI is on the roadmap.
- x265 zones: spatial mean only. x265's
--zoneshas temporal granularity but not per-block spatial granularity. The zone carries the mean QP delta across all blocks. Per-block x265 ROI requires a future x265 qpfile port (different format from x264). - libaom segment quantisation. The FFmpeg bridge maps 16×16 qpfile deltas onto libaom's MI grid and at most eight segment QPs. Very fine-grained deltas are therefore quantised to the nearest segment.
- SVT-AV1 / VVenC: 64×64 granularity. Both encoders document 64×64 as their ROI-map unit. The saliency mask is reduced to this grid via
reduce_qp_map_to_blocks(qp_map, block=64)before writing. - RGB saliency input. The student model receives BT.709-limited yuv420p converted to ImageNet-normalised RGB. Chroma is nearest-neighbour upsampled to luma resolution before inference.
- Don't use the placeholder.
mobilesal_placeholder_v0and the radial fallback insidevmaf-roiare smoke-test stubs. Pass an explicit--saliency-modelpointing at the real fork-trained weights when you want a perceptual benefit.
Reproducer¶
The test suite mocks the ONNX session and the encode runner so it runs without ffmpeg or onnxruntime installed:
pytest tools/vmaf-tune/tests/test_saliency.py \
tools/vmaf-tune/tests/test_saliency_roi_adapters.py \
tools/vmaf-tune/tests/test_saliency_roi_codec.py -v
Codec adapter contract¶
The encode driver (tools/vmaf-tune/src/vmaftune/encode.py) is codec-agnostic as of ADR-0297. It looks up the codec adapter via vmaftune.codec_adapters.get_adapter(req.encoder) and asks the adapter for its FFmpeg argv slice — the harness itself never branches on codec identity. Adding a new codec is one file under tools/vmaf-tune/src/vmaftune/codec_adapters/ plus a registry entry; the search loop, corpus row schema, and FFmpeg invocation stay untouched.
A codec adapter is a frozen dataclass exposing:
| Member | Type | Purpose |
|---|---|---|
name | str | Human-readable codec id ("libx264"). |
encoder | str | FFmpeg -c:v value ("libx264", "h264_nvenc", ...). |
quality_knob | str | Name of the quality knob ("crf", "cq", "qp", ...). |
quality_range | tuple[int, int] | Inclusive (min, max) for the knob. |
quality_default | int | Default quality value. |
invert_quality | bool | True when a higher value means lower quality (CRF / QP). |
presets | tuple[str, ...] | Allowed preset names. |
validate(preset, quality) -> None | method | Raises ValueError on out-of-range input. |
ffmpeg_codec_args(preset, quality) -> list[str] | method | Codec-specific argv slice (e.g. ["-c:v", "libx264", "-preset", "medium", "-crf", "23"]). |
extra_params() -> tuple[str, ...] | method (optional) | Additional non-codec argv (e.g. ("-svtav1-params", "tune=0")). |
The dispatcher composes the final ffmpeg command as:
[ffmpeg, -y, -hide_banner, -loglevel info,
-f rawvideo -pix_fmt <pf> -s WxH -r FR -i <src>,
*adapter.ffmpeg_codec_args(preset, quality),
*adapter.extra_params(),
*req.extra_params,
<output>]
Adapters that do not yet implement ffmpeg_codec_args (or for which get_adapter raises KeyError) fall back to the legacy x264-CRF shape (-c:v <encoder> -preset <p> -crf <q>) so partial adapters stay drivable end-to-end while their per-codec PRs are in flight.
parse_versions(stderr, encoder=...) selects a per-codec version probe; missing matches degrade to "unknown" rather than raising.
Adapters in flight¶
| Codec | PR | Status |
|---|---|---|
libx264 | shipped (Phase A) | green |
libx265 | #362 | adapter ships; dispatcher unblocks end-to-end |
libsvtav1 | #370 | adapter ships; dispatcher unblocks end-to-end |
libaom-av1 | #360 | adapter ships; dispatcher unblocks end-to-end |
libvvenc | #368 | adapter ships; dispatcher unblocks end-to-end |
h264_nvenc / hevc_nvenc / av1_nvenc | #364 | adapter ships; dispatcher unblocks end-to-end |
h264_qsv / hevc_qsv | #367 | adapter ships; dispatcher unblocks end-to-end |
h264_amf / hevc_amf | #366 | adapter ships; dispatcher unblocks end-to-end |
h264_videotoolbox / hevc_videotoolbox | #373 | adapter ships; dispatcher unblocks end-to-end |
Codec comparison¶
vmaf-tune compare answers the perennial "should I migrate from x264 to SVT-AV1 yet?" question per-source: given one reference and a target VMAF, run each codec's recommend predicate in parallel and rank the results by smallest file. This is Bucket #7 of the vmaf-tune capability audit; the default CLI backend is Phase B target-VMAF bisect per ADR-0326.
vmaf-tune compare \
--src ref.yuv \
--width 1920 --height 1080 --pix-fmt yuv420p \
--framerate 24 --duration 10 \
--sample-clip-seconds 4 \
--target-vmaf 92 \
--encoders libx264,libx265,libsvtav1,libaom-av1,libvvenc \
--crf-min 15 --crf-max 40 \
--format markdown
The real bisect backend needs source geometry because raw YUV does not self-describe. Pass --width and --height explicitly; --pix-fmt, --framerate, and --duration default to the common SDR 24 fps shape but should be set for accurate scoring and bitrate math. Pass --sample-clip-seconds N to evaluate the centre N seconds of the source per bisect iteration; compare forwards matching --frame_skip_ref / --frame_cnt scorer bounds and normalises bitrate against the sample duration. For custom rankers or tests, --predicate-module MODULE:CALLABLE still accepts any importable (codec, src, target_vmaf) -> RecommendResult callable and bypasses the bisect backend.
Container sources auto-probe their framerate / duration (ADR-0509). When --src is a container (mp4 / mkv / mov / Y4M / ...) and you omit --framerate / --duration, the CLI calls ffprobe on the source and substitutes the probed values before invoking the bisect predicate. This avoids a silent failure mode where the argparse default --framerate=24 against a 60 fps source mis-aligns reference vs. distorted decodes and collapses VMAF to the 4-90 band regardless of CRF. Explicit --framerate / --duration flags still win; a stderr warning fires when an explicit value disagrees with the probed source rate (subsampling is legitimate but should be deliberate). For raw .yuv sources nothing changes — the probe is skipped because raw YUV does not self-describe its framerate.
Sample output (--format markdown, abridged):
# Codec comparison — target VMAF 92
- Source: `ref.yuv`
- Tool: `vmaf-tune 0.0.1`
- Wall time: 6421.3 ms
| Rank | Codec | Encoder | Best CRF | Bitrate (kbps) | Encode time (ms) | VMAF | Status |
|---:|---|---|---:|---:|---:|---:|---|
| 1 | libaom-av1 | libaom-3.8.0 | 30 | 1500.0 | 18000.0 | 92.40 | ok |
| 2 | libx265 | libx265-3.5 | 26 | 1700.0 | 4200.0 | 92.00 | ok |
| 3 | libsvtav1 | libsvtav1-1.7.0 | 32 | 1900.0 | 2800.0 | 92.30 | ok |
| 4 | libx264 | libx264-164 | 23 | 2400.0 | 1500.0 | 92.10 | ok |
**Smallest file**: `libaom-av1` at CRF 30 → 1500.0 kbps (VMAF 92.40).
Multi-target rate-quality sweep (schema v2, ADR-0516 + ADR-0534 + ADR-0538)¶
vmaf-tune compare defaults to a 4-point rate-quality sweep covering premium-archival operating points: --target-vmafs 94,96,97,98 (ADR-0538, supersedes the ADR-0534 streaming defaults). The JSON output stamps schema_version: 2 and carries one row per (codec, target_vmaf) pair, plus an optional bisect_samples list per row that records every encode+score probe the underlying target-VMAF bisect computed. vmaf-tune report --compare-json sweep.json then renders a per-codec rate-quality curve assembled from those probes, with the picked-CRF rows highlighted as larger circled markers and the pareto frontier (lowest bitrate at each target) drawn as a heavier dashed overlay. Current HTML/Markdown profile cards also include a Quick takeaways block before the charts, short "how to read this" notes, report-local codec identity chips (text badges that link to the upstream codec or vendor project), per-codec failure status, and an embedded encoder_profile JSON payload. The takeaways spell out the smallest successful row at each target, failed/unavailable row count, ladder span, and per-shot CRF spread so non-expert readers do not have to infer the main result from the chart alone. The profile payload is intentionally machine-readable: vmaf-tune encode-profile can read the HTML, Markdown, or raw JSON and turn one selected recommendation into a concrete FFmpeg encode.
# Out of the box: 5-point sweep, 3-codec compare, GPU scoring.
vmaf-tune compare \
--src bbb_1080p_60fps.mp4 \
--width 1920 --height 1080 --framerate 60 \
--encoders libx265,libsvtav1 \
--sample-clip-seconds 3 --max-iterations 3 \
--score-backend cuda \
--format json --output sweep.json
vmaf-tune report \
--src bbb_1080p_60fps.mp4 \
--compare-json sweep.json --target-vmaf 92 \
--format html --output sweep_report.html
# Convenience path: render the same profile card directly from compare.
vmaf-tune compare \
--src bbb_1080p_60fps.mp4 \
--width 1920 --height 1080 --framerate 60 \
--encoders libx265,libsvtav1,av1_nvenc,av1_qsv \
--sample-clip-seconds 3 --max-iterations 3 \
--score-backend cuda \
--format both --output sweep_profile.html
# Reuse one recommendation from that profile without re-running the sweep.
vmaf-tune encode-profile \
--profile sweep_profile.html \
--src bbb_1080p_60fps.mp4 \
--codec libsvtav1 --target-vmaf 96 \
--output bbb_svtav1_vmaf96.mkv
Why these defaults (ADR-0538 supersedes ADR-0530)
94,96,97,98covers premium-archival. The fork's primary user encodes archival masters at VMAF >= 95 exclusively; VMAF 94 is the subjectively-transparent floor on 4K source, 98 is the near-lossless ceiling. The previous ADR-0530 / ADR-0534 default (75,80,85,90,93) targeted streaming / broadcast workflows that this fork does not service, so its R-Q chart contained no points the user picks CRFs from.- VMAF >= 95 is reachable. The earlier "top stops at 93" caveat was a bisect-harness artefact: the search window defaulted to the codec adapter's narrow
quality_range(e.g.libx265 = (15, 40),libsvtav1 = (20, 50)) which the adapter validator additionally enforced as a hard CRF gate. ADR-0538 widens the default search window to the encoder's absolute CRF range — see the High-VMAF bisect contract subsection below for the per-codec table and the contract callers can rely on. - The chart renders from
bisect_samples, not from the picked-CRF cells. Connecting picked-CRF rows per codec across targets produced physically impossible downward dips, because the bisect overshoots each target by a different amount. Plotting every probe the bisect already computed shows the genuine per-codec R-Q curve.
High-VMAF bisect contract (ADR-0538)¶
When crf_range is left at its default (the CLI default; no --crf-min / --crf-max passed), bisect_target_vmaf searches the encoder's absolute CRF range, not the codec adapter's perceptually-informative quality_range:
| Codec | Absolute CRF range | Informative quality_range (legacy default) |
|---|---|---|
libx264 | 0..51 | 0..51 (already maximal) |
libx265 | 0..51 | 15..40 |
libvpx-vp9 | 0..63 | 0..63 (already maximal) |
libaom-av1 | 0..63 | 0..63 (already maximal) |
libsvtav1 | 0..63 | 20..50 |
Other (*_nvenc, *_qsv, *_amf, *_videotoolbox, libvvenc) | falls back to adapter.crf_min/crf_max then adapter.quality_range | as declared by the adapter |
The contract guarantees:
- The bisect starts at the encoder's accepted floor. For libx264 / libx265 / libvpx-vp9 / libaom-av1 / libsvtav1 the lowest probe is CRF 0 (lossless), so any reasonable source produces VMAF >= 98 at that CRF and high targets are reachable.
max_iterations >= 6covers the widest window.ceil(log2(64)) = 6; the CLI default is--max-iterations 8(+2 safety). For a target near the top of the VMAF scale (>= 97) callers running the bisect programmatically should keep at least 6 iterations.- Overshoot at the floor is OK. If the codec already overshoots the target at CRF 0 (e.g. CRF 0 gives VMAF 99.5 on a low-distortion source against target 96), the bisect narrows toward higher CRFs looking for the highest CRF that still clears the target and returns that one with
ok=True. The achieved VMAF will be>= target_vmafby construction. - The adapter's narrow informative window is bypassed for CRF validation but not for preset validation. Preset names are still checked against the adapter's whitelist; CRFs are checked against the encoder absolute range. Pass
--crf-min/--crf-maxexplicitly to recover the historical narrow-window behaviour.
Single-target legacy (v1, back-compat): pass --target-vmaf NN without --target-vmafs to fall back to the v1 single-target schema (one row per codec at the single target, bar+dot chart). A _TrackedDefaultAction sentinel distinguishes "user passed --target-vmaf explicitly" from "argparse default" so legacy single-target scripts keep working unchanged.
Hardware encoders (*_nvenc, *_qsv, *_amf) are availability- probed before dispatch: a two-stage probe greps ffmpeg -encoders for the encoder name (catches "not compiled into this ffmpeg build") and then runs a 1-frame lavfi nullsrc dummy encode (catches "no compatible GPU at runtime"). Encoders that fail either stage surface as ok=false rows with a stable hardware encoder not available: … error string — the renderer flags them visually and the sweep continues; the run does not abort.
The --encoders flag is optional; when omitted it defaults to the CPU set libx265,libsvtav1. Older archival smoke runs also swept libx264 and libvpx-vp9, but those two CPU lanes dominate wall time without covering the fork's current HEVC/AV1 decision points.
BBB v9 artifacts are retained only as historical bug evidence; do not use v9 as a current run baseline. New BBB compare probes should start from the ADR-0641 profile-report path (--format both) and list libx264 / libvpx-vp9 only for an explicit legacy comparison.
compare CLI flags¶
| Flag | Default | Notes |
|---|---|---|
--src PATH | — | Required. Single reference clip. |
--target-vmaf F | 92.0 | Single VMAF target. When passed explicitly and --target-vmafs is at its default sweep, the v1 single-target schema is emitted (ADR-0534 back-compat). |
--target-vmafs LIST | 94,96,97,98 (ADR-0538, supersedes ADR-0534) | Comma-separated VMAF targets to sweep per codec. The default covers premium-archival operating points (4K archival masters at VMAF 94-98). Pass a single value to opt into the v1 path; pass an explicit multi-value list (e.g. 80,85,90) to override the sweep range. |
--encoders LIST | libx265,libsvtav1 | Comma-separated codec names. Hardware encoders (h264_nvenc, hevc_nvenc, av1_nvenc, h264_qsv, hevc_qsv, av1_qsv, h264_amf, hevc_amf, av1_amf) are accepted; missing-encoder rows skip with a reason rather than failing the whole run. |
--width / --height | — | Required for the default real-bisect backend. |
--pix-fmt | yuv420p | Source pixel format forwarded to the scorer. |
--framerate | 24.0 (or auto-probed for container --src) | Source framerate. Container sources (mp4 / mkv / mov / Y4M / ...) auto-probe via ffprobe when this flag is left at its default; explicit values still win with a stderr-warning on probed-vs-user mismatch (ADR-0509). |
--duration | 0.0 (or auto-probed for container --src) | Source duration in seconds, used for bitrate math. Same auto-probe semantics as --framerate for container sources (ADR-0509). |
--sample-clip-seconds | 0.0 | 0.0 scores the full source. Positive values shorter than --duration use a centre sample window for encode, score, and bitrate math (ADR-0301). |
--preset | adapter default | Preset forwarded to the codec adapter. |
--crf-min / --crf-max | adapter range | Inclusive CRF search window. Pass both or neither. |
--max-iterations | 8 | Encode+score round-trip cap per codec. |
--vmaf-model | vmaf_v0.6.1 | VMAF model forwarded to the scorer. |
--score-backend | scorer default | cpu, cuda, sycl, hip, or auto. (vulkan removed in ADR-0726.) |
--ffmpeg-bin / --vmaf-bin | ffmpeg / vmaf | Binary overrides. |
--encoder-ffmpeg-bin ENCODER=PATH | off | Bind one compare token to a specific FFmpeg binary. Use with ADAPTER@VARIANT labels such as libsvtav1@svt-av1-hdr=/opt/ffmpeg-8.1.1-svtav1-hdr/bin/ffmpeg; unbound tokens use --ffmpeg-bin. |
--format | markdown | One of markdown, json, csv, html, both. html and both render the profile-card report directly; both writes .html and .md next to --output and therefore requires --output. |
--no-parallel | off | Run codecs sequentially (default: thread pool, one per codec). |
--max-workers N | len(encoders) | Cap on the parallel thread pool. |
--predicate-module MOD:FN | off | Advanced hook that bypasses the bisect backend. |
--no-bisect | off | Switch to CRF-sweep mode: skip target-VMAF bisect and encode each (codec, CRF) pair from --crf-sweep exactly once. See CRF sweep mode below. (ADR-0542) |
--crf-sweep LIST | — | Comma-separated CRF values to use in --no-bisect mode. Example: 18,23,28,33. Required when --no-bisect is passed. (ADR-0542) |
--workdir PATH | None | Directory under which to create the per-run temporary scratch directory for the decoded reference YUV and encodes. Overrides VMAFTUNE_WORKDIR. When unset, falls through to VMAFTUNE_WORKDIR (if set), then to the OS default (/tmp). Pass this when your source is large and /tmp is a small tmpfs (e.g. 634-second 1080p60 BBB decodes to ~118 GB raw YUV). (ADR-0598) |
--max-concurrent-decodes N | 1 | Maximum number of reference-YUV decode operations that may run simultaneously across all codec bisect threads. Default 1 (serial decodes) caps peak workdir disk usage to one YUV at a time regardless of how many codecs run in the thread pool. For example, with 3 codecs and a 110 GB BBB source, the default prevents the 330 GB peak that caused the v13 ENOSPC failure — peak stays at 110 GB. Raise to N on hosts where --workdir points to a volume with sufficient free space and the I/O subsystem can sustain N parallel decode streams. Encoder runs are always parallel; only the decode-to-raw-YUV step is serialised at the default. (ADR-0577) |
--output PATH | stdout | Write the rendered report to PATH instead of stdout. |
compare output schema¶
The JSON / CSV columns are exported as vmaftune.compare.COMPARE_ROW_KEYS: codec, adapter, runtime_variant, ffmpeg_bin, encoder_version, best_crf, bitrate_kbps, encode_time_ms, vmaf_score, target_vmaf, ok, error. Failed rows trail successful ones in the ranking; ok=False rows carry a human-readable error and sentinel numerics (-1 for best_crf, NaN for the floats). adapter, runtime_variant, and ffmpeg_bin are provenance fields for ADAPTER@VARIANT compare runs; they are empty on rows produced by old programmatic predicates that do not bind a runtime variant.
v1 (single-target legacy): emitted when --target-vmafs is not passed. The JSON has no schema_version key; rows is one row per codec at --target-vmaf.
v2 (multi-target sweep, ADR-0516 + ADR-0534 + ADR-0538): emitted when --target-vmafs lists ≥ 2 targets (the new default, ADR-0538). The JSON carries "schema_version": 2, "target_vmafs": [94.0, 96.0, 97.0, 98.0], and rows is one row per (codec, target_vmaf) pair. Each row also carries an optional bisect_samples: [{crf, bitrate_kbps, vmaf_score, encode_time_ms}, ...] list (ADR-0530, additive) recording every encode+score probe the underlying bisect computed; the field is absent on old v2 dumps pre-dating ADR-0530.
vmaf-tune report detects the v1 vs v2 discriminator via the schema-version key (or the presence of target_vmafs) and picks the right chart: v1 renders the bar+dot chart, v2 renders the per-codec rate-quality curve. When bisect_samples is populated the chart plots every probe per codec (deduplicated by CRF, sorted by bitrate) and overlays the picked-CRF rows as larger circled markers; pareto frontier stays as the dashed overlay. When bisect_samples is absent (old v2 dump or v1) the chart falls back to the legacy connect-the-dots line with a caveat note in the title — the dots- connect-the-dots representation can show physically impossible downward dips when the bisect's per-target overshoot varies, which is the failure mode ADR-0530 fixes for the new path.
The CSV emitter intentionally drops the bisect_samples column (extrasaction="ignore" on the writer); it stays a JSON-only structured field.
Encode-time normalisation: the
encode_time_mscolumn is wall-clock on whatever machine ran the predicate. Cross-codec time comparisons only make sense when every predicate was run on the same hardware in the same configuration — see Research-0061 Bucket #7.
CRF sweep mode (--no-bisect)¶
When --no-bisect is passed together with --crf-sweep, compare skips the target-VMAF bisect entirely and instead encodes each (codec, CRF) pair exactly once. This is useful for operators who want to sweep a fixed CRF ladder (e.g. 18, 23, 28, 33) and inspect the resulting (bitrate, VMAF) pairs — faster than running four separate bisect sweeps and more direct than the corpus pipeline.
vmaf-tune compare \
--src clip.mp4 \
--encoders libx264,libx265,libsvtav1 \
--no-bisect --crf-sweep 18,23,28,33 \
--duration 60 --sample-clip-seconds 30 \
--format json --output cmp_sweep.json
3 codecs × 4 CRFs = 12 rows in cmp_sweep.json.
v3 (CRF-sweep, ADR-0542): emitted when --no-bisect is set. The JSON carries "schema_version": 3, "mode": "crf_sweep", "crf_sweep": [18, 23, 28, 33], and rows is one row per (codec, crf) pair. Each row carries codec, adapter, runtime_variant, ffmpeg_bin, crf, bitrate_kbps, vmaf_score, encode_time_ms, encoder_version, ok, and error. The --target-vmaf / --target-vmafs flags are accepted but act as label-only knobs in this mode (they annotate pareto frontier markers when rendered, but do not drive the encode loop). CRF-sweep mode currently requires --format json; render the result with a separate downstream report step once v3 ingestion lands.
Hardware encoder availability is probed the same way as the bisect path: unavailable encoders (e.g. h264_nvenc without an NVIDIA device) produce ok=false rows for each CRF rather than aborting the entire run.
HDR-aware tuning (Bucket #9, ADR-0300)¶
Phase A auto-detects HDR sources and injects codec-appropriate HDR encode flags + HDR VMAF scoring. Detection runs ffprobe against each --source once at corpus start; the per-source encode argv gets the resulting HDR flag set appended.
What gets detected¶
A source is classified as HDR iff its first video stream carries both of:
color_transfer∈ {smpte2084(PQ),arib-std-b67/hlg} andcolor_primaries∈ {bt2020,bt2020nc,bt2020-ncl,bt2020c,bt2020-cl}.
Mismatched signaling (e.g. PQ transfer with BT.709 primaries) is treated as SDR — misclassifying SDR as HDR is the dangerous failure mode. Mastering-display + max-CLL SEI side data is read when present and propagated to encoders that expose stable FFmpeg SEI flags (x265, SVT-AV1, HEVC NVENC).
Detection modes¶
| Mode | When to use |
|---|---|
--auto-hdr (default) | Mixed corpora; let ffprobe classify each source. |
--force-sdr | Disable HDR injection entirely (override probe). |
--force-hdr-pq | Raw YUV refs with no container metadata; you know the source is PQ. |
--force-hdr-hlg | Same, for HLG. |
The four flags are mutually exclusive.
Codec dispatch¶
| Encoder | HDR signaling carrier |
|---|---|
libx264 | Container-level -color_* flags only (x264 has no in-stream HDR SEI). |
libaom-av1 | Global -color_* tags only; no fork-owned private SEI mapping yet. |
libx265 | Global -color_* + -x265-params colorprim=bt2020:transfer=...:colormatrix=bt2020nc[:master-display=...:max-cll=...:hdr10-opt=1]. |
libsvtav1 | Global -color_* + -svtav1-params color-primaries=9:transfer-characteristics=16 (PQ) or =18 (HLG) :matrix-coefficients=9. |
hevc_nvenc | -pix_fmt p010le -profile:v main10 + global -color_* + -master_display / -max_cll (when ffmpeg supports them). |
av1_nvenc | -pix_fmt p010le + global -color_* tags. |
hevc_qsv / hevc_amf / hevc_videotoolbox | -pix_fmt p010le -profile:v main10 + global -color_* tags. |
av1_qsv / av1_amf | -pix_fmt p010le + global -color_* tags. |
libvvenc | Global -color_* only (SEI options live behind --vvenc-params in newer ffmpeg builds). |
Encoders not in the dispatch table emit no HDR flags and the corpus row's hdr_* fields still record the detection result.
HDR VMAF scoring (model-port slot)¶
vmaftune.hdr.select_hdr_vmaf_model(model_dir, transfer="pq"|"hlg") resolves an HDR-trained model JSON via a two-stage lookup:
- canonical filename —
model/vmaf_hdr_v0.6.1.json(the Netflix research artefact name); preferred whentransferis"pq"or"hlg". - glob fallback —
model/vmaf_hdr_*.json(so future revisions can land without code changes).
The fork does not ship the JSON in this PR. Verified 2026-05-08 against Netflix/vmaf master model/: no vmaf_hdr_*.json is present in the upstream public tree; Netflix publishes the artefact in a separate research bundle outside the repo. A fork-local license review is the gating follow-up (ADR-0300 § Status update 2026-05-08). Until then, HDR sources are scored against the SDR model with a one-shot warning logged on the first miss (subsequent misses stay quiet). Resulting vmaf_score values trend low for high-luminance regions and are not directly comparable to SDR scores. Drop a licensed copy at model/vmaf_hdr_v0.6.1.json and the harness picks it up automatically — no code change required.
Content-addressed cache¶
Re-running a corpus sweep after adjusting an unrelated flag should not re-encode and re-score tuples that have not changed. The content-addressed cache turns repeated (src, encoder, preset, crf) combinations into a free hit on the second run, restoring the parsed (bitrate, vmaf, encode_time, score_time) tuple from disk and skipping both subprocess calls. See ADR-0298 for the design.
Key composition¶
The cache key is sha256 of the canonical-JSON-encoded six-tuple:
src_sha256— content hash of the reference YUVencoder— adapter name (libx264, …)preset— encoder preset stringcrf— quality knob value (int)adapter_version— bumps when the codec adapter's argv shape changesffmpeg_version— host ffmpeg version string
Dropping any one of these would let stale entries shadow real results when the adapter or ffmpeg is upgraded — the test suite asserts each field flips the key.
Layout¶
The cache lives at $XDG_CACHE_HOME/vmaf-tune/ (or ~/.cache/vmaf-tune/ if the env var is unset). Override with --cache-dir. Layout:
<cache-dir>/
meta/<key>.json — parsed (bitrate, vmaf, encode_time, ...) tuple
blobs/<key>.bin — opaque encoded artifact
__index__.json — last-access timestamps for LRU eviction
Eviction¶
LRU with a default 10 GiB cap (configurable via --cache-size-gb). On every put, the oldest entries are dropped until the total on-disk size sits at or below the cap.
Disabling¶
Pass --no-cache to force a re-encode/re-score on every cell. The cache is also automatically skipped when --no-source-hash is active (no stable content key) or when ffmpeg -version cannot be probed before the run.
Caveats¶
- The cache is not baked into the JSONL row; the row stays the canonical record, the cache is an opaque sidecar.
- Cache hits do not write a synthetic
encode_path— that field remains empty unless--keep-encodesis set. - Concurrent runs against a shared cache dir (e.g. NFS) work for reads; writes are last-writer-wins and both writers' bytes are valid by content addressing.
vmaf-tune corpus --source ref.yuv \
--width 1920 --height 1080 --framerate 24 --duration 10 \
--preset medium \
--coarse-to-fine --target-vmaf 92 \
--output corpus.jsonl
--crf is no longer required — the CRF axis is generated by the search. --target-vmaf is optional here; without it the search still runs both passes and refines around the highest-VMAF coarse point.
recommend — pick a CRF for a quality target¶
The recommend subcommand always runs coarse-to-fine and prints the single recommended (preset, crf) pair plus its measured VMAF. It also writes the visited points to --output so callers have the corpus row for downstream analysis.
vmaf-tune recommend \
--source ref.yuv \
--width 1920 --height 1080 --framerate 24 --duration 10 \
--preset medium \
--target-vmaf 92
# stdout: src=ref.yuv preset=medium crf=27 vmaf=92.341 (visited 15 encodes)
Tunables¶
| Flag | Default | Notes |
|---|---|---|
--coarse-to-fine | off (corpus); on (recommend) | Activate the 2-pass search. |
--coarse-step N | 10 | Step for the coarse pass. With defaults gives [10, 20, 30, 40, 50]. |
--fine-radius R | 5 | ±R around best-coarse for the fine pass. |
--fine-step S | 1 | Step for the fine pass. |
--target-vmaf V | unset | Required for recommend; optional for corpus. |
Timing comparison¶
Numbers below are illustrative — actual encode + score wall time per point varies with source resolution, preset, and the libvmaf backend (cpu/cuda/sycl/hip). The relevant ratio is points visited, not seconds:
| Mode | Points visited | Relative wall time |
|---|---|---|
Full grid --crf 0 ... 51 | 52 | 1.00× (baseline) |
| Coarse-to-fine, defaults, target met mid-range | 15 | ~0.29× (3.46× faster) |
| Coarse-to-fine, 1-pass shortcut (target met at coarse max) | 5 | ~0.10× (10.4× faster) |
| Coarse-to-fine, target unmet (full fine pass anyway) | 15 | ~0.29× |
For a 1080p --preset medium clip where one (encode + score) pass takes ~5 s, the coarse-to-fine path drops a single recommend run from ~260 s to ~75 s.
What Phase A does not do¶
- No target-VMAF bisect (Phase B).
- No per-title CRF prediction (Phase C).
- No Pareto ABR ladder generation (Phase E).
Per-title ladder (Phase E)¶
Phase E ships the vmaf-tune ladder subcommand — given one source, sample (resolution × target-VMAF) points, take the Pareto upper-convex hull on (bitrate, vmaf), pick n evenly-spaced rungs along the hull, and emit the result as an HLS master playlist, DASH MPD, or JSON descriptor. This is the "per-title encoding" loop in one command — a fixed authoring-spec ladder is replaced by the ladder that's actually optimal for this title.
See ADR-0295 for the design and the alternatives considered (geometric ladder, JND- spaced, fixed Apple HLS).
The default sampler is wired: for each (resolution, target_vmaf) cell it runs the canonical 5-point CRF sweep 18,23,28,33,38 through the normal Phase A encode-and-score path, picks the closest row to the target, and feeds those points into hull and knee selection. A custom sampler= callback remains supported for callers that want a finer grid, a bisect loop, or a precomputed corpus stream.
Container and Y4M sources¶
--src accepts any input ffmpeg can decode: raw .yuv, .y4m, or a container (.mp4, .mkv, .webm, …). The ladder, corpus, and bisect paths transparently decode the reference once per sweep into a .ref.decoded.yuv sidecar under the encode dir and reuse it across every cell — there is no need to pre-decode the source by hand (ADR-0499). When --duration is set, the reference decode is clamped to that window so a 10-second probe against a multi-minute source produces a bounded YUV instead of materialising the full file (ADR-0498).
The libvmaf CLI itself reads .yuv only when --width / --height / --pixel_format / --bitdepth are passed (which vmaf-tune always does); .y4m is treated as raw planar YUV the same way .mp4 is, so the wrapper decodes both.
Canonical 5-rung invocation¶
The default rendition set is the canonical 5-rung 1080p/720p/480p/360p/240p ladder against VMAF targets {95, 90, 85, 75, 65}:
vmaf-tune ladder \
--src episode01.yuv \
--encoder libx264 \
--resolutions 1920x1080,1280x720,854x480,640x360,426x240 \
--target-vmafs 95,90,85,75,65 \
--quality-tiers 5 \
--format hls \
--output episode01_ladder.m3u8
The output is an HLS master playlist with one #EXT-X-STREAM-INF per rung; bandwidth (in bps) is monotonically increasing. Variant URIs are placeholders — re-point them at your per-rendition playlists when packaging the encoded segments.
Other manifest formats¶
# DASH MPD
vmaf-tune ladder --src ep01.yuv --format dash --output ladder.mpd
# JSON descriptor (machine-readable, vmaf-tune-ladder/v1 schema)
vmaf-tune ladder --src ep01.yuv --format json --output ladder.json
The JSON descriptor carries three top-level fields:
schema— schema identifier ("vmaf-tune-ladder/v1").renditions[]— the post-hull rungs thatselect_kneespicked. Ascending-bitrate order; each entry haswidth,height,bitrate_kbps,bandwidth_bps,vmaf,crf.samples[]— every encoded(resolution, crf)row the sampler scored, pre-hull. Ascending by(pixel_count, bitrate_kbps); same per-entry shape asrenditions[]. Added 2026-05-18 per ADR-0501 sovmaf-tune report --ladder-jsoncan render the Pareto-cloud overlay; widened 2026-05-18 per ADR-0505 from one-row-per-target-cell (V4 emit shape) to the full per-CRF sweep — every CRF encoded by the sampler now lands in the array exactly once, de-duplicated by(width, height, crf). The array is always present — callers running with no scored cells get an empty list rather than a missing key.
Cross-resolution scoring against container sources¶
Prior to ADR-0501 the reference leg of a cross-resolution rung was decoded at the source's native geometry while the libvmaf CLI was told to read both legs at the rung target. A 1920x1080 reference was therefore mis-parsed as a 1280x720 frame and emitted a catastrophic VMAF (~21 instead of ~93), collapsing the post-hull ladder to a single rendition. The corpus / ladder paths now downscale the reference YUV sidecar to the rung target via a per-rung -vf scale=W:H filter on the ffmpeg decode call. The per-rung sidecar's filename embeds the target dims (<src>.ref.decoded.<W>x<H>.yuv) so a multi-rung sweep in the same encode_dir doesn't collide on a stale path. Single-resolution ladders and rungs that match the source geometry keep the legacy decode-at-native-geometry path — there's no decode overhead when the target already matches.
ADR-0505 closes the matching gap on the encode side. Before 2026-05-18 the encode driver passed -f rawvideo -pix_fmt yuv420p -s WxH -i src.mp4 against every source, re-interpreting a container's compressed bytes as planar YUV pixels and producing a uniformly-bogus ~50 Mbps encode with VMAF in the 4-9 band regardless of CRF. The corpus now detects container sources by suffix (anything outside _VMAF_RAW_SUFFIXES = {".yuv", ""}) and sets EncodeRequest.source_is_container=True, so ffmpeg auto-detects the format and the rung-target -vf scale=W:H filter handles the resolution change. Raw .yuv sources keep the legacy rawvideo framing.
Rung spacing¶
--spacing log_bitrate (default) doubles bandwidth per rung — Apple HLS authoring-spec convention. --spacing vmaf spaces rungs by equal VMAF gaps, matching how viewers perceive quality steps. uniform is accepted as a legacy alias for vmaf.
Phase E ladder CLI flags¶
| Flag | Default | Notes |
|---|---|---|
--src PATH | — | Required. Source label (sampling currently mocked). |
--encoder NAME | libx264 | Codec adapter (Phase A wires libx264 only). |
--resolutions WxH,... | 1920x1080,1280x720,854x480,640x360,426x240 | Canonical 5-rung. |
--target-vmafs F,... | 95,90,85,75,65 | VMAF targets per resolution. |
--quality-tiers N | 5 | Rungs to pick from the Pareto hull. |
--spacing | log_bitrate | log_bitrate (HLS spec) or vmaf (perceptual); uniform is a legacy alias for vmaf. |
--format | hls | hls, dash, or json. |
--with-uncertainty | off | Apply the ADR-0279 prune/insert recipe. Sampled vmaf_interval payloads win; point-only rows use the active wide-interval threshold as a conservative fallback. |
--uncertainty-sidecar PATH | default thresholds | Calibration sidecar for the uncertainty recipe. |
--rung-overlap-threshold F | 0.5 | Adjacent-rung interval overlap threshold for pruning. |
--output PATH | stdout | Manifest destination. |
--src-width INT | largest --resolutions entry | Actual source width for raw YUV cross-resolution ladders. When the source is a higher resolution than the smallest rung, this is the demuxer-side -s W:H; the encode pipe scales to each rung target via -vf scale=W:H. Container sources auto-detect geometry and ignore this flag. Added 2026-05-18 per ADR-0498 / Bug #v2-B. |
--src-height INT | largest --resolutions entry | Companion to --src-width. Default picks the tallest entry in --resolutions so a --resolutions 1920x1080,1280x720,854x480 ladder against a 1080p raw YUV "just works". |
--score-backend NAME | auto | libvmaf scoring backend used by the default corpus sampler. Accepts auto\|cpu\|cuda\|sycl\|hip (vulkan was removed in ADR-0726 — same enum as corpus --score-backend / compare --score-backend). auto picks the fastest available in native-first order (cuda > sycl > hip > cpu); a specific name is honoured strictly and the run errors out with RC=2 before any encodes start if the local vmaf binary does not advertise it. Use cpu to force bit-exact CPU scoring for verification against the Netflix golden gate. Added 2026-05-18 per ADR-0511 / Bug C; HIP and native-first order added by ADR-0667. |
--vmaf-bin PATH | vmaf | Path to the vmaf binary used to probe backend availability for --score-backend. Added 2026-05-18 per ADR-0511 / Bug C. |
--workdir PATH | None | Directory under which to create the per-run temporary scratch directory. Overrides VMAFTUNE_WORKDIR. See compare --workdir for full semantics. (ADR-0598) |
--max-concurrent-decodes N | 1 | Accepted for consistency with compare and tune-per-shot (ADR-0577). Currently a no-op for ladder because the corpus sampler does not use the bisect decode path; effective when the ladder sampler is updated to bisect in a future PR. |
Cross-resolution ladders against raw YUV: prior to ADR-0498 the default sampler used the rung target dims as the source dims, which corrupted every encode against a raw YUV source whose actual resolution differed from the requested rung (
-s 1280x720on 1080p bytes = decoded garbage). The ladder now accepts separate source dims and injects a-vf scale=W:Hfilter for each sub-source rung. Container (.mp4/.mkv) sources are unaffected — ffmpeg auto-detects their geometry. ADR-0501 closes the corresponding gap on the reference leg: a per-rung.ref.decoded.<W>x<H>.yuvsidecar is now produced so the libvmaf CLI reads both legs at the same geometry.
report subcommand — stdout aggregate flags¶
The report subcommand renders a profile-card (HTML / Markdown) from one or more JSON dumps emitted upstream of it (--compare-json, --ladder-json, --per-shot-json). It also writes a one-line stdout JSON summary that downstream automation can pipe to jq. The rendered card starts with a Quick takeaways block that summarizes the concrete recommendation and coverage gaps before the detailed tables. The stdout fields:
| Key | Meaning |
|---|---|
ok | true when at least one codec row succeeded and no non-ok row is a real encode failure. An "encoder unavailable" row (infrastructure gap — codec not built into ffmpeg) does not flip this to false. With no codec rows the report is informational and stays ok=true. |
degraded | true when at least one codec row is an "encoder unavailable" row. Lets dashboards show the missing codec without flipping the run red. Added 2026-05-18 per ADR-0501 / Bug #V4-C. |
codec_rows / codec_rows_ok / codec_rows_failed | Total / succeeded / failed row counts. |
codec_rows_unavailable | Subset of codec_rows_failed whose error starts with "encoder unavailable". Added 2026-05-18 per ADR-0501. |
ladder_samples / ladder_rungs | Counts read from --ladder-json. ladder_samples reads the top-level samples[] array (always present in vmaf-tune-ladder/v1 JSON since ADR-0501). |
shots | Count of per-shot rows read from --per-shot-json. |
outputs | Paths of the rendered card files. |
The "encoder unavailable" discrimination keys on the bisect-stage error prefix added in ADR-0498 ("encoder unavailable (NAME): …"). A row whose error does not start with that prefix is treated as a genuine encode/score failure and flips ok=false.
encode-profile subcommand — reuse report recommendations¶
Every HTML, Markdown, and raw JSON profile card embeds encoder_profile with schema vmaftune.encoder_profile.v1 (ADR-0643). The payload contains source metadata, tool/binary provenance, codec metadata, failures, and a sorted list of concrete encoder recommendations. encode-profile reads that payload and runs exactly one selected FFmpeg encode; it never encodes the full ladder or every codec by default.
# Inspect the exact FFmpeg argv first.
vmaf-tune encode-profile \
--profile sweep_profile.html \
--src bbb_1080p_60fps.mp4 \
--codec libsvtav1 \
--target-vmaf 96 \
--output bbb_svtav1_vmaf96.mkv \
--dry-run
# Run the encode once the selected row looks right.
vmaf-tune encode-profile \
--profile sweep_profile.html \
--src bbb_1080p_60fps.mp4 \
--codec libsvtav1 \
--target-vmaf 96 \
--output bbb_svtav1_vmaf96.mkv \
--extra-ffmpeg-arg=-movflags --extra-ffmpeg-arg=+faststart
Selection rules:
- With no filters, the first pareto-selected row with the lowest bitrate is used.
--codec NAMEnarrows the candidate list to one adapter token (libsvtav1,libx265,av1_nvenc, ...).--target-vmaf Fnarrows by the exact target stored in the profile.--recommendation-index Npicks the zero-based row after those filters, useful when several rows remain tied or intentionally comparable.
Input handling:
| Flag | Default | Notes |
|---|---|---|
--profile PATH | — | Required. Accepts report JSON, report HTML, or report Markdown. |
--src PATH | profile source | Override when the profile was generated on another machine. |
--output PATH | — | Required encoded output path. |
--source-kind auto\|container\|raw | auto | auto treats .yuv / .raw / .rgb / .gray as raw and everything else as FFmpeg-auto-detected container input. |
--width / --height / --framerate / --pix-fmt | profile values | Required only when the selected source is raw and the profile did not carry those fields. |
--preset | profile row / adapter default | Override the selected row's preset. |
--duration | profile source duration | Bounds the input-side encode window. |
--sample-clip-seconds / --sample-clip-start-s | 0 | Optional input-side clip flags forwarded to FFmpeg. |
--extra-ffmpeg-arg TOKEN | none | Append one raw FFmpeg argv token after codec args. Repeat as needed; use --extra-ffmpeg-arg=-movflags for leading-dash tokens. |
--ffmpeg-bin PATH | profile ffmpeg_bin, then ffmpeg | Override the FFmpeg binary. |
--dry-run | off | Print the selected recommendation and ffmpeg_argv without encoding. |
Phase F — multi-pass encoding (ADR-0333)¶
Phase F lights up 2-pass encoding for codecs that benefit. Default behaviour stays single-pass; opting in via --two-pass runs the encoder twice — pass 1 analyses the source and writes a stats file to a temp directory, pass 2 reads those stats to make better rate-allocation decisions.
When to use it¶
2-pass encoding pays off most clearly in target-bitrate workflows (VOD ladder generation, codec comparisons at fixed bitrate). Constant- quality (CRF) encodes already adapt QPs frame-by-frame from the encoder's lookahead, so the win at fixed CRF is more modest. Expect:
- +1 to +3 VMAF points at a fixed bitrate target on libx265 vs 1-pass ABR (typical-content range; see x265 rate-control docs).
- ~2× encode wall time — the second pass roughly doubles the cost.
Quick start¶
vmaf-tune corpus \
--source ref.yuv --width 1920 --height 1080 \
--pix-fmt yuv420p --framerate 24 --duration 5 \
--encoder libx265 --preset medium --crf 23 \
--two-pass \
--output corpus_2pass.jsonl
The driver materialises a per-encode stats file under tempfile.gettempdir() (e.g. /tmp/vmaftune-2pass-XXXXXX/), runs both passes back-to-back, and removes the stats file plus known encoder sidecars when the run completes — successful or not.
Codec support matrix¶
ADR-0546 closed the contract for every codec adapter. Each adapter now either runs a real two-invocation 2-pass, returns single-invocation quality-boost flags callers can splice into extra_params, or raises a typed error documenting why the encoder cannot support multi-pass.
| Codec | supports_two_pass | two_pass_args(1, p) returns | Notes |
|---|---|---|---|
libx264 | yes | -pass 1 -passlogfile <prefix> | FFmpeg-native two-invocation 2-pass. ADR-0333. |
libx265 | yes | -x265-params pass=1:stats=<path> | x265 routes pass control through its codec-private payload. ADR-0333. |
libvpx-vp9 | yes | -pass 1 -passlogfile <prefix> | FFmpeg-native two-invocation 2-pass; CRF mode pinned via -b:v 0. |
libaom-av1 | yes | -pass 1 -passlogfile <prefix> | FFmpeg-native two-invocation 2-pass. ADR-0546. |
libvvenc | yes | -pass 1 -passlogfile <prefix> | FFmpeg ≥ 6.1 translates -pass to VVenC's RcStatsFile config. ADR-0546. |
libsvtav1 | no | -pass 1 -passlogfile <prefix> | SVT-AV1 forbids multi-pass in CRF mode (verified against v4.1.0: Svt[error]: CRF does not support multi-pass. Use single pass.). The adapter still returns VBR-mode argv for callers that explicitly switch into bitrate-targeted mode via extra_params. ADR-0546. |
h264_nvenc / hevc_nvenc / av1_nvenc | no | -multipass fullres | NVENC's multipass is a single-invocation, in-encoder full-resolution analysis. Splice the pass-1 return value into EncodeRequest.extra_params for a quality-boosted single-pass encode. Requires NVENC's VBR rate control + target bitrate. ADR-0546. |
h264_qsv / hevc_qsv / av1_qsv | no | -extbrc 1 -look_ahead_depth 40 | Intel QSV's extended-BRC look-ahead is a single-invocation in-encoder pre-analysis window. Compose into extra_params for the quality boost. ADR-0546. |
h264_amf / hevc_amf / av1_amf | no | -preanalysis true | AMD AMF's pre-analysis stage runs inside a single ffmpeg invocation. Compose into extra_params for the quality boost. ADR-0546. |
h264_videotoolbox / hevc_videotoolbox / av1_videotoolbox / prores_videotoolbox | no | raise VideoToolboxTwoPassUnsupportedError | Apple VTCompressionSession has no multi-pass C API. For true 2-pass on macOS, switch to a software encoder (libx264 / libx265 / libsvtav1 / libaom-av1 / libvvenc) — all of which ship in the same FFmpeg build. ADR-0546. |
When --two-pass is set against a codec where supports_two_pass = False, vmaf-tune writes a one-line warning to stderr and runs single-pass. (Mirrors the saliency.py "unsupported ROI encoder, fallback to plain encode" precedent.) To fail loud instead, callers using the Python API can pass on_unsupported="raise" to run_two_pass_encode. VideoToolbox calls to adapter.two_pass_args() always raise the typed error so callers introspecting the contract can disambiguate "API limitation" from "we forgot to implement".
Hardware quality-boost composition (ADR-0546)¶
Hardware encoders expose their "2-pass equivalent" as single-invocation flags. To combine them with the harness's normal single-pass driver, pull adapter.two_pass_args(1, Path("/tmp/unused")) and splice the result into EncodeRequest.extra_params:
from vmaftune.codec_adapters import get_adapter
from vmaftune.encode import EncodeRequest, run_encode
from pathlib import Path
adapter = get_adapter("h264_nvenc")
boost = adapter.two_pass_args(1, Path("/tmp/_unused")) # ('-multipass', 'fullres')
req = EncodeRequest(
source=ref, width=1920, height=1080, pix_fmt="yuv420p", framerate=24.0,
encoder="h264_nvenc", preset="slow", crf=22, output=out,
extra_params=boost, # quality-boosted single-pass
)
run_encode(req)
Cache interaction¶
The content-addressed encode cache (ADR-0298) keys on pass count, so a 1-pass encode and a 2-pass encode of the same (src, codec, preset, crf) are distinct cache entries — the cache will never serve a 1-pass encode for a 2-pass request.
Sample-clip composition¶
Sample-clip mode (ADR-0297) composes with 2-pass: both passes apply the same -ss <start> -t <N> input slice, and the per-encode stats file is unique per slice. No special handling required.
What Phase A / E / F do not do¶
- No target-VMAF bisect (Phase B). Phase E uses the canonical 5-point CRF sweep as its production default sampler.
- No per-title or per-shot CRF prediction (Phase C / D).
- No real-corpus end-to-end ladder validation against a Netflix per- title baseline yet.
- True two-invocation 2-pass on codecs the underlying encoder refuses (today:
libsvtav1in CRF mode; ADR-0546). The adapter still exposes meaningful argv for callers that switch into the encoder's supported mode (e.g. VBR for SvtAv1). - Encoder-stats parsing for
libvpx-vp9: FFmpeg's libvpx passlog is binary first-pass data, not the x264/x265 text schema consumed byencoder_stats.py.
Phase A.5 — opt-in fast recommend¶
The fast subcommand (ADR-0276, Research-0060) is an opt-in recommendation surface that combines three acceleration levers — VMAF proxy via fr_regressor_v2, Bayesian search via Optuna's TPE sampler, and GPU-accelerated VMAF for the verify step — to replace the exhaustive grid for the recommendation use case. The slow corpus path stays the canonical ground truth. Production mode runs the sample encode, extracts canonical-6 features, calls the fr_regressor_v2 ONNX proxy, and performs a single verify pass through the selected VMAF backend. --smoke is still available for dependency- free CI and local plumbing checks.
Install¶
The core install path stays zero-dependency; the [fast] extra is strictly opt-in.
Smoke-mode quick start¶
Smoke mode runs Optuna over a synthetic x264-shaped CRF→VMAF curve. No ffmpeg, no ONNX Runtime, no GPU is touched. Output is a single JSON object:
{
"encoder": "libx264",
"target_vmaf": 92.0,
"recommended_crf": 18,
"predicted_vmaf": 92.39,
"predicted_kbps": 4121.09,
"n_trials": 50,
"smoke": true,
"notes": "smoke mode — synthetic predictor; ..."
}
CLI flags¶
| Flag | Default | Notes |
|---|---|---|
--src PATH | — | Source video. Required outside --smoke. |
--target-vmaf F | 92.0 | Quality target on VMAF [0, 100] scale. |
--encoder NAME | libx264 | Codec adapter (Phase A.5: x264 only). |
--crf-lo N | 10 | Lower bound of the CRF search. |
--crf-hi N | 51 | Upper bound of the CRF search. |
--n-trials N | 50 | Optuna TPE trial count. |
--time-budget-s N | 300 | Soft wall-clock budget for Optuna. Completed trial count may be lower than --n-trials when the timeout is hit. |
--smoke | off | Synthetic predictor — exercises the pipeline without ffmpeg / ONNX. |
Speedup model¶
Per Research-0060 §Speedup model:
| Combination | Speedup vs Phase A grid |
|---|---|
| Phase A grid (baseline) | 1× |
fast (proxy + Bayesian + GPU verify) | ≈20–50× |
fast + NVENC (lever C, follow-up) | ≈100–500× |
These are upper bounds. The production claim is gated on a recommendation-quality benchmark against the slow grid.
Production limits¶
fast is a recommendation shortcut, not a corpus generator. It still needs a representative source clip, a usable encoder, and a VMAF binary for verification. When the proxy / verify gap exceeds --proxy-tolerance, the CLI exits with code 3 so operators can fall back to the slow grid:
vmaf-tune fast --src ref.yuv --target-vmaf 92 || \
vmaf-tune recommend --from-corpus corpus.jsonl --target-vmaf 92
Per-shot parallelisation remains a separate integration with TransNet V2 and vmaf-perShot.
fast accepts any registered codec adapter in production mode. Smoke mode remains synthetic and x264-shaped by design.
libvpx-vp9 example¶
The VP9 adapter routes --encoder libvpx-vp9 through FFmpeg's libvpx-vp9 wrapper. It maps the shared preset vocabulary to -deadline good -cpu-used N, emits -crf, and forces VP9 constant-quality mode with -b:v 0.
vmaf-tune corpus \
--encoder libvpx-vp9 \
--source ref.yuv \
--width 1920 --height 1080 --pix-fmt yuv420p \
--framerate 24 --duration 10 \
--preset medium --preset fast \
--crf 28 --crf 32 --crf 36 \
--output corpus_vp9.jsonl
--two-pass is supported for VP9 via FFmpeg's generic -pass / -passlogfile switches. Per-frame encoder-stats columns remain zero for VP9 until a binary libvpx first-pass parser lands.
x265 example¶
x265 ships ten presets (ultrafast … placebo) on the same 0..51 CRF scale as x264; the harness routes --encoder libx265 through ffmpeg's -c:v libx265 path. External ffmpeg must be built with --enable-libx265.
vmaf-tune corpus \
--encoder libx265 \
--source ref.yuv \
--width 1920 --height 1080 --pix-fmt yuv420p \
--framerate 24 --duration 10 \
--preset medium --preset slow --preset placebo \
--crf 23 --crf 28 --crf 34 \
--output corpus_x265.jsonl
The 10-bit pipeline is enabled by setting --pix-fmt yuv420p10le; the adapter reports the corresponding HEVC profile (main10) via X265Adapter.profile_for(pix_fmt) for downstream consumers that need it.
The software adapters (libx264, libx265, libsvtav1, libaom-av1, libvpx-vp9, libvvenc) all route through the same codec-adapter registry; the search loop does not branch on codec name.
VVenC (H.266 / VVC)¶
vmaf-tune ships a libvvenc codec adapter (ADR-0285) that drives Fraunhofer HHI's open-source VVC encoder via FFmpeg's -c:v libvvenc wrapper. VVC is the ITU-T / ISO standard that succeeds HEVC and delivers roughly 30-50% better compression at equal quality. As a rough rule of thumb, VVenC slow is ~5-10% better quality than HEVC slower at the same bitrate but ~3-5× slower wall-clock — VVC is the "opt in to longer encodes for tighter bitrates" branch of the adapter set.
Quality knob and presets¶
| Property | Value |
|---|---|
| Quality knob | qp (forwarded as the integer that the harness's --crf flag carries; VVenC's wrapper accepts the value regardless of label) |
| Quality range | [17, 50] (perceptually informative window; full VVenC scale is 0..63) |
| Default | 32 |
| Native presets | faster, fast, medium, slow, slower (5 levels) |
The harness's canonical 7-name preset vocabulary (placebo / slowest / slower / slow / medium / fast / faster / veryfast / superfast / ultrafast) compresses onto VVenC's native 5 levels via a static map: anything strictly slower than slow pins to slower; anything strictly faster than fast pins to faster; the central three names map identically. This matches the projection rule used by the parallel HEVC and AV1 adapters so that predictor inputs stay codec-uniform.
Tuning surface (real VVenC 1.14.0 knobs)¶
The adapter exposes a curated subset of VVenC's config keys via FFmpeg's -vvenc-params key=value:key=value channel. Keys are sourced verbatim from source/Lib/apputils/VVEncAppCfg.h at tag v1.14.0 (SHA 9428ea8636ae7f443ecde89999d16b2dfc421524, accessed 2026-05-09). Every knob defaults to None (library default preserved); the search loop opts into a non-default value only when the corpus row records it.
| Knob | VVenC key | Default | Effect / typical use |
|---|---|---|---|
perceptual_qpa | PerceptQPA | library default | XPSNR-driven perceptual QPA. Materially shifts the rate-distortion curve; recorded per-row in encoder_extra_params for predictor conditioning. |
internal_bitdepth | InternalBitDepth | library default (10 for VVC) | Force 8- or 10-bit internal precision; necessary for HDR profiles. |
tier | Tier | library default (main) | main or high. Caps signalled max bitrate / resolution. |
tiles | Tiles | single tile | (cols, rows) partitioning, emitted as NxM. Useful for parallel encode of high-resolution content. |
max_parallel_frames | MaxParallelFrames | library default (auto) | Parallel-frames perf knob; 0 disables, >=2 enables. |
rpr | RPR | library default | VVC reference-picture-resampling. 0 off / 1 on / 2 RPR-ready. |
sao | SAO | library default (on) | Sample Adaptive Offset loop filter. Useful for ablation studies. |
alf | ALF | library default | Adaptive Loop Filter; useful for ablation. |
ccalf | CCALF | library default | Cross-Component ALF (only meaningful when alf is on). |
Toggles are emitted in field-declaration order so the argv stays byte-stable for cache-key hashing (per ADR-0298). The adapter_version field bumps to "2" for the 2026-05-09 surface — stale cached results are invalidated automatically.
NN-VC status (deferred)¶
VVC the standard defines NN-VC tool-points (NN-based intra prediction, NN-based loop filter, NN-based super-resolution), but VVenC 1.14.0 does not ship implementations of any of them. An earlier draft of this adapter exposed an nnvc_intra toggle that emitted -vvenc-params IntraNN=1; that key has never existed in any released VVenC and has been removed (see ADR-0285 §"Status update 2026-05-09"). If upstream VVenC ever lands NN-VC tools the adapter will pick them up via the placeholder pattern from ADR-0339's self-activating adapter set.
External binary requirements¶
Running the VVenC adapter end-to-end requires:
ffmpegcompiled with--enable-libvvenconPATH(or--ffmpeg-bin).- The
libvvencshared library and headers from https://github.com/fraunhoferhhi/vvenc.
The shipped unit tests mock subprocess.run so the adapter can be exercised without either binary present; integration smoke is gated to a CI runner that has a libvvenc-enabled FFmpeg.
Phase D — per-shot CRF tuning¶
The tune-per-shot subcommand drives the Netflix-style per-shot encoding path. It cuts the source into shots (via the C-side vmaf-perShot binary, which wraps TransNet V2 — see ADR-0223), extracts each shot to a temporary raw-YUV reference, runs the Phase-B target-VMAF bisect for that shot, and emits an FFmpeg encoding plan that produces one segment per shot plus a final concat-demuxer command.
The target-VMAF predicate remains pluggable for advanced callers: --predicate-module MODULE:CALLABLE bypasses the default bisect path, and the Python API still accepts tune_per_shot(..., predicate=...). Codec emission is still portable segment-and-concat output rather than native per-shot mechanisms (--qpfile for x264, --zones for x265, the SVT-AV1 segment table).
Design rationale and the decision matrix live in ADR-0392.
Quick start¶
Container source (geometry auto-probed, no pre-extraction needed — ADR-0542):
vmaf-tune tune-per-shot \
--src clip.mp4 \
--target-vmaf 92 \
--encoder libx264 \
--output per_shot_encode.mp4 \
--plan-out plan.json
Raw YUV source (explicit geometry required):
vmaf-tune tune-per-shot \
--src ref.yuv \
--width 1920 --height 1080 \
--framerate 24 \
--target-vmaf 92 \
--encoder libx264 \
--output per_shot_encode.mp4 \
--plan-out plan.json
The plan is emitted to stdout as JSON unless --plan-out is specified. Pass --script-out plan.sh to also receive a copy-paste shell script of the per-segment + concat commands.
CLI flags¶
| Flag | Default | Notes |
|---|---|---|
--src PATH | — | Required. Source video. Accepts raw YUV (.yuv / .raw) or any container format (mp4, mkv, mov, ts, …). For container sources, --width, --height, --framerate, and --total-frames are auto-derived from ffprobe. (ADR-0542) |
--width / --height | auto-probed | Source resolution. Required for raw YUV sources. For container sources the values are auto-derived from ffprobe when these flags are omitted. Pass explicit values to override the probe. (ADR-0542) |
--pix-fmt PFMT | yuv420p | Forwarded to vmaf-perShot. |
--framerate F | auto-probed | Source framerate. Auto-derived from ffprobe for container sources; defaults to 24.0 if the probe cannot determine a rate. (ADR-0542) |
--target-vmaf V | 92.0 | Per-shot quality target. |
--encoder NAME | libx264 | Any registered codec adapter accepted by the Phase-B bisect backend. |
--bitdepth N | 8 | Forwarded to vmaf-perShot (8, 10, or 12). |
--total-frames N | 0 | Frame count for the single-shot fallback when vmaf-perShot is unavailable. Auto-derived from ffprobe for container sources. (ADR-0542) |
--scene-threshold X | unset | Override vmaf-perShot --diff-threshold (mean-absolute-luma-delta cutoff for cut classification; lower = more shots). Omit to keep the C-side compiled default (12.0 on 8-bit content). See ADR-0513. |
--max-shot-duration S | 2.0 | Uniform-time-window splitter (seconds). Any detected shot longer than S is sliced into equal-length sub-shots so the per-shot tuner sees a non-degenerate timeline even when the detector under-cuts (e.g. short clips, low-contrast fades). Set to 0 to disable. See ADR-0513. |
--per-shot-bin PATH | vmaf-perShot | Override the shot detector binary. |
--ffmpeg-bin PATH | ffmpeg | Override the FFmpeg binary. |
--vmaf-bin PATH | vmaf | Override the libvmaf CLI used by the per-shot scorer. |
--preset NAME | adapter default | Codec preset forwarded to Phase-B bisect. |
--crf-min / --crf-max | adapter range | Optional inclusive CRF search bounds; pass both or neither. |
--max-iterations N | 8 | Maximum encode+score iterations per detected shot. |
--vmaf-model NAME | vmaf_v0.6.1 | VMAF model forwarded to the per-shot scorer. |
--score-backend NAME | auto | libvmaf scoring backend for the per-shot scorer (auto, cpu, cuda, sycl, hip). (vulkan removed in ADR-0726.) |
--predicate-module SPEC | — | Advanced hook MODULE:CALLABLE matching (shot, target_vmaf, encoder) -> (crf, measured_vmaf); bypasses real bisect. |
--workdir PATH | None | Directory under which to create the per-run temporary scratch directory. Overrides VMAFTUNE_WORKDIR. See compare --workdir for full semantics. (ADR-0598) |
--max-concurrent-decodes N | 1 | Maximum number of simultaneous reference-YUV decode operations across all per-shot bisect threads. Default 1 (serial decodes). See compare --max-concurrent-decodes for full semantics. (ADR-0577) |
--output PATH | per_shot_encode.mp4 | Final concatenated encode destination. |
--segment-dir PATH | see below | Directory for per-shot segment files. Priority order: (1) explicit --segment-dir; (2) <plan-out>.parent/segments when --plan-out is set; (3) <output>.parent/segments. If the resolved directory is not writable (e.g. a read-only bind-mount), a WARN is emitted to stderr and the command still exits 0 — the plan JSON remains the authoritative deliverable. See ADR-0532. |
--plan-out PATH | stdout | Write the JSON plan here instead of stdout. |
--script-out PATH | — | Optional: also emit a copy-paste shell script. |
Plan JSON schema¶
{
"encoder": "libx264",
"framerate": 24.0,
"predicate": "bisect",
"target_vmaf": 92.0,
"shots": [
{
"start_frame": 0, "end_frame": 24,
"crf": 22, "predicted_vmaf": 93.0,
"bitrate_kbps": 5234.12
},
{
"start_frame": 24, "end_frame": 72,
"crf": 26, "predicted_vmaf": 92.5,
"bitrate_kbps": 4182.44
}
],
"segment_commands": [
["ffmpeg", "-y", "-hide_banner", "-ss", "0.000000", "-i", "ref.mp4",
"-frames:v", "24", "-c:v", "libx264", "-crf", "22",
"/tmp/segments/shot_0000.mp4"]
],
"concat_command": [
"ffmpeg", "-y", "-hide_banner", "-f", "concat", "-safe", "0",
"-i", "/tmp/segments/concat.txt", "-c", "copy", "out.mp4"
]
}
start_frame is inclusive, end_frame is exclusive (Python-slice convention). The vmaf-perShot CSV/JSON sidecar uses inclusive end_frame; the planner normalises into the half-open form and the segment commands honour the half-open semantics via -frames:v.
bitrate_kbps (added in ADR-0531) carries the encoded segment bitrate measured by the Phase-B bisect backend: (segment_size_bytes × 8 / 1000) / shot_duration_s. It is null when a custom --predicate-module is used (no real encode happens in that path) and always a positive finite float for the default bisect backend. The vmaf-tune report renderer uses this field to populate the Bitrate column in the per-shot table; a null / absent value renders as "—".
Single-shot fallback¶
If the vmaf-perShot binary is not on PATH, or it exits non-zero, the planner falls back to a single shot covering the whole clip ([0, --total-frames)). This keeps tune-per-shot usable as a smoke test on machines that have not built the shot-detector binary yet.
The uniform-time-window splitter (--max-shot-duration, default 2.0 s) still applies to that fallback: even when the detector returns one giant shot, the splitter slices it into roughly ceil(duration / window) equal-length sub-shots so the per-shot tuner produces a useful CRF timeline. Pass --max-shot-duration 0 to restore the historical single-shot behaviour. See ADR-0513.
Tuning scene sensitivity¶
vmaf-perShot classifies frame N as a cut when the mean absolute luma delta against frame N-1 (in the 8-bit domain) crosses the --diff-threshold cutoff. The compiled-in default is 12.0, which the empirical calibration in ADR-0222 tuned against the testdata fixtures. Real-world content varies: animated material with deep saturated transitions trips the heuristic easily, but short live-action clips with low-contrast scene changes (BBB's underwater segment, indoor talking-heads) under-cut at the default.
Lower --scene-threshold (e.g. 4 - 6) to recover those cuts. Higher values (e.g. 18 - 25) reject motion-rich in-shot bursts that the default classifies as fades. The CLI flag passes through to the C binary verbatim so existing diagnostic scripts that invoke vmaf-perShot directly use the same units.
What Phase D does not do¶
- Does not run the encodes — only emits the plan. Pipe
--script-out plan.shthroughshto execute it manually. - Does not emit native per-codec per-shot mechanisms (x264
--qpfile, x265--zones, SVT-AV1 segment tables). Per-segment encode plus concat-demuxer is the portable fallback. - Does not handle GOP-aligned shot boundaries — the per-segment approach side-steps this by re-encoding each shot from frame 0.
auto¶
vmaf-tune auto is the Phase F entry point (ADR-0325). One CLI verb composes the per-phase subcommands (corpus, recommend, predict, tune-per-shot, recommend-saliency, ladder, compare) plus the orthogonal modes (HDR auto-detect, sample-clip, resolution-aware) into a deterministic decision tree. F.1 shipped the sequential composition; F.2-F.5 added the short-circuits, confidence-aware fallbacks, and per-content recipes.
Synopsis¶
vmaf-tune auto \
--src reference.mp4 \
--target-vmaf 93 \
--max-budget-bitrate 5000 \
--allow-codecs libx264,libx265 \
[--codec libx265] \
[--sample-clip-seconds 10] \
[--smoke] \
[--output plan.json]
The non-smoke path probes source geometry, source duration, and HDR metadata through the same ffprobe/HDR helpers used by the corpus path. Probe failures degrade to conservative 1920x1080 SDR defaults so the planner can still emit a JSON plan and expose which later stages need real evidence. --smoke still exercises the composition end-to-end with mocked sub-phases (no ffmpeg, no ONNX). The JSON plan emitted under metadata.short_circuits records which short-circuits fired; post-hoc analysis uses this to measure the speedup contribution of each one. For each non-smoke cell, auto now feeds the probed metadata into the existing Predictor path, picks a codec-specific CRF for metadata.effective_predictor_target_vmaf, and records predictor estimates for estimated_vmaf and estimated_bitrate_kbps. These are planner estimates, not measured encode results, until the future realise/encode step scores the chosen cells. The per-cell prediction_source key distinguishes the production estimate path ("predictor") from the --smoke composition placeholder ("smoke-placeholder").
Short-circuits¶
The ten short-circuits below are the F.2 surface. Each one is a guarded fast-path with one trigger condition; the predicates are exposed as _should_short_circuit_<N> helpers in tools/vmaf-tune/src/vmaftune/auto.py so they can be unit-tested in isolation.
| # | Identifier | Trigger | Skips |
|---|---|---|---|
| 1 | ladder-single-rung | meta.height < 2160 | Multi-rung ABR ladder evaluation (ADR-0289 / ADR-0295). |
| 2 | codec-pinned | --codec set or --allow-codecs resolves to one entry | The compare.shortlist stage. |
| 3 | predictor-gospel | predict.crf_for_target returns GOSPEL (ADR-0306) | The recommend.coarse_to_fine fallback for that cell. |
| 4 | skip-saliency | meta.content_class is photographic / live-action (not animation / screen content) | The recommend_saliency.maybe_apply stage (ADR-0293). |
| 5 | sdr-skip | not meta.is_hdr (per ADR-0300 detector) | The HDR resolution + model-selection branch. |
| 6 | sample-clip-propagate | --sample-clip-seconds > 0 | Re-deciding clip length per stage; the user-supplied value propagates verbatim (ADR-0301). |
| 7 | skip-per-shot | duration < 5min AND shot_variance < 0.15 | The tune_per_shot.refine pass (ADR-0392). |
| 8 | low-complexity | meta.complexity_score < 200 kbps (probe-encode bitrate) | The recommend.coarse_to_fine sweep — the predictor's point estimate is already tight on simple content. 0.0/NaN does not fire (no probe run yet). |
| 9 | baseline-meets-target | meta.baseline_vmaf >= target_vmaf | The full predictor sweep — the default-CRF encode already satisfies the quality target. 0.0/NaN does not fire (no baseline scored yet). |
| 10 | no-two-pass | adapter.supports_two_pass == False (ADR-0333 + ADR-0546) | The two-pass calibration stage. Hardware encoders (*_nvenc, *_amf, *_qsv, *_videotoolbox) and libsvtav1 (CRF-mode multi-pass prohibition) fire this. libx264 / libx265 / libvpx-vp9 / libaom-av1 / libvvenc set supports_two_pass = True. |
The 5-min and 0.15 thresholds in short-circuit #7 are placeholders; F.3 fits them empirically once Phase F has emitted enough labelled compositions to make the fit statistically defensible. The constants live at the top of auto.py (PHASE_D_DURATION_GATE_S, PHASE_D_SHOT_VARIANCE_GATE) so the eventual fit lands as a one-line edit. The 200 kbps threshold in short-circuit #8 is likewise a placeholder: LOW_COMPLEXITY_PROBE_BITRATE_THRESHOLD_KBPS at the top of auto.py.
The evaluation order in SHORT_CIRCUIT_PREDICATES is part of the public contract: tests assert that an earlier-firing predicate doesn't shadow a later one whose result would have been different. Adding a new short-circuit means appending; never reordering.
auto does not dispatch the fast subcommand from inside its tree. fast (ADR-0276 fast-path) is a different operator surface (proxy + Bayesian over a single codec) and remains a sibling, not a child, of auto.
For HDR sources, every emitted cell records the codec-specific hdr_args produced by vmaftune.hdr.hdr_codec_args(codec, info). That means x264 gets only container-level -color_* flags, x265 gets -x265-params SEI signalling, SVT-AV1 gets -svtav1-params, and SDR cells record an empty list after the sdr-skip short-circuit fires.
Winner selection¶
auto now finishes the planning pass by selecting one estimated cell. The chosen cell is marked with "selected": true in cells[]; every other cell has "selected": false. The same decision is copied to metadata.winner so scripts can read a stable object without scanning the cell array.
Winner statuses:
| Status | Meaning |
|---|---|
budget_and_quality_met | At least one cell met --target-vmaf and --max-budget-bitrate; the lowest estimated bitrate wins. |
quality_met_budget_exceeded | No cell was inside budget, but at least one cell met the quality target; the smallest budget overage wins. |
target_unmet | No cell met the quality target; the closest estimated VMAF miss wins so the caller still gets a concrete next encode. |
no_eligible_cells | No cell carried finite estimated_vmaf and estimated_bitrate_kbps; this is an input/planner evidence failure. |
metadata.winner records cell_index, rung, codec, crf, estimated_vmaf, estimated_bitrate_kbps, quality_margin, and budget_margin_kbps. This is still a planning result: the selected cell is the next encode target, not a substitute for the final encode/score verification pass.
Execute mode (ADR-0454)¶
After the planning pass, --execute drives real FFmpeg encodes and libvmaf scores for the selected cell(s):
vmaf-tune auto \
--src reference.mp4 \
--target-vmaf 93 \
--max-budget-bitrate 5000 \
--allow-codecs libx264,libx265 \
--output plan.json \
--execute \
--runs-dir runs/
--execute flags:
| Flag | Default | Description |
|---|---|---|
--execute | off | Enable execute mode; plan-only when absent. |
--runs-dir PATH | runs/ | Destination for encoded files and tune_results.jsonl. |
--execute-all | off | Run every plan cell instead of only the selected winner. |
Results are appended to <runs-dir>/tune_results.jsonl one row per executed cell. Each row carries the cell metadata (codec, preset, CRF, estimated VMAF/bitrate) merged with encode outcomes (size, encode time, FFmpeg version) and score outcomes (measured VMAF, per-feature means/stds, vmaf-binary version). The file appends on each run so partial runs and incremental re-runs do not overwrite previous results.
The CLI exits with status 1 if at least one cell was executed but none scored successfully (encode failures, vmaf binary absent, etc.); it exits 0 on plan-only runs regardless of plan content.
Per-shot execution (ADR-0468)¶
run_plan_per_shot splits the source into shot boundaries (via vmaf-perShot / TransNet V2, ADR-0223) and scores each segment independently. Results land in <runs-dir>/tune_results_per_shot.jsonl:
from vmaftune.executor import run_plan_per_shot
per_shot_results = run_plan_per_shot(
plan, src=Path("reference.mp4"), out_dir=Path("runs/"),
vmaf_model="vmaf_v0.6.1",
)
for r in per_shot_results:
print(f"cell {r.row['cell_index']}: "
f"{r.row['shot_count']} shots, "
f"weighted VMAF={r.weighted_vmaf:.2f}")
Each top-level row carries shot_count and weighted_vmaf (frame-length-weighted mean of per-shot scores). When vmaf-perShot is absent, the call falls back to a single-shot range covering the whole clip; shot_count == 1 signals this.
Saliency execution (ADR-0468)¶
run_plan_saliency applies saliency-aware encoding via saliency_aware_encode (see § Saliency-aware encoding) before scoring. Results land in <runs-dir>/tune_results_saliency.jsonl:
from vmaftune.executor import run_plan_saliency
sal_results = run_plan_saliency(
plan, src=Path("reference.mp4"), out_dir=Path("runs/"),
saliency_model_path=Path("model/tiny/saliency_student_v1.onnx"),
duration_frames=total_frame_count,
)
for r in sal_results:
print(f"cell {r.row['cell_index']}: "
f"saliency_available={r.saliency_available}, "
f"VMAF={r.row['vmaf_score']}")
saliency_available in the result row is True when the ONNX model ran; False means the encoder fell back to a plain encode (model file missing or onnxruntime not installed). The encode and score always proceed regardless.
Confidence-aware fallbacks (F.3)¶
F.2 treats the predictor's verdict as a binary GOSPEL / FALL_BACK gate (short-circuit #3). F.3 makes the gate continuous by consulting the conformal interval half-width returned by Predictor.predict_vmaf_with_uncertainty (ADR-0393). Two width gates carve the half-width axis into three regions:
| Interval width | Outcome | Effect on F.2 |
|---|---|---|
width <= tight_interval_max_width | SKIP_ESCALATION | Predictor is confident; trust the point estimate even when the native verdict said FALL_BACK. |
tight < width < wide | RECOMMEND_ESCALATION (on FALL_BACK / unknown) or SKIP_ESCALATION (on GOSPEL / LIKELY) | Defer to the native verdict — exactly the F.2 behaviour. |
width >= wide_interval_min_width | FORCE_ESCALATION | Predictor is uncertain; escalate to recommend.coarse_to_fine even when the native verdict said GOSPEL. |
The two thresholds are corpus-derived. The conformal-VQA calibration pipeline ships a JSON sidecar with the canonical keys tight_interval_max_width and wide_interval_min_width; the loader honours per-corpus overrides transparently. When no sidecar is found the loader falls back to the Research-0067 emergency floor (2.0 / 5.0 VMAF) and emits a one-line warning; the floor is documented behaviour, not a magic constant.
Per-cell decisions are recorded in plan.metadata.confidence_aware_escalations[] (one entry per cell, keyed by rung, codec, verdict, interval_width, decision), and each cell in plan.cells[] carries its own confidence_decision + interval_width keys so JSON consumers don't need to cross-reference the metadata array index.
Per CLAUDE.md feedback_no_test_weakening: the thresholds are calibration outputs. If a sidecar value triggers surprising cell escalations on real data, the fix is a recalibration PR — not a loosening of the F.3 gate here.
The helper _confidence_aware_escalation(verdict, interval_width, thresholds) in tools/vmaf-tune/src/vmaftune/auto.py is exposed for unit testing and direct embedding by downstream tools (the MCP server's auto proxy, the CI corpus collector). It is a pure function of its three inputs.
Per-content-type recipes (F.4)¶
F.4 layers per-content-type recipe overrides on top of F.1+F.2+F.3. When the upstream classifier (per_shot.detect_shots plus the fork-local content-class heuristics) tags a source as animation, screen_content, live_action_hdr, or ugc, the auto driver applies a small override dict before the F.2 short-circuits evaluate so a recipe can flip force_single_rung and have the ladder stage honour it. Any source whose meta.content_class doesn't match a named recipe falls through to the empty default recipe.
The override keys consumed by the driver are:
| Key | Type | Effect |
|---|---|---|
tight_interval_max_width | float | Narrows / widens the F.3 conformal-tight gate. |
force_single_rung | bool | Arms short-circuit #1 (ladder-single-rung) even on >= 2160p sources. |
saliency_intensity | str | Passed through to the saliency stage when not skipped. One of default, aggressive, very_aggressive. |
target_vmaf_offset | float | Additive offset applied to the predictor's effective target VMAF. The input --target-vmaf (the gate that ships models) is never shifted by this value — see the no-test-weakening note below. |
The five recipe classes ship the following overrides. The values below are the F.5-calibrated thresholds emitted by ai/scripts/calibrate_phase_f_recipes.py and shipped in ai/data/phase_f_recipes_calibrated.json. The calibration was run on 2026-05-09 against the K150K corpus (.workingdir2/konvid-150k/konvid_150k.jsonl, 148 543 rows out of an expected 153 841 — the ingestion was ~96.6 % complete; a re-run on the full corpus is a follow-up PR). Threshold rationale and the per-class proxy-vs-corpus provenance break-down live in Research-0067 §"F.4 recipe-override placeholders" plus the JSON metadata block.
| Class | tight_interval_max_width | force_single_rung | saliency_intensity | target_vmaf_offset | Source |
|---|---|---|---|---|---|
animation | 1.75 | true | aggressive | +2.0 | proxy (UGC-anchored) |
screen_content | (unset) | (unset) | very_aggressive | +1.0 | proxy (UGC-anchored) |
live_action_hdr | 1.4 | (unset) | (default) | 0.0 | proxy (UGC-anchored) |
ugc | 3.5 | false | default | +1.5 | corpus (K150K) |
default | (unset) | (unset) | (default) | 0.0 | n/a |
K150K is a UGC-only corpus and carries no per-source content_class column; only the ugc row is corpus-derived. The other three rows are calibrated as documented absolute offsets ("proxy") anchored on the F.4 envelope until PR #477's TransNet shot-metadata columns plus a class-labelled subset land. The JSON recipes.<class>._provenance sub-dict records the source per row so future re-calibrations can distinguish corpus-derived from proxy-derived values. UGC's target_vmaf_offset came out empirically positive (+1.5) on K150K because the corpus's MOS distribution has a heavier upper tail than lower tail; the calibration script clamps every offset to the F.4 documented envelope of [-2.0, +2.0] so a pathological corpus cannot push the predictor target outside the regime the planner has been exercised against.
The auto.py runtime loads the JSON at module import via _load_calibrated_recipes; if the JSON file is missing or malformed, the F.4 placeholder constants in _F4_PLACEHOLDER_RECIPES apply as a graceful fallback. The generated JSON includes ADR-0661 run_provenance with the calibration script, argv, source corpus JSONL, row cap, and recipe output target. To regenerate the calibration after the corpus ingestion completes (or when a class-labelled corpus replaces K150K), run:
python ai/scripts/calibrate_phase_f_recipes.py \
--corpus .workingdir2/konvid-150k/konvid_150k.jsonl \
--out ai/data/phase_f_recipes_calibrated.json
Recipe rationale (each cited threshold is provisional pending F.5 calibration):
- Animation — predictor residuals are tighter on flat colour fields, single-rung ladder is sufficient, and saliency is more aggressive on cel-line edges. Animation is intrinsically more compressible at a given perceptual quality, so the predictor aims ~2 VMAF higher.
- Screen content — split-frame structure (low-entropy background + high-detail text/icon regions) benefits from
very_aggressivesaliency that raises QP on the background while keeping text near-lossless. Predictor target nudged +1. - Live-action HDR — per ADR-0300 the HDR pipeline already runs; the F.3 conformal-tight gate is narrowed to
1.4because a wide predictor interval on HDR is more suspect than on SDR (the predictor was largely trained on SDR per ADR-0393). - UGC — user-generated content carries higher upstream-encode noise, inconsistent grading, and resolution mismatches; predictor uncertainty is the baseline. Widening the F.3 tight gate to
3.5avoids over-flagging UGC cells as "needs escalation" simply because the interval is wider than a Netflix-grade reference. The K150K calibration nudges the predictor target up 1.5 because the corpus MOS distribution has a heavier upper tail.
The recipe class is recorded in plan.metadata.recipe_applied (one of animation, screen_content, live_action_hdr, ugc, or default) and the override dict in plan.metadata.recipe_overrides. Each cell in plan.cells[] also carries the resolved saliency_intensity and effective_predictor_target_vmaf so JSON consumers don't need to cross-reference the metadata block.
Per CLAUDE.md memory feedback_no_test_weakening: recipe overrides MUST NOT silently widen the production-flip gate that ships models. They affect predictor thresholds (the effective_predictor_target_vmaf that the predictor aims for; the F.3 width gate that decides per-cell escalation), not the input --target-vmaf that downstream consumers treat as the contract. The driver records the input target_vmaf verbatim in plan.metadata.target_vmaf and the offset target in plan.metadata.effective_predictor_target_vmaf; the two are kept distinct.
The helper _apply_recipe_override(meta, plan_state, thresholds) in tools/vmaf-tune/src/vmaftune/auto.py resolves the recipe and returns a (recipe_class, recipe, effective_thresholds) triple; the get_recipe_for_class(content_class) helper returns a fresh override dict for any of the five canonical class strings. Both are pure functions; the table at module scope (_CONTENT_RECIPE_TABLE) holds factory callables so each call returns a fresh dict that callers may mutate without affecting subsequent runs.
References: ADR-0397 §F.4, ADR-0393, ADR-0300, Research-0067.
Tests¶
The shipped suite mocks subprocess.run so it neither requires ffmpeg nor a built vmaf. Real-binary integration coverage will land when the codec adapter set widens.