Skip to content

vmaf-tune — quality-aware encode automation harness

vmaf-tune is a fork-added Python tool (ADR-0237, Research-0044) that drives FFmpeg over an encoder parameter grid, scores each encode with vmaf, and emits a JSONL corpus of (source, encoder, params, bitrate, vmaf) rows.

Subcommands at a glance

Subcommand Purpose ADR / research
corpus Multi-codec encoder grid sweep + scoring ADR-0237
recommend Target-VMAF / target-bitrate predicate (Phase B) Research-0061, buckets 4+5
tune-per-shot Per-shot CRF zones ADR-0392
recommend-saliency Saliency-aware ROI tuning ADR-0287 ecosystem; consumes vmaf-roi sidecars
ladder Per-title bitrate ladder (Pareto ABR) ADR-0295
fast Predicted-CRF fast path ADR-0276
prefilter Pelorus deband strengths + CRF joint autotune ADR-1116
hdr HDR-aware tuning + HDR-VMAF scoring (PR #434, bucket #9)
compare Apples-to-apples codec comparison at matched VMAF (PR #435)
benchmark Offline cross-codec report from an existing JSONL ADR-0424
sidecar Local predictor bias-correction training / inspection ADR-0394
encode-profile Encode one recommendation from a report profile ADR-0643

Codec adapters

19 adapters under tools/vmaf-tune/src/vmaftune/codec_adapters/:

Family Software NVIDIA NVENC Intel QSV AMD AMF Apple VideoToolbox
AV1 libaom-av1, libsvtav1 av1_nvenc (ADR-0290) av1_qsv av1_amf av1_videotoolbox placeholder
H.264 x264 (Phase A) h264_nvenc h264_qsv h264_amf h264_videotoolbox
HEVC x265 (ADR-0288) hevc_nvenc hevc_qsv hevc_amf hevc_videotoolbox
VP9 libvpx-vp9
VVC vvenc

Pipeline

ref.yuv ──► vmaf-tune corpus ──► encode (libx264) ──► vmaf score ──► corpus.jsonl
ref.yuv ──► vmaf-tune corpus ──► encode (libx264|libx265) ──► vmaf score ──► corpus.jsonl
              └─► encodes are written to --encode-dir, deleted post-score
                  unless --keep-encodes

corpus.jsonl ──► vmaf-tune recommend --target-vmaf T   ──► smallest CRF >= T
                                    --target-bitrate B ──► CRF closest to B
corpus.jsonl ──► vmaf-tune benchmark --target-vmaf T   ──► encoder ranking

JSON artifact portability

Human-facing vmaf-tune CLI JSON outputs, report-style artifacts, Phase-F executor result JSONL (tune_results*.jsonl), and local sidecar state files are strict RFC-8259 JSON. Diagnostic values that are non-finite in memory (NaN, Infinity, -Infinity) are serialized as null rather than as JavaScript-only tokens, so notebooks, dashboards, FFmpeg profile consumers, and MCP clients can parse the files with strict JSON decoders. Corpus JSONL rows remain the training interchange format; their feature-missing semantics are documented separately in the corpus schema below.

Environment variables

Variable Default Notes
VMAFTUNE_WORKDIR (OS default, typically /tmp) Parent directory under which all vmaf-tune subcommands create their per-run temporary scratch directories (decoded reference YUV, intermediate encodes). Set this to a path on a volume with sufficient free space when the OS /tmp is a small tmpfs — e.g. a 634-second 1080p60 source decodes to approximately 118 GB of raw YUV420p. The compare, tune-per-shot, and ladder subcommands also accept --workdir PATH to override this variable per invocation. In the vmaf-dev-mcp container this is pre-set to /probes/vmaftune-work (the 435 GB /probes bind-mount). Resolution order: --workdir flag > VMAFTUNE_WORKDIR env var > OS default. (ADR-0598)

Workdir disk-space management (ADR-0577)

compare and tune-per-shot both run bisects inside a thread pool. Without a concurrency cap, each codec thread decodes the full reference source in parallel — a 3-codec compare against a 110 GB BBB 1080p source would peak at 330 GB, which overflowed the 420 GB /probes volume in the BBB v13 failure.

Three safeguards are active by default:

  1. Decode concurrency cap (--max-concurrent-decodes 1). A semaphore serialises reference-YUV decode operations across all threads. Peak disk stays at one YUV at a time (110 GB for BBB) regardless of how many codecs run in parallel. Raise the cap on hosts with large volumes and fast disks.

  2. Aggressive cleanup. After each bisect (codec × target VMAF pair) completes, its decoded reference YUV is deleted immediately. Per-iteration .mkv and .decoded.yuv files are deleted as soon as scoring finishes. Scratch accumulation is therefore bounded to the working set of the currently active bisect, not the full run.

  3. Mid-run disk-space check. Before each iteration's decode, shutil.disk_usage is called and compared against 2× the estimated YUV size. If the volume is too full the bisect fails fast with a structured error message that names the codec and target VMAF, rather than an opaque ffmpeg rc=228 (ENOSPC) followed by a corrupt output JSON.

To reproduce the BBB v13 scenario in tests, monkeypatch shutil.disk_usage to return low free space and pass a container source — the preflight and mid-run checks both fire before any real decode is attempted.

Install

The tool ships under tools/vmaf-tune/ as a standalone Python package. Phase A has zero runtime dependencies beyond the standard library.

pip install -e tools/vmaf-tune
# or run directly from the checkout:
python tools/vmaf-tune/vmaf-tune --help

External binaries required at runtime:

  • ffmpeg with --enable-libx264 (and --enable-libsvtav1 if you pass --encoder libsvtav1) on PATH (or --ffmpeg-bin).
  • vmaf (this fork's CLI, built via meson) on PATH (or --vmaf-bin).

Predictor Training Corpora

vmaftune.predictor_train trains the per-codec ONNX predictors consumed by the fast, per-shot, ladder, and auto paths. --corpus accepts either a single Phase-A JSONL file or a directory of JSONL shards; directory inputs are scanned recursively in sorted order so the trainer can consume .workingdir2/corpus_run/ directly:

python -m vmaftune.predictor_train \
    --corpus .workingdir2/corpus_run \
    --codec libx264 \
    --output-dir .workingdir2/predictor-real

Rows are filtered per codec after schema aliases are normalised. The trainer accepts both current corpus.py rows (encoder, crf, vmaf_score, bitrate_kbps) and older hardware-sweep rows (codec, q/cq, vmaf, actual_kbps). If a codec has no usable rows, that codec still falls back to the documented synthetic-stub corpus and the model card records corpus.kind: synthetic-stub-*. Directory inputs are passed through the same loader as single files; a codec with rows in any shard records corpus.kind: real-N=<rows>.

When rows include richer predictor inputs, the trainer now preserves them instead of zero-filling: probe_i_frame_avg_bytes, probe_p_frame_avg_bytes, probe_b_frame_avg_bytes, saliency_mean, saliency_var, frame_diff_mean, y_avg, and y_var. Older rows remain valid; missing probe-byte columns use the deterministic bitrate stand-ins and missing saliency / signalstats columns stay 0.0.

vmaf-tune predict --use-saliency uses the same saliency feature slots at runtime. It decodes each sampled shot to temporary yuv420p, runs the configured saliency ONNX, and feeds only saliency_mean / saliency_var into the predictor:

vmaf-tune predict \
    --source source.mp4 \
    --codec libx264 \
    --target-vmaf 96 \
    --use-saliency \
    --saliency-model model/tiny/saliency_student_v1.onnx

This flag is separate from recommend-saliency --saliency-aware, which creates ROI/QP sidecars for the encoder.

Local Sidecar Bias Correction

vmaf-tune sidecar exposes the local sidecar model from docs/ai/local-sidecar-training.md as an operator CLI. It never uploads captures and never mutates the shipped predictor; it stores only the online-ridge correction under ${XDG_CACHE_HOME:-~/.cache}/vmaf-tune/sidecar/.

Inspect the current sidecar state:

vmaf-tune sidecar status --codec libx264 --json

Record one observed encode. features.json may be either a flat ShotFeatures object or { "features": { ... } }; the required fields are probe_bitrate_kbps, probe_i_frame_avg_bytes, probe_p_frame_avg_bytes, and probe_b_frame_avg_bytes.

vmaf-tune sidecar record \
    --codec libx264 \
    --features-json features.json \
    --crf 28 \
    --observed-vmaf 94.2

Batch training accepts JSONL with one row per observed encode:

{"features":{"probe_bitrate_kbps":3000,"probe_i_frame_avg_bytes":10000,"probe_p_frame_avg_bytes":2000,"probe_b_frame_avg_bytes":1000,"width":1920,"height":1080,"fps":24},"crf":28,"observed_vmaf":94.2}
vmaf-tune sidecar batch-record --codec libx264 --captures-jsonl captures.jsonl

Prediction reports the bare predictor score, sidecar correction, and clamped final score:

vmaf-tune sidecar predict --codec libx264 --features-json features.json --crf 28 --json

Quick start

Generate a 6-cell corpus row over (medium, slow) × (22, 28, 34) for one 1080p source clip:

vmaf-tune corpus \
    --source ref.yuv \
    --width 1920 --height 1080 --pix-fmt yuv420p \
    --framerate 24 --duration 10 \
    --preset medium --preset slow \
    --crf 22 --crf 28 --crf 34 \
    --output corpus.jsonl

--source is repeatable — pass one flag per source clip. The grid is the Cartesian product of --preset × --crf.

SVT-AV1 example (ADR-0278)

The libsvtav1 adapter accepts the same x264-style preset names — they are translated to SVT-AV1's integer presets internally. AV1 CRF values live in 0..63; the Phase A informative window is (20, 50):

vmaf-tune corpus \
    --source ref.yuv \
    --width 1920 --height 1080 --pix-fmt yuv420p \
    --framerate 24 --duration 10 \
    --encoder libsvtav1 \
    --preset medium --preset slow \
    --crf 28 --crf 35 --crf 42 \
    --output corpus_av1.jsonl

The corpus row records the human-readable preset name ("medium"), while the FFmpeg argv carries the integer SVT-AV1 expects (-preset 7).

Codec adapter parameter ranges

Each adapter declares its own quality knob, range, and preset vocabulary. The harness validates (preset, crf) against the adapter before invoking FFmpeg.

Encoder CRF (absolute) Phase A CRF window Default CRF Presets
libx264 0..51 (15, 40) 23 ultrafast, superfast, veryfast, faster, fast, medium, slow, slower, veryslow
libsvtav1 0..63 (20, 50) 35 placebo, slowest, slower, slow, medium, fast, faster, veryfast
libvpx-vp9 0..63 (0, 63) 32 placebo, slowest, slower, slow, medium, fast, faster, veryfast, superfast, ultrafast

SVT-AV1 preset name -> integer mapping

SVT-AV1 uses integer presets 0..13 (0 = slowest / best, 13 = fastest). The harness maps the x264-style names below to AV1 integers so the corpus row schema is codec-independent:

Name SVT-AV1 integer Notes
placebo 0 Slowest; research-grade only.
slowest 1
slower 3
slow 5
medium 7 SVT-AV1 default.
fast 9
faster 11
veryfast 13 Fastest.

The mapping is closed and order-stable; see ADR-0294.

CLI flags

Flag Default Notes
--source PATH Required. Repeatable for multi-source sweeps.
--width / --height Required. Source resolution.
--pix-fmt PFMT yuv420p Forwarded to ffmpeg -pix_fmt.
--framerate F 24.0 Source framerate.
--duration S 0 Source duration in seconds (used for bitrate calc).
--encoder NAME libx264 One of libx264, libx265.
--encoder NAME libx264 One of libx264, h264_amf, hevc_amf, av1_amf.
--preset P Required. Repeatable. x264 preset name.
--crf N Required. Repeatable. x264 CRF integer.
--encoder NAME libx264 Currently wired: libx264, libsvtav1 (ADR-0294).
--preset P Required. Repeatable. Preset name (see codec table below).
--crf N Required. Repeatable. CRF integer (range varies by codec).
--output PATH corpus.jsonl JSONL destination.
--encode-dir PATH .workingdir2/encodes Scratch dir; gitignored by convention.
--keep-encodes off Retain encoded files after scoring.
--vmaf-model NAME vmaf_v0.6.1 Forwarded to vmaf --model. Only used when --no-resolution-aware is set; otherwise auto-picked per encode resolution (see "Resolution-aware mode" below).
--resolution-aware / --no-resolution-aware on Auto-pick the VMAF model per encode resolution. Default on.
--ffmpeg-bin PATH ffmpeg Override the ffmpeg binary.
--ffprobe-bin PATH ffprobe Override the ffprobe binary (used for HDR detection).
--vmaf-bin PATH vmaf Override the vmaf binary.
--score-backend NAME auto libvmaf scoring backend — auto\|cpu\|cuda\|sycl\|hip. See below. (vulkan was removed in ADR-0726.)
--no-source-hash off Skip src_sha256 (faster on large YUVs; loses provenance).
--auto-hdr (default) Probe each source via ffprobe; inject HDR flags + HDR model when PQ / HLG signaling is detected.
--force-sdr off Treat all sources as SDR; skip HDR detection.
--force-hdr-pq off Treat all sources as HDR PQ (SMPTE-2084) without probing. Useful for raw YUV refs that ffprobe cannot read color metadata from.
--force-hdr-hlg off Treat all sources as HDR HLG (ARIB STD-B67) without probing.
--two-pass off Phase F (ADR-0333 + ADR-0546). Run a 2-pass encode for codecs whose adapter sets supports_two_pass = True (today: libx264, libx265, libvpx-vp9, libaom-av1, libvvenc). Codecs without 2-pass support fall back to single-pass with a stderr warning. Doubles encode wall time.
--vaapi-device PATH auto VA-API DRI render-node for Intel QSV hardware-device init. auto selects the first Intel render node from /sys/class/drm and falls back to /dev/dri/renderD128 only when no Intel node is discoverable. Also overridable via VMAFTUNE_VAAPI_DEVICE env var; the flag takes precedence. Ignored for non-QSV encoders. (ADR-0641)

Resolution-aware mode

VMAF is a resolution-aware metric: the fork ships two production-grade pooled-mean models — vmaf_v0.6.1 (trained on a 1080p viewing setup) and vmaf_4k_v0.6.1 (re-fit for a 4K display). Scoring 4K content against the 1080p model under-counts spatial detail; scoring 1080p content against the 4K model over-counts coding artefacts. The bias is several VMAF points either way — large enough to poison a mixed-resolution ABR-ladder corpus.

When --resolution-aware is on (the default), vmaf-tune picks the model per encode according to a height-only rule that mirrors Netflix's published guidance:

Encode height Selected model
≥ 2160 (UHD-1 and up) vmaf_4k_v0.6.1
< 2160 (everything else, including 1440p / 720p / SD) vmaf_v0.6.1

The fork has no 720p / 1440p / SD model — vmaf_v0.6.1 is the canonical fallback for all sub-2160p content (matches Netflix's recommendation).

The emitted JSONL row's vmaf_model field now records the effective model used for that row, not the global --vmaf-model opt. Mixed-ladder corpora legitimately contain multiple distinct vmaf_model values across rows; downstream consumers should group / filter by vmaf_model rather than assume a constant.

To force a single model regardless of resolution (e.g. to reproduce a legacy single-model corpus), pass --no-resolution-aware:

vmaf-tune corpus \
    --source ref_4k.yuv --width 3840 --height 2160 \
    --preset medium --crf 23 \
    --no-resolution-aware --vmaf-model vmaf_v0.6.1 \
    --output corpus.jsonl

The Python API exposes the decision rule directly for callers that need to consult it outside the corpus loop:

from vmaftune.resolution import (
    select_vmaf_model_version,    # str: "vmaf_v0.6.1" or "vmaf_4k_v0.6.1"
    select_vmaf_model,            # Path: in-tree model JSON file
    crf_offset_for_resolution,    # int: -2 / 0 / +2 / +4 by resolution band
)

assert select_vmaf_model_version(3840, 2160) == "vmaf_4k_v0.6.1"
assert select_vmaf_model_version(1920, 1080) == "vmaf_v0.6.1"

crf_offset_for_resolution returns a small integer offset that the future search layer (Phase B target-VMAF bisect) can apply when seeding bisect bounds across an ABR ladder. The shipped defaults are codec-agnostic and conservative; Phase B/C/D will learn per-codec offsets from real corpora and override them via the same function signature. See ADR-0289 and Research-0054 for the full rationale.

GPU scoring backend

Per ADR-0299, vmaf-tune corpus forwards a --backend NAME argument to the libvmaf CLI so scoring runs on a GPU when one is present. CPU VMAF runs at ~1–2 fps on 1080p; the CUDA / SYCL / HIP backends shipped with this fork (ADR-0667) deliver 10–30× speedup on the score axis. (The Vulkan backend was removed in ADR-0726; use HIP for vendor-neutral AMD / Intel Arc GPU scoring.)

Modes

Value Behaviour
auto (default) Detect what the local vmaf binary supports + what the host hardware advertises, then pick the fastest available backend in vmaf-tune probe order: cuda → sycl → hip → cpu. Falls back to CPU silently only because no GPU was found.
cuda / sycl / hip Strict mode. Errors out with BackendUnavailableError if the local vmaf binary does not support the requested backend or the host hardware is missing. No silent downgrade to CPU — that would mask hardware/build mismatches and lie about wall-clock expectations.
cpu Force the CPU path. Useful for reproducibility against the Netflix golden-data gate or to bypass a known-bad GPU driver day.

Detection heuristics

vmaf-tune inspects the vmaf --help output to learn which backends the binary advertises (the CLI prints a line of the form --backend $name: ...auto|cpu|cuda|sycl|hip|metal), then runs cheap hardware probes:

  • CUDA: nvidia-smi -L returns at least one GPU line.
  • SYCL: sycl-ls lists at least one :gpu device.
  • HIP/ROCm: rocminfo reports a gfx* agent, or rocm-smi reports a GPU.

Missing tools degrade to "backend not available" — they never raise hard errors. CPU is always considered available even if the help line is missing.

Probe order vs. libvmaf registry order: vmaf-tune's auto probe order (cuda → sycl → hip → cpu) differs from the libvmaf internal registry order (sycl → cuda → hip → cpu). vmaf-tune probes CUDA first because nvidia-smi provides the most reliable availability detection. The libvmaf registry prioritises SYCL first as the primary continuous-integration GPU target. When comparing timing results, be aware that auto may select different backends in vmaf-tune versus a direct vmaf --backend auto invocation.

Wall-clock expectation (60 s 1080p source, indicative)

Score backend Hardware Wall-clock Throughput
cpu AVX2 desktop CPU ~600–1200 s ~1.2–2.5 fps
cuda RTX 30/40-class GPU ~50–120 s ~12–30 fps
sycl Intel Arc / Iris Xe ~80–180 s ~8–18 fps
hip RDNA2 / RDNA3 ROCm host ~80–180 s ~8–18 fps

Numbers are order-of-magnitude only; exact figures depend on the specific feature extractors enabled by the model (vmaf_v0.6.1 versus tiny-AI variants), whether --keep-encodes is on, and host I/O bandwidth. Cross-backend numerical parity is guaranteed to places=4 by the ADR-0214 CI gate.

Examples

# Default — auto-pick the fastest backend.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
    --preset medium --crf 22 --crf 28

# Force CUDA. Errors out clearly if /opt/cuda is missing or the
# vmaf binary was built without CUDA support.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
    --preset medium --crf 22 --score-backend cuda

# Pin CPU for reproducibility against the Netflix golden gate.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
    --preset medium --crf 22 --score-backend cpu

Vulkan score backend (--score-backend=vulkan)

Status: REMOVED — ADR-0726 (2026-05-28). The Vulkan backend was removed from libvmaf. --score-backend=vulkan is no longer a valid value — passing it returns a non-zero exit code with an "unsupported backend" error. Use --score-backend=hip for vendor-neutral GPU scoring on AMD or Intel Arc hosts. The sections below are preserved for historical reference only; none of the workflows described are functional in current builds.

Per ADR-0314 (now superseded by ADR-0726), the vulkan value of --score-backend was the vendor-neutral GPU score path. Use --score-backend=hip for AMD/Intel Arc hosts going forward.

Supported platforms

Host Driver Status
Linux + AMD RDNA2/RDNA3 Mesa RADV Production.
Linux + Intel Arc / Iris Xe Mesa anv Production.
Linux + NVIDIA Mesa NVK or proprietary nvidia driver Production; coexists with --score-backend=cuda.
Linux CI / no GPU Mesa lavapipe (software rasteriser) Slow but functional — the cross-backend parity gate (ADR-0214) runs on lavapipe.
macOS (Apple Silicon + Intel) MoltenVK 1.2 layered over Metal Functional.
Windows Vendor-supplied Vulkan ICD Functional; not gated in CI yet.

The libvmaf binary needs to be built with -Denable_vulkan=true (default in fork release artefacts). The vulkan value will fail strict-mode validation otherwise — vmaf --help will not advertise vulkan in its --backend alternation.

Verifying Vulkan availability

vmaf-tune runs the same probe libvmaf does — vulkaninfo --summary must succeed and report at least one deviceName:

$ vulkaninfo --summary | grep deviceName
        deviceName        = AMD Radeon RX 7900 XTX (RADV NAVI31)

If that command is missing, install the Vulkan SDK loader (Linux: vulkan-tools package; macOS: brew install vulkan-tools).

Example

# Vendor-neutral GPU scoring on AMD / Intel Arc / MoltenVK hosts.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
    --preset medium --crf 22 --crf 28 --score-backend vulkan

Failure mode (no Vulkan loader installed):

vmaf-tune: backend 'vulkan' requested but not available on this host
(available: cpu). Check that the local vmaf binary was built with the
matching backend support and the corresponding runtime/driver is
installed.

The exit code is 2 and no encodes are dispatched — the strict-mode guarantee from ADR-0299 (no silent CPU downgrade) is preserved across all four backend values.

| --no-cache | off | Disable the content-addressed encode/score cache (default: ON). | | --cache-dir PATH | $XDG_CACHE_HOME/vmaf-tune | Override cache location (falls back to ~/.cache/vmaf-tune). | | --cache-size-gb N | 10 | LRU eviction cap in GiB. | | --sample-clip-seconds N | 0 | Encode/score only the centre N-second slice of each source. 0 (default) keeps the legacy full-source behaviour. See Sample-clip mode. |

Sample-clip mode

Set --sample-clip-seconds N to evaluate each grid cell on the centre N-second slice of the source instead of the full reference. This is a runtime/accuracy trade-off, formalised in ADR-0301.

  • Speedup. Encode wall-time scales roughly linearly with slice length, so e.g. a 10-second slice of a 60-second source is a ~6x speedup per grid point. The libvmaf scoring pass shrinks by the same factor (it reads only the matching reference window via --frame_skip_ref / --frame_cnt).
  • Accuracy delta. Expect ~1-2 VMAF points of drift versus full-clip on diverse content (mixed-shot trailers, sports, action), tighter (~0.3-0.5 VMAF points) on uniform content (single-shot interviews, animation, static stills). The delta is per-cell consistent — relative ordering between (preset, crf) cells survives, which is what Phase B (target-VMAF bisect) and Phase C (per-title CRF predictor) actually consume. Full-clip rescoring of the predictor's pick is the recommended Phase C epilogue.
  • Window placement. Naive centre-anchored: the slice is (duration_s − N) / 2 .. (duration_s + N) / 2. Smarter shot-aware placement (e.g. via transnet_v2) is on the follow-up backlog.
  • Fallback. If N >= duration_s the harness silently falls back to full-clip mode and tags the row clip_mode="full" — the request is treated as "use the whole source", not as an error.
  • Bitrate semantics. bitrate_kbps is computed against the encoded duration, so sample-clip rows aren't biased low by dividing slice-bytes by full-source seconds. duration_s keeps the original source provenance.
# 6x faster grid sweep — 10s of a 60s source per cell.
vmaf-tune corpus \
    --source ref.yuv \
    --width 1920 --height 1080 --pix-fmt yuv420p \
    --framerate 24 --duration 60 \
    --preset medium --preset slow \
    --crf 22 --crf 28 --crf 34 \
    --sample-clip-seconds 10 \
    --output corpus.jsonl

Each emitted row carries clip_mode="sample_10s" (or "full"), letting Phase B/C either filter sample rows out, weight them differently, or rescore the chosen cell on the full source.

benchmark subcommand — corpus-level codec ranking

vmaf-tune benchmark is Phase G of the tune workflow. It does not run FFmpeg or libvmaf; it reads an existing Phase-A JSONL corpus and ranks encoders by their best matched-quality point.

For each encoder in the corpus, the command filters to successful rows with finite vmaf_score and bitrate_kbps, then chooses the lowest bitrate row whose VMAF clears --target-vmaf. Encoders that never clear the target stay in the report as status=unmet using their closest VMAF miss, so a too-narrow CRF sweep is visible instead of silently dropped.

vmaf-tune benchmark \
    --from-corpus corpus.jsonl \
    --target-vmaf 92 \
    --baseline-encoder libx264 \
    --format markdown

Output formats:

Format Use
markdown PR comments and human review.
json Notebooks, dashboards, and follow-up automation.
csv Spreadsheets and quick plots.

--baseline-encoder controls the bitrate-delta column. When omitted, the baseline is the lowest-bitrate encoder that clears the target. The report inherits the corpus coverage: if libx264 was swept over 20 CRFs and libx265 over 3 CRFs, the ranking reflects those sampled points only.

Corpus JSONL schema

Each row is one JSON object on its own line. The full key list is exported as vmaftune.CORPUS_ROW_KEYS for programmatic consumers and versioned via vmaftune.SCHEMA_VERSION (currently 3 — v2 added clip_mode for sample-clip mode under ADR-0301; v3 added the HDR provenance triple hdr_transfer / hdr_primaries / hdr_forced when corpus.iter_rows was wired to hdr.detect_hdr + hdr.hdr_codec_args per the ADR-0300 status update of 2026-05-08). Bumping the schema is a coordinated change with Phase B/C; do not edit row shape without bumping the version.

Key Type Description
schema_version int Currently 3. v3 adds the enc_internal_* aggregates (ADR-0332).
run_id str Per-row UUID4 hex.
timestamp str UTC ISO-8601 (seconds precision).
src str Path to the reference YUV.
src_sha256 str SHA-256 of the reference (empty if --no-source-hash).
width / height int Source dimensions.
pix_fmt str Source pixel format.
framerate float Source framerate.
duration_s float Source duration in seconds.
encoder str Codec adapter name (e.g. libx264).
encoder_version str Detected encoder version (e.g. libx264-164).
preset str Encoder preset.
crf int Quality knob value.
extra_params list[str] Additional encoder argv (Phase A: []).
encode_path str Path to encoded file (empty if not retained).
encode_size_bytes int Encoded file size.
bitrate_kbps float (encode_size_bytes × 8 / 1000) / duration_s.
encode_time_ms float Wall-clock encode time.
vmaf_score float Pooled-mean VMAF (NaN if scoring skipped/failed).
vmaf_model str Model version string (e.g. vmaf_v0.6.1).
score_time_ms float Wall-clock scoring time.
ffmpeg_version str Detected ffmpeg version.
vmaf_binary_version str Detected vmaf binary version.
exit_status int First non-zero of (encode, score) exit codes.
hdr_transfer str "" (SDR), "pq" (SMPTE-2084) or "hlg" (ARIB STD-B67). Schema v3+.
hdr_primaries str Raw ffprobe color_primaries (e.g. bt2020); empty for SDR. Schema v3+.
hdr_forced bool true iff the user overrode detection via --force-hdr-* / --force-sdr. Schema v3+.
clip_mode str "full" (default) or "sample_<N>s" per --sample-clip-seconds. Schema v2+.
shot_count int Number of TransNet-V2 shots in the source (0 when shot detection unavailable). v3+.
shot_avg_duration_sec float Mean shot length in seconds (0.0 when unavailable). v3+.
shot_duration_std_sec float Population std of shot lengths in seconds — content-class proxy (animation: low; live action: high). v3+.
adm2_mean float Per-frame ADM2 mean (canonical-6). NaN when scoring skipped. v3+ (ADR-0366).
vif_scale0_meanmotion2_std float Remaining canonical-6 mean/std aggregates (12 columns total). v3+ (ADR-0366).
enc_internal_qp_mean float Per-frame QP mean from x264 pass-1 stats. 0.0 for opt-out adapters. v3+ (ADR-0332).
enc_internal_qp_std float Per-frame QP standard deviation. v3+ (ADR-0332).
enc_internal_bits_mean float Per-frame bit-cost mean (tex+mv+misc). v3+ (ADR-0332).
enc_internal_bits_std float Per-frame bit-cost standard deviation. v3+ (ADR-0332).
enc_internal_mv_mean float Per-frame motion-vector bit-cost mean. v3+ (ADR-0332).
enc_internal_mv_std float Per-frame motion-vector bit-cost standard deviation. v3+ (ADR-0332).
enc_internal_itex_mean float Mean intra-texture cost across I/i frames. v3+ (ADR-0332).
enc_internal_ptex_mean float Mean predicted-texture cost across P/B/b frames. v3+ (ADR-0332).
enc_internal_intra_ratio float Fraction of macroblocks coded as intra. v3+ (ADR-0332).
enc_internal_skip_ratio float Fraction of macroblocks coded as skip. v3+ (ADR-0332).

The ten enc_internal_* columns are populated for adapters that declare supports_encoder_stats = True (currently libx264 and libx265). The parser normalizes x264 macroblock counters and x265 icu / pcu / scu CTU counters into the same intra / predicted / skip ratio columns. Hardware encoders (NVENC / AMF / QSV / VideoToolbox) and AV1 software encoders (libaom-av1 / libsvtav1 / libvvenc) opt out and emit 0.0 for every column so the schema is uniform across the corpus. The trade-off: per-encode wall-clock cost roughly doubles for opt-in adapters because the harness runs a stats-only -pass 1 invocation before the production CRF encode.

Example row

{
  "schema_version": 3,
  "run_id": "0a3b1c8b...",
  "timestamp": "2026-05-03T16:00:00+00:00",
  "src": "ref.yuv",
  "src_sha256": "",
  "width": 1920, "height": 1080, "pix_fmt": "yuv420p",
  "framerate": 24.0, "duration_s": 10.0,
  "encoder": "libx264", "encoder_version": "libx264-164",
  "preset": "medium", "crf": 28,
  "extra_params": [],
  "encode_path": "",
  "encode_size_bytes": 845210,
  "bitrate_kbps": 676.168,
  "encode_time_ms": 4321.0,
  "vmaf_score": 92.41,
  "vmaf_model": "vmaf_v0.6.1",
  "score_time_ms": 1820.5,
  "ffmpeg_version": "6.1.1",
  "vmaf_binary_version": "3.0.0-lusoris.0",
  "exit_status": 0,
  "clip_mode": "full",
  "shot_count": 12,
  "shot_avg_duration_sec": 0.83,
  "shot_duration_std_sec": 0.41,
  "adm2_mean": 9.73, "adm2_std": 0.12,
  "enc_internal_qp_mean": 25.23,
  "enc_internal_qp_std": 0.12,
  "enc_internal_bits_mean": 4975.0,
  "enc_internal_bits_std": 1820.5,
  "enc_internal_mv_mean": 60.3,
  "enc_internal_mv_std": 32.1,
  "enc_internal_itex_mean": 8000.0,
  "enc_internal_ptex_mean": 1500.0,
  "enc_internal_intra_ratio": 0.07,
  "enc_internal_skip_ratio": 0.16
}

recommend subcommand — target-VMAF / target-bitrate

vmaf-tune recommend consumes the corpus (either pre-built via --from-corpus PATH.jsonl or generated on the fly from --source + grid flags) and applies one of two predicates:

  • --target-vmaf T — return the row with the smallest CRF whose vmaf_score >= T. If no row clears the bar, the row with the highest VMAF is returned and the predicate is annotated (UNMET) in the output. Exit code is still 0 for an honest closest-miss.
  • --target-bitrate KBPS — return the row whose bitrate_kbps is closest (absolute distance) to KBPS. Ties on distance go to the smaller CRF (higher quality).

The two flags are mutually exclusive — argparse rejects passing both with exit code 2.

When reading a pre-built corpus, recommend ignores rows whose exit_status is non-zero and rows with missing or non-finite vmaf_score. The --encoder and --preset flags act as filters in that mode, so a mixed-codec corpus can be reused without first splitting it into per-codec files.

Use a pre-built corpus

# Phase A — build once.
vmaf-tune corpus --source ref.yuv --width 1920 --height 1080 \
    --framerate 24 --duration 10 \
    --preset medium --crf 18 --crf 22 --crf 26 --crf 30 --crf 34 \
    --output corpus.jsonl

# Smallest CRF whose VMAF >= 92.
vmaf-tune recommend --from-corpus corpus.jsonl --target-vmaf 92.0

# CRF whose bitrate is closest to 5 Mbps.
vmaf-tune recommend --from-corpus corpus.jsonl --target-bitrate 5000

Build the corpus on the fly

vmaf-tune recommend \
    --source ref.yuv --width 1920 --height 1080 \
    --framerate 24 --duration 10 \
    --preset medium --crf 18 --crf 22 --crf 26 --crf 30 \
    --target-vmaf 92.0

If --preset and --crf are omitted, recommend sweeps medium × range(18, 36, 2) as a sensible default for ad-hoc runs.

Output

Default output is a single human-readable line on stdout, e.g.

encoder=libx264 preset=medium crf=22 vmaf=95.000 bitrate_kbps=5000.00 \
  predicate=target_vmaf>=92.0 margin=+3.000

Pass --json to get the full corpus row as a JSON object on stdout instead — convenient for piping into other tooling.

recommend flags

Flag Default Notes
--target-vmaf T Smallest CRF whose vmaf_score >= T.
--target-bitrate KBPS CRF whose bitrate_kbps is closest to KBPS.
--from-corpus PATH Read rows from a pre-built JSONL. Skips encode + score.
--source / --width / --height / --framerate / --duration Build a corpus on the fly. Required when --from-corpus is omitted.
--encoder / --preset / --crf libx264 / medium / [18,20,...,34] Sweep grid (when building). Filter (when loading).
--json off Emit the winning row as JSON instead of the prose summary.

fast subcommand — proxy + Bayesian + GPU-verify (Phase A.5)

vmaf-tune fast is the seconds-to-minutes alternative to the Phase A grid for the recommendation use case. It runs an Optuna TPE search over the integer CRF axis, scores each trial with the fr_regressor_v2 proxy (ADR-0291) on the canonical-6 libvmaf features extracted from a short probe encode, then runs one real-encode + libvmaf verify pass at the recommended CRF before reporting. The slow grid stays canonical (ADR-0276) — fast is opt-in, and falls back to the grid when the proxy/verify gap exceeds the configured tolerance.

Install

The fast-path needs Optuna in addition to the core install:

pip install 'vmaf-tune[fast]'

The shipped [fast] extra is the only correct install path; the core package stays zero-extra-dep so corpus generation works on hosts that never run the fast path.

Smoke run (no ffmpeg / no ONNX / no GPU)

The --smoke flag swaps the proxy + verify pipeline for a deterministic synthetic CRF→VMAF curve so CI on bare hosts still exercises the search loop:

vmaf-tune fast --target-vmaf 92.0 --smoke --n-trials 12
{
  "encoder": "libx264",
  "n_trials": 12,
  "notes": "smoke mode — synthetic predictor; no ffmpeg / ONNX / GPU. See ADR-0276 + ADR-0304 + Research-0076 for the production path.",
  "predicted_kbps": 1954.27,
  "predicted_vmaf": 82.65,
  "proxy_verify_gap": null,
  "recommended_crf": 27,
  "smoke": true,
  "target_vmaf": 92.0,
  "verify_vmaf": null
}

Production run

vmaf-tune fast \
    --src ref.yuv --width 1920 --height 1080 \
    --framerate 24 --pix-fmt yuv420p \
    --encoder libx264 --preset medium \
    --target-vmaf 92.0 \
    --crf-min 18 --crf-max 40 \
    --n-trials 30 \
    --score-backend auto \
    --output recommendation.json

The recommendation lands as a single JSON object — same schema recommend and predict already emit, plus the fast-path-specific verify_vmaf and proxy_verify_gap diagnostics:

{
  "encoder": "libx264",
  "target_vmaf": 92.0,
  "recommended_crf": 22,
  "predicted_vmaf": 92.41,
  "predicted_kbps": 4820.0,
  "n_trials": 30,
  "smoke": false,
  "notes": "production: TPE over 30 trials with v2 proxy; GPU verify gap = 0.612 VMAF (tolerance 1.50).",
  "verify_vmaf": 91.80,
  "proxy_verify_gap": 0.612,
  "score_backend": "cuda"
}

Exit codes

Code Meaning
0 Recommendation produced; proxy/verify gap within tolerance.
2 Argument validation error (missing --src, bad CRF range, ...).
3 Out-of-distribution: proxy/verify gap exceeded --proxy-tolerance. The recommendation is still emitted; callers should fall back to the slow Phase A grid (vmaf-tune corpus + vmaf-tune recommend).

Fall-back idiom

vmaf-tune fast --src ref.yuv --width 1920 --height 1080 \
    --target-vmaf 92.0 --output rec.json \
  || vmaf-tune recommend --source ref.yuv --width 1920 --height 1080 \
        --preset medium --target-vmaf 92.0 --output rec.json

The || chain captures both the production-error case (rc=2) and the OOD case (rc=3), so the slow grid is the safety net whenever the fast-path is not confident.

fast flags

Flag Default Notes
--src PATH Source video. Required outside --smoke.
--width / --height 0 Raw-YUV geometry. Required outside --smoke.
--pix-fmt yuv420p ffmpeg pix_fmt for the probe + verify encodes.
--framerate 24.0 Reference framerate.
--target-vmaf T Quality target on the standard [0, 100] scale. Required.
--encoder libx264 Codec adapter; must be in ENCODER_VOCAB_V2 for production mode.
--preset medium Encoder preset for the probe + verify encodes.
--crf-min / --crf-max 10 / 51 TPE search range over the integer CRF axis.
--n-trials 30 (prod), 50 (smoke) TPE trial budget.
--time-budget-s 300 Soft wall-clock cap for Optuna. TPE stops scheduling new trials after the timeout; an in-flight probe finishes.
--proxy-tolerance 1.5 Max abs proxy/verify gap before exit code 3.
--sample-chunk-seconds 5.0 Probe-slice duration per TPE trial.
--smoke off Synthetic curve; no ffmpeg / ONNX / GPU.
--score-backend auto Verify-pass backend (auto/cpu/cuda/sycl/hip). (vulkan removed in ADR-0726.)
--ffmpeg-bin / --vmaf-bin ffmpeg / vmaf Tool paths.
--vmaf-model vmaf_v0.6.1 libvmaf model for the verify pass.
--encode-dir .workingdir2/fast Scratch dir for probe + verify encodes.
--output stdout JSON destination for the recommendation payload.

prefilter subcommand — Pelorus deband + CRF joint autotune (ADR-1116)

vmaf-tune prefilter runs a VMAF-in-the-loop search that jointly tunes the Pelorus deband pre-filter's strengths and the encoder CRF, returning the lowest-bitrate combination that hits a target VMAF. This is the control-plane ("mode 2") seam between vmafx and Pelorus (Pelorus ADR-0106; contract in Pelorus ADR-0110).

vmafx stays Vulkan-free: it only emits the ffmpeg -vf "pelorus_deband_vulkan=range=..:thry=..:.." string and scores the encoded output. The deband filter runs inside ffmpeg — so the live loop requires an ffmpeg build with the Pelorus Vulkan deband filter. When that filter is absent the subcommand refuses the live run with a clear message and points you at --smoke.

The frozen knob contract

The search space is exactly the 10 tunable knobs Pelorus ADR-0110 freezes (name / type / range / default). vmafx hard-codes this table — renaming, narrowing, or retyping a knob is a coordinated two-repo break:

Knob Type Range Default Meaning
range int 1–31 15 reference-sampling radius (px)
thry float 0.0–0.25 0.012 luma flat-test threshold
thrc float 0.0–0.25 0.012 chroma flat-test threshold
grainy float 0.0–0.4 0.006 luma grain amplitude
grainc float 0.0–0.4 0.0 chroma grain amplitude
softness float 0.0–1.0 0.5 soft-blend transition width
detail float 0.0–0.25 0.06 detail-mask activity threshold
dither enum 0–2 2 0=none, 1=bayer8, 2=bluenoise
dynamic bool 0–1 1 re-seed grain each frame
protect bool 0–1 1 gate debanding off textured regions

The out-of-contract options (sample, blur, planes, meta) are never swept — they are pipeline-topology / reporting switches set once per run outside the optimizer.

How the joint search works

Each Optuna TPE trial proposes a full (deband-dict, crf) and the loop runs [pelorus_deband_vulkan=...] → HW encode → VMAF score for it. The objective is |achieved_vmaf − target| + λ·kbps, so the search converges on the lowest-bitrate deband+CRF that hits the target. CRF is an ordinal integer dimension joined to the 10 deband dimensions in one study — the two axes are co-optimised, not searched in nested loops (debanding shifts the rate-quality curve, so they are not separable). The result reports the recommended strengths, CRF, achieved VMAF, and the per-probe VMAF log.

The search engine is the same TPESampler study the fast subcommand uses (ADR-0276 / ADR-0304) — prefilter only constructs the joint search space and the objective.

Quick start (smoke)

Exercise the joint search end-to-end with a synthetic surface — no ffmpeg, no Vulkan, no GPU:

vmaf-tune prefilter --target-vmaf 92 --smoke --n-trials 40

Live loop (requires a Pelorus-enabled ffmpeg)

vmaf-tune prefilter \
  --src ref.yuv --width 1920 --height 1080 \
  --target-vmaf 93 --encoder libx264 \
  --crf-min 18 --crf-max 40 \
  --ffmpeg-bin /opt/ffmpeg-pelorus/bin/ffmpeg

The emitted recommendation includes a ready-to-use recommended_vf fragment, e.g.:

pelorus_deband_vulkan=range=12:thry=0.018:grainy=0.008

Restrict the swept knobs with one or more --sweep-knob flags (the rest stay at the filter default):

vmaf-tune prefilter --target-vmaf 93 --smoke \
  --sweep-knob range --sweep-knob grainy

prefilter flags

Flag Default Notes
--src PATH Source video. Required for the live loop.
--width / --height 0 Raw-YUV geometry. Required for the live loop.
--pix-fmt yuv420p ffmpeg pix_fmt for the probe encodes.
--framerate 24.0 Reference framerate.
--target-vmaf T Quality target on the [0, 100] scale. Required.
--encoder libx264 Codec adapter that performs the post-deband encode.
--preset medium Encoder preset for the probe encodes.
--filter pelorus_deband Pre-encode filter adapter to autotune.
--sweep-knob KNOB all 10 Repeatable; restricts the swept deband knobs.
--crf-min / --crf-max 18 / 40 Joint TPE search range over CRF.
--n-trials 60 (live), 40 (smoke) TPE trial budget.
--time-budget-s 600 Soft wall-clock cap for Optuna.
--seed 0 TPE sampler seed (reproducible search).
--smoke off Synthetic deband+CRF surface; no ffmpeg / Vulkan / GPU.
--score-backend auto Probe-score backend (auto/cpu/cuda/sycl/hip).
--ffmpeg-bin / --vmaf-bin ffmpeg / vmaf Tool paths.
--vmaf-model vmaf_v0.6.1 libvmaf model for the probe scores.
--neg off Use the VMAF NEG model variant.
--encode-dir .workingdir2/prefilter Scratch dir for probe encodes.
--output stdout JSON destination for the recommendation payload.

Note: the live encode path is unit-tested with a mocked encode/score loop but has not been run against a real Pelorus-enabled ffmpeg in this environment (the filter is not installed here). See docs/state.mdT-PREFILTER-LIVE-ENCODE-UNTESTED-2026-06-14.

Codec adapters

Phase A wires libx264 end-to-end through the search loop. Additional codec adapters land as one-file additions under tools/vmaf-tune/src/vmaftune/codec_adapters/ and join the registry without touching the search loop. The currently-registered adapters are discoverable via vmaftune.codec_adapters.known_codecs().

libaom-av1

Google's reference AV1 encoder. Adapter shipped via ADR-0279.

  • Encoder name (FFmpeg): libaom-av1.
  • Quality knob: -crf integer in [0, 63] (default 35); higher CRF is lower quality.
  • Speed knob: -cpu-used integer in [0, 9] (0 = slowest/best, 9 = fastest). The adapter exposes a human-readable preset vocabulary that matches x264/x265 so a single sweep axis covers all four encoders.
--preset name libaom -cpu-used
placebo 0 (slowest, highest quality)
slowest 1
slower 2
slow 3
medium 4 (default)
fast 5
faster 6
veryfast 7
superfast 8
ultrafast 9 (fastest)

Sample FFmpeg invocation produced by the adapter (Phase B+ wires the codec args into vmaftune.encode):

ffmpeg -i ref.y4m -c:v libaom-av1 -crf 35 -cpu-used 4 -an -y out.mkv

libaom vs SVT-AV1 trade-offs

libaom-av1 and libsvtav1 both target the AV1 bitstream but sit at different points on the speed/quality curve. Use the table below as a rough decision aid; the per-corpus numbers belong in your local sweep output, not here.

Concern libaom-av1 libsvtav1
Encode wall time at matched preset meaningfully slower meaningfully faster
Quality at slow presets (matched bitrate) slightly higher per AOM benchmarks slightly lower
Quality at fast presets comparable comparable, sometimes ahead
CRF range 0..63 0..63
Best fit offline / high-quality archive encodes live, batch, large catalog

The fork's vmaf-tune corpus rows record the exact (encoder, preset, crf, vmaf_score, encode_time_ms, bitrate_kbps) tuple, so Phase C/D predictors can pick whichever encoder dominates the relevant region of the rate-distortion plane on a given source.

SVT-AV1-HDR (juliobbv-p/svt-av1-hdr) is a runtime variant of the libsvtav1 adapter, not a separate adapter. Its FFmpeg examples still use -c:v libsvtav1; the difference is which SVT library the FFmpeg binary is linked against. Use ADAPTER@VARIANT tokens plus --encoder-ffmpeg-bin to compare mainline SVT-AV1 and SVT-AV1-HDR in one report:

vmaf-tune compare \
    --src clip.mkv --width 3840 --height 2160 --pix-fmt yuv420p10le \
    --target-vmafs 94,96,98 \
    --encoders libsvtav1,libsvtav1@svt-av1-hdr \
    --ffmpeg-bin /opt/ffmpeg-8.1.1-main/bin/ffmpeg \
    --encoder-ffmpeg-bin libsvtav1@svt-av1-hdr=/opt/ffmpeg-8.1.1-svtav1-hdr/bin/ffmpeg \
    --format json --output svtav1-vs-hdr.json

The row label stays human-readable (codec = libsvtav1@svt-av1-hdr), while machine-readable provenance records adapter = libsvtav1, runtime_variant = svt-av1-hdr, and the exact ffmpeg_bin path. Tokens without a per-token binding use the global --ffmpeg-bin.

Hardware encoders (NVENC)

Phase A also wires the NVIDIA NVENC family for hardware-accelerated sweeps:

Adapter --encoder FFmpeg encoder Hardware required
h264_nvenc h264_nvenc NVIDIA Kepler+ (most modern GPUs)
hevc_nvenc hevc_nvenc NVIDIA Maxwell 2nd-gen+ (GTX 960+)
av1_nvenc av1_nvenc NVIDIA Ada Lovelace+ (RTX 40-series, L40, L4)

NVENC's quality knob is -cq (constant quantizer), the closest analogue to libx264 CRF. The fork's CQ window is the same [15, 40] perceptually informative range used for libx264; the hardware accepts [0, 51].

NVENC has seven preset levels (p1 fastest → p7 slowest). The CLI takes the same mnemonic preset names as libx264 and maps them:

Mnemonic NVENC preset
ultrafast, superfast, veryfast p1
faster p2
fast p3
medium (default) p4
slow p5
slower p6
slowest, placebo p7

Hardware encoders (AMF)

vmaf-tune ships software (libx264) and hardware adapters under one codec-adapter contract — the harness search loop is identical for all of them. AMD AMF (Advanced Media Framework) covers H.264, HEVC, and AV1 on AMD GPUs (see ADR-0282):

Encoder Codec Hardware FFmpeg flag Quality knob
h264_amf H.264 / AVC Any AMD GPU with AMF -c:v h264_amf -qp_i / -qp_p (cqp)
hevc_amf HEVC / H.265 Any AMD GPU with AMF -c:v hevc_amf -qp_i / -qp_p (cqp)
av1_amf AV1 RDNA3+ only (RX 7000 series and newer) -c:v av1_amf -qp_i / -qp_p (cqp)

Requirements:

  • AMD GPU with AMF support (AV1 needs RDNA3 silicon or newer).
  • FFmpeg built with --enable-amf.
  • The AMF runtime / driver installed (Adrenalin on Windows; the open-source Mesa AMF stack or AMD Pro driver on Linux).

Behavioural notes:

  • Preset compression — 7 levels collapse to 3. AMF exposes only three -quality rungs (quality, balanced, speed) where libx264 / NVENC / QSV expose seven preset names. The adapter maps preset names onto AMF rungs as follows:
Preset names AMF -quality
placebo, slowest, slower, slow quality
medium (default) balanced
fast, faster, veryfast, superfast, ultrafast speed

This is opinionated: AMF's hardware pipeline does not expose finer steps. Callers that need finer granularity should pin -qp_i / -qp_p (the qp quality knob) instead of the preset.

  • Rate control is constant-QP. AMF -rc cqp plus matched -qp_i / -qp_p is the closest analogue to x264 CRF. Range is 0..51; the harness exposes the (15, 40) Phase A informative window for cross-codec comparability.

Example:

Coarse-to-fine CRF search (ADR-0306)

Sweeping every CRF from 0..51 (52 encodes per source × preset) is wasteful when the only question is "what's the smallest CRF whose VMAF still meets my target?". The fork ships a 2-pass coarse-to-fine search that visits ~15 points instead of 52, a ~3.5× wall-time speedup with no measurable quality regression.

How it works:

  1. Coarse pass at --coarse-step over [10, 50] — by default that's [10, 20, 30, 40, 50] (5 encodes).
  2. Fine pass at --fine-step within ±--fine-radius of the best-coarse CRF. With defaults that's the 10 unique CRFs around the best (e.g. [25..29, 31..35] if best-coarse is 30).
  3. 1-pass shortcut: when the highest-CRF coarse point already meets the VMAF target, no refinement is needed (lower bitrate would have to come from CRFs above the coarse grid, which the fine pass wouldn't probe anyway). The search stops after the coarse pass.

corpus --coarse-to-fine

vmaf-tune corpus \
    --source ref.yuv \
    --width 1920 --height 1080 --pix-fmt yuv420p \
    --framerate 24 --duration 10 \
    --encoder h264_nvenc \
    --preset medium --preset slow \
    --crf 23 --crf 28 --crf 34 \
    --output corpus_nvenc.jsonl

(The --crf flag carries the quality value regardless of whether the encoder names it CRF or CQ; the adapter forwards it as -cq for NVENC.)

Hardware vs software trade-off

NVENC is 10–100× faster than the software encoders at the cost of quality. Empirically, h264_nvenc at medium typically loses 3–5 VMAF points versus libx264 medium at the same bitrate, depending on content complexity. The Pareto frontier is genuinely different — that is precisely why the harness treats NVENC as separate codec entries rather than a flag on libx264. Use NVENC when you need a large corpus quickly or when the production pipeline is GPU-encoded; use software when you need the perceptually best encode at a given bitrate.

If FFmpeg reports Encoder h264_nvenc not found (or one of the sibling encoders), the FFmpeg build wasn't compiled with --enable-nvenc or the GPU lacks the relevant generation. The harness records the failure as exit_status != 0 and skips scoring, so a partial corpus over a heterogeneous fleet is still well-formed.

vmaf-tune corpus \
    --encoder h264_amf \
    --preset slow --preset medium --preset fast \
    --crf 23 --crf 28 --crf 34 \
    --output corpus_amf.jsonl

The raw FFmpeg invocation the adapter emits looks like:

ffmpeg -i ref.yuv -c:v h264_amf \
       -quality balanced -rc cqp -qp_i 23 -qp_p 23 \
       -an out.mkv

Hardware encoders (QSV)

Beyond libx264, the registry exposes the three Intel QSV (Quick Sync Video) hardware adapters added in ADR-0281. They share the QSV preset vocabulary (veryslow, slower, slow, medium, fast, faster, veryfast — same names as x264's medium-and-down subset) and use -global_quality N (ICQ rate control, range 1..51, semantically similar to CRF).

Adapter name FFmpeg encoder Quality knob Hardware required
h264_qsv h264_qsv global_quality (1–51) Intel iGPU 7th-gen+ (Kaby Lake or newer) or Arc / Battlemage
hevc_qsv hevc_qsv global_quality (1–51) Intel iGPU 7th-gen+ (10-bit needs 11th-gen+) or Arc / Battlemage
av1_qsv av1_qsv global_quality (1–51) Intel iGPU 12th-gen+ only or Arc / Battlemage

Hardware-device initialisation (Linux). FFmpeg's QSV bridge on Linux requires an explicit VA-API device chain before the input and a pixel-format conversion filter after it. vmaf-tune injects these automatically for all QSV encoders:

# Before -i:
-init_hw_device vaapi=va:/dev/dri/renderD129
-init_hw_device qsv=qsv_dev@va
-filter_hw_device va
# After -i, before -c:v:
-vf format=nv12,hwupload=extra_hw_frames=64

Without this chain every QSV encode fails with -22 Invalid argument. The VA-API device defaults to auto: vmaf-tune walks /dev/dri/by-path and /sys/class/drm/renderD*/device/vendor, selects the first Intel vendor node (0x8086), and falls back to /dev/dri/renderD128 only when no Intel render node is discoverable. Override with --vaapi-device /dev/dri/renderD129 or VMAFTUNE_VAAPI_DEVICE=/dev/dri/renderD129 when you need to pin a specific node. (ADR-0641)

The underlying FFmpeg invocations look like:

ffmpeg \
    -init_hw_device vaapi=va:/dev/dri/renderD129 \
    -init_hw_device qsv=qsv_dev@va \
    -filter_hw_device va \
    -i src.mkv \
    -vf format=nv12,hwupload=extra_hw_frames=64 \
    -c:v h264_qsv -preset medium -global_quality 23 -an out.mkv

vmaf-tune validates the (preset, global_quality) pair before spawning ffmpeg and probes ffmpeg -encoders for the requested encoder; if libmfx / VPL is not compiled in, the harness raises RuntimeError with a build-time hint rather than letting ffmpeg emit an Encoder not found line buried in stderr.

Apple VideoToolbox adapters

The h264_videotoolbox, hevc_videotoolbox, and prores_videotoolbox registry entries cover Apple Silicon (M-series) and T2-equipped Intel Macs via the VideoToolbox.framework hardware encoder. The H.264 and HEVC adapters were added in ADR-0283; the ProRes adapter follows the same registry pattern (see ADR-0283 Status update 2026-05-09). All three share _videotoolbox_common.py for the preset → -realtime mapping; H.264 / HEVC also share the -q:v quality knob, while ProRes uses the integer profile tier instead.

Adapter name FFmpeg encoder Quality knob Hardware required
h264_videotoolbox h264_videotoolbox q:v (0..100, higher = better) Apple Silicon or Intel Mac with T2
hevc_videotoolbox hevc_videotoolbox q:v (0..100, higher = better) Apple Silicon or Intel Mac with T2
prores_videotoolbox prores_videotoolbox profile:v (0..5, higher tier = better) Apple Silicon M1 Pro / Max / Ultra or later

VideoToolbox exposes only a binary -realtime {0,1} flag instead of a multi-valued preset, so the harness's nine-name preset vocabulary collapses onto that boolean per the table in _videotoolbox_common.py: ultrafast/superfast/veryfast/faster/fastrealtime=1 (low-latency fast path); medium/slow/slower/veryslowrealtime=0 (offline / quality-priority). The mapping is intentionally lossy — VT cannot expose a finer dial.

The underlying FFmpeg invocations look like:

ffmpeg -i src.mkv -c:v h264_videotoolbox -realtime 0 -q:v 60 -an out.mkv
ffmpeg -i src.mkv -c:v hevc_videotoolbox -realtime 0 -q:v 60 -an out.mkv
ffmpeg -i src.mkv -c:v prores_videotoolbox -realtime 0 -profile:v hq -an out.mov

AV1 hardware encoding is intentionally not wired — Apple Silicon has no AV1 hardware encoder block as of 2026 and FFmpeg exposes no av1_videotoolbox. Use libaom-av1 or libsvtav1 for AV1 on macOS.

vmaf-tune validates the (preset, quality) pair via the adapter (-q:v for H.264 / HEVC; integer tier id for ProRes) and probes ffmpeg -encoders for the requested encoder; if VideoToolbox is unavailable (e.g. the host is Linux), the harness raises RuntimeError rather than letting ffmpeg emit Encoder not found.

ProRes tier reference

ProRes is a fixed-rate intermediate codec — there is no CRF / QP scalar. Quality is selected entirely by the tier, and bitrate is implicit in the tier × resolution × frame-rate combination. The harness's --crf flag carries the integer tier id (the FFmpeg profile:v AVOption value); the adapter emits the canonical FFmpeg alias on the argv for diagnosability.

crf value FFmpeg alias Marketing name Typical use
0 proxy ProRes 422 Proxy Offline editing, dailies
1 lt ProRes 422 LT Broadcast acquisition
2 standard ProRes 422 Mainline broadcast master
3 hq ProRes 422 HQ High-end broadcast / film master (default)
4 4444 ProRes 4444 Graphics, alpha, colour grading
5 xq ProRes 4444 XQ High-dynamic-range / wide-gamut master

Source: FFmpeg libavcodec/videotoolboxenc.c prores_options AVOption table (verified against an FFmpeg n8.1.1 checkout).

ProRes is intra-only — every frame is a keyframe — so --keyint / --force-keyframes flags are accepted but have no rate-distortion effect. The harness still emits them so the muxer's seek-table density is predictable across codecs.

Saliency-aware encoding (recommend-saliency --saliency-aware)

Bucket #2 of the PR #354 audit (see ADR-0293) wires the fork-trained saliency_student_v1 ONNX model (ADR-0286) into vmaf-tune so a single command can produce an encode that biases bits toward salient regions (faces, focal subjects, action) and saves bits on background.

Synopsis

vmaf-tune recommend-saliency \
    --src ref.yuv --width 1920 --height 1080 --framerate 24 \
    --duration-frames 240 \
    --preset medium --crf 23 \
    --saliency-aware \
    [--saliency-offset -4] \
    [--saliency-model model/tiny/saliency_student_v1.onnx] \
    [--saliency-aggregator mean|ema|max|motion-weighted] \
    --output out.mp4

How it works

  1. compute_saliency_map() samples the requested frame window from the source YUV, runs those frames through saliency_student_v1.onnx (ImageNet-normalised RGB derived from YUV, NCHW [1, 3, H, W]), and reduces the per-frame saliency outputs into one mask in [0, 1].
  2. saliency_to_qp_map() linearly maps the mask to per-pixel QP deltas — --saliency-offset is the QP delta at peak saliency (negative means better quality on salient regions). Background gets the symmetric positive delta. The output is clamped to [-12, +12] (matching the vmaf-roi sidecar convention from ADR-0247).
  3. The pixel-level QP-offset map is dispatched to the encoder-specific ROI channel (see the encoder-targets table below); each channel reduces the map to its native granularity before writing any sidecar file or composing the argv slice.
  4. The augmented extra_params (qpfile path, zones string, or ROI-map path) are appended to the FFmpeg command and the normal encode path runs.

Trade-off

Axis Direction
Bitrate (same VMAF) −10 % to −20 % for content with strong attention focus (faces, action, sport). Background-uniform content sees little change.
Encode time +5 % typical (saliency inference + per-MB reduce; per-frame model time is sub-millisecond on CPU at SD/HD).
Decode time unchanged (the bitstream is standard-compliant for all five supported encoders).
Quality (VMAF) unchanged at the clip-mean level; concentrated where the eye looks.

Numbers are indicative. Today's recommend-saliency subcommand is a one-shot encode at --crf (or the adapter default); target-VMAF selection remains the job of recommend, compare, or a caller-provided bisect loop.

Temporal aggregation

--saliency-aggregator controls how sampled frame masks become the single ROI pattern applied to the encode:

Aggregator Behaviour Use when
mean Per-pixel arithmetic mean; preserves the original implementation. Default, stable clips, and baseline comparisons.
ema Exponential moving average with --saliency-ema-alpha as the current-frame weight. Motion or cuts make the latest sampled frames more representative.
max Per-pixel maximum over sampled masks. Missing a brief salient object is worse than over-protecting background.
motion-weighted Weighted mean where each sampled frame is weighted by luma delta from the previous sampled frame. Motion-heavy clips where changing frames should dominate the aggregate.

All four reducers use the same saliency_student_v1 weights and the same downstream QP mapping, so changing the reducer does not change the model contract or encoder sidecar format.

Graceful fallback

If onnxruntime is not installed or model/tiny/saliency_student_v1.onnx cannot be loaded, recommend-saliency --saliency-aware logs a warning and falls back to a plain encode. This matches the vmaf-roi C sidecar's posture.

If the chosen --encoder has no ROI dispatch implementation (e.g. h264_nvenc, libvpx-vp9, hevc_nvenc) the command exits with code 2 and a structured error message listing the supported codecs. To accept a plain encode instead, pass --saliency-fallback-plain (or set the environment variable VMAFTUNE_SALIENCY_FALLBACK_OK=1); in that case an ERROR is logged rather than a WARNING. This hard-fail posture matches ADR-0498 / ADR-0546.

Encoder targets

The saliency pipeline supports five encoder ROI mechanisms:

Encoder ROI channel Granularity argv slot
libx264 ASCII --qpfile (x264 r2390+) 16×16 luma MB -x264-params qpfile=…
libaom-av1 patched FFmpeg -qpfile ROI bridge 16×16 luma MB mapped onto libaom MI cells -qpfile …
libx265 --zones QP delta (per-clip mean) full-clip spatial mean -x265-params zones=0,N,q=<delta>
libsvtav1 --qp-file offset map (SVT-AV1 v1.7+) 64×64 super-block -svtav1-params qp-file=…
libvvenc ROIFile CSV (VVenC v1.14.0+) 64×64 CTU -vvenc-params ROIFile=…

See ADR-0293 (x264 baseline) and ADR-0414 (x265 / SVT-AV1 / VVenC). The libaom-av1 row uses the shared -qpfile bridge documented in vmaf-tune-ffmpeg.md.

Per-adapter helpers in vmaftune.saliency:

  • write_x265_zones_arg(block_offsets, duration_frames) → zones string
  • write_x264_qpfile(block_offsets, out_path, duration_frames) → x264/libaom qpfile
  • write_svtav1_qpoffset_map(block_offsets, out_path, duration_frames) → Path
  • write_vvenc_roi_csv(block_offsets, out_path, duration_frames) → Path

Caveats

  • Aggregate mask, not per-frame ROI. The current implementation reduces saliency across the sampled frames and applies one delta pattern across the whole clip. Per-frame ROI is on the roadmap.
  • x265 zones: spatial mean only. x265's --zones has temporal granularity but not per-block spatial granularity. The zone carries the mean QP delta across all blocks. Per-block x265 ROI requires a future x265 qpfile port (different format from x264).
  • libaom segment quantisation. The FFmpeg bridge maps 16×16 qpfile deltas onto libaom's MI grid and at most eight segment QPs. Very fine-grained deltas are therefore quantised to the nearest segment.
  • SVT-AV1 / VVenC: 64×64 granularity. Both encoders document 64×64 as their ROI-map unit. The saliency mask is reduced to this grid via reduce_qp_map_to_blocks(qp_map, block=64) before writing.
  • RGB saliency input. The student model receives BT.709-limited yuv420p converted to ImageNet-normalised RGB. Chroma is nearest-neighbour upsampled to luma resolution before inference.
  • Don't use the placeholder. mobilesal_placeholder_v0 and the radial fallback inside vmaf-roi are smoke-test stubs. Pass an explicit --saliency-model pointing at the real fork-trained weights when you want a perceptual benefit.

Reproducer

The test suite mocks the ONNX session and the encode runner so it runs without ffmpeg or onnxruntime installed:

pytest tools/vmaf-tune/tests/test_saliency.py \
       tools/vmaf-tune/tests/test_saliency_roi_adapters.py \
       tools/vmaf-tune/tests/test_saliency_roi_codec.py -v

Codec adapter contract

The encode driver (tools/vmaf-tune/src/vmaftune/encode.py) is codec-agnostic as of ADR-0297. It looks up the codec adapter via vmaftune.codec_adapters.get_adapter(req.encoder) and asks the adapter for its FFmpeg argv slice — the harness itself never branches on codec identity. Adding a new codec is one file under tools/vmaf-tune/src/vmaftune/codec_adapters/ plus a registry entry; the search loop, corpus row schema, and FFmpeg invocation stay untouched.

A codec adapter is a frozen dataclass exposing:

Member Type Purpose
name str Human-readable codec id ("libx264").
encoder str FFmpeg -c:v value ("libx264", "h264_nvenc", ...).
quality_knob str Name of the quality knob ("crf", "cq", "qp", ...).
quality_range tuple[int, int] Inclusive (min, max) for the knob.
quality_default int Default quality value.
invert_quality bool True when a higher value means lower quality (CRF / QP).
presets tuple[str, ...] Allowed preset names.
validate(preset, quality) -> None method Raises ValueError on out-of-range input.
ffmpeg_codec_args(preset, quality) -> list[str] method Codec-specific argv slice (e.g. ["-c:v", "libx264", "-preset", "medium", "-crf", "23"]).
extra_params() -> tuple[str, ...] method (optional) Additional non-codec argv (e.g. ("-svtav1-params", "tune=0")).

The dispatcher composes the final ffmpeg command as:

[ffmpeg, -y, -hide_banner, -loglevel info,
 -f rawvideo -pix_fmt <pf> -s WxH -r FR -i <src>,
 *adapter.ffmpeg_codec_args(preset, quality),
 *adapter.extra_params(),
 *req.extra_params,
 <output>]

Adapters that do not yet implement ffmpeg_codec_args (or for which get_adapter raises KeyError) fall back to the legacy x264-CRF shape (-c:v <encoder> -preset <p> -crf <q>) so partial adapters stay drivable end-to-end while their per-codec PRs are in flight.

parse_versions(stderr, encoder=...) selects a per-codec version probe; missing matches degrade to "unknown" rather than raising.

Adapters in flight

Codec PR Status
libx264 shipped (Phase A) green
libx265 #362 adapter ships; dispatcher unblocks end-to-end
libsvtav1 #370 adapter ships; dispatcher unblocks end-to-end
libaom-av1 #360 adapter ships; dispatcher unblocks end-to-end
libvvenc #368 adapter ships; dispatcher unblocks end-to-end
h264_nvenc / hevc_nvenc / av1_nvenc #364 adapter ships; dispatcher unblocks end-to-end
h264_qsv / hevc_qsv #367 adapter ships; dispatcher unblocks end-to-end
h264_amf / hevc_amf #366 adapter ships; dispatcher unblocks end-to-end
h264_videotoolbox / hevc_videotoolbox #373 adapter ships; dispatcher unblocks end-to-end

Codec comparison

vmaf-tune compare answers the perennial "should I migrate from x264 to SVT-AV1 yet?" question per-source: given one reference and a target VMAF, run each codec's recommend predicate in parallel and rank the results by smallest file. This is Bucket #7 of the vmaf-tune capability audit; the default CLI backend is Phase B target-VMAF bisect per ADR-0326.

vmaf-tune compare \
    --src ref.yuv \
    --width 1920 --height 1080 --pix-fmt yuv420p \
    --framerate 24 --duration 10 \
    --sample-clip-seconds 4 \
    --target-vmaf 92 \
    --encoders libx264,libx265,libsvtav1,libaom-av1,libvvenc \
    --crf-min 15 --crf-max 40 \
    --format markdown

The real bisect backend needs source geometry because raw YUV does not self-describe. Pass --width and --height explicitly; --pix-fmt, --framerate, and --duration default to the common SDR 24 fps shape but should be set for accurate scoring and bitrate math. Pass --sample-clip-seconds N to evaluate the centre N seconds of the source per bisect iteration; compare forwards matching --frame_skip_ref / --frame_cnt scorer bounds and normalises bitrate against the sample duration. For custom rankers or tests, --predicate-module MODULE:CALLABLE still accepts any importable (codec, src, target_vmaf) -> RecommendResult callable and bypasses the bisect backend.

Container sources auto-probe their framerate / duration (ADR-0509). When --src is a container (mp4 / mkv / mov / Y4M / ...) and you omit --framerate / --duration, the CLI calls ffprobe on the source and substitutes the probed values before invoking the bisect predicate. This avoids a silent failure mode where the argparse default --framerate=24 against a 60 fps source mis-aligns reference vs. distorted decodes and collapses VMAF to the 4-90 band regardless of CRF. Explicit --framerate / --duration flags still win; a stderr warning fires when an explicit value disagrees with the probed source rate (subsampling is legitimate but should be deliberate). For raw .yuv sources nothing changes — the probe is skipped because raw YUV does not self-describe its framerate.

Sample output (--format markdown, abridged):

# Codec comparison — target VMAF 92

- Source: `ref.yuv`
- Tool: `vmaf-tune 0.0.1`
- Wall time: 6421.3 ms

| Rank | Codec    | Encoder         | Best CRF | Bitrate (kbps) | Encode time (ms) | VMAF  | Status |
|---:|---|---|---:|---:|---:|---:|---|
| 1  | libaom-av1 | libaom-3.8.0   | 30       |         1500.0 |          18000.0 | 92.40 | ok     |
| 2  | libx265   | libx265-3.5     | 26       |         1700.0 |           4200.0 | 92.00 | ok     |
| 3  | libsvtav1 | libsvtav1-1.7.0 | 32       |         1900.0 |           2800.0 | 92.30 | ok     |
| 4  | libx264   | libx264-164     | 23       |         2400.0 |           1500.0 | 92.10 | ok     |

**Smallest file**: `libaom-av1` at CRF 30 → 1500.0 kbps (VMAF 92.40).

Multi-target rate-quality sweep (schema v2, ADR-0516 + ADR-0534 + ADR-0538)

vmaf-tune compare defaults to a 4-point rate-quality sweep covering premium-archival operating points: --target-vmafs 94,96,97,98 (ADR-0538, supersedes the ADR-0534 streaming defaults). The JSON output stamps schema_version: 2 and carries one row per (codec, target_vmaf) pair, plus an optional bisect_samples list per row that records every encode+score probe the underlying target-VMAF bisect computed. vmaf-tune report --compare-json sweep.json then renders a per-codec rate-quality curve assembled from those probes, with the picked-CRF rows highlighted as larger circled markers and the pareto frontier (lowest bitrate at each target) drawn as a heavier dashed overlay. Current HTML/Markdown profile cards also include a Quick takeaways block before the charts, short "how to read this" notes, report-local codec identity chips (text badges that link to the upstream codec or vendor project), per-codec failure status, and an embedded encoder_profile JSON payload. The takeaways spell out the smallest successful row at each target, failed/unavailable row count, ladder span, and per-shot CRF spread so non-expert readers do not have to infer the main result from the chart alone. The profile payload is intentionally machine-readable: vmaf-tune encode-profile can read the HTML, Markdown, or raw JSON and turn one selected recommendation into a concrete FFmpeg encode.

# Out of the box: 5-point sweep, 3-codec compare, GPU scoring.
vmaf-tune compare \
    --src bbb_1080p_60fps.mp4 \
    --width 1920 --height 1080 --framerate 60 \
    --encoders libx265,libsvtav1 \
    --sample-clip-seconds 3 --max-iterations 3 \
    --score-backend cuda \
    --format json --output sweep.json
vmaf-tune report \
    --src bbb_1080p_60fps.mp4 \
    --compare-json sweep.json --target-vmaf 92 \
    --format html --output sweep_report.html

# Convenience path: render the same profile card directly from compare.
vmaf-tune compare \
    --src bbb_1080p_60fps.mp4 \
    --width 1920 --height 1080 --framerate 60 \
    --encoders libx265,libsvtav1,av1_nvenc,av1_qsv \
    --sample-clip-seconds 3 --max-iterations 3 \
    --score-backend cuda \
    --format both --output sweep_profile.html

# Reuse one recommendation from that profile without re-running the sweep.
vmaf-tune encode-profile \
    --profile sweep_profile.html \
    --src bbb_1080p_60fps.mp4 \
    --codec libsvtav1 --target-vmaf 96 \
    --output bbb_svtav1_vmaf96.mkv

Why these defaults (ADR-0538 supersedes ADR-0530)

  • 94,96,97,98 covers premium-archival. The fork's primary user encodes archival masters at VMAF >= 95 exclusively; VMAF 94 is the subjectively-transparent floor on 4K source, 98 is the near-lossless ceiling. The previous ADR-0530 / ADR-0534 default (75,80,85,90,93) targeted streaming / broadcast workflows that this fork does not service, so its R-Q chart contained no points the user picks CRFs from.
  • VMAF >= 95 is reachable. The earlier "top stops at 93" caveat was a bisect-harness artefact: the search window defaulted to the codec adapter's narrow quality_range (e.g. libx265 = (15, 40), libsvtav1 = (20, 50)) which the adapter validator additionally enforced as a hard CRF gate. ADR-0538 widens the default search window to the encoder's absolute CRF range — see the High-VMAF bisect contract subsection below for the per-codec table and the contract callers can rely on.
  • The chart renders from bisect_samples, not from the picked-CRF cells. Connecting picked-CRF rows per codec across targets produced physically impossible downward dips, because the bisect overshoots each target by a different amount. Plotting every probe the bisect already computed shows the genuine per-codec R-Q curve.

High-VMAF bisect contract (ADR-0538)

When crf_range is left at its default (the CLI default; no --crf-min / --crf-max passed), bisect_target_vmaf searches the encoder's absolute CRF range, not the codec adapter's perceptually-informative quality_range:

Codec Absolute CRF range Informative quality_range (legacy default)
libx264 0..51 0..51 (already maximal)
libx265 0..51 15..40
libvpx-vp9 0..63 0..63 (already maximal)
libaom-av1 0..63 0..63 (already maximal)
libsvtav1 0..63 20..50
Other (*_nvenc, *_qsv, *_amf, *_videotoolbox, libvvenc) falls back to adapter.crf_min/crf_max then adapter.quality_range as declared by the adapter

The contract guarantees:

  1. The bisect starts at the encoder's accepted floor. For libx264 / libx265 / libvpx-vp9 / libaom-av1 / libsvtav1 the lowest probe is CRF 0 (lossless), so any reasonable source produces VMAF >= 98 at that CRF and high targets are reachable.
  2. max_iterations >= 6 covers the widest window. ceil(log2(64)) = 6; the CLI default is --max-iterations 8 (+2 safety). For a target near the top of the VMAF scale (>= 97) callers running the bisect programmatically should keep at least 6 iterations.
  3. Overshoot at the floor is OK. If the codec already overshoots the target at CRF 0 (e.g. CRF 0 gives VMAF 99.5 on a low-distortion source against target 96), the bisect narrows toward higher CRFs looking for the highest CRF that still clears the target and returns that one with ok=True. The achieved VMAF will be >= target_vmaf by construction.
  4. The adapter's narrow informative window is bypassed for CRF validation but not for preset validation. Preset names are still checked against the adapter's whitelist; CRFs are checked against the encoder absolute range. Pass --crf-min / --crf-max explicitly to recover the historical narrow-window behaviour.

Single-target legacy (v1, back-compat): pass --target-vmaf NN without --target-vmafs to fall back to the v1 single-target schema (one row per codec at the single target, bar+dot chart). A _TrackedDefaultAction sentinel distinguishes "user passed --target-vmaf explicitly" from "argparse default" so legacy single-target scripts keep working unchanged.

Hardware encoders (*_nvenc, *_qsv, *_amf) are availability- probed before dispatch: a two-stage probe greps ffmpeg -encoders for the encoder name (catches "not compiled into this ffmpeg build") and then runs a 1-frame lavfi nullsrc dummy encode (catches "no compatible GPU at runtime"). Encoders that fail either stage surface as ok=false rows with a stable hardware encoder not available: … error string — the renderer flags them visually and the sweep continues; the run does not abort.

The --encoders flag is optional; when omitted it defaults to the CPU set libx265,libsvtav1. Older archival smoke runs also swept libx264 and libvpx-vp9, but those two CPU lanes dominate wall time without covering the fork's current HEVC/AV1 decision points.

BBB v9 artifacts are retained only as historical bug evidence; do not use v9 as a current run baseline. New BBB compare probes should start from the ADR-0641 profile-report path (--format both) and list libx264 / libvpx-vp9 only for an explicit legacy comparison.

compare CLI flags

Flag Default Notes
--src PATH Required. Single reference clip.
--target-vmaf F 92.0 Single VMAF target. When passed explicitly and --target-vmafs is at its default sweep, the v1 single-target schema is emitted (ADR-0534 back-compat).
--target-vmafs LIST 94,96,97,98 (ADR-0538, supersedes ADR-0534) Comma-separated VMAF targets to sweep per codec. The default covers premium-archival operating points (4K archival masters at VMAF 94-98). Pass a single value to opt into the v1 path; pass an explicit multi-value list (e.g. 80,85,90) to override the sweep range.
--encoders LIST libx265,libsvtav1 Comma-separated codec names. Hardware encoders (h264_nvenc, hevc_nvenc, av1_nvenc, h264_qsv, hevc_qsv, av1_qsv, h264_amf, hevc_amf, av1_amf) are accepted; missing-encoder rows skip with a reason rather than failing the whole run.
--width / --height Required for the default real-bisect backend.
--pix-fmt yuv420p Source pixel format forwarded to the scorer.
--framerate 24.0 (or auto-probed for container --src) Source framerate. Container sources (mp4 / mkv / mov / Y4M / ...) auto-probe via ffprobe when this flag is left at its default; explicit values still win with a stderr-warning on probed-vs-user mismatch (ADR-0509).
--duration 0.0 (or auto-probed for container --src) Source duration in seconds, used for bitrate math. Same auto-probe semantics as --framerate for container sources (ADR-0509).
--sample-clip-seconds 0.0 0.0 scores the full source. Positive values shorter than --duration use a centre sample window for encode, score, and bitrate math (ADR-0301).
--preset adapter default Preset forwarded to the codec adapter.
--crf-min / --crf-max adapter range Inclusive CRF search window. Pass both or neither.
--max-iterations 8 Encode+score round-trip cap per codec.
--vmaf-model vmaf_v0.6.1 VMAF model forwarded to the scorer.
--score-backend scorer default cpu, cuda, sycl, hip, or auto. (vulkan removed in ADR-0726.)
--ffmpeg-bin / --vmaf-bin ffmpeg / vmaf Binary overrides.
--encoder-ffmpeg-bin ENCODER=PATH off Bind one compare token to a specific FFmpeg binary. Use with ADAPTER@VARIANT labels such as libsvtav1@svt-av1-hdr=/opt/ffmpeg-8.1.1-svtav1-hdr/bin/ffmpeg; unbound tokens use --ffmpeg-bin.
--format markdown One of markdown, json, csv, html, both. html and both render the profile-card report directly; both writes .html and .md next to --output and therefore requires --output.
--no-parallel off Run codecs sequentially (default: thread pool, one per codec).
--max-workers N len(encoders) Cap on the parallel thread pool.
--predicate-module MOD:FN off Advanced hook that bypasses the bisect backend.
--no-bisect off Switch to CRF-sweep mode: skip target-VMAF bisect and encode each (codec, CRF) pair from --crf-sweep exactly once. See CRF sweep mode below. (ADR-0542)
--crf-sweep LIST Comma-separated CRF values to use in --no-bisect mode. Example: 18,23,28,33. Required when --no-bisect is passed. (ADR-0542)
--workdir PATH None Directory under which to create the per-run temporary scratch directory for the decoded reference YUV and encodes. Overrides VMAFTUNE_WORKDIR. When unset, falls through to VMAFTUNE_WORKDIR (if set), then to the OS default (/tmp). Pass this when your source is large and /tmp is a small tmpfs (e.g. 634-second 1080p60 BBB decodes to ~118 GB raw YUV). (ADR-0598)
--max-concurrent-decodes N 1 Maximum number of reference-YUV decode operations that may run simultaneously across all codec bisect threads. Default 1 (serial decodes) caps peak workdir disk usage to one YUV at a time regardless of how many codecs run in the thread pool. For example, with 3 codecs and a 110 GB BBB source, the default prevents the 330 GB peak that caused the v13 ENOSPC failure — peak stays at 110 GB. Raise to N on hosts where --workdir points to a volume with sufficient free space and the I/O subsystem can sustain N parallel decode streams. Encoder runs are always parallel; only the decode-to-raw-YUV step is serialised at the default. (ADR-0577)
--output PATH stdout Write the rendered report to PATH instead of stdout.

compare output schema

The JSON / CSV columns are exported as vmaftune.compare.COMPARE_ROW_KEYS: codec, adapter, runtime_variant, ffmpeg_bin, encoder_version, best_crf, bitrate_kbps, encode_time_ms, vmaf_score, target_vmaf, ok, error. Failed rows trail successful ones in the ranking; ok=False rows carry a human-readable error and sentinel numerics (-1 for best_crf, NaN for the floats). adapter, runtime_variant, and ffmpeg_bin are provenance fields for ADAPTER@VARIANT compare runs; they are empty on rows produced by old programmatic predicates that do not bind a runtime variant.

v1 (single-target legacy): emitted when --target-vmafs is not passed. The JSON has no schema_version key; rows is one row per codec at --target-vmaf.

v2 (multi-target sweep, ADR-0516 + ADR-0534 + ADR-0538): emitted when --target-vmafs lists ≥ 2 targets (the new default, ADR-0538). The JSON carries "schema_version": 2, "target_vmafs": [94.0, 96.0, 97.0, 98.0], and rows is one row per (codec, target_vmaf) pair. Each row also carries an optional bisect_samples: [{crf, bitrate_kbps, vmaf_score, encode_time_ms}, ...] list (ADR-0530, additive) recording every encode+score probe the underlying bisect computed; the field is absent on old v2 dumps pre-dating ADR-0530.

vmaf-tune report detects the v1 vs v2 discriminator via the schema-version key (or the presence of target_vmafs) and picks the right chart: v1 renders the bar+dot chart, v2 renders the per-codec rate-quality curve. When bisect_samples is populated the chart plots every probe per codec (deduplicated by CRF, sorted by bitrate) and overlays the picked-CRF rows as larger circled markers; pareto frontier stays as the dashed overlay. When bisect_samples is absent (old v2 dump or v1) the chart falls back to the legacy connect-the-dots line with a caveat note in the title — the dots- connect-the-dots representation can show physically impossible downward dips when the bisect's per-target overshoot varies, which is the failure mode ADR-0530 fixes for the new path.

The CSV emitter intentionally drops the bisect_samples column (extrasaction="ignore" on the writer); it stays a JSON-only structured field.

Encode-time normalisation: the encode_time_ms column is wall-clock on whatever machine ran the predicate. Cross-codec time comparisons only make sense when every predicate was run on the same hardware in the same configuration — see Research-0061 Bucket #7.

CRF sweep mode (--no-bisect)

When --no-bisect is passed together with --crf-sweep, compare skips the target-VMAF bisect entirely and instead encodes each (codec, CRF) pair exactly once. This is useful for operators who want to sweep a fixed CRF ladder (e.g. 18, 23, 28, 33) and inspect the resulting (bitrate, VMAF) pairs — faster than running four separate bisect sweeps and more direct than the corpus pipeline.

vmaf-tune compare \
    --src clip.mp4 \
    --encoders libx264,libx265,libsvtav1 \
    --no-bisect --crf-sweep 18,23,28,33 \
    --duration 60 --sample-clip-seconds 30 \
    --format json --output cmp_sweep.json

3 codecs × 4 CRFs = 12 rows in cmp_sweep.json.

v3 (CRF-sweep, ADR-0542): emitted when --no-bisect is set. The JSON carries "schema_version": 3, "mode": "crf_sweep", "crf_sweep": [18, 23, 28, 33], and rows is one row per (codec, crf) pair. Each row carries codec, adapter, runtime_variant, ffmpeg_bin, crf, bitrate_kbps, vmaf_score, encode_time_ms, encoder_version, ok, and error. The --target-vmaf / --target-vmafs flags are accepted but act as label-only knobs in this mode (they annotate pareto frontier markers when rendered, but do not drive the encode loop). CRF-sweep mode currently requires --format json; render the result with a separate downstream report step once v3 ingestion lands.

Hardware encoder availability is probed the same way as the bisect path: unavailable encoders (e.g. h264_nvenc without an NVIDIA device) produce ok=false rows for each CRF rather than aborting the entire run.

HDR-aware tuning (Bucket #9, ADR-0300)

Phase A auto-detects HDR sources and injects codec-appropriate HDR encode flags + HDR VMAF scoring. Detection runs ffprobe against each --source once at corpus start; the per-source encode argv gets the resulting HDR flag set appended.

What gets detected

A source is classified as HDR iff its first video stream carries both of:

  • color_transfer ∈ {smpte2084 (PQ), arib-std-b67 / hlg} and
  • color_primaries ∈ {bt2020, bt2020nc, bt2020-ncl, bt2020c, bt2020-cl}.

Mismatched signaling (e.g. PQ transfer with BT.709 primaries) is treated as SDR — misclassifying SDR as HDR is the dangerous failure mode. Mastering-display + max-CLL SEI side data is read when present and propagated to encoders that expose stable FFmpeg SEI flags (x265, SVT-AV1, HEVC NVENC).

Detection modes

Mode When to use
--auto-hdr (default) Mixed corpora; let ffprobe classify each source.
--force-sdr Disable HDR injection entirely (override probe).
--force-hdr-pq Raw YUV refs with no container metadata; you know the source is PQ.
--force-hdr-hlg Same, for HLG.

The four flags are mutually exclusive.

Codec dispatch

Encoder HDR signaling carrier
libx264 Container-level -color_* flags only (x264 has no in-stream HDR SEI).
libaom-av1 Global -color_* tags only; no fork-owned private SEI mapping yet.
libx265 Global -color_* + -x265-params colorprim=bt2020:transfer=...:colormatrix=bt2020nc[:master-display=...:max-cll=...:hdr10-opt=1].
libsvtav1 Global -color_* + -svtav1-params color-primaries=9:transfer-characteristics=16 (PQ) or =18 (HLG) :matrix-coefficients=9.
hevc_nvenc -pix_fmt p010le -profile:v main10 + global -color_* + -master_display / -max_cll (when ffmpeg supports them).
av1_nvenc -pix_fmt p010le + global -color_* tags.
hevc_qsv / hevc_amf / hevc_videotoolbox -pix_fmt p010le -profile:v main10 + global -color_* tags.
av1_qsv / av1_amf -pix_fmt p010le + global -color_* tags.
libvvenc Global -color_* only (SEI options live behind --vvenc-params in newer ffmpeg builds).

Encoders not in the dispatch table emit no HDR flags and the corpus row's hdr_* fields still record the detection result.

HDR VMAF scoring (model-port slot)

vmaftune.hdr.select_hdr_vmaf_model(model_dir, transfer="pq"|"hlg") resolves an HDR-trained model JSON via a two-stage lookup:

  1. canonical filename — model/vmaf_hdr_v0.6.1.json (the Netflix research artefact name); preferred when transfer is "pq" or "hlg".
  2. glob fallback — model/vmaf_hdr_*.json (so future revisions can land without code changes).

The fork does not ship the JSON in this PR. Verified 2026-05-08 against Netflix/vmaf master model/: no vmaf_hdr_*.json is present in the upstream public tree; Netflix publishes the artefact in a separate research bundle outside the repo. A fork-local license review is the gating follow-up (ADR-0300 § Status update 2026-05-08). Until then, HDR sources are scored against the SDR model with a one-shot warning logged on the first miss (subsequent misses stay quiet). Resulting vmaf_score values trend low for high-luminance regions and are not directly comparable to SDR scores. Drop a licensed copy at model/vmaf_hdr_v0.6.1.json and the harness picks it up automatically — no code change required.

Content-addressed cache

Re-running a corpus sweep after adjusting an unrelated flag should not re-encode and re-score tuples that have not changed. The content-addressed cache turns repeated (src, encoder, preset, crf) combinations into a free hit on the second run, restoring the parsed (bitrate, vmaf, encode_time, score_time) tuple from disk and skipping both subprocess calls. See ADR-0298 for the design.

Key composition

The cache key is sha256 of the canonical-JSON-encoded six-tuple:

  1. src_sha256 — content hash of the reference YUV
  2. encoder — adapter name (libx264, …)
  3. preset — encoder preset string
  4. crf — quality knob value (int)
  5. adapter_version — bumps when the codec adapter's argv shape changes
  6. ffmpeg_version — host ffmpeg version string

Dropping any one of these would let stale entries shadow real results when the adapter or ffmpeg is upgraded — the test suite asserts each field flips the key.

Layout

The cache lives at $XDG_CACHE_HOME/vmaf-tune/ (or ~/.cache/vmaf-tune/ if the env var is unset). Override with --cache-dir. Layout:

<cache-dir>/
  meta/<key>.json     — parsed (bitrate, vmaf, encode_time, ...) tuple
  blobs/<key>.bin     — opaque encoded artifact
  __index__.json      — last-access timestamps for LRU eviction

Eviction

LRU with a default 10 GiB cap (configurable via --cache-size-gb). On every put, the oldest entries are dropped until the total on-disk size sits at or below the cap.

Disabling

Pass --no-cache to force a re-encode/re-score on every cell. The cache is also automatically skipped when --no-source-hash is active (no stable content key) or when ffmpeg -version cannot be probed before the run.

Caveats

  • The cache is not baked into the JSONL row; the row stays the canonical record, the cache is an opaque sidecar.
  • Cache hits do not write a synthetic encode_path — that field remains empty unless --keep-encodes is set.
  • Concurrent runs against a shared cache dir (e.g. NFS) work for reads; writes are last-writer-wins and both writers' bytes are valid by content addressing.
vmaf-tune corpus --source ref.yuv \
    --width 1920 --height 1080 --framerate 24 --duration 10 \
    --preset medium \
    --coarse-to-fine --target-vmaf 92 \
    --output corpus.jsonl

--crf is no longer required — the CRF axis is generated by the search. --target-vmaf is optional here; without it the search still runs both passes and refines around the highest-VMAF coarse point.

recommend — pick a CRF for a quality target

The recommend subcommand always runs coarse-to-fine and prints the single recommended (preset, crf) pair plus its measured VMAF. It also writes the visited points to --output so callers have the corpus row for downstream analysis.

vmaf-tune recommend \
    --source ref.yuv \
    --width 1920 --height 1080 --framerate 24 --duration 10 \
    --preset medium \
    --target-vmaf 92
# stdout: src=ref.yuv preset=medium crf=27 vmaf=92.341 (visited 15 encodes)

Tunables

Flag Default Notes
--coarse-to-fine off (corpus); on (recommend) Activate the 2-pass search.
--coarse-step N 10 Step for the coarse pass. With defaults gives [10, 20, 30, 40, 50].
--fine-radius R 5 ±R around best-coarse for the fine pass.
--fine-step S 1 Step for the fine pass.
--target-vmaf V unset Required for recommend; optional for corpus.

Timing comparison

Numbers below are illustrative — actual encode + score wall time per point varies with source resolution, preset, and the libvmaf backend (cpu/cuda/sycl/hip). The relevant ratio is points visited, not seconds:

Mode Points visited Relative wall time
Full grid --crf 0 ... 51 52 1.00× (baseline)
Coarse-to-fine, defaults, target met mid-range 15 ~0.29× (3.46× faster)
Coarse-to-fine, 1-pass shortcut (target met at coarse max) 5 ~0.10× (10.4× faster)
Coarse-to-fine, target unmet (full fine pass anyway) 15 ~0.29×

For a 1080p --preset medium clip where one (encode + score) pass takes ~5 s, the coarse-to-fine path drops a single recommend run from ~260 s to ~75 s.

What Phase A does not do

  • No target-VMAF bisect (Phase B).
  • No per-title CRF prediction (Phase C).
  • No Pareto ABR ladder generation (Phase E).

Per-title ladder (Phase E)

Phase E ships the vmaf-tune ladder subcommand — given one source, sample (resolution × target-VMAF) points, take the Pareto upper-convex hull on (bitrate, vmaf), pick n evenly-spaced rungs along the hull, and emit the result as an HLS master playlist, DASH MPD, or JSON descriptor. This is the "per-title encoding" loop in one command — a fixed authoring-spec ladder is replaced by the ladder that's actually optimal for this title.

See ADR-0295 for the design and the alternatives considered (geometric ladder, JND- spaced, fixed Apple HLS).

The default sampler is wired: for each (resolution, target_vmaf) cell it runs the canonical 5-point CRF sweep 18,23,28,33,38 through the normal Phase A encode-and-score path, picks the closest row to the target, and feeds those points into hull and knee selection. A custom sampler= callback remains supported for callers that want a finer grid, a bisect loop, or a precomputed corpus stream.

Container and Y4M sources

--src accepts any input ffmpeg can decode: raw .yuv, .y4m, or a container (.mp4, .mkv, .webm, …). The ladder, corpus, and bisect paths transparently decode the reference once per sweep into a .ref.decoded.yuv sidecar under the encode dir and reuse it across every cell — there is no need to pre-decode the source by hand (ADR-0499). When --duration is set, the reference decode is clamped to that window so a 10-second probe against a multi-minute source produces a bounded YUV instead of materialising the full file (ADR-0498).

The libvmaf CLI itself reads .yuv only when --width / --height / --pixel_format / --bitdepth are passed (which vmaf-tune always does); .y4m is treated as raw planar YUV the same way .mp4 is, so the wrapper decodes both.

Canonical 5-rung invocation

The default rendition set is the canonical 5-rung 1080p/720p/480p/360p/240p ladder against VMAF targets {95, 90, 85, 75, 65}:

vmaf-tune ladder \
    --src episode01.yuv \
    --encoder libx264 \
    --resolutions 1920x1080,1280x720,854x480,640x360,426x240 \
    --target-vmafs 95,90,85,75,65 \
    --quality-tiers 5 \
    --format hls \
    --output episode01_ladder.m3u8

The output is an HLS master playlist with one #EXT-X-STREAM-INF per rung; bandwidth (in bps) is monotonically increasing. Variant URIs are placeholders — re-point them at your per-rendition playlists when packaging the encoded segments.

Other manifest formats

# DASH MPD
vmaf-tune ladder --src ep01.yuv --format dash --output ladder.mpd

# JSON descriptor (machine-readable, vmaf-tune-ladder/v1 schema)
vmaf-tune ladder --src ep01.yuv --format json --output ladder.json

The JSON descriptor carries three top-level fields:

  • schema — schema identifier ("vmaf-tune-ladder/v1").
  • renditions[] — the post-hull rungs that select_knees picked. Ascending-bitrate order; each entry has width, height, bitrate_kbps, bandwidth_bps, vmaf, crf.
  • samples[] — every encoded (resolution, crf) row the sampler scored, pre-hull. Ascending by (pixel_count, bitrate_kbps); same per-entry shape as renditions[]. Added 2026-05-18 per ADR-0501 so vmaf-tune report --ladder-json can render the Pareto-cloud overlay; widened 2026-05-18 per ADR-0505 from one-row-per-target-cell (V4 emit shape) to the full per-CRF sweep — every CRF encoded by the sampler now lands in the array exactly once, de-duplicated by (width, height, crf). The array is always present — callers running with no scored cells get an empty list rather than a missing key.

Cross-resolution scoring against container sources

Prior to ADR-0501 the reference leg of a cross-resolution rung was decoded at the source's native geometry while the libvmaf CLI was told to read both legs at the rung target. A 1920x1080 reference was therefore mis-parsed as a 1280x720 frame and emitted a catastrophic VMAF (~21 instead of ~93), collapsing the post-hull ladder to a single rendition. The corpus / ladder paths now downscale the reference YUV sidecar to the rung target via a per-rung -vf scale=W:H filter on the ffmpeg decode call. The per-rung sidecar's filename embeds the target dims (<src>.ref.decoded.<W>x<H>.yuv) so a multi-rung sweep in the same encode_dir doesn't collide on a stale path. Single-resolution ladders and rungs that match the source geometry keep the legacy decode-at-native-geometry path — there's no decode overhead when the target already matches.

ADR-0505 closes the matching gap on the encode side. Before 2026-05-18 the encode driver passed -f rawvideo -pix_fmt yuv420p -s WxH -i src.mp4 against every source, re-interpreting a container's compressed bytes as planar YUV pixels and producing a uniformly-bogus ~50 Mbps encode with VMAF in the 4-9 band regardless of CRF. The corpus now detects container sources by suffix (anything outside _VMAF_RAW_SUFFIXES = {".yuv", ""}) and sets EncodeRequest.source_is_container=True, so ffmpeg auto-detects the format and the rung-target -vf scale=W:H filter handles the resolution change. Raw .yuv sources keep the legacy rawvideo framing.

Rung spacing

--spacing log_bitrate (default) doubles bandwidth per rung — Apple HLS authoring-spec convention. --spacing vmaf spaces rungs by equal VMAF gaps, matching how viewers perceive quality steps. uniform is accepted as a legacy alias for vmaf.

Phase E ladder CLI flags

Flag Default Notes
--src PATH Required. Source label (sampling currently mocked).
--encoder NAME libx264 Codec adapter (Phase A wires libx264 only).
--resolutions WxH,... 1920x1080,1280x720,854x480,640x360,426x240 Canonical 5-rung.
--target-vmafs F,... 95,90,85,75,65 VMAF targets per resolution.
--quality-tiers N 5 Rungs to pick from the Pareto hull.
--spacing log_bitrate log_bitrate (HLS spec) or vmaf (perceptual); uniform is a legacy alias for vmaf.
--format hls hls, dash, or json.
--with-uncertainty off Apply the ADR-0279 prune/insert recipe. Sampled vmaf_interval payloads win; point-only rows use the active wide-interval threshold as a conservative fallback.
--uncertainty-sidecar PATH default thresholds Calibration sidecar for the uncertainty recipe.
--rung-overlap-threshold F 0.5 Adjacent-rung interval overlap threshold for pruning.
--output PATH stdout Manifest destination.
--src-width INT largest --resolutions entry Actual source width for raw YUV cross-resolution ladders. When the source is a higher resolution than the smallest rung, this is the demuxer-side -s W:H; the encode pipe scales to each rung target via -vf scale=W:H. Container sources auto-detect geometry and ignore this flag. Added 2026-05-18 per ADR-0498 / Bug #v2-B.
--src-height INT largest --resolutions entry Companion to --src-width. Default picks the tallest entry in --resolutions so a --resolutions 1920x1080,1280x720,854x480 ladder against a 1080p raw YUV "just works".
--score-backend NAME auto libvmaf scoring backend used by the default corpus sampler. Accepts auto\|cpu\|cuda\|sycl\|hip (vulkan was removed in ADR-0726 — same enum as corpus --score-backend / compare --score-backend). auto picks the fastest available in native-first order (cuda > sycl > hip > cpu); a specific name is honoured strictly and the run errors out with RC=2 before any encodes start if the local vmaf binary does not advertise it. Use cpu to force bit-exact CPU scoring for verification against the Netflix golden gate. Added 2026-05-18 per ADR-0511 / Bug C; HIP and native-first order added by ADR-0667.
--vmaf-bin PATH vmaf Path to the vmaf binary used to probe backend availability for --score-backend. Added 2026-05-18 per ADR-0511 / Bug C.
--workdir PATH None Directory under which to create the per-run temporary scratch directory. Overrides VMAFTUNE_WORKDIR. See compare --workdir for full semantics. (ADR-0598)
--max-concurrent-decodes N 1 Accepted for consistency with compare and tune-per-shot (ADR-0577). Currently a no-op for ladder because the corpus sampler does not use the bisect decode path; effective when the ladder sampler is updated to bisect in a future PR.

Cross-resolution ladders against raw YUV: prior to ADR-0498 the default sampler used the rung target dims as the source dims, which corrupted every encode against a raw YUV source whose actual resolution differed from the requested rung (-s 1280x720 on 1080p bytes = decoded garbage). The ladder now accepts separate source dims and injects a -vf scale=W:H filter for each sub-source rung. Container (.mp4 / .mkv) sources are unaffected — ffmpeg auto-detects their geometry. ADR-0501 closes the corresponding gap on the reference leg: a per-rung .ref.decoded.<W>x<H>.yuv sidecar is now produced so the libvmaf CLI reads both legs at the same geometry.

report subcommand — stdout aggregate flags

The report subcommand renders a profile-card (HTML / Markdown) from one or more JSON dumps emitted upstream of it (--compare-json, --ladder-json, --per-shot-json). It also writes a one-line stdout JSON summary that downstream automation can pipe to jq. The rendered card starts with a Quick takeaways block that summarizes the concrete recommendation and coverage gaps before the detailed tables. The stdout fields:

Key Meaning
ok true when at least one codec row succeeded and no non-ok row is a real encode failure. An "encoder unavailable" row (infrastructure gap — codec not built into ffmpeg) does not flip this to false. With no codec rows the report is informational and stays ok=true.
degraded true when at least one codec row is an "encoder unavailable" row. Lets dashboards show the missing codec without flipping the run red. Added 2026-05-18 per ADR-0501 / Bug #V4-C.
codec_rows / codec_rows_ok / codec_rows_failed Total / succeeded / failed row counts.
codec_rows_unavailable Subset of codec_rows_failed whose error starts with "encoder unavailable". Added 2026-05-18 per ADR-0501.
ladder_samples / ladder_rungs Counts read from --ladder-json. ladder_samples reads the top-level samples[] array (always present in vmaf-tune-ladder/v1 JSON since ADR-0501).
shots Count of per-shot rows read from --per-shot-json.
outputs Paths of the rendered card files.

The "encoder unavailable" discrimination keys on the bisect-stage error prefix added in ADR-0498 ("encoder unavailable (NAME): …"). A row whose error does not start with that prefix is treated as a genuine encode/score failure and flips ok=false.

encode-profile subcommand — reuse report recommendations

Every HTML, Markdown, and raw JSON profile card embeds encoder_profile with schema vmaftune.encoder_profile.v1 (ADR-0643). The payload contains source metadata, tool/binary provenance, codec metadata, failures, and a sorted list of concrete encoder recommendations. encode-profile reads that payload and runs exactly one selected FFmpeg encode; it never encodes the full ladder or every codec by default.

# Inspect the exact FFmpeg argv first.
vmaf-tune encode-profile \
    --profile sweep_profile.html \
    --src bbb_1080p_60fps.mp4 \
    --codec libsvtav1 \
    --target-vmaf 96 \
    --output bbb_svtav1_vmaf96.mkv \
    --dry-run

# Run the encode once the selected row looks right.
vmaf-tune encode-profile \
    --profile sweep_profile.html \
    --src bbb_1080p_60fps.mp4 \
    --codec libsvtav1 \
    --target-vmaf 96 \
    --output bbb_svtav1_vmaf96.mkv \
    --extra-ffmpeg-arg=-movflags --extra-ffmpeg-arg=+faststart

Selection rules:

  • With no filters, the first pareto-selected row with the lowest bitrate is used.
  • --codec NAME narrows the candidate list to one adapter token (libsvtav1, libx265, av1_nvenc, ...).
  • --target-vmaf F narrows by the exact target stored in the profile.
  • --recommendation-index N picks the zero-based row after those filters, useful when several rows remain tied or intentionally comparable.

Input handling:

Flag Default Notes
--profile PATH Required. Accepts report JSON, report HTML, or report Markdown.
--src PATH profile source Override when the profile was generated on another machine.
--output PATH Required encoded output path.
--source-kind auto\|container\|raw auto auto treats .yuv / .raw / .rgb / .gray as raw and everything else as FFmpeg-auto-detected container input.
--width / --height / --framerate / --pix-fmt profile values Required only when the selected source is raw and the profile did not carry those fields.
--preset profile row / adapter default Override the selected row's preset.
--duration profile source duration Bounds the input-side encode window.
--sample-clip-seconds / --sample-clip-start-s 0 Optional input-side clip flags forwarded to FFmpeg.
--extra-ffmpeg-arg TOKEN none Append one raw FFmpeg argv token after codec args. Repeat as needed; use --extra-ffmpeg-arg=-movflags for leading-dash tokens.
--ffmpeg-bin PATH profile ffmpeg_bin, then ffmpeg Override the FFmpeg binary.
--dry-run off Print the selected recommendation and ffmpeg_argv without encoding.

Phase F — multi-pass encoding (ADR-0333)

Phase F lights up 2-pass encoding for codecs that benefit. Default behaviour stays single-pass; opting in via --two-pass runs the encoder twice — pass 1 analyses the source and writes a stats file to a temp directory, pass 2 reads those stats to make better rate-allocation decisions.

When to use it

2-pass encoding pays off most clearly in target-bitrate workflows (VOD ladder generation, codec comparisons at fixed bitrate). Constant- quality (CRF) encodes already adapt QPs frame-by-frame from the encoder's lookahead, so the win at fixed CRF is more modest. Expect:

  • +1 to +3 VMAF points at a fixed bitrate target on libx265 vs 1-pass ABR (typical-content range; see x265 rate-control docs).
  • ~2× encode wall time — the second pass roughly doubles the cost.

Quick start

vmaf-tune corpus \
  --source ref.yuv --width 1920 --height 1080 \
  --pix-fmt yuv420p --framerate 24 --duration 5 \
  --encoder libx265 --preset medium --crf 23 \
  --two-pass \
  --output corpus_2pass.jsonl

The driver materialises a per-encode stats file under tempfile.gettempdir() (e.g. /tmp/vmaftune-2pass-XXXXXX/), runs both passes back-to-back, and removes the stats file plus known encoder sidecars when the run completes — successful or not.

Codec support matrix

ADR-0546 closed the contract for every codec adapter. Each adapter now either runs a real two-invocation 2-pass, returns single-invocation quality-boost flags callers can splice into extra_params, or raises a typed error documenting why the encoder cannot support multi-pass.

Codec supports_two_pass two_pass_args(1, p) returns Notes
libx264 yes -pass 1 -passlogfile <prefix> FFmpeg-native two-invocation 2-pass. ADR-0333.
libx265 yes -x265-params pass=1:stats=<path> x265 routes pass control through its codec-private payload. ADR-0333.
libvpx-vp9 yes -pass 1 -passlogfile <prefix> FFmpeg-native two-invocation 2-pass; CRF mode pinned via -b:v 0.
libaom-av1 yes -pass 1 -passlogfile <prefix> FFmpeg-native two-invocation 2-pass. ADR-0546.
libvvenc yes -pass 1 -passlogfile <prefix> FFmpeg ≥ 6.1 translates -pass to VVenC's RcStatsFile config. ADR-0546.
libsvtav1 no -pass 1 -passlogfile <prefix> SVT-AV1 forbids multi-pass in CRF mode (verified against v4.1.0: Svt[error]: CRF does not support multi-pass. Use single pass.). The adapter still returns VBR-mode argv for callers that explicitly switch into bitrate-targeted mode via extra_params. ADR-0546.
h264_nvenc / hevc_nvenc / av1_nvenc no -multipass fullres NVENC's multipass is a single-invocation, in-encoder full-resolution analysis. Splice the pass-1 return value into EncodeRequest.extra_params for a quality-boosted single-pass encode. Requires NVENC's VBR rate control + target bitrate. ADR-0546.
h264_qsv / hevc_qsv / av1_qsv no -extbrc 1 -look_ahead_depth 40 Intel QSV's extended-BRC look-ahead is a single-invocation in-encoder pre-analysis window. Compose into extra_params for the quality boost. ADR-0546.
h264_amf / hevc_amf / av1_amf no -preanalysis true AMD AMF's pre-analysis stage runs inside a single ffmpeg invocation. Compose into extra_params for the quality boost. ADR-0546.
h264_videotoolbox / hevc_videotoolbox / av1_videotoolbox / prores_videotoolbox no raise VideoToolboxTwoPassUnsupportedError Apple VTCompressionSession has no multi-pass C API. For true 2-pass on macOS, switch to a software encoder (libx264 / libx265 / libsvtav1 / libaom-av1 / libvvenc) — all of which ship in the same FFmpeg build. ADR-0546.

When --two-pass is set against a codec where supports_two_pass = False, vmaf-tune writes a one-line warning to stderr and runs single-pass. (Mirrors the saliency.py "unsupported ROI encoder, fallback to plain encode" precedent.) To fail loud instead, callers using the Python API can pass on_unsupported="raise" to run_two_pass_encode. VideoToolbox calls to adapter.two_pass_args() always raise the typed error so callers introspecting the contract can disambiguate "API limitation" from "we forgot to implement".

Hardware quality-boost composition (ADR-0546)

Hardware encoders expose their "2-pass equivalent" as single-invocation flags. To combine them with the harness's normal single-pass driver, pull adapter.two_pass_args(1, Path("/tmp/unused")) and splice the result into EncodeRequest.extra_params:

from vmaftune.codec_adapters import get_adapter
from vmaftune.encode import EncodeRequest, run_encode
from pathlib import Path

adapter = get_adapter("h264_nvenc")
boost = adapter.two_pass_args(1, Path("/tmp/_unused"))  # ('-multipass', 'fullres')

req = EncodeRequest(
    source=ref, width=1920, height=1080, pix_fmt="yuv420p", framerate=24.0,
    encoder="h264_nvenc", preset="slow", crf=22, output=out,
    extra_params=boost,  # quality-boosted single-pass
)
run_encode(req)

Cache interaction

The content-addressed encode cache (ADR-0298) keys on pass count, so a 1-pass encode and a 2-pass encode of the same (src, codec, preset, crf) are distinct cache entries — the cache will never serve a 1-pass encode for a 2-pass request.

Sample-clip composition

Sample-clip mode (ADR-0297) composes with 2-pass: both passes apply the same -ss <start> -t <N> input slice, and the per-encode stats file is unique per slice. No special handling required.

What Phase A / E / F do not do

  • No target-VMAF bisect (Phase B). Phase E uses the canonical 5-point CRF sweep as its production default sampler.
  • No per-title or per-shot CRF prediction (Phase C / D).
  • No real-corpus end-to-end ladder validation against a Netflix per- title baseline yet.
  • True two-invocation 2-pass on codecs the underlying encoder refuses (today: libsvtav1 in CRF mode; ADR-0546). The adapter still exposes meaningful argv for callers that switch into the encoder's supported mode (e.g. VBR for SvtAv1).
  • Encoder-stats parsing for libvpx-vp9: FFmpeg's libvpx passlog is binary first-pass data, not the x264/x265 text schema consumed by encoder_stats.py.

Phase A.5 — opt-in fast recommend

The fast subcommand (ADR-0276, Research-0060) is an opt-in recommendation surface that combines three acceleration levers — VMAF proxy via fr_regressor_v2, Bayesian search via Optuna's TPE sampler, and GPU-accelerated VMAF for the verify step — to replace the exhaustive grid for the recommendation use case. The slow corpus path stays the canonical ground truth. Production mode runs the sample encode, extracts canonical-6 features, calls the fr_regressor_v2 ONNX proxy, and performs a single verify pass through the selected VMAF backend. --smoke is still available for dependency- free CI and local plumbing checks.

Install

pip install -e 'tools/vmaf-tune[fast]'   # adds Optuna

The core install path stays zero-dependency; the [fast] extra is strictly opt-in.

Smoke-mode quick start

vmaf-tune fast --smoke --target-vmaf 92

Smoke mode runs Optuna over a synthetic x264-shaped CRF→VMAF curve. No ffmpeg, no ONNX Runtime, no GPU is touched. Output is a single JSON object:

{
  "encoder": "libx264",
  "target_vmaf": 92.0,
  "recommended_crf": 18,
  "predicted_vmaf": 92.39,
  "predicted_kbps": 4121.09,
  "n_trials": 50,
  "smoke": true,
  "notes": "smoke mode — synthetic predictor; ..."
}

CLI flags

Flag Default Notes
--src PATH Source video. Required outside --smoke.
--target-vmaf F 92.0 Quality target on VMAF [0, 100] scale.
--encoder NAME libx264 Codec adapter (Phase A.5: x264 only).
--crf-lo N 10 Lower bound of the CRF search.
--crf-hi N 51 Upper bound of the CRF search.
--n-trials N 50 Optuna TPE trial count.
--time-budget-s N 300 Soft wall-clock budget for Optuna. Completed trial count may be lower than --n-trials when the timeout is hit.
--smoke off Synthetic predictor — exercises the pipeline without ffmpeg / ONNX.

Speedup model

Per Research-0060 §Speedup model:

Combination Speedup vs Phase A grid
Phase A grid (baseline)
fast (proxy + Bayesian + GPU verify) ≈20–50×
fast + NVENC (lever C, follow-up) ≈100–500×

These are upper bounds. The production claim is gated on a recommendation-quality benchmark against the slow grid.

Production limits

fast is a recommendation shortcut, not a corpus generator. It still needs a representative source clip, a usable encoder, and a VMAF binary for verification. When the proxy / verify gap exceeds --proxy-tolerance, the CLI exits with code 3 so operators can fall back to the slow grid:

vmaf-tune fast --src ref.yuv --target-vmaf 92 || \
    vmaf-tune recommend --from-corpus corpus.jsonl --target-vmaf 92

Per-shot parallelisation remains a separate integration with TransNet V2 and vmaf-perShot.

fast accepts any registered codec adapter in production mode. Smoke mode remains synthetic and x264-shaped by design.

libvpx-vp9 example

The VP9 adapter routes --encoder libvpx-vp9 through FFmpeg's libvpx-vp9 wrapper. It maps the shared preset vocabulary to -deadline good -cpu-used N, emits -crf, and forces VP9 constant-quality mode with -b:v 0.

vmaf-tune corpus \
    --encoder libvpx-vp9 \
    --source ref.yuv \
    --width 1920 --height 1080 --pix-fmt yuv420p \
    --framerate 24 --duration 10 \
    --preset medium --preset fast \
    --crf 28 --crf 32 --crf 36 \
    --output corpus_vp9.jsonl

--two-pass is supported for VP9 via FFmpeg's generic -pass / -passlogfile switches. Per-frame encoder-stats columns remain zero for VP9 until a binary libvpx first-pass parser lands.

x265 example

x265 ships ten presets (ultrafastplacebo) on the same 0..51 CRF scale as x264; the harness routes --encoder libx265 through ffmpeg's -c:v libx265 path. External ffmpeg must be built with --enable-libx265.

vmaf-tune corpus \
    --encoder libx265 \
    --source ref.yuv \
    --width 1920 --height 1080 --pix-fmt yuv420p \
    --framerate 24 --duration 10 \
    --preset medium --preset slow --preset placebo \
    --crf 23 --crf 28 --crf 34 \
    --output corpus_x265.jsonl

The 10-bit pipeline is enabled by setting --pix-fmt yuv420p10le; the adapter reports the corresponding HEVC profile (main10) via X265Adapter.profile_for(pix_fmt) for downstream consumers that need it.

The software adapters (libx264, libx265, libsvtav1, libaom-av1, libvpx-vp9, libvvenc) all route through the same codec-adapter registry; the search loop does not branch on codec name.

VVenC (H.266 / VVC)

vmaf-tune ships a libvvenc codec adapter (ADR-0285) that drives Fraunhofer HHI's open-source VVC encoder via FFmpeg's -c:v libvvenc wrapper. VVC is the ITU-T / ISO standard that succeeds HEVC and delivers roughly 30-50% better compression at equal quality. As a rough rule of thumb, VVenC slow is ~5-10% better quality than HEVC slower at the same bitrate but ~3-5× slower wall-clock — VVC is the "opt in to longer encodes for tighter bitrates" branch of the adapter set.

Quality knob and presets

Property Value
Quality knob qp (forwarded as the integer that the harness's --crf flag carries; VVenC's wrapper accepts the value regardless of label)
Quality range [17, 50] (perceptually informative window; full VVenC scale is 0..63)
Default 32
Native presets faster, fast, medium, slow, slower (5 levels)

The harness's canonical 7-name preset vocabulary (placebo / slowest / slower / slow / medium / fast / faster / veryfast / superfast / ultrafast) compresses onto VVenC's native 5 levels via a static map: anything strictly slower than slow pins to slower; anything strictly faster than fast pins to faster; the central three names map identically. This matches the projection rule used by the parallel HEVC and AV1 adapters so that predictor inputs stay codec-uniform.

Tuning surface (real VVenC 1.14.0 knobs)

The adapter exposes a curated subset of VVenC's config keys via FFmpeg's -vvenc-params key=value:key=value channel. Keys are sourced verbatim from source/Lib/apputils/VVEncAppCfg.h at tag v1.14.0 (SHA 9428ea8636ae7f443ecde89999d16b2dfc421524, accessed 2026-05-09). Every knob defaults to None (library default preserved); the search loop opts into a non-default value only when the corpus row records it.

Knob VVenC key Default Effect / typical use
perceptual_qpa PerceptQPA library default XPSNR-driven perceptual QPA. Materially shifts the rate-distortion curve; recorded per-row in encoder_extra_params for predictor conditioning.
internal_bitdepth InternalBitDepth library default (10 for VVC) Force 8- or 10-bit internal precision; necessary for HDR profiles.
tier Tier library default (main) main or high. Caps signalled max bitrate / resolution.
tiles Tiles single tile (cols, rows) partitioning, emitted as NxM. Useful for parallel encode of high-resolution content.
max_parallel_frames MaxParallelFrames library default (auto) Parallel-frames perf knob; 0 disables, >=2 enables.
rpr RPR library default VVC reference-picture-resampling. 0 off / 1 on / 2 RPR-ready.
sao SAO library default (on) Sample Adaptive Offset loop filter. Useful for ablation studies.
alf ALF library default Adaptive Loop Filter; useful for ablation.
ccalf CCALF library default Cross-Component ALF (only meaningful when alf is on).

Toggles are emitted in field-declaration order so the argv stays byte-stable for cache-key hashing (per ADR-0298). The adapter_version field bumps to "2" for the 2026-05-09 surface — stale cached results are invalidated automatically.

NN-VC status (deferred)

VVC the standard defines NN-VC tool-points (NN-based intra prediction, NN-based loop filter, NN-based super-resolution), but VVenC 1.14.0 does not ship implementations of any of them. An earlier draft of this adapter exposed an nnvc_intra toggle that emitted -vvenc-params IntraNN=1; that key has never existed in any released VVenC and has been removed (see ADR-0285 §"Status update 2026-05-09"). If upstream VVenC ever lands NN-VC tools the adapter will pick them up via the placeholder pattern from ADR-0339's self-activating adapter set.

External binary requirements

Running the VVenC adapter end-to-end requires:

The shipped unit tests mock subprocess.run so the adapter can be exercised without either binary present; integration smoke is gated to a CI runner that has a libvvenc-enabled FFmpeg.

Phase D — per-shot CRF tuning

The tune-per-shot subcommand drives the Netflix-style per-shot encoding path. It cuts the source into shots (via the C-side vmaf-perShot binary, which wraps TransNet V2 — see ADR-0223), extracts each shot to a temporary raw-YUV reference, runs the Phase-B target-VMAF bisect for that shot, and emits an FFmpeg encoding plan that produces one segment per shot plus a final concat-demuxer command.

The target-VMAF predicate remains pluggable for advanced callers: --predicate-module MODULE:CALLABLE bypasses the default bisect path, and the Python API still accepts tune_per_shot(..., predicate=...). Codec emission is still portable segment-and-concat output rather than native per-shot mechanisms (--qpfile for x264, --zones for x265, the SVT-AV1 segment table).

Design rationale and the decision matrix live in ADR-0392.

Quick start

Container source (geometry auto-probed, no pre-extraction needed — ADR-0542):

vmaf-tune tune-per-shot \
    --src clip.mp4 \
    --target-vmaf 92 \
    --encoder libx264 \
    --output per_shot_encode.mp4 \
    --plan-out plan.json

Raw YUV source (explicit geometry required):

vmaf-tune tune-per-shot \
    --src ref.yuv \
    --width 1920 --height 1080 \
    --framerate 24 \
    --target-vmaf 92 \
    --encoder libx264 \
    --output per_shot_encode.mp4 \
    --plan-out plan.json

The plan is emitted to stdout as JSON unless --plan-out is specified. Pass --script-out plan.sh to also receive a copy-paste shell script of the per-segment + concat commands.

CLI flags

Flag Default Notes
--src PATH Required. Source video. Accepts raw YUV (.yuv / .raw) or any container format (mp4, mkv, mov, ts, …). For container sources, --width, --height, --framerate, and --total-frames are auto-derived from ffprobe. (ADR-0542)
--width / --height auto-probed Source resolution. Required for raw YUV sources. For container sources the values are auto-derived from ffprobe when these flags are omitted. Pass explicit values to override the probe. (ADR-0542)
--pix-fmt PFMT yuv420p Forwarded to vmaf-perShot.
--framerate F auto-probed Source framerate. Auto-derived from ffprobe for container sources; defaults to 24.0 if the probe cannot determine a rate. (ADR-0542)
--target-vmaf V 92.0 Per-shot quality target.
--encoder NAME libx264 Any registered codec adapter accepted by the Phase-B bisect backend.
--bitdepth N 8 Forwarded to vmaf-perShot (8, 10, or 12).
--total-frames N 0 Frame count for the single-shot fallback when vmaf-perShot is unavailable. Auto-derived from ffprobe for container sources. (ADR-0542)
--scene-threshold X unset Override vmaf-perShot --diff-threshold (mean-absolute-luma-delta cutoff for cut classification; lower = more shots). Omit to keep the C-side compiled default (12.0 on 8-bit content). See ADR-0513.
--max-shot-duration S 2.0 Uniform-time-window splitter (seconds). Any detected shot longer than S is sliced into equal-length sub-shots so the per-shot tuner sees a non-degenerate timeline even when the detector under-cuts (e.g. short clips, low-contrast fades). Set to 0 to disable. See ADR-0513.
--per-shot-bin PATH vmaf-perShot Override the shot detector binary.
--ffmpeg-bin PATH ffmpeg Override the FFmpeg binary.
--vmaf-bin PATH vmaf Override the libvmaf CLI used by the per-shot scorer.
--preset NAME adapter default Codec preset forwarded to Phase-B bisect.
--crf-min / --crf-max adapter range Optional inclusive CRF search bounds; pass both or neither.
--max-iterations N 8 Maximum encode+score iterations per detected shot.
--vmaf-model NAME vmaf_v0.6.1 VMAF model forwarded to the per-shot scorer.
--score-backend NAME auto libvmaf scoring backend for the per-shot scorer (auto, cpu, cuda, sycl, hip). (vulkan removed in ADR-0726.)
--predicate-module SPEC Advanced hook MODULE:CALLABLE matching (shot, target_vmaf, encoder) -> (crf, measured_vmaf); bypasses real bisect.
--workdir PATH None Directory under which to create the per-run temporary scratch directory. Overrides VMAFTUNE_WORKDIR. See compare --workdir for full semantics. (ADR-0598)
--max-concurrent-decodes N 1 Maximum number of simultaneous reference-YUV decode operations across all per-shot bisect threads. Default 1 (serial decodes). See compare --max-concurrent-decodes for full semantics. (ADR-0577)
--output PATH per_shot_encode.mp4 Final concatenated encode destination.
--segment-dir PATH see below Directory for per-shot segment files. Priority order: (1) explicit --segment-dir; (2) <plan-out>.parent/segments when --plan-out is set; (3) <output>.parent/segments. If the resolved directory is not writable (e.g. a read-only bind-mount), a WARN is emitted to stderr and the command still exits 0 — the plan JSON remains the authoritative deliverable. See ADR-0532.
--plan-out PATH stdout Write the JSON plan here instead of stdout.
--script-out PATH Optional: also emit a copy-paste shell script.

Plan JSON schema

{
  "encoder": "libx264",
  "framerate": 24.0,
  "predicate": "bisect",
  "target_vmaf": 92.0,
  "shots": [
    {
      "start_frame": 0, "end_frame": 24,
      "crf": 22, "predicted_vmaf": 93.0,
      "bitrate_kbps": 5234.12
    },
    {
      "start_frame": 24, "end_frame": 72,
      "crf": 26, "predicted_vmaf": 92.5,
      "bitrate_kbps": 4182.44
    }
  ],
  "segment_commands": [
    ["ffmpeg", "-y", "-hide_banner", "-ss", "0.000000", "-i", "ref.mp4",
     "-frames:v", "24", "-c:v", "libx264", "-crf", "22",
     "/tmp/segments/shot_0000.mp4"]
  ],
  "concat_command": [
    "ffmpeg", "-y", "-hide_banner", "-f", "concat", "-safe", "0",
    "-i", "/tmp/segments/concat.txt", "-c", "copy", "out.mp4"
  ]
}

start_frame is inclusive, end_frame is exclusive (Python-slice convention). The vmaf-perShot CSV/JSON sidecar uses inclusive end_frame; the planner normalises into the half-open form and the segment commands honour the half-open semantics via -frames:v.

bitrate_kbps (added in ADR-0531) carries the encoded segment bitrate measured by the Phase-B bisect backend: (segment_size_bytes × 8 / 1000) / shot_duration_s. It is null when a custom --predicate-module is used (no real encode happens in that path) and always a positive finite float for the default bisect backend. The vmaf-tune report renderer uses this field to populate the Bitrate column in the per-shot table; a null / absent value renders as "—".

Single-shot fallback

If the vmaf-perShot binary is not on PATH, or it exits non-zero, the planner falls back to a single shot covering the whole clip ([0, --total-frames)). This keeps tune-per-shot usable as a smoke test on machines that have not built the shot-detector binary yet.

The uniform-time-window splitter (--max-shot-duration, default 2.0 s) still applies to that fallback: even when the detector returns one giant shot, the splitter slices it into roughly ceil(duration / window) equal-length sub-shots so the per-shot tuner produces a useful CRF timeline. Pass --max-shot-duration 0 to restore the historical single-shot behaviour. See ADR-0513.

Tuning scene sensitivity

vmaf-perShot classifies frame N as a cut when the mean absolute luma delta against frame N-1 (in the 8-bit domain) crosses the --diff-threshold cutoff. The compiled-in default is 12.0, which the empirical calibration in ADR-0222 tuned against the testdata fixtures. Real-world content varies: animated material with deep saturated transitions trips the heuristic easily, but short live-action clips with low-contrast scene changes (BBB's underwater segment, indoor talking-heads) under-cut at the default.

Lower --scene-threshold (e.g. 4 - 6) to recover those cuts. Higher values (e.g. 18 - 25) reject motion-rich in-shot bursts that the default classifies as fades. The CLI flag passes through to the C binary verbatim so existing diagnostic scripts that invoke vmaf-perShot directly use the same units.

What Phase D does not do

  • Does not run the encodes — only emits the plan. Pipe --script-out plan.sh through sh to execute it manually.
  • Does not emit native per-codec per-shot mechanisms (x264 --qpfile, x265 --zones, SVT-AV1 segment tables). Per-segment encode plus concat-demuxer is the portable fallback.
  • Does not handle GOP-aligned shot boundaries — the per-segment approach side-steps this by re-encoding each shot from frame 0.

auto

vmaf-tune auto is the Phase F entry point (ADR-0325). One CLI verb composes the per-phase subcommands (corpus, recommend, predict, tune-per-shot, recommend-saliency, ladder, compare) plus the orthogonal modes (HDR auto-detect, sample-clip, resolution-aware) into a deterministic decision tree. F.1 shipped the sequential composition; F.2-F.5 added the short-circuits, confidence-aware fallbacks, and per-content recipes.

Synopsis

vmaf-tune auto \
    --src reference.mp4 \
    --target-vmaf 93 \
    --max-budget-bitrate 5000 \
    --allow-codecs libx264,libx265 \
    [--codec libx265] \
    [--sample-clip-seconds 10] \
    [--smoke] \
    [--output plan.json]

The non-smoke path probes source geometry, source duration, and HDR metadata through the same ffprobe/HDR helpers used by the corpus path. Probe failures degrade to conservative 1920x1080 SDR defaults so the planner can still emit a JSON plan and expose which later stages need real evidence. --smoke still exercises the composition end-to-end with mocked sub-phases (no ffmpeg, no ONNX). The JSON plan emitted under metadata.short_circuits records which short-circuits fired; post-hoc analysis uses this to measure the speedup contribution of each one. For each non-smoke cell, auto now feeds the probed metadata into the existing Predictor path, picks a codec-specific CRF for metadata.effective_predictor_target_vmaf, and records predictor estimates for estimated_vmaf and estimated_bitrate_kbps. These are planner estimates, not measured encode results, until the future realise/encode step scores the chosen cells. The per-cell prediction_source key distinguishes the production estimate path ("predictor") from the --smoke composition placeholder ("smoke-placeholder").

Short-circuits

The ten short-circuits below are the F.2 surface. Each one is a guarded fast-path with one trigger condition; the predicates are exposed as _should_short_circuit_<N> helpers in tools/vmaf-tune/src/vmaftune/auto.py so they can be unit-tested in isolation.

# Identifier Trigger Skips
1 ladder-single-rung meta.height < 2160 Multi-rung ABR ladder evaluation (ADR-0289 / ADR-0295).
2 codec-pinned --codec set or --allow-codecs resolves to one entry The compare.shortlist stage.
3 predictor-gospel predict.crf_for_target returns GOSPEL (ADR-0306) The recommend.coarse_to_fine fallback for that cell.
4 skip-saliency meta.content_class is photographic / live-action (not animation / screen content) The recommend_saliency.maybe_apply stage (ADR-0293).
5 sdr-skip not meta.is_hdr (per ADR-0300 detector) The HDR resolution + model-selection branch.
6 sample-clip-propagate --sample-clip-seconds > 0 Re-deciding clip length per stage; the user-supplied value propagates verbatim (ADR-0301).
7 skip-per-shot duration < 5min AND shot_variance < 0.15 The tune_per_shot.refine pass (ADR-0392).
8 low-complexity meta.complexity_score < 200 kbps (probe-encode bitrate) The recommend.coarse_to_fine sweep — the predictor's point estimate is already tight on simple content. 0.0/NaN does not fire (no probe run yet).
9 baseline-meets-target meta.baseline_vmaf >= target_vmaf The full predictor sweep — the default-CRF encode already satisfies the quality target. 0.0/NaN does not fire (no baseline scored yet).
10 no-two-pass adapter.supports_two_pass == False (ADR-0333 + ADR-0546) The two-pass calibration stage. Hardware encoders (*_nvenc, *_amf, *_qsv, *_videotoolbox) and libsvtav1 (CRF-mode multi-pass prohibition) fire this. libx264 / libx265 / libvpx-vp9 / libaom-av1 / libvvenc set supports_two_pass = True.

The 5-min and 0.15 thresholds in short-circuit #7 are placeholders; F.3 fits them empirically once Phase F has emitted enough labelled compositions to make the fit statistically defensible. The constants live at the top of auto.py (PHASE_D_DURATION_GATE_S, PHASE_D_SHOT_VARIANCE_GATE) so the eventual fit lands as a one-line edit. The 200 kbps threshold in short-circuit #8 is likewise a placeholder: LOW_COMPLEXITY_PROBE_BITRATE_THRESHOLD_KBPS at the top of auto.py.

The evaluation order in SHORT_CIRCUIT_PREDICATES is part of the public contract: tests assert that an earlier-firing predicate doesn't shadow a later one whose result would have been different. Adding a new short-circuit means appending; never reordering.

auto does not dispatch the fast subcommand from inside its tree. fast (ADR-0276 fast-path) is a different operator surface (proxy + Bayesian over a single codec) and remains a sibling, not a child, of auto.

For HDR sources, every emitted cell records the codec-specific hdr_args produced by vmaftune.hdr.hdr_codec_args(codec, info). That means x264 gets only container-level -color_* flags, x265 gets -x265-params SEI signalling, SVT-AV1 gets -svtav1-params, and SDR cells record an empty list after the sdr-skip short-circuit fires.

Winner selection

auto now finishes the planning pass by selecting one estimated cell. The chosen cell is marked with "selected": true in cells[]; every other cell has "selected": false. The same decision is copied to metadata.winner so scripts can read a stable object without scanning the cell array.

Winner statuses:

Status Meaning
budget_and_quality_met At least one cell met --target-vmaf and --max-budget-bitrate; the lowest estimated bitrate wins.
quality_met_budget_exceeded No cell was inside budget, but at least one cell met the quality target; the smallest budget overage wins.
target_unmet No cell met the quality target; the closest estimated VMAF miss wins so the caller still gets a concrete next encode.
no_eligible_cells No cell carried finite estimated_vmaf and estimated_bitrate_kbps; this is an input/planner evidence failure.

metadata.winner records cell_index, rung, codec, crf, estimated_vmaf, estimated_bitrate_kbps, quality_margin, and budget_margin_kbps. This is still a planning result: the selected cell is the next encode target, not a substitute for the final encode/score verification pass.

Execute mode (ADR-0454)

After the planning pass, --execute drives real FFmpeg encodes and libvmaf scores for the selected cell(s):

vmaf-tune auto \
    --src reference.mp4 \
    --target-vmaf 93 \
    --max-budget-bitrate 5000 \
    --allow-codecs libx264,libx265 \
    --output plan.json \
    --execute \
    --runs-dir runs/

--execute flags:

Flag Default Description
--execute off Enable execute mode; plan-only when absent.
--runs-dir PATH runs/ Destination for encoded files and tune_results.jsonl.
--execute-all off Run every plan cell instead of only the selected winner.

Results are appended to <runs-dir>/tune_results.jsonl one row per executed cell. Each row carries the cell metadata (codec, preset, CRF, estimated VMAF/bitrate) merged with encode outcomes (size, encode time, FFmpeg version) and score outcomes (measured VMAF, per-feature means/stds, vmaf-binary version). The file appends on each run so partial runs and incremental re-runs do not overwrite previous results.

The CLI exits with status 1 if at least one cell was executed but none scored successfully (encode failures, vmaf binary absent, etc.); it exits 0 on plan-only runs regardless of plan content.

Per-shot execution (ADR-0468)

run_plan_per_shot splits the source into shot boundaries (via vmaf-perShot / TransNet V2, ADR-0223) and scores each segment independently. Results land in <runs-dir>/tune_results_per_shot.jsonl:

from vmaftune.executor import run_plan_per_shot

per_shot_results = run_plan_per_shot(
    plan, src=Path("reference.mp4"), out_dir=Path("runs/"),
    vmaf_model="vmaf_v0.6.1",
)
for r in per_shot_results:
    print(f"cell {r.row['cell_index']}: "
          f"{r.row['shot_count']} shots, "
          f"weighted VMAF={r.weighted_vmaf:.2f}")

Each top-level row carries shot_count and weighted_vmaf (frame-length-weighted mean of per-shot scores). When vmaf-perShot is absent, the call falls back to a single-shot range covering the whole clip; shot_count == 1 signals this.

Saliency execution (ADR-0468)

run_plan_saliency applies saliency-aware encoding via saliency_aware_encode (see § Saliency-aware encoding) before scoring. Results land in <runs-dir>/tune_results_saliency.jsonl:

from vmaftune.executor import run_plan_saliency

sal_results = run_plan_saliency(
    plan, src=Path("reference.mp4"), out_dir=Path("runs/"),
    saliency_model_path=Path("model/tiny/saliency_student_v1.onnx"),
    duration_frames=total_frame_count,
)
for r in sal_results:
    print(f"cell {r.row['cell_index']}: "
          f"saliency_available={r.saliency_available}, "
          f"VMAF={r.row['vmaf_score']}")

saliency_available in the result row is True when the ONNX model ran; False means the encoder fell back to a plain encode (model file missing or onnxruntime not installed). The encode and score always proceed regardless.

Confidence-aware fallbacks (F.3)

F.2 treats the predictor's verdict as a binary GOSPEL / FALL_BACK gate (short-circuit #3). F.3 makes the gate continuous by consulting the conformal interval half-width returned by Predictor.predict_vmaf_with_uncertainty (ADR-0393). Two width gates carve the half-width axis into three regions:

Interval width Outcome Effect on F.2
width <= tight_interval_max_width SKIP_ESCALATION Predictor is confident; trust the point estimate even when the native verdict said FALL_BACK.
tight < width < wide RECOMMEND_ESCALATION (on FALL_BACK / unknown) or SKIP_ESCALATION (on GOSPEL / LIKELY) Defer to the native verdict — exactly the F.2 behaviour.
width >= wide_interval_min_width FORCE_ESCALATION Predictor is uncertain; escalate to recommend.coarse_to_fine even when the native verdict said GOSPEL.

The two thresholds are corpus-derived. The conformal-VQA calibration pipeline ships a JSON sidecar with the canonical keys tight_interval_max_width and wide_interval_min_width; the loader honours per-corpus overrides transparently. When no sidecar is found the loader falls back to the Research-0067 emergency floor (2.0 / 5.0 VMAF) and emits a one-line warning; the floor is documented behaviour, not a magic constant.

Per-cell decisions are recorded in plan.metadata.confidence_aware_escalations[] (one entry per cell, keyed by rung, codec, verdict, interval_width, decision), and each cell in plan.cells[] carries its own confidence_decision + interval_width keys so JSON consumers don't need to cross-reference the metadata array index.

Per CLAUDE.md feedback_no_test_weakening: the thresholds are calibration outputs. If a sidecar value triggers surprising cell escalations on real data, the fix is a recalibration PR — not a loosening of the F.3 gate here.

The helper _confidence_aware_escalation(verdict, interval_width, thresholds) in tools/vmaf-tune/src/vmaftune/auto.py is exposed for unit testing and direct embedding by downstream tools (the MCP server's auto proxy, the CI corpus collector). It is a pure function of its three inputs.

Per-content-type recipes (F.4)

F.4 layers per-content-type recipe overrides on top of F.1+F.2+F.3. When the upstream classifier (per_shot.detect_shots plus the fork-local content-class heuristics) tags a source as animation, screen_content, live_action_hdr, or ugc, the auto driver applies a small override dict before the F.2 short-circuits evaluate so a recipe can flip force_single_rung and have the ladder stage honour it. Any source whose meta.content_class doesn't match a named recipe falls through to the empty default recipe.

The override keys consumed by the driver are:

Key Type Effect
tight_interval_max_width float Narrows / widens the F.3 conformal-tight gate.
force_single_rung bool Arms short-circuit #1 (ladder-single-rung) even on >= 2160p sources.
saliency_intensity str Passed through to the saliency stage when not skipped. One of default, aggressive, very_aggressive.
target_vmaf_offset float Additive offset applied to the predictor's effective target VMAF. The input --target-vmaf (the gate that ships models) is never shifted by this value — see the no-test-weakening note below.

The five recipe classes ship the following overrides. The values below are the F.5-calibrated thresholds emitted by ai/scripts/calibrate_phase_f_recipes.py and shipped in ai/data/phase_f_recipes_calibrated.json. The calibration was run on 2026-05-09 against the K150K corpus (.workingdir2/konvid-150k/konvid_150k.jsonl, 148 543 rows out of an expected 153 841 — the ingestion was ~96.6 % complete; a re-run on the full corpus is a follow-up PR). Threshold rationale and the per-class proxy-vs-corpus provenance break-down live in Research-0067 §"F.4 recipe-override placeholders" plus the JSON metadata block.

Class tight_interval_max_width force_single_rung saliency_intensity target_vmaf_offset Source
animation 1.75 true aggressive +2.0 proxy (UGC-anchored)
screen_content (unset) (unset) very_aggressive +1.0 proxy (UGC-anchored)
live_action_hdr 1.4 (unset) (default) 0.0 proxy (UGC-anchored)
ugc 3.5 false default +1.5 corpus (K150K)
default (unset) (unset) (default) 0.0 n/a

K150K is a UGC-only corpus and carries no per-source content_class column; only the ugc row is corpus-derived. The other three rows are calibrated as documented absolute offsets ("proxy") anchored on the F.4 envelope until PR #477's TransNet shot-metadata columns plus a class-labelled subset land. The JSON recipes.<class>._provenance sub-dict records the source per row so future re-calibrations can distinguish corpus-derived from proxy-derived values. UGC's target_vmaf_offset came out empirically positive (+1.5) on K150K because the corpus's MOS distribution has a heavier upper tail than lower tail; the calibration script clamps every offset to the F.4 documented envelope of [-2.0, +2.0] so a pathological corpus cannot push the predictor target outside the regime the planner has been exercised against.

The auto.py runtime loads the JSON at module import via _load_calibrated_recipes; if the JSON file is missing or malformed, the F.4 placeholder constants in _F4_PLACEHOLDER_RECIPES apply as a graceful fallback. The generated JSON includes ADR-0661 run_provenance with the calibration script, argv, source corpus JSONL, row cap, and recipe output target. To regenerate the calibration after the corpus ingestion completes (or when a class-labelled corpus replaces K150K), run:

python ai/scripts/calibrate_phase_f_recipes.py \
    --corpus .workingdir2/konvid-150k/konvid_150k.jsonl \
    --out ai/data/phase_f_recipes_calibrated.json

Recipe rationale (each cited threshold is provisional pending F.5 calibration):

  • Animation — predictor residuals are tighter on flat colour fields, single-rung ladder is sufficient, and saliency is more aggressive on cel-line edges. Animation is intrinsically more compressible at a given perceptual quality, so the predictor aims ~2 VMAF higher.
  • Screen content — split-frame structure (low-entropy background + high-detail text/icon regions) benefits from very_aggressive saliency that raises QP on the background while keeping text near-lossless. Predictor target nudged +1.
  • Live-action HDR — per ADR-0300 the HDR pipeline already runs; the F.3 conformal-tight gate is narrowed to 1.4 because a wide predictor interval on HDR is more suspect than on SDR (the predictor was largely trained on SDR per ADR-0393).
  • UGC — user-generated content carries higher upstream-encode noise, inconsistent grading, and resolution mismatches; predictor uncertainty is the baseline. Widening the F.3 tight gate to 3.5 avoids over-flagging UGC cells as "needs escalation" simply because the interval is wider than a Netflix-grade reference. The K150K calibration nudges the predictor target up 1.5 because the corpus MOS distribution has a heavier upper tail.

The recipe class is recorded in plan.metadata.recipe_applied (one of animation, screen_content, live_action_hdr, ugc, or default) and the override dict in plan.metadata.recipe_overrides. Each cell in plan.cells[] also carries the resolved saliency_intensity and effective_predictor_target_vmaf so JSON consumers don't need to cross-reference the metadata block.

Per CLAUDE.md memory feedback_no_test_weakening: recipe overrides MUST NOT silently widen the production-flip gate that ships models. They affect predictor thresholds (the effective_predictor_target_vmaf that the predictor aims for; the F.3 width gate that decides per-cell escalation), not the input --target-vmaf that downstream consumers treat as the contract. The driver records the input target_vmaf verbatim in plan.metadata.target_vmaf and the offset target in plan.metadata.effective_predictor_target_vmaf; the two are kept distinct.

The helper _apply_recipe_override(meta, plan_state, thresholds) in tools/vmaf-tune/src/vmaftune/auto.py resolves the recipe and returns a (recipe_class, recipe, effective_thresholds) triple; the get_recipe_for_class(content_class) helper returns a fresh override dict for any of the five canonical class strings. Both are pure functions; the table at module scope (_CONTENT_RECIPE_TABLE) holds factory callables so each call returns a fresh dict that callers may mutate without affecting subsequent runs.

References: ADR-0397 §F.4, ADR-0393, ADR-0300, Research-0067.

Tests

pytest tools/vmaf-tune/tests/

The shipped suite mocks subprocess.run so it neither requires ffmpeg nor a built vmaf. Real-binary integration coverage will land when the codec adapter set widens.