ADR-0301: vmaf-tune --sample-clip-seconds (sample-clip mode)¶
- Status: Accepted
- Date: 2026-05-03
- Deciders: Lusoris
- Tags: tooling, ffmpeg, vmaf-tune, fork-local
Context¶
ADR-0237 accepted the Phase A scaffold for tools/vmaf-tune/: a grid sweep over (preset, crf) cells that encodes the full source per cell, scores the encode against the reference with the libvmaf CLI, and emits a JSONL row per cell. With the multi-codec backlog (libx265, libsvtav1, libvpx-vp9, libvvenc) and the per-title / per-shot search loops still to come (Phases C–E), full-source encoding becomes a wall-time bottleneck on long sources: a 60-second clip across a 6-cell grid is six 60-second libx264 slow passes, which Phase C calibration sweeps multiply by tens of (preset, crf) cells across many sources.
A representative slice typically reproduces the relative ordering of encoder-parameter cells without the wall-time cost of a full encode. The harness needs an opt-in mode that cuts the per-cell encode and score to a fixed-length window of the source, with the same window on both sides so VMAF stays valid (mismatched windows yield nonsense scores). The fork-local consumers — Phase B (target-VMAF bisect), Phase C (per-title CRF predictor) — need to know whether a row was sampled or full so they can filter, weight, or epilogue-rescore the chosen cell on the full source.
Decision¶
We will add --sample-clip-seconds N to vmaf-tune corpus (default 0 = legacy full-source). When set, the harness computes a centre-anchored window start = (duration_s − N) / 2, prepends -ss <start> -t <N> to FFmpeg's input-side options (so the rawvideo demuxer fast-seeks the YUV by skipping start * framerate frame-sized byte chunks instead of decoding past the slice), and mirrors the same window on the score side via the libvmaf CLI's --frame_skip_ref / --frame_cnt. Each emitted row carries clip_mode = "sample_<N>s" or "full". The corpus schema bumps to SCHEMA_VERSION = 2 (additive — clip_mode is the only new key). bitrate_kbps is computed against the encoded duration (slice seconds when sample-clip is on, source seconds otherwise) so sample-clip rows aren't biased low by a slice-bytes ÷ source-seconds ratio. If N >= duration_s the harness silently falls back to "full" mode.
For the first scaffold, "representative" is naive centre-anchored — shot-aware placement (e.g. via the existing TransNet V2 placeholder to pick the slice with the highest spatial-temporal complexity) is on the follow-up backlog and out of scope here.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Naive centre slice (chosen) | Simplest impl; one int math expression; deterministic; reproducible | Misses content variation if the centre window is uniform (single shot, static interview) | Picked: matches the task's "first scaffold, smarter sampling later" framing. Falls back to full automatically when source is too short. |
| Slice the reference YUV on disk into a temp file, then encode + score the temp | Simpler downstream — both sides see a "normal" source | Adds an I/O pass equal to slice-bytes per cell (defeats some of the speedup), pollutes scratch dir, requires temp-cleanup failure handling | Rejected. Mirroring the slice via --frame_skip_ref / --frame_cnt is zero-I/O on the reference and uses libvmaf CLI flags that already exist. |
Output-side -ss (after -i) | One-line FFmpeg argv change | FFmpeg still decodes (rawvideo demuxer still reads) the full source up to start, then re-encodes only the slice — saves only encoder time, not demuxer / I/O time | Rejected. Defeats the linear-with-duration speedup that's the whole point. |
| Smart shot-aware placement (TransNet V2 + complexity score) | Picks the most-distinct window — better aggregate-VMAF predictor | Brings in the TransNet V2 dependency at the harness layer; needs a complexity scorer; out of scope for the scaffold | Deferred to a follow-up. The naive scaffold is the foundation; smart placement is one swap of _resolve_sample_clip. |
| Per-shot dynamic slicing (Phase D territory) | Closest to what a real per-shot CRF predictor wants | Phase D is its own ADR-0237 phase; conflating sample-clip and per-shot would block both | Out of scope. ADR-0237 keeps these separate phases. |
Consequences¶
- Positive: Per-cell wall time scales with slice length, so
--sample-clip-seconds 10on a 60-second source is roughly a 6x speedup per cell (encode + score). Phase C's multi-source calibration sweeps become tractable on sources longer than a few seconds. Theclip_moderow tag lets Phase B/C consumers filter or weight sample rows without re-reading the source. - Negative: Sample-clip VMAF drifts from full-clip VMAF by typically ~1–2 points on diverse content (mixed-shot trailers, sports, action), tighter (~0.3–0.5 points) on uniform content (single-shot interviews, animation). The relative ordering between cells survives — which is what the search loops actually consume — but the absolute number is noisier. Phase C should rescore the predictor's chosen cell on the full source as an epilogue.
- Neutral / follow-ups:
- Schema bumped to v2 (additive
clip_mode); downstream consumers should readclip_modedefensively (default"full"for v1 rows). - Naive centre placement leaves shot-aware placement on the backlog — when TransNet V2 lands real weights, swap
_resolve_sample_clipto pick the slice with the most distinct content. - Bitrate semantics: rows now divide bytes by encoded duration, not source duration. Documented in
docs/usage/vmaf-tune.md.
References¶
- ADR-0237 — Phase A parent, defines the JSONL contract this ADR extends.
docs/usage/vmaf-tune.md— user-discoverable flag + accuracy caveat documentation.- libvmaf CLI:
--frame_skip_ref/--frame_cnt— seecore/tools/cli_parse.c. - Source:
req— direct user request 2026-05-03 to add a sample-clip mode to vmaf-tune that encodes only a short representative slice (e.g. 10 seconds) per grid point instead of the full video, with the naive middle-slice as the first scaffold and TransNet V2-based smart sampling as a follow-up.