Skip to content

ADR-0222: vmaf-perShot per-shot CRF predictor sidecar

  • Status: Accepted
  • Date: 2026-04-29
  • Deciders: Lusoris
  • Tags: ai, tools, encoder-hint, fork-local, t6-3b

Context

The fork's tiny-AI roadmap §2.4 (per-shot CRF predictor + TransNet V2 shot boundaries) decomposes into two tasks tracked in .workingdir2/BACKLOG.md:

  • T6-3a — TransNet V2 shot-boundary feature extractor (~1M params, 100-frame window). Already proposed as PR #210 / ADR-0220 — emits per-frame shot_boundary_probability and a binary shot_boundary flag.
  • T6-3bthis ADR — the standalone CLI tools/vmaf-perShot that consumes a YUV reference, segments it into shots, computes per-shot signal aggregates (scene complexity, motion energy, length), and emits a per-shot CRF plan (CSV / JSON) for downstream encoders.

The split keeps each side independently mergeable. T6-3b ships a fallback frame-difference shot detector so the tool is useful without T6-3a having merged yet; once T6-3a lands the tool can take a pre-computed shot map and skip its built-in detector. The CRF predictor itself stays inside the sidecar — it never enters libvmaf's metric path.

Decision

We will ship vmaf-perShot as a standalone executable under core/tools/vmaf_per_shot.c with the following v1 contract:

  • Input: planar YUV420P (8 / 10 / 12-bit), a target VMAF, and CRF clamp bounds.
  • Shot detector: per-frame mean absolute luma delta vs. the previous frame, normalised to the 8-bit domain. Cuts trigger when the delta exceeds --diff-threshold (default 12.0) and the running shot has reached VMAF_PER_SHOT_MIN_LEN = 4 frames (suppresses flash / fade flicker).
  • Per-shot signals: mean_complexity (sample variance of luma) and mean_motion (mean absolute frame delta) averaged over the shot's frames.
  • CRF predictor (v1, transparent linear blend):
crf = base
      + 0.20 * range * motion_norm * length_factor
      - 0.20 * range * complexity_norm
      - 0.15 * range * target_norm

where range = crf_max - crf_min, base = crf_min + range/2, signals are clipped into [0, 1], length_factor = 0.5 for shots < 24 frames (else 1.0). Output clamps into [crf_min, crf_max]. - Output: CSV (default) or JSON, one row / object per shot with the full signal vector so downstream encoders can override the predicted CRF with their own rule.

Alternatives considered

Option Pros Cons Why not chosen
Linear-blend heuristic (chosen) Transparent; debuggable; no training corpus needed; runs in 30 ms on a 48-frame fixture Coefficients are a prior, not a fit Roadmap §2.4 says "v1 = small CNN/MLP"; without a labelled corpus a "small MLP" would just memorise the prior. Defer the MLP to v2 once the BVI-DVC + Netflix corpora produce shot-level CRF labels
Trained MLP (PyTorch → ONNX) Closer to roadmap §2.4's "small CNN/MLP" wording; later versions could trade accuracy for size No labelled per-shot CRF dataset exists today; the existing tiny-AI corpora (BVI-DVC, Netflix) have per-clip / per-frame labels, not per-shot CRF targets Forces fabricating labels; adds ORT dependency to the sidecar for a v1 that has no measured advantage over the linear blend
Embed as a libvmaf C-API entry point (vmaf_per_shot_predict) Reusable from FFmpeg / Python Stretches libvmaf's "metric engine" charter into encoder-hint territory; CLI flag set is large enough that an embedded API would still need a CLI wrapper Roadmap §2.4 explicitly says "Standalone CLI (tools/vmaf-perShot) that writes an encoder-ingestible sidecar. Does not run inside libvmaf — its output is a parameter hint, not a quality score"
CSV-only output Simplest tooling Encoders that consume JSON sidecars (some webapp pipelines) need a second tool Both formats are 5 lines of code each; ship both
JSON-only output Modern; nestable Bash / awk pipelines have to install jq CSV is the path-of-least-resistance for shell-driven encode farms
SI / TI complexity (Spatial / Temporal Information per ITU-T P.910) Industry-standard for content classification Two extra Sobel-filter passes per frame; v1 doesn't need that fidelity Frame-variance + mean-abs-delta are correlated enough with SI / TI for a v1 prior; revisit if the linear blend's empirical PLCC vs. real CRF labels comes back below 0.7

Consequences

  • Positive.
  • Encoder pipelines can run vmaf-perShot against a reference YUV and feed the resulting CRF plan into x264 / x265 / SVT-AV1 --zones (or equivalent) without involving the libvmaf metric path.
  • The sidecar contract (CSV / JSON columns: shot_id, start_frame, end_frame, frames, mean_complexity, mean_motion, predicted_crf) is stable across v1; v2's trained MLP can replace predicted_crf without touching columns or the CLI flag set.
  • T6-3b unblocks the CRF-prediction half of roadmap §2.4 independently of T6-3a's merge state.
  • Negative.
  • The frame-difference detector is less accurate than TransNet V2 on dissolves / fades / cross-fades; the v1 plan will sometimes under-segment those transitions. Documented in docs/usage/vmaf-perShot.md.
  • The linear-blend predictor encodes a static prior; a real encode will only validate it against a target VMAF, not optimise over the trade-off.
  • Neutral / follow-ups.
  • v2 trained predictor (separate ADR): wire a per-shot MLP once a labelled corpus exists; reuse the same CSV / JSON schema so consumers don't churn.
  • TransNet V2 wiring (T6-3a follow-up): once the extractor merges, the tool will gain --shots PATH to consume a pre-computed shot map and bypass the heuristic detector.
  • Higher-fidelity complexity (SI / TI): if v2's empirical PLCC against real CRF labels falls below the target, swap the variance signal for SI and the abs-diff signal for TI.

References