ADR-0247: vmaf-roi sidecar binary for per-CTU QP offsets¶

Status: Accepted
Date: 2026-04-29
Deciders: lusoris, Claude
Tags: tools, ai, roi, encoder

Context¶

T6-2a (ADR-0218 / PR #208) shipped the in-libvmaf mobilesal saliency extractor — same model, scoring side: it tells callers where the model expects perceptual error. T6-2b is the encoder-steering counterpart: a CLI sidecar that consumes the same saliency map and emits a per-CTU QP-offset file the encoder reads back. Two surfaces, one model.

The fork already ships several CLI tools under core/tools/ (vmaf, vmaf_bench); a new sidecar binary slots into that pattern with no public C-API impact and no library link surface beyond the existing libvmaf/dnn.h session API.

The decision space sits at the intersection of three orthogonal axes: the sidecar format (ASCII vs binary, encoder-specific), the reduction (mean vs max over CTU samples), and the encoder coverage (x265-only vs multi-encoder). Each pick locks in a contract that downstream encoder drivers will depend on.

Decision¶

We ship vmaf-roi as a fork-local sidecar binary at core/tools/vmaf_roi.c that:

Consumes raw planar YUV input + a 0-based frame index, seeks to the requested frame with fseeko (>2 GiB safe), reads only the luma plane.
Computes a per-pixel saliency map via the optional --saliency-model ONNX session (vmaf_dnn_session_run_luma8), or via a deterministic center-weighted radial placeholder when no model is provided (smoke-test fallback only — explicitly documented as not for real encodes).
Reduces the per-pixel map to a per-CTU mean (not max) over each CTU's bounding box, with partial CTUs at right/bottom edges averaged over their actual sample count.
Maps [0, 1] saliency to a signed QP offset via qp = clamp(-strength * (2 * saliency - 1), -12, +12) — high saliency drives the offset negative (boost quality), low saliency positive (save bits), neutral (~0.5) zero.
Emits two formats selected by --encoder: ASCII grid for x265 (--qpfile-style, with two # comment header lines documenting frame / CTU / strength) and raw int8_t binary for SVT-AV1 (--roi-map-file, no header).
Operates one frame per invocation; multi-frame batching is a shell driver loop (or a future built-in mode).

Alternatives considered¶

Axis	Option	Pros	Cons	Why not chosen
Format	ASCII per-row grid (chosen for x265)	Human-readable; matches x265's `--qpfile-style` precedent; trivial to diff in CI	~2 - 3x larger on disk than binary; slower to parse for very large grids	Selected for x265 — the encoder's own qpfile-style is ASCII, so we follow that convention rather than fight it.
Format	Raw int8 binary (chosen for SVT-AV1)	Compact (1 byte per CTU); matches SVT-AV1's `--roi-map-file` byte layout	Not human-readable; needs a hex-dump tool to inspect	Selected for SVT-AV1 — the encoder explicitly requires this layout, no choice.
Format	Single universal format (e.g. JSON)	Encoder-agnostic on disk	Every encoder driver still needs a converter; defeats the purpose of "sidecar"	Rejected: it just moves the conversion cost to the consumer.
Reduction	Per-CTU mean (chosen)	Matches what most ROI heuristics use; smooth; partial-CTU clamping is straightforward	A single salient pixel inside a mostly-flat CTU gets averaged out	Selected: faces / focal subjects fill many pixels, so the mean tracks the perceptual signal well in practice.
Reduction	Per-CTU max	One-pixel anomalies still bias the offset; defends against under-allocating to small but important regions	Wildly oversensitive to MobileSal noise; many CTUs end up at the +12/-12 clamp	Rejected: noise on a learned saliency map is the dominant failure mode, not under-coverage.
Reduction	Per-CTU 90th percentile	Compromise between mean and max	Adds a per-CTU sort; doesn't measurably outperform mean on Wave 1 sweeps	Deferred: revisit if mean shows under-allocation in real encodes.
Encoder coverage	x265 + SVT-AV1 day-one (chosen)	Covers the two encoders most users pair with libvmaf; one binary handles both	Slightly larger code surface; two emit paths to maintain	Selected: the cost of adding the second emit path is ~30 LoC; gating SVT-AV1 to a follow-up PR doubles review rounds for no real savings.
Encoder coverage	x265-only first, SVT-AV1 later	Smallest possible PR	Forces a follow-up PR + ADR for what is fundamentally one decision	Rejected: same review cost twice.
Signal blend	Saliency-only (chosen)	One signal, one model, one place to invest	A flat / textureless salient region still gets a strong negative offset even though encoders need fewer bits there	Selected for T6-2b: keep the contract simple.
Signal blend	Saliency × edge density	Better per-CTU fidelity; punishes flat regions appropriately	Needs a second pass (Sobel / gradient) per frame; couples the sidecar to the libvmaf feature graph	Deferred: tracked as a Wave 2 follow-up; the `vmaf-roi` CLI surface is forward-compatible (a `--blend edge-density` flag drops in cleanly).

Consequences¶

Positive: encoder-side ROI now has a documented, lint-clean, test-covered fork tool that any encoder driver can shell out to. Same mobilesal ONNX feeds both scoring (T6-2a) and steering (T6-2b); no model duplication.
Positive: 8-bit-only contract is explicit in --bitdepth (only 8 accepted) and in docs/usage/vmaf-roi.md; 10/12-bit lands when the mobilesal extractor's bit-depth contract is finalised, not before.
Negative: per-CTU mean is a known signal-attenuation point; we accept it for Wave 1 and revisit if real encodes show under-allocation. The CLI is forward-compatible with a --blend flag for the edge-density follow-up.
Neutral / follow-ups:
Wave 2 should add multi-frame batch mode (one input, N frames per invocation) so encoder drivers don't pay the ORT-session cost N times. Tracked in the roadmap as a sub-bullet of T6-2b.
SVT-AV1's --roi-map-file reads one map per frame; the per-frame file naming convention used in docs/usage/vmaf-roi.md is the consumer's responsibility for now.

References¶

T6-2b roadmap entry: docs/ai/roadmap.md § Wave 1 saliency surface.
ADR-0218 / PR #208 — mobilesal saliency extractor (T6-2a, the scoring-side counterpart).
ADR-0042 — tiny-AI per-PR docs rule (each AI surface ships docs in the same PR).
ADR-0100 — project-wide doc-substance rule.
ADR-0108 — six deep-dive deliverables.
ADR-0141 — touched-file lint-clean rule (the refactor of parse_args / main / vmaf_roi_reduce_per_ctu was driven by this).
Source: roadmap T6-2b, Wave 1 saliency surface (planning dossier .workingdir2/).