ADR-0247: vmaf-roi sidecar binary for per-CTU QP offsets¶
- Status: Accepted
- Date: 2026-04-29
- Deciders: lusoris, Claude
- Tags:
tools,ai,roi,encoder
Context¶
T6-2a (ADR-0218 / PR #208) shipped the in-libvmaf mobilesal saliency extractor — same model, scoring side: it tells callers where the model expects perceptual error. T6-2b is the encoder-steering counterpart: a CLI sidecar that consumes the same saliency map and emits a per-CTU QP-offset file the encoder reads back. Two surfaces, one model.
The fork already ships several CLI tools under core/tools/ (vmaf, vmaf_bench); a new sidecar binary slots into that pattern with no public C-API impact and no library link surface beyond the existing libvmaf/dnn.h session API.
The decision space sits at the intersection of three orthogonal axes: the sidecar format (ASCII vs binary, encoder-specific), the reduction (mean vs max over CTU samples), and the encoder coverage (x265-only vs multi-encoder). Each pick locks in a contract that downstream encoder drivers will depend on.
Decision¶
We ship vmaf-roi as a fork-local sidecar binary at core/tools/vmaf_roi.c that:
- Consumes raw planar YUV input + a 0-based frame index, seeks to the requested frame with
fseeko(>2 GiB safe), reads only the luma plane. - Computes a per-pixel saliency map via the optional
--saliency-modelONNX session (vmaf_dnn_session_run_luma8), or via a deterministic center-weighted radial placeholder when no model is provided (smoke-test fallback only — explicitly documented as not for real encodes). - Reduces the per-pixel map to a per-CTU mean (not max) over each CTU's bounding box, with partial CTUs at right/bottom edges averaged over their actual sample count.
- Maps
[0, 1]saliency to a signed QP offset viaqp = clamp(-strength * (2 * saliency - 1), -12, +12)— high saliency drives the offset negative (boost quality), low saliency positive (save bits), neutral (~0.5) zero. - Emits two formats selected by
--encoder: ASCII grid for x265 (--qpfile-style, with two#comment header lines documenting frame / CTU / strength) and rawint8_tbinary for SVT-AV1 (--roi-map-file, no header). - Operates one frame per invocation; multi-frame batching is a shell driver loop (or a future built-in mode).
Alternatives considered¶
| Axis | Option | Pros | Cons | Why not chosen |
|---|---|---|---|---|
| Format | ASCII per-row grid (chosen for x265) | Human-readable; matches x265's --qpfile-style precedent; trivial to diff in CI | ~2 - 3x larger on disk than binary; slower to parse for very large grids | Selected for x265 — the encoder's own qpfile-style is ASCII, so we follow that convention rather than fight it. |
| Format | Raw int8 binary (chosen for SVT-AV1) | Compact (1 byte per CTU); matches SVT-AV1's --roi-map-file byte layout | Not human-readable; needs a hex-dump tool to inspect | Selected for SVT-AV1 — the encoder explicitly requires this layout, no choice. |
| Format | Single universal format (e.g. JSON) | Encoder-agnostic on disk | Every encoder driver still needs a converter; defeats the purpose of "sidecar" | Rejected: it just moves the conversion cost to the consumer. |
| Reduction | Per-CTU mean (chosen) | Matches what most ROI heuristics use; smooth; partial-CTU clamping is straightforward | A single salient pixel inside a mostly-flat CTU gets averaged out | Selected: faces / focal subjects fill many pixels, so the mean tracks the perceptual signal well in practice. |
| Reduction | Per-CTU max | One-pixel anomalies still bias the offset; defends against under-allocating to small but important regions | Wildly oversensitive to MobileSal noise; many CTUs end up at the +12/-12 clamp | Rejected: noise on a learned saliency map is the dominant failure mode, not under-coverage. |
| Reduction | Per-CTU 90th percentile | Compromise between mean and max | Adds a per-CTU sort; doesn't measurably outperform mean on Wave 1 sweeps | Deferred: revisit if mean shows under-allocation in real encodes. |
| Encoder coverage | x265 + SVT-AV1 day-one (chosen) | Covers the two encoders most users pair with libvmaf; one binary handles both | Slightly larger code surface; two emit paths to maintain | Selected: the cost of adding the second emit path is ~30 LoC; gating SVT-AV1 to a follow-up PR doubles review rounds for no real savings. |
| Encoder coverage | x265-only first, SVT-AV1 later | Smallest possible PR | Forces a follow-up PR + ADR for what is fundamentally one decision | Rejected: same review cost twice. |
| Signal blend | Saliency-only (chosen) | One signal, one model, one place to invest | A flat / textureless salient region still gets a strong negative offset even though encoders need fewer bits there | Selected for T6-2b: keep the contract simple. |
| Signal blend | Saliency × edge density | Better per-CTU fidelity; punishes flat regions appropriately | Needs a second pass (Sobel / gradient) per frame; couples the sidecar to the libvmaf feature graph | Deferred: tracked as a Wave 2 follow-up; the vmaf-roi CLI surface is forward-compatible (a --blend edge-density flag drops in cleanly). |
Consequences¶
- Positive: encoder-side ROI now has a documented, lint-clean, test-covered fork tool that any encoder driver can shell out to. Same
mobilesalONNX feeds both scoring (T6-2a) and steering (T6-2b); no model duplication. - Positive: 8-bit-only contract is explicit in
--bitdepth(only8accepted) and indocs/usage/vmaf-roi.md; 10/12-bit lands when themobilesalextractor's bit-depth contract is finalised, not before. - Negative: per-CTU mean is a known signal-attenuation point; we accept it for Wave 1 and revisit if real encodes show under-allocation. The CLI is forward-compatible with a
--blendflag for the edge-density follow-up. - Neutral / follow-ups:
- Wave 2 should add multi-frame batch mode (one input, N frames per invocation) so encoder drivers don't pay the ORT-session cost N times. Tracked in the roadmap as a sub-bullet of T6-2b.
- SVT-AV1's
--roi-map-filereads one map per frame; the per-frame file naming convention used indocs/usage/vmaf-roi.mdis the consumer's responsibility for now.
References¶
- T6-2b roadmap entry:
docs/ai/roadmap.md§ Wave 1 saliency surface. - ADR-0218 / PR #208 —
mobilesalsaliency extractor (T6-2a, the scoring-side counterpart). - ADR-0042 — tiny-AI per-PR docs rule (each AI surface ships docs in the same PR).
- ADR-0100 — project-wide doc-substance rule.
- ADR-0108 — six deep-dive deliverables.
- ADR-0141 — touched-file lint-clean rule (the refactor of
parse_args/main/vmaf_roi_reduce_per_ctuwas driven by this). - Source: roadmap T6-2b, Wave 1 saliency surface (planning dossier
.workingdir2/).