vmaf-roi — saliency-driven ROI sidecars for x265 / SVT-AV1¶
vmaf-roi is a sidecar binary that consumes a per-frame saliency map and emits an encoder-native per-CTU QP-offset file. It complements mobilesal (the scoring-side saliency extractor): same model, two surfaces — scoring the residual vs steering the encoder.
Binary name note: Built and installed as
vmaf_roi(underscore, percore/tools/meson.build). Throughout this page thevmaf-roi(hyphen) form refers to the same binary; if you typedvmaf-roiand got "command not found", fall back tovmaf_roi.
This is T6-2b (sidecar). T6-2a shipped the in-libvmaf saliency extractor.
What it produces¶
For every CTU in a frame the tool emits a signed integer offset:
- High saliency (eyes, faces, focal subject) → negative offset → encoder spends more bits there.
- Low saliency (background, periphery) → positive offset → encoder saves bits.
- Neutral saliency (≈ 0.5) → zero offset → no change.
Build¶
vmaf-roi is built whenever -Denable_tools=true (the default):
meson setup build -Denable_cuda=false -Denable_sycl=false -Denable_tools=true
ninja -C build tools/vmaf_roi
The binary depends only on libvmaf's public DNN surface (libvmaf/dnn.h); when libvmaf is built with -Denable_dnn=disabled the --saliency-model path returns -ENOSYS and the tool falls back to a deterministic radial placeholder useful only for smoke-testing the sidecar plumbing.
Synopsis¶
vmaf-roi --reference REF.yuv --width W --height H \
--frame N --output qpfile.txt \
[--pixel_format 420|422|444] [--bitdepth 8|10|12|16] \
[--ctu-size 8..128] [--encoder x265|svt-av1] \
[--strength FLOAT] [--saliency-model model.onnx]
Required flags:
| Flag | Meaning |
|---|---|
--reference | Raw planar YUV input. Read with no demuxer. |
--width | Frame width in luma samples (≤ 16 384). |
--height | Frame height in luma samples (≤ 16 384). |
--frame | 0-based frame index inside the YUV file. |
--output | Destination path; - writes to stdout. |
Optional flags:
| Flag | Default | Description |
|---|---|---|
--pixel_format | 420 | One of 420 / 422 / 444. Saliency reads luma only; chroma is skipped. |
--bitdepth | 8 | One of 8, 10, 12, or 16. High-bit-depth planar YUV uses little-endian 16-bit containers; luma is downscaled to the 8-bit DNN contract. |
--ctu-size | 64 | Luma samples per CTU side. Range 8..128. Use 64 for x265, 64 for SVT-AV1. |
--encoder | x265 | Selects sidecar format: x265 (ASCII) or svt-av1 (binary int8_t). |
--strength | 6.0 | QP-offset gain. Output is clamped to ±12 regardless of strength. |
--saliency-model | unset | Path to a tiny ONNX [1, 1, H, W] luma → saliency model. |
Sidecar formats¶
x265 (--encoder x265)¶
ASCII grid, one row per CTU row, space-separated signed offsets, two # comment header lines documenting the run:
# vmaf-roi qpfile (x265, --qpfile-style)
# frame=0 ctu=64 cols=30 rows=17 strength=6.000
0 1 2 3 ...
...
Feed it to x265 via --qpfile:
SVT-AV1 (--encoder svt-av1)¶
Raw binary: int8_t per CTU, row-major, no header. Length is exactly cols * rows bytes. Pass via SVT-AV1's ROI map input:
Examples¶
One-shot for frame 42 with the default placeholder¶
vmaf-roi --reference clip.yuv --width 1920 --height 1080 \
--frame 42 --output frame_42.qp \
--encoder x265 --ctu-size 64 --strength 6.0
With a real saliency model¶
vmaf-roi --reference clip.yuv --width 1920 --height 1080 \
--frame 42 --output frame_42.qp \
--saliency-model model/tiny/mobilesal.onnx \
--encoder x265 --strength 8.0
10-bit planar input¶
vmaf-roi --reference hdr_clip_420p10le.yuv --width 3840 --height 2160 \
--frame 42 --pixel_format 420 --bitdepth 10 \
--output frame_42.qp --encoder x265 \
--saliency-model model/tiny/mobilesal.onnx
Per-frame loop (shell-driver pattern)¶
for f in $(seq 0 99); do
vmaf-roi --reference clip.yuv --width 1920 --height 1080 \
--frame "$f" --output "qp/frame_${f}.qp" \
--saliency-model model/tiny/mobilesal.onnx
done
(A built-in batch mode is on the roadmap; see roadmap §2.3.)
Caveats¶
- Placeholder is for smoke testing only. Without
--saliency-modelthe tool emits a center-weighted radial map that has zero perceptual validity. Do not drive a real encode from it. - High-bit-depth input is luma8-normalised.
--bitdepth 10|12|16accepts little-endian 16-bit planar YUV, skips chroma using the selected--pixel_format, and downscales luma to the saliency model's existing 8-bit input contract. The ROI sidecar itself remains per-CTU QP offsets, not a high-bit-depth image output. - Single frame per invocation. Wave 1 keeps the sidecar one-frame at a time so callers can reuse it from any encoder driver. A streaming variant is a follow-up.
See also¶
- ADR-0247 — the decision record (sidecar format, encoder coverage, signal blend).
docs/ai/roadmap.md§2.3 — Wave 1 saliency surface.docs/usage/cli.md— index of fork CLIs.