vmaf-roi — saliency-driven ROI sidecars for x265 / SVT-AV1¶

vmaf-roi is a sidecar binary that consumes a per-frame saliency map and emits an encoder-native per-CTU QP-offset file. It complements mobilesal (the scoring-side saliency extractor): same model, two surfaces — scoring the residual vs steering the encoder.

Binary name note: Built and installed as vmaf_roi (underscore, per core/tools/meson.build). Throughout this page the vmaf-roi (hyphen) form refers to the same binary; if you typed vmaf-roi and got "command not found", fall back to vmaf_roi.

This is T6-2b (sidecar). T6-2a shipped the in-libvmaf saliency extractor.

What it produces¶

For every CTU in a frame the tool emits a signed integer offset:

qp_offset = clamp(-strength * (2 * saliency - 1), -12, +12)

High saliency (eyes, faces, focal subject) → negative offset → encoder spends more bits there.
Low saliency (background, periphery) → positive offset → encoder saves bits.
Neutral saliency (≈ 0.5) → zero offset → no change.

Build¶

vmaf-roi is built whenever -Denable_tools=true (the default):

meson setup build -Denable_cuda=false -Denable_sycl=false -Denable_tools=true
ninja -C build tools/vmaf_roi

The binary depends only on libvmaf's public DNN surface (libvmaf/dnn.h); when libvmaf is built with -Denable_dnn=disabled the --saliency-model path returns -ENOSYS and the tool falls back to a deterministic radial placeholder useful only for smoke-testing the sidecar plumbing.

Synopsis¶

vmaf-roi --reference REF.yuv --width W --height H \
         --frame N --output qpfile.txt \
         [--pixel_format 420|422|444] [--bitdepth 8|10|12|16] \
         [--ctu-size 8..128] [--encoder x265|svt-av1] \
         [--strength FLOAT] [--saliency-model model.onnx]

Required flags:

Flag	Meaning
`--reference`	Raw planar YUV input. Read with no demuxer.
`--width`	Frame width in luma samples (≤ 16 384).
`--height`	Frame height in luma samples (≤ 16 384).
`--frame`	0-based frame index inside the YUV file.
`--output`	Destination path; `-` writes to stdout.

Optional flags:

Flag	Default	Description
`--pixel_format`	`420`	One of `420` / `422` / `444`. Saliency reads luma only; chroma is skipped.
`--bitdepth`	`8`	One of `8`, `10`, `12`, or `16`. High-bit-depth planar YUV uses little-endian 16-bit containers; luma is downscaled to the 8-bit DNN contract.
`--ctu-size`	`64`	Luma samples per CTU side. Range `8..128`. Use 64 for x265, 64 for SVT-AV1.
`--encoder`	`x265`	Selects sidecar format: `x265` (ASCII) or `svt-av1` (binary `int8_t`).
`--strength`	`6.0`	QP-offset gain. Output is clamped to ±12 regardless of strength.
`--saliency-model`	unset	Path to a tiny ONNX `[1, 1, H, W]` luma → saliency model.

Sidecar formats¶

x265 (`--encoder x265`)¶

ASCII grid, one row per CTU row, space-separated signed offsets, two # comment header lines documenting the run:

# vmaf-roi qpfile (x265, --qpfile-style)
# frame=0 ctu=64 cols=30 rows=17 strength=6.000
0 1 2 3 ...
...

Feed it to x265 via --qpfile:

x265 --input-res 1920x1080 --fps 30 \
     --qpfile vmaf_roi_frame_0.txt \
     -o out.h265 input.yuv

SVT-AV1 (`--encoder svt-av1`)¶

Raw binary: int8_t per CTU, row-major, no header. Length is exactly cols * rows bytes. Pass via SVT-AV1's ROI map input:

SvtAv1EncApp -i input.yuv -w 1920 -h 1080 \
     --roi-map-file vmaf_roi_frame_0.bin \
     -b out.ivf

Examples¶

One-shot for frame 42 with the default placeholder¶

vmaf-roi --reference clip.yuv --width 1920 --height 1080 \
         --frame 42 --output frame_42.qp \
         --encoder x265 --ctu-size 64 --strength 6.0

With a real saliency model¶

vmaf-roi --reference clip.yuv --width 1920 --height 1080 \
         --frame 42 --output frame_42.qp \
         --saliency-model model/tiny/mobilesal.onnx \
         --encoder x265 --strength 8.0

10-bit planar input¶

vmaf-roi --reference hdr_clip_420p10le.yuv --width 3840 --height 2160 \
         --frame 42 --pixel_format 420 --bitdepth 10 \
         --output frame_42.qp --encoder x265 \
         --saliency-model model/tiny/mobilesal.onnx

Per-frame loop (shell-driver pattern)¶

for f in $(seq 0 99); do
    vmaf-roi --reference clip.yuv --width 1920 --height 1080 \
             --frame "$f" --output "qp/frame_${f}.qp" \
             --saliency-model model/tiny/mobilesal.onnx
done

(A built-in batch mode is on the roadmap; see roadmap §2.3.)

Caveats¶

Placeholder is for smoke testing only. Without --saliency-model the tool emits a center-weighted radial map that has zero perceptual validity. Do not drive a real encode from it.
High-bit-depth input is luma8-normalised. --bitdepth 10|12|16 accepts little-endian 16-bit planar YUV, skips chroma using the selected --pixel_format, and downscales luma to the saliency model's existing 8-bit input contract. The ROI sidecar itself remains per-CTU QP offsets, not a high-bit-depth image output.
Single frame per invocation. Wave 1 keeps the sidecar one-frame at a time so callers can reuse it from any encoder driver. A streaming variant is a follow-up.