Research-0046: vmaf-tune saliency-aware ROI tuning (Bucket #2)¶
- Status: Companion to ADR-0293
- Date: 2026-05-03
- Tags: tooling, ai, saliency, ffmpeg, codec, fork-local
Question¶
Bucket #2 of the PR #354 audit calls for saliency-aware ROI tuning in vmaf-tune. The fork already ships:
saliency_student_v1.onnx(PR #359, BSD-3-Clause-Plus-Patent, ~113K params, fork-trained on DUTS-TR) — the model.vmaf-roiC sidecar (ADR-0247) — emits per-CTU QP-offset files for x265 / SVT-AV1, one frame per invocation.vmaf-tunePython harness (ADR-0237 Phase A, PR #329) — drives ffmpeg, scores with libvmaf, emits JSONL.
The design question is: how should vmaf-tune consume the saliency signal — by re-implementing inference in Python, or by shelling out to vmaf-roi?
Option space¶
Option A — pure-Python ONNX inference (chosen)¶
saliency.py loads saliency_student_v1.onnx via onnxruntime, samples N frames from the source YUV, runs inference, averages into one mask, maps to QP offsets, reduces to per-MB, writes an x264 --qpfile.
- Pro: zero binary dependency on a built libvmaf —
vmaf-tuneis a pip-installable Python package; pulling in a built C binary for every recommend invocation is friction. - Pro: clean test-seam (
session_factory=…); the test suite runs without onnxruntime or ffmpeg installed. - Pro: Python is the natural layer for x264 qpfile orchestration; C-side
vmaf-roiwrites one file per frame and would need a Python-side aggregator anyway. - Con: numeric duplication of the saliency→QP signal blend (~5 numpy lines vs ~25 C lines — bit-for-bit equivalent, pinned by one test).
- Con: requires
onnxruntimefor the saliency code path. Mitigated by graceful fallback to plain encode when the import fails.
Option B — shell out to vmaf-roi¶
Drive vmaf-roi once per sampled frame, parse its output, average, write the qpfile.
- Pro: one source of truth for the signal blend — no duplication with the C code.
- Pro: covers x265 / SVT-AV1 sidecar formats for free.
- Con: requires a built libvmaf tree on PATH at runtime (
vmaf-roiis built when-Denable_tools=true);vmaf-tunebecomes harder to ship as a standalone Python package. - Con: per-frame subprocess overhead (~50 ms each on a typical CPU) and per-frame YUV re-decode.
- Con:
vmaf-roiis one-frame-per-invocation today (Wave 1); a per-clip aggregate needs Python-side aggregation regardless.
Option C — bake into corpus¶
Add a saliency_aware: bool row column and let the grid sweep emit saliency-aware encodes alongside plain ones.
- Pro: one subcommand, no new CLI surface.
- Con: conflates "sweep a grid" (corpus) with "produce a recommended encode" (recommend); downstream Phase B (target-VMAF bisect) and Phase C (per-shot CRF predictor) would have to special-case saliency-aware rows.
Decision¶
Option A. The graceful-fallback posture covers the no-onnxruntime case so callers always get a result; the bit-for-bit contract with vmaf-roi's C signal blend is pinned by test_saliency_to_qp_map_neutral_is_zero + test_saliency_to_qp_map_high_saliency_lowers_qp. x265 / SVT-AV1 extension is a one-file follow-up under codec_adapters/ once a qpfile formatter for those encoders lands; until then those codecs route through vmaf-roi directly.
Open follow-ups¶
- Per-frame ROI (rather than per-clip aggregate) once
vmaf-tunedecodes per-frame. - x265 / SVT-AV1 codec adapters with native qpfile / ROI-map emitters.
- Real-binary integration coverage (ffmpeg + onnxruntime + real YUVs) gated behind a
--integrationpytest mark; tracked alongside Phase B. - Bitrate-savings empirical sweep (Pareto curve over
--saliency-offset∈ {-2, -4, -6, -8}) — lands with Phase B.