Research-0086 — Video-temporal saliency feasibility for ROI-encode tuning¶

Status: Active. Companion to ADR-0325.
Date: 2026-05-08
Tags: ai, saliency, video-saliency, dnn, vmaf-tune, roi, fork-local, design

Question¶

The fork ships saliency_student_v1 (image-level, ~113 K params, fork-trained on DUTS-TR, BSD-3-Clause-Plus-Patent, ADR-0286) and a vmaf-tune ROI-encode pipeline (ADR-0293, tools/vmaf-tune/src/vmaftune/saliency.py) that computes saliency on N evenly-sampled frames per shot and averages them into a single per-clip mask before reducing to an x264 --qpfile. The aggregator is a true mean over independently-inferred per-frame masks — no temporal model, no recurrence, no 3D conv. Eyes track motion; a model that ignores motion is a known approximation.

The question this digest answers: should the fork ship a true video-saliency model next, or stay on temporal pooling of the existing 2D student? Sub-questions on (a) which dataset, (b) which model architecture, (c) at what cost, and (d) on what timeline are answered in turn so the companion ADR can lock a phased rollout.

The fork's ROI-encode targets are x264 --qpfile, x265 --qpfile, and SVT-AV1 ROI-map. None of these consume a per-frame saliency model at runtime — they consume one or more pre-computed QP-offset maps per shot. The model runs once, in the harness, before the encoder starts. That moves "inference cost" from a hot to a cold path and eliminates the academic-SOTA-only-runs-in-PyTorch risk early.

Method¶

WebSearch + WebFetch on five axes, dated 2026-05-08:

Datasets: DHF1K (Wang et al., CVPR 2018), Hollywood-2 saliency, UCF-Sports saliency, LEDOV (Jiang et al.), AViMoS (AIM 2024).
Models: TASED-Net (ICCV 2019), UNISAL (ECCV 2020), SalEMA (BMVC 2019), ViNet-v2 (ICASSP 2025), STSANet, Mamba-based "ZenithChaser" (AIM 2024 efficient track), and the "Minimalistic Video Saliency Prediction" (arXiv 2502.00397, 2025).
Image-vs-video on ROI encoding: literature on saliency-driven HEVC rate-control BD-rate gains.
Tiny / efficient video saliency: ONNX-deployable, sub-1M parameter regime.
Temporal pooling baselines: SalEMA's "EMA-on-2D-features" finding plus the DHF1K paper's own comparisons.

Findings¶

Axis 1 — Datasets¶

Dataset	Year	Size	License	MOS / fixations	Notes
DHF1K [1]	2018	1000 videos, splits 600 / 100 / 300; Hollywood-2 sibling alone is 74.6 GB	CC BY 4.0	Eye-tracker, 17 observers	Permissive; standard benchmark; the gold reference. Distribution gated through Google Drive — same UI-clickwrap risk as ADR-0257 hit on MobileSal
AViMoS (AIM 2024) [2]	2024	1500 videos, 1080p, ~170 GB extracted ground-truth	CC-BY (per challenge page)	Crowdsourced mouse tracking, > 5000 observers across the corpus	License is permissive but mouse-tracking is a proxy for fixations, not the gold-standard eye-tracker signal the older corpora use
Hollywood-2 saliency [1]	bundled with DHF1K	74.6 GB	inherits DHF1K's CC BY 4.0	Eye-tracker, 16 observers	Movie clips; biases the model toward narrative content
UCF-Sports saliency [1]	bundled with DHF1K	smaller	inherits	Eye-tracker	Sports action; biases toward fast motion
LEDOV [3]	2018	538 videos, 179 336 frames	License not stated on the GitHub repo or paper landing page; needs author contact (Jiang et al., BUAA)	Eye-tracker	Larger than DHF1K in frame count; license-blocker until clarified

DHF1K is the de-facto evaluation reference. AViMoS is more recent and larger but mouse-tracked, which lower-bounds the perceived gap to a true eye-tracking model.

Axis 2 — Models¶

Model	Year	Params	License	DHF1K CC / NSS (paper claim)	Notes
TASED-Net [4]	2019	21.2 M params, 63.2 GFLOPs (T=32 late-aggregation variant) [4a]	MIT	(paper-reported SOTA at submission)	S3D backbone pretrained on Kinetics-400; 3D-conv; the canonical "true video saliency" baseline
UNISAL [5]	2020	sub-5 M params (paper claims "5–20× smaller than competing deep methods"; backbone is MobileNetV2 + Bypass-RNN)	Apache-2.0	SOTA on DHF1K, Hollywood-2, UCF-Sports at one parameter set	Encoder–RNN–decoder, joint image+video training; the architecture closest to "lightweight teacher we can distil from"
SalEMA [6]	2019	small (paper builds on a 2D backbone + EMA over a single conv state)	(PyTorch repo, license not surfaced on paper page; verify before use)	Reports "almost as good as ConvLSTM" without retraining the 2D backbone	Direct evidence that EMA over a 2D backbone closes most of the gap to true temporal models
ViNet-v2 (ViNet-S, ViNet-A) [7]	ICASSP 2025	ViNet-S 36 MB, ViNet-A 148 MB (~5–35 M params equivalent); ViNet-S claims > 1000 fps on a GPU	CC BY-NC-SA 4.0	SOTA on DHF1K + 6 audio-visual datasets at ensemble	License-incompatible with the fork's BSD-3-Clause-Plus-Patent (re-runs the ADR-0257 / MobileSal blocker)
AIM 2024 winners (UMT-based) [2]	2024	top entry (CV_MM) 420.5 M params; 2nd (VistaHL) 187.7 M; 3rd (PeRCeiVe Lab) UMT-based	per-team (no blanket licence)	Top of AViMoS leaderboard	Far above the fork's tiny-AI footprint; not shippable as a runtime model
AIM 2024 efficient (ZenithChaser, Mamba) [2]	2024	0.19 M params	per-team (no blanket licence)	5th–6th overall on AViMoS leaderboard	First-ever "tiny-class" video-saliency entry; demonstrates the regime is reachable. Mamba selective-state-space op is not on the fork's ONNX op-allowlist today
"Minimalistic" (arXiv 2502.00397) [8]	2025	not extracted from the abstract page; positions itself as efficient-decoder + STAL-cued	per-paper repo	claims competitive vs ViNet / TASED-Net	Useful design reference but no shippable open weights confirmed at this digest's date

Axis 3 — Image vs. video saliency for ROI encoding¶

The literature on saliency-driven HEVC and AV1 ROI rate-control reports BD-rate savings in the 3–12 % range depending on content type and saliency-map quality [9]. Critically, none of the cited HEVC papers isolate "image saliency averaged across N frames" vs. "true video saliency model" as separate ablations. The 3–12 % band is an upper-bound on the combined gain from any saliency signal of "reasonable quality" plus the rate-control wiring.

Empirical signal from the saliency literature itself: SalEMA's headline finding [6] is that an exponential moving average over a frozen 2D backbone "achieves almost as well as a sophisticated ConvLSTM recurrence" on DHF1K, without retraining. That establishes the EMA-of-2D regime as a strong baseline — already on the order of 80–90 % of what a heavier ConvLSTM model captures, on the standard CC / NSS / AUC-J / SIM metric set. ViNet-S's 2025 ICASSP claim of > 1000 fps [7] also implies the per-frame cost of a 2D-style backbone, evaluated densely, is not the bottleneck. The bottleneck is recurrent state across frames — and EMA models that with a single weighted running-mean update.

For the fork's specific ROI-encode use case, the saliency mask is reduced to per-MB granularity (16×16 luma) before quantisation. Any spatial precision that survives a 16× downsample is what the encoder actually consumes. Sub-block jitter from a slightly noisier per-frame mean is averaged out by the per-MB reduce. The per-MB step is therefore an inherent low-pass over the saliency signal that flatters cheaper temporal aggregators relative to true 3D-conv models.

Axis 4 — Tiny / efficient video saliency¶

The space below 1 M parameters is reachable: ZenithChaser (AIM 2024, 0.19 M params, Mamba) demonstrates it. UNISAL (MobileNetV2 + Bypass-RNN) sits in the 5 M range. The "Minimalistic" paper [8] explicitly targets parameter-efficient decoders. None of these have weights distributed under a BSD-/Apache-/MIT-compatible licence with direct-HTTP download as of this digest's date — the ZenithChaser model is reproducible from the published paper but has no off-the-shelf ONNX export, and Mamba's selective-state-space op is not on the fork's core/src/dnn/op_allowlist.c.

The shippable path is therefore knowledge distillation: distil a tiny student (target ~150–250 K params, similar regime to saliency_student_v1) from a permissive teacher (UNISAL, Apache-2.0) on a permissive corpus (DHF1K, CC BY 4.0). The student ships under BSD-3-Clause-Plus-Patent, the teacher and dataset never ship in-tree.

Axis 5 — Temporal pooling tricks¶

Two specific aggregators emerge from the literature, both compatible with a frozen saliency_student_v1 (no retraining needed for Phase 1):

Per-frame mean (status quo) — N=8 evenly-spaced frames, per-pixel arithmetic mean. Already in tools/vmaf-tune/src/vmaftune/saliency.py.
EMA over per-frame masks — per-pixel m_t = α · sal_t + (1 − α) · m_{t−1} with α tuned per the SalEMA recipe [6]. Captures temporal coherence (a salient region in frames 1–3 stays salient at t=4 even if the per-frame model slightly mis-fires). Adds a single hyperparameter, no new model.
Motion-weighted mean — weight per-frame mask by per-frame inter-frame difference (a cheap proxy for "this frame has motion the eye is tracking"). Adds one frame-difference computation per sampled frame; no new model.

Aggregator (2) is the SalEMA-validated choice. Aggregator (3) is the content-aware variant that surfaces the motion signal the per-clip mean explicitly throws away.

Recommendation¶

Phase 1 — temporal-pooling baseline, frozen saliency_student_v1 (ship next). Replace the per-clip mean in vmaf-tune/saliency.py with a configurable aggregator: mean (today), ema (default), motion-weighted. No new model, no new ONNX file, no new training script. Cost: ~50 LOC + tests. Expected BD-rate-vs-status-quo: at most a small fraction of the 3–12 % band [9], probably comparable to the SalEMA "EMA closes the ConvLSTM-vs-2D gap" finding [6]. Ships in days, not weeks.

Phase 2 — video_saliency_student_v1 via distillation (follow-up). Distil a ~200–300 K-parameter student from UNISAL (Apache-2.0, MobileNetV2 + Bypass-RNN) on DHF1K (CC BY 4.0). Architecture mirrors saliency_student_v1 (TinyU-Net) plus a Bypass-RNN-style single-state recurrence — i.e., the Phase-1 EMA promoted into a learned per-pixel update rule, exported as one ONNX graph. Use only ops on core/src/dnn/op_allowlist.c (the existing TinyU-Net op-set plus a Where / Mul / Add chain that already lives in ONNX opset 17). Training script under ai/scripts/train_video_saliency_student.py, recipe modelled on train_saliency_student.py. Distil from UNISAL outputs as soft labels on DHF1K-train; eval on DHF1K-val with CC / NSS / AUC-J / SIM, with per-MB 16× downsampled mask IoU as the application-aligned metric. Ship the trained .onnx only; UNISAL weights are not redistributed in-tree. Estimated wall-clock: ~30 minutes on a single GPU at the fork's existing scale.

Phase 3 — ONNX-export + harness integration. Register the new model in model/tiny/video_saliency_student_v1.onnx; add a docs/ai/models/video_saliency_student_v1.md model card; wire a --saliency-mode {image, video} flag into vmaf-tune recommend that switches between today's image-saliency student and the new video-saliency student; default stays image until the BD-rate sweep confirms a positive lift on the corpus.

The decision matrix's runner-up is "ship Phase 1 only and stop." It is the conservative outcome if the Phase-1 EMA closes the gap on the application-aligned per-MB IoU metric, since the per-MB downsample dominates the spatial precision the encoder consumes anyway.

Alternatives considered¶

Option	Pros	Cons	Verdict
(A) Phase-1 + Phase-2 distillation roadmap (recommended)	Cheap immediate win (Phase 1); a true video-saliency model on the same shippable footprint as `saliency_student_v1` (Phase 2); both phases are independently mergeable	Two PRs instead of one; Phase 2 needs a fork-managed teacher run	Chosen
(B) Ship a true 3D-conv model directly (TASED-Net, 21 M params, MIT)	MIT-licensed; well-known SOTA reference	21 M params is 2 orders of magnitude above the fork's tiny-AI footprint (`saliency_student_v1` is 113 K); 3D conv stack inflates ONNX graph size and runs against a different op-allowlist subset; the ROI-encode use case downsamples to per-MB anyway	Rejected — wrong size class for fork, no measurable BD-rate gain expected to justify it
(C) Adopt ViNet-S directly (36 MB, 1000+ fps)	SOTA, fast	CC BY-NC-SA 4.0 — non-commercial, share-alike; same blocker that rejected upstream MobileSal in ADR-0257	Rejected — license-incompatible with BSD-3-Clause-Plus-Patent fork
(D) Train on AViMoS (1500 videos, CC-BY, AIM 2024)	Larger, more recent than DHF1K; permissive	Mouse-tracked rather than eye-tracked → upper-bounded ground-truth quality; ~170 GB ground-truth alone	Held in reserve for `video_saliency_student_v2` if Phase 2 saturates on DHF1K
(E) Stay on per-frame image saliency forever (status quo)	Zero engineering	Eye-tracking literature is unambiguous that motion drives fixation; the per-clip mean discards this signal; SalEMA [6] shows EMA closes most of the gap for free	Rejected — the cheap (Phase 1) option is too cheap to skip
(F) Adopt the AIM 2024 ZenithChaser Mamba 0.19 M model	Right parameter regime; demonstrates "tiny video saliency" is reachable	Mamba's selective-state-space op is not on the fork's ONNX op-allowlist today; would inflate the PR scope into an op-allowlist patch + a training run + a runtime audit	Held — re-evaluate when the op-allowlist gains the relevant ops for an unrelated reason

References¶

[1] Wang, Shen, Guo, Cheng, Borji, "Revisiting Video Saliency: A Large-scale Benchmark and a New Model" (CVPR 2018; PAMI 2019). DHF1K project page: https://github.com/wenguanwang/DHF1K — license Creative Commons Attribution 4.0; train/val/test split 600/100/300; Hollywood-2 sibling 74.6 GB. WebFetch verified 2026-05-08.

[2] AIM 2024 Challenge on Video Saliency Prediction (ECCVW 2024). https://arxiv.org/html/2409.14827v1. Top entries: CV_MM 420.5 M params, VistaHL 187.7 M, PeRCeiVe Lab UMT-based; ZenithChaser 0.19 M params Mamba (efficient track). Dataset: AViMoS, 1500 videos at 1080p, mouse-tracked, > 5000 observers, ~170 GB ground-truth. WebFetch verified 2026-05-08.

[3] Jiang et al., "DeepVS: A Deep Learning Based Video Saliency Prediction Approach" (ECCV 2018). LEDOV repo: https://github.com/remega/LEDOV-eye-tracking-database. License not stated on the public README at the time of this digest. WebSearch 2026-05-08.

[4] Min, Corso, "TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection" (ICCV 2019). https://github.com/MichiganCOG/TASED-Net — MIT license. S3D encoder pretrained on Kinetics-400. WebFetch verified 2026-05-08.

[4a] TASED-Net parameter / FLOPs figure (T=32 late-aggregation variant, 21.2 M params / 63.2 GFLOPs) sourced from the comparative table in the "Minimalistic Video Saliency Prediction" paper [8].

[5] Droste, Jiao, Noble, "Unified Image and Video Saliency Modeling" (ECCV 2020). https://arxiv.org/abs/2003.05477. Repo https://github.com/rdroste/unisal — Apache-2.0. Backbone MobileNetV2 with Bypass-RNN; "5 to 20-fold smaller model size compared to all competing deep methods"; evaluated on DHF1K, Hollywood-2, UCF-Sports, SALICON, MIT300. WebFetch + WebSearch verified 2026-05-08.

[6] Linardos, Mohedano, Nieto, McGuinness, Giró-i-Nieto, O'Connor, "Simple vs complex temporal recurrences for video saliency prediction" (BMVC 2019). https://arxiv.org/abs/1907.01869; project page https://imatge-upc.github.io/SalEMA/. Headline: EMA over a frozen 2D backbone "achieves almost as well as a sophisticated ConvLSTM recurrence" on DHF1K. WebFetch verified 2026-05-08.

[7] ViNet-v2 (ICASSP 2025). https://github.com/ViNet-Saliency/vinet_v2 — CC BY-NC-SA 4.0 (non-commercial, share-alike). ViNet-S 36 MB, ViNet-A 148 MB; ViNet-S claims > 1000 fps. WebFetch verified 2026-05-08. License blocker mirrors ADR-0257 / upstream MobileSal rejection.

[8] "Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues", arXiv 2502.00397 (2025). https://arxiv.org/abs/2502.00397. Source for cross-model parameter / FLOPs figures including TASED-Net [4a]. WebFetch 2026-05-08.

[9] Saliency-driven HEVC rate control: "Saliency based rate control scheme for high efficiency video coding" (https://ieeexplore.ieee.org/document/7820898); "High Efficiency Video Coding Compliant Perceptual Video Coding Using Entropy Based Visual Saliency Model" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514295/); "Content-aware rate control scheme for HEVC based on static and dynamic saliency detection" (https://www.sciencedirect.com/science/article/abs/pii/S0925231220309668). Combined band of reported BD-rate savings: 3–12 %. WebSearch 2026-05-08.

[10] AViMoS challenge / dataset landing page, https://challenges.videoprocessing.ai/challenges/video-saliency-prediction.html; release repo https://github.com/msu-video-group/ECCVW24_Saliency_Prediction. License CC-BY (per challenge page). WebFetch verified 2026-05-08.

[11] Existing fork prior art — ADR-0286 (saliency_student_v1, fork-trained, ~113 K params, BSD-3-Clause-Plus-Patent), ADR-0293 (vmaf-tune saliency-aware ROI), ADR-0257 (license blocker on upstream weights), and tools/vmaf-tune/src/vmaftune/saliency.py (the per-clip mean this digest proposes to upgrade).

[12] Source of this research task: paraphrased — user direction to research a video-temporal saliency model that complements the existing image-level student, and to be honest about cost so the recommendation matches the actual lift available through the per-MB-downsampled ROI surface.