Research-0062 — Fork-trained saliency student on DUTS-TR¶

Field	Value
Date	2026-05-03
Status	Implemented — `saliency_student_v1` shipped
Tags	dnn, tiny-ai, mobilesal, saliency, training, license, fork-local

Companion to ADR-0286. Records the alternatives walked, the dataset-license analysis, the architecture decisions, and the training-recipe rationale behind the from-scratch saliency student that replaces the smoke-only mobilesal_placeholder_v0 checkpoint.

Background¶

ADR-0257 and Research-0053 (2026-05-03) documented the three blockers that ruled out the FastDVDnet-style real-weights swap for MobileSal: (a) upstream yuhuan-wu/MobileSal is CC BY-NC-SA 4.0 (incompatible with the fork's BSD-3-Clause-Plus-Patent), (b) checkpoints are distributed via Google Drive viewer URLs only, and (c) the network is RGB-D while the C contract is RGB-only. The recommended replacement path was to swap to U-2-Net u2netp (Apache-2.0). When that path was investigated in detail it inherited the same Google-Drive distribution problem and its 4.7 MB checkpoint dwarfs every other model under model/tiny/.

This digest records the fallback path that was feasible: train a small saliency student from scratch on a permissively-distributed public corpus. The resulting weights are wholly fork-owned, ship under BSD-3-Clause-Plus-Patent, and lock down a substantively content-dependent saliency_mean signal for the first time on this fork.

Dataset survey¶

The fork needs a saliency-segmentation dataset with (a) a clearly documented research-distribution licence (no clickwrap), (b) a stable HTTP URL the export script can pin, and (c) enough scale to train a small student to a useful operating point in a single GPU-hour.

Dataset	Size	License / distribution	Picked?
DUTS-TR (Wang et al. 2017)	10,553 RGB images + binary masks; 271 MB	"Free for academic research" per the official site (`http://saliencydetection.net/duts/`); direct `https://saliencydetection.net/duts/download/DUTS-TR.zip` URL with `Last-Modified` headers and a stable `ETag`	Yes — the only dataset that satisfied all three criteria in the survey
ECSSD (Yan et al. 2013)	1,000 images	Free for research; smaller; same direct-URL distribution	Held in reserve as an external test set; not used for training
MSRA10K (Cheng et al. 2014)	10,000 images	Free for research; mostly single-object centred; less diverse than DUTS-TR	Considered for augmentation; not bundled in v1
HKU-IS (Li & Yu 2015)	4,447 images	Direct distribution; OK	Smaller than DUTS-TR; held in reserve
DUTS-TE (Wang et al. 2017)	5,019 images	Same provenance as DUTS-TR	Used as the external test split for future evaluations; v1 keeps the held-out 5% of DUTS-TR for in-loop validation IoU
SOD (Movahedi & Faugeras 2010)	300 images	Direct distribution	Too small to train on alone
DAVIS-S (Caelles et al. 2018)	Video	Apache-2.0 mask format but video-segmentation flavour, not still-image SOD	Out of scope

DUTS-TR is the de-facto standard training corpus for image-level salient-object detection (used by U-2-Net, BASNet, F3Net, EGNet, ...). The README on the project page records the academic-research distribution terms ("free for academic and research purposes"). Only the trained weights are shipped in this fork; the DUTS images themselves are deliberately not committed to the repository.

DUTS-TR archive provenance recorded in docs/ai/models/saliency_student_v1.md:

URL:           https://saliencydetection.net/duts/download/DUTS-TR.zip
Last-Modified: 2025-03-10
Content-Length: 270 997 309 bytes
SHA-256:       ce61e023c8f59d022b4d46981cf16813b83d089242e6489a45630d83962ea058
Pairs:         10 553 (train images + binary masks, 1:1)

Architecture survey¶

Four candidate architectures were reviewed against the existing C contract (input [1, 3, H, W] → saliency_map [1, 1, H, W]) and the fork's ONNX op-allowlist (core/src/dnn/op_allowlist.c):

Architecture	Approx. params	Notes	Picked?
Tiny U-Net (this work)	~113 K	3 down / 3 up with skip connections; `Conv` + `BatchNormalization` + `ReLU` + `MaxPool` + `ConvTranspose` + `Concat` + `Sigmoid` — every op is on the allowlist; `ConvTranspose` keeps the graph load-clean against vanilla origin/master without an allowlist patch in the same PR	Yes
BASNet-lite	~3.4 M	Strong upstream; would require porting a substantive code drop	Too large; needs upstream code import → second licence audit
U-2-Net `u2netp`	~4.7 M	Strong upstream; Apache-2.0 codebase but Google-Drive-only weights	Code import is cleaner than re-training but still wraps the licence question; not a useful first cut
MobileNetV2 + sigmoid head	~2.2 M	Easy to wire	Way over the size budget for a pure saliency student

The chosen architecture has 112,841 trainable parameters — well under the 200,000-parameter target and an order of magnitude smaller than any "real" upstream SOD model — but enough capacity to learn a useful content-dependent saliency signal on DUTS-TR.

Training-recipe rationale¶

Knob	Value	Rationale
Optimizer	Adam, lr = 1e-3	Default working setting for tiny U-Nets in segmentation; matches the per-task brief
LR schedule	Cosine annealing to 0 over 50 epochs	Smoothly winds down without warmup; deterministic given seed
Batch size	32	Fits comfortably in 24 GB at 256×256; good gradient-noise / wall-clock trade-off
Crop size	256×256	Matches what every published SOD model trains at; the model is fully-convolutional so inference at native resolution is unrestricted
Loss	BCE + Dice (per-image, mean reduced)	Dice covers the foreground–background imbalance; BCE keeps gradients alive when Dice saturates
Augmentation	Random crop + horizontal flip	Cheap, dataset-license-safe (no external data); larger augmentation packages deferred
Validation split	5% held-out from DUTS-TR (528 pairs)	Stable seed-shuffled split; in-loop selection only — the external DUTS-TE / ECSSD evaluation is a follow-up
Selection	Best epoch by val IoU at threshold 0.5	Single deterministic selection criterion; per-task brief
Early-stop floor	val IoU ≥ 0.5 ships; below ≥ 0.5 ships docs-only failure PR	Per the task brief — a saliency model with IoU < 0.5 at this scale is below useful, so a docs-only failure PR is more honest than a noisy weights drop
Seed	42	Reproducible across re-runs; baked into `train_saliency_student.py`

Op-allowlist analysis¶

Every op in the exported ONNX graph was verified against core/src/dnn/op_allowlist.c at export time via ai/src/vmaf_train/op_allowlist.py::check_model. The graph contains exactly the ops Conv, BatchNormalization (folded into Conv by constant folding at export), Relu, MaxPool, ConvTranspose, Concat, and Sigmoid — all on the allowlist. Resize is not used (and is not on the allowlist at the time of this PR — using ConvTranspose for stride-2 upsampling avoids a scope expansion into the allowlist + the C-side scanner in the same PR).

Known limitations (carried over from MobileSal)¶

The C-side feature_mobilesal.c extractor is unchanged, so the known limitations recorded in docs/ai/models/mobilesal.md carry over verbatim — 8-bit YUV only, BT.709 limited-range YUV→RGB, ImageNet normalisation in C. The saliency_student_v1 weights are a true drop-in replacement (same input/output tensor names, same NCHW shapes, same dynamic axes).

Reproducer¶

.venv/bin/python ai/scripts/train_saliency_student.py \
    --duts-root /path/to/DUTS-TR \
    --output    model/tiny/saliency_student_v1.onnx \
    --epochs 50 --batch-size 32 --lr 1e-3 --seed 42 \
    --metrics-out build_artifacts/saliency_student_v1_train.json

The training script is deterministic given the seed and the pinned PyTorch / CUDA versions; re-runs reproduce the val-IoU curve to within float-rounding noise.

References¶

ADR-0286 — accompanying decision record.
ADR-0218 — original MobileSal extractor wiring (unchanged by this PR).
ADR-0257 — the deferral that this PR partly unblocks.
Research-0053 — the upstream survey that ruled out the real-weights swap.
DUTS dataset: Wang et al., "Learning to Detect Salient Objects with Image-Level Supervision", CVPR 2017. Project page: http://saliencydetection.net/duts/. License: free for academic research.
U-Net: Ronneberger et al., "U-Net: Convolutional Networks for Biomedical Image Segmentation", MICCAI 2015.
Source: paraphrased — task brief directive "train a small saliency student from scratch on a permissively-licensed public dataset, replacing the placeholder."