Skip to content

Research-0062 — Fork-trained saliency student on DUTS-TR

Field Value
Date 2026-05-03
Status Implemented — saliency_student_v1 shipped
Tags dnn, tiny-ai, mobilesal, saliency, training, license, fork-local

Companion to ADR-0286. Records the alternatives walked, the dataset-license analysis, the architecture decisions, and the training-recipe rationale behind the from-scratch saliency student that replaces the smoke-only mobilesal_placeholder_v0 checkpoint.

Background

ADR-0257 and Research-0053 (2026-05-03) documented the three blockers that ruled out the FastDVDnet-style real-weights swap for MobileSal: (a) upstream yuhuan-wu/MobileSal is CC BY-NC-SA 4.0 (incompatible with the fork's BSD-3-Clause-Plus-Patent), (b) checkpoints are distributed via Google Drive viewer URLs only, and (c) the network is RGB-D while the C contract is RGB-only. The recommended replacement path was to swap to U-2-Net u2netp (Apache-2.0). When that path was investigated in detail it inherited the same Google-Drive distribution problem and its 4.7 MB checkpoint dwarfs every other model under model/tiny/.

This digest records the fallback path that was feasible: train a small saliency student from scratch on a permissively-distributed public corpus. The resulting weights are wholly fork-owned, ship under BSD-3-Clause-Plus-Patent, and lock down a substantively content-dependent saliency_mean signal for the first time on this fork.

Dataset survey

The fork needs a saliency-segmentation dataset with (a) a clearly documented research-distribution licence (no clickwrap), (b) a stable HTTP URL the export script can pin, and (c) enough scale to train a small student to a useful operating point in a single GPU-hour.

Dataset Size License / distribution Picked?
DUTS-TR (Wang et al. 2017) 10,553 RGB images + binary masks; 271 MB "Free for academic research" per the official site (http://saliencydetection.net/duts/); direct https://saliencydetection.net/duts/download/DUTS-TR.zip URL with Last-Modified headers and a stable ETag Yes — the only dataset that satisfied all three criteria in the survey
ECSSD (Yan et al. 2013) 1,000 images Free for research; smaller; same direct-URL distribution Held in reserve as an external test set; not used for training
MSRA10K (Cheng et al. 2014) 10,000 images Free for research; mostly single-object centred; less diverse than DUTS-TR Considered for augmentation; not bundled in v1
HKU-IS (Li & Yu 2015) 4,447 images Direct distribution; OK Smaller than DUTS-TR; held in reserve
DUTS-TE (Wang et al. 2017) 5,019 images Same provenance as DUTS-TR Used as the external test split for future evaluations; v1 keeps the held-out 5% of DUTS-TR for in-loop validation IoU
SOD (Movahedi & Faugeras 2010) 300 images Direct distribution Too small to train on alone
DAVIS-S (Caelles et al. 2018) Video Apache-2.0 mask format but video-segmentation flavour, not still-image SOD Out of scope

DUTS-TR is the de-facto standard training corpus for image-level salient-object detection (used by U-2-Net, BASNet, F3Net, EGNet, ...). The README on the project page records the academic-research distribution terms ("free for academic and research purposes"). Only the trained weights are shipped in this fork; the DUTS images themselves are deliberately not committed to the repository.

DUTS-TR archive provenance recorded in docs/ai/models/saliency_student_v1.md:

URL:           https://saliencydetection.net/duts/download/DUTS-TR.zip
Last-Modified: 2025-03-10
Content-Length: 270 997 309 bytes
SHA-256:       ce61e023c8f59d022b4d46981cf16813b83d089242e6489a45630d83962ea058
Pairs:         10 553 (train images + binary masks, 1:1)

Architecture survey

Four candidate architectures were reviewed against the existing C contract (input [1, 3, H, W]saliency_map [1, 1, H, W]) and the fork's ONNX op-allowlist (core/src/dnn/op_allowlist.c):

Architecture Approx. params Notes Picked?
Tiny U-Net (this work) ~113 K 3 down / 3 up with skip connections; Conv + BatchNormalization + ReLU + MaxPool + ConvTranspose + Concat + Sigmoid — every op is on the allowlist; ConvTranspose keeps the graph load-clean against vanilla origin/master without an allowlist patch in the same PR Yes
BASNet-lite ~3.4 M Strong upstream; would require porting a substantive code drop Too large; needs upstream code import → second licence audit
U-2-Net u2netp ~4.7 M Strong upstream; Apache-2.0 codebase but Google-Drive-only weights Code import is cleaner than re-training but still wraps the licence question; not a useful first cut
MobileNetV2 + sigmoid head ~2.2 M Easy to wire Way over the size budget for a pure saliency student

The chosen architecture has 112,841 trainable parameters — well under the 200,000-parameter target and an order of magnitude smaller than any "real" upstream SOD model — but enough capacity to learn a useful content-dependent saliency signal on DUTS-TR.

Training-recipe rationale

Knob Value Rationale
Optimizer Adam, lr = 1e-3 Default working setting for tiny U-Nets in segmentation; matches the per-task brief
LR schedule Cosine annealing to 0 over 50 epochs Smoothly winds down without warmup; deterministic given seed
Batch size 32 Fits comfortably in 24 GB at 256×256; good gradient-noise / wall-clock trade-off
Crop size 256×256 Matches what every published SOD model trains at; the model is fully-convolutional so inference at native resolution is unrestricted
Loss BCE + Dice (per-image, mean reduced) Dice covers the foreground–background imbalance; BCE keeps gradients alive when Dice saturates
Augmentation Random crop + horizontal flip Cheap, dataset-license-safe (no external data); larger augmentation packages deferred
Validation split 5% held-out from DUTS-TR (528 pairs) Stable seed-shuffled split; in-loop selection only — the external DUTS-TE / ECSSD evaluation is a follow-up
Selection Best epoch by val IoU at threshold 0.5 Single deterministic selection criterion; per-task brief
Early-stop floor val IoU ≥ 0.5 ships; below ≥ 0.5 ships docs-only failure PR Per the task brief — a saliency model with IoU < 0.5 at this scale is below useful, so a docs-only failure PR is more honest than a noisy weights drop
Seed 42 Reproducible across re-runs; baked into train_saliency_student.py

Op-allowlist analysis

Every op in the exported ONNX graph was verified against core/src/dnn/op_allowlist.c at export time via ai/src/vmaf_train/op_allowlist.py::check_model. The graph contains exactly the ops Conv, BatchNormalization (folded into Conv by constant folding at export), Relu, MaxPool, ConvTranspose, Concat, and Sigmoid — all on the allowlist. Resize is not used (and is not on the allowlist at the time of this PR — using ConvTranspose for stride-2 upsampling avoids a scope expansion into the allowlist + the C-side scanner in the same PR).

Known limitations (carried over from MobileSal)

The C-side feature_mobilesal.c extractor is unchanged, so the known limitations recorded in docs/ai/models/mobilesal.md carry over verbatim — 8-bit YUV only, BT.709 limited-range YUV→RGB, ImageNet normalisation in C. The saliency_student_v1 weights are a true drop-in replacement (same input/output tensor names, same NCHW shapes, same dynamic axes).

Reproducer

.venv/bin/python ai/scripts/train_saliency_student.py \
    --duts-root /path/to/DUTS-TR \
    --output    model/tiny/saliency_student_v1.onnx \
    --epochs 50 --batch-size 32 --lr 1e-3 --seed 42 \
    --metrics-out build_artifacts/saliency_student_v1_train.json

The training script is deterministic given the seed and the pinned PyTorch / CUDA versions; re-runs reproduce the val-IoU curve to within float-rounding noise.

References

  • ADR-0286 — accompanying decision record.
  • ADR-0218 — original MobileSal extractor wiring (unchanged by this PR).
  • ADR-0257 — the deferral that this PR partly unblocks.
  • Research-0053 — the upstream survey that ruled out the real-weights swap.
  • DUTS dataset: Wang et al., "Learning to Detect Salient Objects with Image-Level Supervision", CVPR 2017. Project page: http://saliencydetection.net/duts/. License: free for academic research.
  • U-Net: Ronneberger et al., "U-Net: Convolutional Networks for Biomedical Image Segmentation", MICCAI 2015.
  • Source: paraphrased — task brief directive "train a small saliency student from scratch on a permissively-licensed public dataset, replacing the placeholder."