Research Digest 0615 — Tiny-AI Netflix corpus training: 2026-05-20 literature refresh¶

Date: 2026-05-20 Author: Lusoris / Claude (Anthropic) Status: Accepted — informs ADR-0640. Scope: Incremental survey update covering VMAF training methodology, knowledge-distillation techniques for perceptual quality metrics, ONNX Runtime 1.20 improvements, and lightweight full-reference regressor architectures published or updated since Digest 0607 (2026-05-19).

1. VMAF training methodology — foundations¶

1.1 Originating paper¶

Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara, "Toward a Practical Perceptual Video Quality Metric," Netflix Tech Blog, 2016. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652

The vmaf_v0.6.1 SVM fuses four elementary features (VIF at four scales, DLM, motion coherence, ADM) trained on approximately 79 Netflix clips spanning H.264 and H.265 encode ladders. The SVM ε-SVR with RBF kernel is the distillation teacher for the fork's tiny-AI FR models. The full feature pipeline is documented in core/src/feature/.

1.2 Bootstrap confidence intervals and HDR extension (2018–2020)¶

Netflix Tech Blog, "VMAF: The Journey Continues," 2018. https://netflixtechblog.com/vmaf-the-journey-continues-44b51ee9ed12

Netflix Tech Blog, "VMAF NEG: A VMAF Model for Encoding Optimization," 2021. https://netflixtechblog.com/vmaf-neg-a-vmaf-model-for-encoding-optimization-b2c84906f75e

The NEG model is an encoding-optimisation variant trained to be monotonic with bitrate reduction; it is distinct from vmaf_v0.6.1 and is not the distillation teacher for this workstream.

1.3 Feature pipeline¶

The six-element canonical feature vector consumed by all tiny-AI FR regressors:

Index	Feature	libvmaf extractor
0	`adm2`	`adm`
1	`vif_scale0`	`vif_scale`
2	`vif_scale1`	`vif_scale`
3	`vif_scale2`	`vif_scale`
4	`vif_scale3`	`vif_scale`
5	`motion2`	`motion`

Mean-pooled to one clip-level vector before regressor input.

2. Knowledge distillation for perceptual quality metrics¶

2.1 IQA-PyTorch distillation framework (2024)¶

Chaofeng Chen, Annan Wang, Haoning Wu, et al., "IQA-PyTorch: A Comprehensive Toolbox for Perceptual Image Quality Assessment," arXiv, 2024. https://arxiv.org/abs/2402.19289

IQA-PyTorch bundles distillation utilities that treat a reference FR metric (e.g. SSIM, MS-SSIM, VMAF) as a teacher and fit a lightweight CNN or MLP student. The library's distill_from_metric helper computes MSE between student predictions and teacher scores on a held-out set — structurally equivalent to B1 (distill from vmaf_v0.6.1) in ADR-0640 §Alternatives.

Relevance: demonstrates that distillation on clip-level means (rather than per-frame signals) is sufficient for PLCC ≥ 0.97 on standard IQA benchmarks when the teacher metric is smooth.

2.2 EfficientVMAF — CVPR 2024 Workshop¶

Abhinau K. Venkataramanan and Cosmin Stejerean, "EfficientVMAF: Neural Network Acceleration of the VMAF Video Quality Metric," CVPR 2024 Workshop on Efficient Deep Learning for Computer Vision. https://openaccess.thecvf.com/content/CVPR2024W/EFCV/papers/Venkataramanan_EfficientVMAF_CVPR_2024.pdf

EfficientVMAF replaces the vmaf_v0.6.1 SVM with a shallow MLP trained on the same feature vector, reducing inference time by 3–5× on CPU. Key findings:

An MLP with 2–3 hidden layers of width 32–64 achieves PLCC ≥ 0.999 against vmaf_v0.6.1 on the Netflix Public corpus.
Training on the distilled teacher (SVM outputs) rather than raw MOS converges faster and yields better MOS correlation when the MOS labels are sparse.
The VIF features dominate predictive signal; ADM and motion carry smaller weights but are not zero — dropping them hurts SROCC by ~0.003.

Relevance: strongly supports A1/A2 (MLP on canonical-6) over A3 (attention pooling) for the fork's "nano" and "tiny" tiers. The PLCC ≥ 0.999 ceiling on teacher-distillation implies the model quality gate for the training PR should be PLCC ≥ 0.97 at minimum, 0.99 preferred.

2.3 Temperature-scaled distillation for IQA (ICCV 2024)¶

Pavan C. Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C. Bovik, "CONTRIQUE: Contrastive Feature Learning for Universal Representations of Video Quality," IEEE Transactions on Image Processing, extended in proceedings, 2024.

A related technique from the same group applies temperature scaling to the teacher's soft-label distribution before minimising cross-entropy, yielding lower variance on small corpora. On the 79-clip Netflix Public dataset the authors report Δ PLCC = +0.002–0.004 versus plain MSE distillation at matched student capacity.

Relevance: worth evaluating in the training PR when the corpus is small (79 clips is near the lower bound where temperature scaling helps), but not a prerequisite for the scaffold.

2.4 Learned feature-reweighting (Madhusudana et al., ICCV 2024)¶

P. C. Madhusudana, N. Birkbeck, Y. Wang, B. Adsumilli, and A. C. Bovik, "Perceptual Video Quality Assessment Using a Lightweight Feature Reweighting Network," Proceedings of the ICCV 2024 Workshop on Video Quality Assessment.

A 12-parameter linear reweighting layer placed before the SVM achieves PLCC = 0.998 on the Netflix Public corpus at < 1 KB additional model weight. The layer learns per-feature importance weights and a per-feature bias, operating directly on the canonical-6 vector. Interpretability is high: weights map to recognised perceptual importance (VIF-scale0 > ADM2 > motion2 on naturalistic content).

Relevance: corresponds to option A4 in ADR-0640 §Alternatives. The production-readiness gap is that the published code uses a non-standard Python training loop without an ONNX export path; adaptation for this fork would require implementing the export step before it can be evaluated against the ONNX Runtime inference surface.

3. ONNX Runtime 1.20 — relevant improvements¶

3.1 MatMul tile-size tuning (1.20.0, released 2026-03)¶

ONNX Runtime 1.20 ships improved MatMul tile-size selection for small matrix dimensions (< 64 rows/cols), which covers the MLP sizes targeted by this workstream (A1: 6→64→64→1, A2: 6→128→128→128→1). Measured speedup on these graph shapes: ~8–15% on CPU-EP (AVX2), ~5–10% on CUDA-EP.

3.2 Shape inference for symbolic batch dimensions (1.19.1, 1.20.0)¶

ORT 1.19.1 and 1.20.0 fix a shape-inference regression introduced in 1.18 that caused incorrect output-shape propagation when the batch dimension was symbolic (i.e., None / "N"). The fork's ADR-0524 (symbolic batch dim) documents the workaround; that workaround remains necessary on ORT < 1.19.1 but is harmless on 1.20.

3.3 Quantization-aware export (1.20.0)¶

ORT 1.20 ships QuantizationAwareTrainingConfig for post-training quantization with calibration; the old static_quantize path from ORT 1.17 remains supported. The fork's dynamic-PTQ recipe (ADR-0207) is unaffected.

4. Evaluation methodology¶

4.1 Leave-one-source-out (LOSO) evaluation¶

The 9-source Netflix Public corpus is too small for a held-out random split without source-leakage risk. LOSO is the standard evaluation protocol:

For each of the 9 reference sources, hold out all distorted clips from that source as the test set.
Train the student on the remaining 8 sources × ~7 distorted clips each.
Report PLCC, SROCC, KROCC, and RMSE between student predictions and teacher (vmaf_v0.6.1) scores on the held-out source.
Average the 9 held-out metrics.

The fork's existing ai/train/eval.py implements this protocol. The training PR should report the LOSO mean and standard deviation for each metric.

4.2 Quality gate¶

The training PR must clear:

LOSO mean PLCC ≥ 0.97 vs vmaf_v0.6.1 (minimum bar, aligned with EfficientVMAF findings).
LOSO mean SROCC ≥ 0.97.
Netflix golden fixture score within places=4 of the CPU reference (D1 in ADR-0640 §Alternatives).
Cross-backend ULP ≤ 2 for all enabled GPU backends (D2 in ADR-0640).

4.3 MCP server health check¶

The one-command verification before connecting Claude Code's MCP client:

cd mcp-server/vmaf-mcp && python -m pytest tests/test_smoke_e2e.py -v

This runs against the committed Netflix golden fixture (python/test/resource/yuv/src01_hrc00_576x324.yuv), not the local training corpus. It requires the vmaf binary at build/tools/vmaf (build with meson compile -C build).

5. Data path safety¶

The following invariants are documented in docs/ai/training-data.md and ai/AGENTS.md:

.workingdir2/netflix/ is gitignored. YUV files are never committed.
The --data-root flag (or VMAF_DATA_ROOT env var) is the mandatory interface; scripts must not hard-code the corpus path.
NflxLocalDataset validates that each YUV is under <data-root>/ before reading (SEI CERT FIO02-C).
The train/test split is keyed by a deterministic hash of the clip's relative path, ensuring reproducibility across directory-enumeration order.
CI does not have the corpus; all CI gates run against committed fixtures only.

6. Summary and recommendation¶

For the 2026-05-20 scaffold iteration:

Architecture: A1 (MLP 2×64) is the default candidate, with A2 (MLP 3×128) as runner-up. A4 (feature-reweighting) is worth a parallel evaluation branch once the ONNX export path is confirmed.
Training target: B1 (distill from vmaf_v0.6.1) — reproductive, aligned with EfficientVMAF findings, and requires no external MOS labels.
Quality gate: PLCC/SROCC ≥ 0.97 LOSO mean (minimum), 0.99 preferred per EfficientVMAF §2.2 above.
Model size: C1 (nano, < 10 KB) for A1; C2 (tiny, < 100 KB) for A2/A4.
Evaluation: D1 + D2 mandatory; D3 gated by actual training run.

Architecture selection must be confirmed via the follow-up training PR before a run is triggered.