ADR-0242: Tiny-AI training on the original Netflix VMAF training corpus¶
- Status: Accepted
- Date: 2026-04-27
- Deciders: Lusoris, Claude (Anthropic)
- Tags:
ai,training,fork-local,onnx,docs
Context¶
The fork's tiny-AI surface (ai/, core/src/dnn/) ships small ONNX full-reference (FR) regressors trained from libvmaf feature vectors. Until now those models were trained on the five canonical public datasets documented in docs/ai/training.md (NFLX Public, KoNViD-1k, LIVE-VQC, YouTube-UGC, BVI-DVC). A separate, unpublished copy of the original Netflix VMAF training corpus — 9 reference YUVs and 70 distorted YUVs produced under the encoding ladder used to build vmaf_v0.6.1 — is available on the user's local machine at .workingdir2/netflix/{ref,dis}/. That path is gitignored; the corpus is never committed.
The corpus naming pattern follows the Netflix encoding-ladder convention:
Training directly on these pairs gives tiny-AI models the best chance of matching or exceeding vmaf_v0.6.1 accuracy on Netflix-internal quality ladders — a goal surfaced in the unpublished-Netflix-models research thread (see user memory entry project_netflix_training_corpus_local.md). The relationship between this corpus and the three Netflix golden CPU reference pairs preserved in python/test/resource/yuv/ (see CLAUDE.md §8) must be explicit: the golden pairs are a held-out correctness gate and are never used as training data.
This ADR is a scaffold-only decision. The actual training run requires GPU access, is estimated at multiple days of wall-clock time per architecture sweep, and depends on forthcoming decisions about model architecture and distillation strategy. This PR defers those decisions deliberately.
Decision¶
We will merge a scaffold-only PR that:
- Documents the corpus path convention, the
--data-rootloader API, and the recommended evaluation harness indocs/ai/training-data.md. - Records the architecture-choice space and distillation policy in this ADR (§ Alternatives considered) without picking any option yet.
- Adds an MCP end-to-end smoke test (
mcp-server/vmaf-mcp/tests/test_smoke_e2e.py) that exercises the JSON-RPCvmaf_scoretool against the smallest Netflix golden fixture, giving the user one-command verification that the MCP server is wired correctly before they attach Claude Code's MCP client to it. - Does NOT run training, download data, or touch Netflix golden test assertions.
Actual training, architecture selection, and hyperparameter choices will land in a follow-up PR once the user has reviewed the alternatives table and answered any popup questions from the training agent.
Alternatives considered¶
A. Model architecture¶
| Option | Pros | Cons | Status |
|---|---|---|---|
2-layer MLP on libvmaf feature vectors (current fr_tiny_v1 baseline) | Fast to train and evaluate; deterministic; interpretable; no new deps | Accuracy ceiling bounded by the hand-crafted features; no spatial sensitivity | Default starting point |
| 4-layer MLP with batch-norm | Higher capacity; still lightweight for ONNX export | Overfit risk on 70-pair corpus; need careful regularisation | Viable; evaluate in sweep |
| 1-D CNN over temporal feature sequences | Captures motion/temporal quality trends | Much larger; training data sparse for temporal modelling on 70 pairs | Defer to Phase 4 |
| Transformer on feature tokens | SOTA on NR tasks; flexible attention | Overkill for FR on 70 pairs; prohibitive training time without pre-training | Deferred |
B. Distillation vs from-scratch¶
| Option | Pros | Cons | Status |
|---|---|---|---|
Distill from vmaf_v0.6.1 (soft-label regression) | Tiny-model inherits vmaf_v0.6.1 score distribution without needing raw MOS | Output is bounded by teacher; systematic teacher errors are inherited | Recommended starting point |
| Train from scratch on subjective scores Netflix published (ACM MM 2016 appendix) | Ground truth independent of teacher; potential to exceed vmaf_v0.6.1 | Published MOS for only a subset of pairs; high variance on 70-pair corpus | Viable; run in parallel sweep |
| Fine-tune an existing ONNX checkpoint | Fast convergence; stable initialisation | Risk of catastrophic forgetting; checkpoint may not exist for the right opset | Deferred |
C. Model size¶
| Option | Target params | Inference budget (CPU, 1080p) | Notes |
|---|---|---|---|
| Micro (≤ 4 KB ONNX) | < 1 K | < 2 ms | Fits embedded / Wasm targets |
| Small (≤ 64 KB ONNX) | 4 K – 16 K | 2–10 ms | Current fr_tiny_v1 range |
| Medium (≤ 512 KB ONNX) | 16 K – 128 K | 10–50 ms | Headroom for spatial features |
D. Evaluation scope¶
| Option | Pros | Cons | Status |
|---|---|---|---|
Netflix golden CPU pairs only (3 pairs, python/test/) | Locked CI gate; regression-proof | Tiny sample; overfits to golden distribution | Required gate, not sole criterion |
Cross-backend ULP delta (/cross-backend-diff) | Verifies numerical parity GPU↔CPU | Doesn't measure perceptual accuracy | Required gate for GPU paths |
| Both golden + cross-backend + PLCC/SROCC on held-out split | Comprehensive | Most expensive to run each PR | Recommended for release gate |
Consequences¶
- Positive: the training pipeline is clearly specified and ready to invoke interactively once the user confirms architecture choices; the MCP smoke test immediately catches MCP-server regressions; the data-path convention prevents accidental corpus commits.
- Negative: actual training runs are multi-day and GPU-bound; the corpus is local-only, so CI cannot gate on training correctness (only on the smoke test against the golden fixture).
- Neutral / follow-ups:
- Follow-up PR to pick architecture, run training, export ONNX, register model, update
docs/ai/models/. - ADR-0042 doc-substance rule applies: the follow-up PR must ship updated docs alongside the trained artefact.
- The
--data-rootflag must be added tovmaf-train extract-features(currently it reads from${VMAF_DATA_ROOT}env var; an explicit CLI flag is cleaner for interactive use).
References¶
- User memory entry
project_netflix_training_corpus_local.md(paraphrased; Lusoris/Claude collaboration record — not committed). - Li, Z. et al. "Toward a Practical Perceptual Video Quality Metric." Netflix Tech Blog, 2016. (originating methodology for
vmaf_v0.6.1.) - Netflix Tech Blog: "VMAF: The Journey Continues" (2018, 2020).
- ADR-0042 — tiny-AI doc-substance rule.
- ADR-0108 — six deep-dive deliverables.
- Research digest 0019.
- Related PR: this scaffold PR (feat(ai): tiny-AI training scaffold + MCP smoke test (Netflix corpus prep)).
- Source:
req(direct user instruction in daily prep-scaffolding routine).