ADR-0390: vmaf_tiny_v4 — mlp_large arch (opt-in only; arch ladder stops here)¶
- Status: Accepted
- Date: 2026-05-02
- Deciders: Lusoris, Claude (Anthropic)
- Tags:
ai,tiny-ai,model,inference
Context¶
PR #294 (parent ADR-0241) shipped vmaf_tiny_v3 (mlp_medium, 6→32→16→1, 769 params), achieving Netflix 9-fold LOSO PLCC=0.9986 ± 0.0015 (vs v2's 0.9978 ± 0.0021). The PR's own report flagged a Phase-3e candidate mlp_large (6→64→32→16→1, ~2.7K params) for follow-up evaluation — does the next rung on the architecture ladder buy further headroom, or does the canonical-6 / 4-corpus regime saturate at v3's capacity?
This ADR records the empirical answer.
Decision¶
We ship vmaf_tiny_v4 (mlp_large, 3 073 params) as an opt-in-only model alongside v2 (production default) and v3 (opt-in higher-tier). The architecture ladder stops at v4 — we will not pursue mlp_huge or further capacity scaling on the same canonical-6 + 4-corpus regime. Future quality gains require a regime change (more features, different corpus, or a fundamentally different fusion strategy), not a wider MLP.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Stay at v3 (mlp_medium); do not ship v4 | Smallest opt-in surface; clear "v3 is the top" story. | Loses the sub-rounding +0.0001 LOSO PLCC; user explicitly requested an empirical evaluation of the v4 candidate. | Rejected — task requires producing v4 + an empirical SHIP / NO-SHIP determination. |
| Ship v4 as opt-in only, document arch ladder stops (chosen) | Honest empirical record; preserves v3 as opt-in tier; closes the "is v4 worth it" question for future maintainers; +0.0001 PLCC + identical std vs v3 means no regression. | Adds a third tiny-AI fusion model surface (3 of them now). 14 KB ONNX vs v3's 4.5 KB (~3x). | Selected — task spec calls for SHIP if PLCC ≥ v3, which v4 narrowly passes. The "arch ladder stops here" guidance is the load-bearing future-protection. |
| Ship v4 as production default, retire v3 | Single highest-PLCC model, less surface area. | +0.0001 mean PLCC delta is below natural single-seed noise; cannot justify retiring v3 (which already has its own ADR-0241). v4's 14 KB ONNX is 5.7x v2's 2.5 KB. | Rejected — gain is statistically indistinguishable from noise; cost is real. |
| Train mlp_huge (6→128→64→32→16→1) as v5 | Tests further saturation. | v4's flat result vs v3 already demonstrates saturation on the current regime. Spending compute + ONNX bytes on a parallel ladder rung that the v3→v4 result predicts will be flat is wasteful. | Rejected — saturation evidence is decisive enough; document the stop and move on. |
Consequences¶
- Positive: v4 is registered + signed + smoke-validated, available to users who want a third tier without hand-rolling. The ADR's "arch ladder stops here" rationale prevents future agents/maintainers from spending cycles training v5/v6 on the same regime. Establishes a saturation reference point for canonical-6 + 4-corpus + 90-epoch Adam recipe.
- Negative: Three concurrent vmaf_tiny_* models (v2 default, v3 + v4 opt-in). Slightly more documentation surface. v4's 14 KB ONNX is ~6x v2's; trivial in absolute terms.
- Neutral / follow-ups: Future quality gains on tiny VMAF fusion regressors require regime change (richer feature set, larger corpus, multi-seed averaging, ensemble) — not deeper MLPs. If a maintainer revisits the arch ladder, this ADR is the prior art for "we already tried; it saturated".
References¶
- Parent ADR-0241 (v3 mlp_medium, ladder candidate).
- Research digest:
docs/research/0048-vmaf-tiny-v4-mlp-large-evaluation.md. - LOSO metrics:
runs/vmaf_tiny_v4_loso_metrics.json(9 folds, single seed for parity with v3). - Source: PR #294 body — "v4 candidate: mlp_large (6 → 64 → 32 → 16 → 1, ~2.7K params); SHIP if PLCC ≥ v3's, DO NOT SHIP otherwise". Verbatim user direction in this session: train + benchmark v4 and report SHIP / NO-SHIP / OPT-IN per the gate.