Research-0048: vmaf_tiny_v4 (mlp_large) evaluation — does the arch ladder saturate?¶
- Date: 2026-05-02
- Status: Complete
- Authors: Lusoris, Claude (Anthropic)
- Companion ADR: ADR-0242
Question¶
PR #294 (ADR-0241) shipped v3 (mlp_medium, 769 params) as the next rung above v2 (mlp_small, 257 params). The PR's own follow-up report flagged a Phase-3e candidate mlp_large (6→64→32→16→1, ~2.7K params): does the next rung buy further LOSO PLCC headroom, or has the canonical-6 + 4-corpus regime saturated?
Methodology¶
Identical to v3's research-0046 except for the architecture function. Recipe:
- Features: canonical-6 =
(adm2, vif_scale0, vif_scale1, vif_scale2, vif_scale3, motion2). - Preprocessing: corpus-wide StandardScaler (mean / std baked into ONNX graph as Constant nodes).
- Optimiser: Adam @ lr=1e-3, MSE loss, 90 epochs, batch_size 256, seed=0.
- Training corpus: 4-corpus parquet (Netflix Public + KoNViD-1k + BVI-DVC A+B+C+D, 330 499 rows).
- Architecture:
nn.Linear(6,64) → ReLU → nn.Linear(64,32) → ReLU → nn.Linear(32,16) → ReLU → nn.Linear(16,1)— 3 073 params (the spec's "~2.7K" estimate undercounts by ~370; exact number is irrelevant, the saturation question stands). - Validation: 5000-row Netflix smoke test (PLCC ≥ 0.97 gate) + 9-fold Netflix LOSO (single seed for parity with v3's eval; v2 was 5-seed).
Results¶
Architecture ladder comparison¶
| Model | Arch | Hidden | Params | ONNX size | Train RMSE | NF LOSO mean PLCC | NF LOSO std PLCC |
|---|---|---|---|---|---|---|---|
| vmaf_tiny_v2 | mlp_small | 6→16→8→1 | 257 | 2 446 B | 0.153 | 0.9978 (5-seed) | 0.0021 |
| vmaf_tiny_v3 | mlp_medium | 6→32→16→1 | 769 | 4 496 B | 0.112 | 0.9986 (1-seed) | 0.0015 |
| vmaf_tiny_v4 | mlp_large | 6→64→32→16→1 | 3 073 | 14 046 B | 0.104 | 0.9987 (1-seed) | 0.0015 |
Per-fold LOSO breakdown (v4)¶
| Source | n | PLCC | SROCC | RMSE |
|---|---|---|---|---|
| BigBuckBunny | 1500 | 0.9995 | 0.9964 | 0.86 |
| BirdsInCage | 1440 | 0.9999 | 0.9998 | 0.30 |
| CrowdRun | 1050 | 0.9999 | 0.9998 | 0.76 |
| ElFuente1 | 1260 | 0.9993 | 0.9952 | 1.13 |
| ElFuente2 | 1620 | 0.9989 | 0.9996 | 1.22 |
| FoxBird | 900 | 0.9952 | 0.9955 | 2.82 |
| OldTownCross | 1050 | 0.9992 | 0.9999 | 1.23 |
| Seeking | 1500 | 0.9989 | 0.9961 | 1.63 |
| Tennis | 720 | 0.9977 | 0.9978 | 1.15 |
| Mean | 0.9987 | 0.9978 | 1.23 | |
| Std (n-1) | 0.0015 | 0.0020 | 0.70 |
v3 → v4 delta¶
- Mean PLCC: +0.0001 (well below 1 std of either model).
- Mean SROCC: +0.0001.
- Std PLCC: identical (0.0015).
- Train-set RMSE: −0.008 (small improvement, reflecting overfit headroom not generalisation).
- ONNX size: +9 550 B (3.1x v3, 5.7x v2).
The per-fold table shows v4 wins narrowly on 5 / 9 folds and matches or loses on the rest — well within single-seed noise.
Interpretation¶
The architecture ladder saturates at v3 on this regime. v4 trains to a marginally better train-set RMSE (over-fits a bit harder thanks to the ~4x parameters), but the held-out PLCC is statistically flat. This is consistent with the canonical-6 feature set being information-bottlenecked: with only 6 input dimensions, a 769-param MLP already has enough capacity to represent the per-frame VMAF target, and adding more capacity buys nothing on out-of-distribution sources.
Decision matrix (mirrors ADR-0242)¶
| Option | Outcome |
|---|---|
| Stay at v3, don't ship v4 | Loses the empirical saturation evidence; user explicitly asked for v4. |
| Ship v4 as opt-in, document ladder stops (chosen) | Records saturation; protects future agents from re-running the same experiment. |
| Ship v4 as production default, retire v3 | +0.0001 PLCC < single-seed noise; not justified. |
| Train mlp_huge as v5 | v3→v4 saturation already predicts a flat outcome; not worth the compute. |
Follow-ups¶
- Future quality gains require regime change: richer features, larger corpus, multi-seed averaging, or a fundamentally different fusion strategy (e.g. ensemble, distillation from a frame-level CNN). A wider MLP is no longer the lever.
- A multi-seed v3 + v4 LOSO study (5+ seeds, matching v2's protocol) would tighten the variance estimate and confirm the +0.0001 mean delta is noise, not signal. Optional follow-up; not gating.
- The 3-tier (v2 default, v3 + v4 opt-in) story is documented in
docs/ai/inference.md; downstream users select via--tiny-model model/tiny/vmaf_tiny_v4.onnx.
References¶
- ADR-0242 (this digest's companion).
- ADR-0241 (parent — v3 mlp_medium ladder candidate).
- PR #294 body (v3 ship + v4 candidate flag).
- LOSO JSON:
runs/vmaf_tiny_v4_loso_metrics.json.