Skip to content

Research-0048: vmaf_tiny_v4 (mlp_large) evaluation — does the arch ladder saturate?

  • Date: 2026-05-02
  • Status: Complete
  • Authors: Lusoris, Claude (Anthropic)
  • Companion ADR: ADR-0242

Question

PR #294 (ADR-0241) shipped v3 (mlp_medium, 769 params) as the next rung above v2 (mlp_small, 257 params). The PR's own follow-up report flagged a Phase-3e candidate mlp_large (6→64→32→16→1, ~2.7K params): does the next rung buy further LOSO PLCC headroom, or has the canonical-6 + 4-corpus regime saturated?

Methodology

Identical to v3's research-0046 except for the architecture function. Recipe:

  • Features: canonical-6 = (adm2, vif_scale0, vif_scale1, vif_scale2, vif_scale3, motion2).
  • Preprocessing: corpus-wide StandardScaler (mean / std baked into ONNX graph as Constant nodes).
  • Optimiser: Adam @ lr=1e-3, MSE loss, 90 epochs, batch_size 256, seed=0.
  • Training corpus: 4-corpus parquet (Netflix Public + KoNViD-1k + BVI-DVC A+B+C+D, 330 499 rows).
  • Architecture: nn.Linear(6,64) → ReLU → nn.Linear(64,32) → ReLU → nn.Linear(32,16) → ReLU → nn.Linear(16,1) — 3 073 params (the spec's "~2.7K" estimate undercounts by ~370; exact number is irrelevant, the saturation question stands).
  • Validation: 5000-row Netflix smoke test (PLCC ≥ 0.97 gate) + 9-fold Netflix LOSO (single seed for parity with v3's eval; v2 was 5-seed).

Results

Architecture ladder comparison

Model Arch Hidden Params ONNX size Train RMSE NF LOSO mean PLCC NF LOSO std PLCC
vmaf_tiny_v2 mlp_small 6→16→8→1 257 2 446 B 0.153 0.9978 (5-seed) 0.0021
vmaf_tiny_v3 mlp_medium 6→32→16→1 769 4 496 B 0.112 0.9986 (1-seed) 0.0015
vmaf_tiny_v4 mlp_large 6→64→32→16→1 3 073 14 046 B 0.104 0.9987 (1-seed) 0.0015

Per-fold LOSO breakdown (v4)

Source n PLCC SROCC RMSE
BigBuckBunny 1500 0.9995 0.9964 0.86
BirdsInCage 1440 0.9999 0.9998 0.30
CrowdRun 1050 0.9999 0.9998 0.76
ElFuente1 1260 0.9993 0.9952 1.13
ElFuente2 1620 0.9989 0.9996 1.22
FoxBird 900 0.9952 0.9955 2.82
OldTownCross 1050 0.9992 0.9999 1.23
Seeking 1500 0.9989 0.9961 1.63
Tennis 720 0.9977 0.9978 1.15
Mean 0.9987 0.9978 1.23
Std (n-1) 0.0015 0.0020 0.70

v3 → v4 delta

  • Mean PLCC: +0.0001 (well below 1 std of either model).
  • Mean SROCC: +0.0001.
  • Std PLCC: identical (0.0015).
  • Train-set RMSE: −0.008 (small improvement, reflecting overfit headroom not generalisation).
  • ONNX size: +9 550 B (3.1x v3, 5.7x v2).

The per-fold table shows v4 wins narrowly on 5 / 9 folds and matches or loses on the rest — well within single-seed noise.

Interpretation

The architecture ladder saturates at v3 on this regime. v4 trains to a marginally better train-set RMSE (over-fits a bit harder thanks to the ~4x parameters), but the held-out PLCC is statistically flat. This is consistent with the canonical-6 feature set being information-bottlenecked: with only 6 input dimensions, a 769-param MLP already has enough capacity to represent the per-frame VMAF target, and adding more capacity buys nothing on out-of-distribution sources.

Decision matrix (mirrors ADR-0242)

Option Outcome
Stay at v3, don't ship v4 Loses the empirical saturation evidence; user explicitly asked for v4.
Ship v4 as opt-in, document ladder stops (chosen) Records saturation; protects future agents from re-running the same experiment.
Ship v4 as production default, retire v3 +0.0001 PLCC < single-seed noise; not justified.
Train mlp_huge as v5 v3→v4 saturation already predicts a flat outcome; not worth the compute.

Follow-ups

  • Future quality gains require regime change: richer features, larger corpus, multi-seed averaging, or a fundamentally different fusion strategy (e.g. ensemble, distillation from a frame-level CNN). A wider MLP is no longer the lever.
  • A multi-seed v3 + v4 LOSO study (5+ seeds, matching v2's protocol) would tighten the variance estimate and confirm the +0.0001 mean delta is noise, not signal. Optional follow-up; not gating.
  • The 3-tier (v2 default, v3 + v4 opt-in) story is documented in docs/ai/inference.md; downstream users select via --tiny-model model/tiny/vmaf_tiny_v4.onnx.

References

  • ADR-0242 (this digest's companion).
  • ADR-0241 (parent — v3 mlp_medium ladder candidate).
  • PR #294 body (v3 ship + v4 candidate flag).
  • LOSO JSON: runs/vmaf_tiny_v4_loso_metrics.json.