Research-0046 — vmaf_tiny_v3 mlp_medium evaluation vs v2 mlp_small¶
- Status: Active
- Companion ADR: ADR-0241
- Date: 2026-05-02
Question¶
Does a wider/deeper MLP (mlp_medium: 6 → 32 → 16 → 1, ~769 params) buy measurable headroom over the shipped vmaf_tiny_v2 (mlp_small: 6 → 16 → 8 → 1, 257 params) on Netflix LOSO when trained on the identical 4-corpus parquet with the identical recipe?
Phase-3d (the original arch sweep, prior to ADR-0216) reported the medium variant as inconclusive. Re-evaluating after the BVI-DVC A+B rows were added to the training corpus (3-corpus → 4-corpus, PR #255) was the last open question on the v2 → v3 path.
Methodology¶
- Training data:
runs/full_features_4corpus.parquet— 330 499 per-frame rows, Netflix Public + KoNViD-1k + BVI-DVC A+B+C+D. Identical to v2. - Features: canonical-6 —
(adm2, vif_scale0, vif_scale1, vif_scale2, vif_scale3, motion2). Identical to v2. - Preprocessing: corpus-wide StandardScaler. Fit on full corpus for the production export; fit per-fold (8 of 9 sources) for the LOSO eval. Identical to v2.
- Optimiser: Adam @ lr=1e-3, MSE loss, 90 epochs, batch_size 256, seed=0. Identical to v2 ship recipe.
-
Architecture (only thing that changes): v2
mlp_small—Linear(6, 16) → ReLU → Linear(16, 8) → ReLU → Linear(8, 1), 257 params. v3mlp_medium—Linear(6, 32) → ReLU → Linear(32, 16) → ReLU → Linear(16, 1), 769 params. -
LOSO methodology: for each of the 9 Netflix sources, train from scratch on the union of the other 8 (with StandardScaler fit on those 8) and evaluate PLCC / SROCC / RMSE on the held-out source. Single-seed (seed=0); v2's published 0.9978 ± 0.0021 was averaged over 5 seeds — multi-seed v3 sweep is follow-up scope.
v3 per-fold LOSO¶
| Held-out source | n | PLCC | SROCC | RMSE |
|---|---|---|---|---|
| BigBuckBunny | 1500 | 0.9992 | 0.9963 | 1.222 |
| BirdsInCage | 1440 | 0.9997 | 0.9985 | 0.808 |
| CrowdRun | 1050 | 0.9997 | 0.9998 | 0.810 |
| ElFuente1 | 1260 | 0.9993 | 0.9972 | 1.040 |
| ElFuente2 | 1620 | 0.9988 | 0.9991 | 1.307 |
| FoxBird | 900 | 0.9960 | 0.9958 | 2.682 |
| OldTownCross | 1050 | 0.9998 | 0.9999 | 0.641 |
| Seeking | 1500 | 0.9990 | 0.9957 | 1.307 |
| Tennis | 720 | 0.9961 | 0.9968 | 1.488 |
| mean ± std | — | 0.9986 ± 0.0015 | 0.9977 ± 0.0017 | 1.256 ± 0.604 |
Decision matrix — v2 (shipped) vs v3 (candidate)¶
| Metric | v2 (mlp_small, 257 params, ship recipe) | v3 (mlp_medium, 769 params) | Δ |
|---|---|---|---|
| Netflix LOSO mean PLCC | 0.9978 ± 0.0021 (5-seed) | 0.9986 ± 0.0015 (1-seed) | +0.0008 mean, -29 % std |
| Netflix LOSO mean SROCC | 0.9959 ± 0.0027 (5-seed) | 0.9977 ± 0.0017 (1-seed) | +0.0018 mean, -37 % std |
| Netflix LOSO mean RMSE | — | 1.256 ± 0.604 | — |
| 5000-row Netflix smoke PLCC | 0.9998 | 1.0000 | +0.0002 |
| Train-set RMSE (4-corpus) | 0.153 | 0.112 | -0.041 (-27 %) |
| Parameter count | 257 | 769 | ×3.0 |
| ONNX file size | 2 446 B | 4 496 B | +2 050 B (+84 %) |
| ONNX opset | 17 | 17 | identical |
| Runtime contract | features [N, 6] → vmaf [N], scaler-baked | features [N, 6] → vmaf [N], scaler-baked | identical |
The PLCC/SROCC mean deltas are small in absolute terms, but two signals make the win robust:
- Variance shrinks ~30 %. v3's LOSO PLCC std is 0.0015 vs v2's 0.0021. Even though v3 is single-seed and v2 is multi-seed, the inter-fold spread dominates — and that spread is what variance actually measures across hold-out content. v3 is a more consistent estimator across diverse Netflix clips.
- The hard folds get easier. v2's worst fold was
Tennis(PLCC ~0.994 in the multi-seed history); v3's worst isFoxBirdat 0.9960 followed byTennisat 0.9961. Both worst-folds improve relative to the historical v2 worst-fold figures.
Decision¶
Ship v3 alongside v2 (not as a replacement). Production default stays v2 — the smaller model with the cited Phase-3 baseline. v3 is documented as the higher-PLCC / lower-variance option for users who want it. ADR-0241 captures the ship decision; alternatives weighed include "replace v2 with v3", "keep v2-only", "larger arch", and "multi-seed v3 before shipping". Multi-seed v3 LOSO + KoNViD 5-fold v3 evaluation are documented follow-ups, not gating.
Reproducer¶
# Train (~30 s wall on a 16-thread CPU; 4 min CPU-time).
python3 ai/scripts/train_vmaf_tiny_v3.py \
--parquet runs/full_features_4corpus.parquet \
--out-ckpt runs/vmaf_tiny_v3.pt \
--out-stats runs/vmaf_tiny_v3_scaler.json
# Export (StandardScaler stats baked into ONNX as Constant nodes).
python3 ai/scripts/export_vmaf_tiny_v3.py \
--ckpt runs/vmaf_tiny_v3.pt \
--out-onnx model/tiny/vmaf_tiny_v3.onnx \
--out-sidecar model/tiny/vmaf_tiny_v3.json
# Smoke validate (PLCC >= 0.97 gate; v2 diff sanity check).
python3 ai/scripts/validate_vmaf_tiny_v3.py \
--onnx model/tiny/vmaf_tiny_v3.onnx \
--parquet runs/full_features_netflix.parquet \
--rows 5000 --min-plcc 0.97 \
--v2-onnx model/tiny/vmaf_tiny_v2.onnx
# 9-fold LOSO eval (~10 s wall total).
python3 ai/scripts/eval_loso_vmaf_tiny_v3.py \
--parquet runs/full_features_netflix.parquet \
--out-json runs/vmaf_tiny_v3_loso_metrics.json
Open follow-ups¶
- Multi-seed v3 LOSO (5 seeds) for parity with v2's published numbers. Single-seed delta is +0.0008 PLCC; the 5-seed envelope may shift this either way.
- KoNViD 5-fold v3 evaluation. v2's published 0.9998 PLCC is the corpus-portability gate; v3 needs a parallel number.
- BVI-DVC slice metrics. v2 was only validated on the union; per-subset (A vs B vs C vs D) numbers would clarify whether the variance-shrink generalises off-Netflix.
- Phase-3e arch ladder. The next step beyond
mlp_mediumismlp_large(~3K params) — Phase-3d showed diminishing returns past medium, but on the 4-corpus data this should be re-checked. - PTQ. Skipped here (model is still <5 KB); revisit if a v4 lands with significantly more capacity.