Skip to content

Research Digest 0023 — 3-arch LOSO results on the Netflix corpus

Date: 2026-04-28 Author: Lusoris / Claude (Anthropic) Status: Accepted — confirms ADR-0203's single-split architecture choice under proper LOSO; informs vmaf_tiny_v1*.onnx shipping defaults. Scope: Per-fold + aggregate PLCC / SROCC / RMSE for mlp_small, mlp_medium, and linear regressors against vmaf_v0.6.1 on the 9-source / ~70-distortion Netflix corpus. Companion to Research Digest 0022 (which covered mlp_small alone) and ADR-0203.


1. Setup

  • Corpus: .workingdir2/netflix/{ref,dis}/, 9 reference YUVs.
  • Teacher: vmaf_v0.6.1 per-frame scores via the libvmaf CLI in core/build/tools/vmaf.
  • Architectures (params):
  • mlp_small — 257
  • mlp_medium — 2 561
  • linear — 7
  • Training: 30 epochs per fold per arch, default optimizer / lr / seed. Each fold trains on the 8 non-held-out sources.
  • Evaluation harness: ai/scripts/eval_loso_3arch.py. Reuses the per-clip JSON cache + _load_session helpers from PR #165's eval_loso_mlp_small.py.
  • Hardware: Ryzen 9 9950X3D / RTX 4090 / 64 GB.
  • Wall time: 9 folds × 3 arch ≈ 9 × ~6 min × ~1.0–1.4× per arch (mlp_medium ~1.4× of mlp_small; linear ~0.4×). All three sweeps ran in parallel on local hardware.

2. Aggregate results (mean ± std across 9 folds)

arch params mean PLCC mean SROCC mean RMSE
mlp_small 257 0.9808 ± 0.0214 0.9848 ± 0.0176 14.907 ± 2.218
mlp_medium 2 561 0.9727 ± 0.0202 0.9794 ± 0.0156 10.848 ± 2.302
linear 7 0.3679 ± 0.0773 0.4861 ± 0.0975 57.868 ± 5.867

Clear pattern, replicated on the LOSO axis from the single-split finding in ADR-0203:

  • mlp_small wins ranking — highest PLCC + SROCC. Default tiny model vmaf_tiny_v1.onnx stays.
  • mlp_medium wins absolute fit — RMSE 10.85 vs 14.91 (~27 % reduction). Alternate vmaf_tiny_v1_medium.onnx stays for users who care about absolute-VMAF agreement on the Netflix corpus distribution.
  • linear is a sanity floor — PLCC 0.37 vs MLP 0.97+. Confirms the 6 features carry substantial signal but the relationship is strongly non-linear; linear is unshippable as a quality model.

3. Per-fold tables

mlp_small

fold n PLCC SROCC RMSE
BigBuckBunny 1 500 0.9767 0.9794 16.666
BirdsInCage 1 440 0.9905 0.9954 14.696
CrowdRun 1 050 0.9927 0.9938 13.273
ElFuente1 1 260 0.9936 0.9909 14.894
ElFuente2 1 620 0.9834 0.9912 15.773
FoxBird 900 0.9266 0.9425 17.524
OldTownCross 1 050 0.9939 0.9990 15.647
Seeking 1 500 0.9912 0.9952 15.754
Tennis 720 0.9788 0.9763 9.939

mlp_medium

fold n PLCC SROCC RMSE
BigBuckBunny 1 500 0.9738 0.9787 10.848
BirdsInCage 1 440 0.9622 0.9715 11.594
CrowdRun 1 050 0.9941 0.9983 13.269
ElFuente1 1 260 0.9706 0.9768 9.025
ElFuente2 1 620 0.9796 0.9839 10.316
FoxBird 900 0.9286 0.9514 14.693
OldTownCross 1 050 0.9932 0.9976 10.889
Seeking 1 500 0.9876 0.9921 10.311
Tennis 720 0.9645 0.9646 6.689

linear

fold n PLCC SROCC RMSE
BigBuckBunny 1 500 0.3596 0.5703 61.387
BirdsInCage 1 440 0.3651 0.6257 65.442
CrowdRun 1 050 0.4685 0.5720 53.139
ElFuente1 1 260 0.2710 0.3908 53.997
ElFuente2 1 620 0.4426 0.4627 52.619
FoxBird 900 0.2773 0.4026 60.054
OldTownCross 1 050 0.4256 0.5283 57.117
Seeking 1 500 0.2783 0.3324 50.324
Tennis 720 0.4229 0.4896 66.728

4. Cross-arch observations

  • FoxBird is the per-fold outlier on both MLPs — lowest PLCC on mlp_small (0.9266) and mlp_medium (0.9286). FoxBird has the smallest sample count (900) and the most distinctive motion-coherence profile in the corpus, so the held-out fold has the least similar training signal. The same outlier on both arch rules out arch-specific overfitting; it's a corpus-distribution issue within the existing 9-source Netflix Public corpus. The Netflix Public Dataset is already in .workingdir2/netflix/ (9 reference clips + 70 distortion variants) and is what these LOSO runs train on, so "more Netflix Public" is not the unblocker. The natural unblocker is a different / larger training corpus that adds source- distribution diversity beyond Netflix's 9 clips — e.g. KoNViD-1k (Konstanz NR-VQA, 1 200 clips), BVI-DVC, or AOM-CTC source sets. Tier 6 has T6-1b (LPIPS-Sq) for an FR baseline expansion; corpus expansion is its own thread to open if FoxBird-class variance becomes a shipped-model concern.
  • Linear's variance is much higher (PLCC ±0.077 vs ±0.020 for MLPs). The poor-fit linear model exposes more inter-clip noise — a fact about the linear model's inability to capture the feature- to-score relationship, not about the corpus itself.
  • PLCC and SROCC track each other tightly across all 27 fold- arch combinations (Pearson ≈ 0.97 between PLCC and SROCC across the table). Either is a sufficient ranking summary; we keep both for consistency with prior work.
  • mlp_medium's RMSE win is consistent across folds — every fold shows lower RMSE on medium than small except CrowdRun and FoxBird (within 0.1 RMSE either way). The 27 % aggregate RMSE reduction is real, not driven by a single fold.

5. Implications

  • vmaf_tiny_v1.onnx (mlp_small) remains the shipped default — its LOSO-mean PLCC of 0.98 is the honest "expected accuracy on a new clip from this distribution" number, beating mlp_medium by ~1 PLCC point.
  • vmaf_tiny_v1_medium.onnx (mlp_medium) remains the shipped alternate for users who care about absolute-VMAF agreement — the 27 % RMSE reduction is the canonical reason to opt in.
  • linear does not ship; remains as a sanity-floor harness control documented in ADR-0203's "Three-arch sweep" section.
  • Future work for FoxBird's outlier status: T6-1a (Netflix Public Dataset training corpus) increases the effective fold count and sample diversity beyond the 9-source ceiling.

6. Reproducer

# 1. Build libvmaf CLI (CPU is enough; eval is host-only)
meson setup build -Denable_cuda=false -Denable_sycl=false
ninja -C build

# 2. Sweep 9 LOSO folds × 3 arch (sequentially or in parallel; the
#    trainer spawns its own per-fold worker)
for arch in mlp_small mlp_medium linear; do
  for src in BigBuckBunny BirdsInCage CrowdRun ElFuente1 ElFuente2 \
             FoxBird OldTownCross Seeking Tennis; do
    out=model/tiny/training_runs/loso_${arch}/fold_${src}
    mkdir -p "$out"
    VMAF_TRAIN_OUT_DIR="$out" \
      bash ai/scripts/run_training.sh \
        --model-arch "$arch" --epochs 30 --val-source "$src"
  done
done

# 3. 3-arch eval
python ai/scripts/eval_loso_3arch.py
cat runs/loso_eval/loso_3arch_eval.md

7. Known artefacts

  • runs/loso_eval/loso_3arch_eval.{json,md} is gitignored — the empirical numbers in §2–3 above are the durable record. Re-run the harness to regenerate.
  • The per-fold ONNX files under model/tiny/training_runs/loso_{mlp_small,mlp_medium,linear}/ are also gitignored. Regenerate via the loop in §6.
  • The shipped baselines vmaf_tiny_v1*.onnx retain their pre-rename external_data.location references (a known issue worked around in _load_session); a proper re-export with matching names is a follow-up tracked elsewhere.