Research-0081: fr_regressor_v2 ensemble — real-corpus retrain methodology¶
- Status: Active
- Date: 2026-05-05
- ADR: ADR-0309
- Related: Research-0075 (parent — gate theory + conformal calibration sketch), Research-0067 (deterministic LOSO baseline), Research-0058 (codec-aware feasibility).
This digest answers three operational questions for the real-corpus LOSO retrain ADR-0309 specifies:
- Is the locally available Netflix corpus large enough?
- Does the 9-fold LOSO sizing carry over from the deterministic ADR-0291 run?
- Which hyperparameters drive seed diversity (the load-bearing property for the
max - min ≤ 0.005gate)?
1. Corpus-size sufficiency¶
The corpus at .workingdir2/netflix/ carries 9 reference clips and 70 distorted clips, sized at ~37 GB total. Per Research-0067 §Corpus sizing:
- Per source, the deterministic ADR-0291 run derived 33,840 per-frame canonical-6 rows over the Phase A grid (9 sources × NVENC + QSV × preset/CRF sweep). That corpus has saturated PLCC at 0.964 ± 0.002 across re-runs (Research-0067 Table 2).
- The minimum source count for stable 9-fold LOSO is 8 — at 7, Research-0067 saw a 0.004 PLCC variance pop because one held-out source carried a content type underrepresented in the training fold. At 9 sources the variance dropped back to 0.001.
Verdict: the local corpus matches the corpus that produced the 0.964 deterministic baseline. No corpus-size shortfall risk for the ensemble retrain.
2. LOSO fold sizing¶
The 9 Netflix sources baked into ai/scripts/train_fr_regressor_v2_ensemble_loso.py::NETFLIX_SOURCES match the order from eval_loso_vmaf_tiny_v3.py / eval_loso_vmaf_tiny_v5.py so the fold-by-fold PLCC traces are directly comparable between the deterministic ADR-0291 run and the ensemble seeds.
Per-fold expected PLCC (from Research-0067 §LOSO baselines on the deterministic v2):
| Held-out fold | Mean PLCC | Notes |
|---|---|---|
| BigBuckBunny_25fps | 0.962 | clean-source baseline |
| BirdsInCage_30fps | 0.967 | high-motion natural |
| CrowdRun_25fps | 0.971 | dense crowd, hard for motion2 |
| ElFuente1_30fps | 0.954 | low-light + grain |
| ElFuente2_30fps | 0.961 | dim, complementary to fuente1 |
| FoxBird_25fps | 0.965 | static + small-motion mix |
| OldTownCross_25fps | 0.969 | textured |
| Seeking_25fps | 0.952 | weakest fold; rapid camera motion |
| Tennis_24fps | 0.974 | strongest fold; predictable motion |
| Mean across folds | 0.964 | — |
The Seeking_25fps fold is the load-bearing weak point — if seed diversity causes one ensemble member to overfit a Seeking-like content distribution, that seed's hold-out PLCC on Seeking will tank and pull min(PLCC_i) below the spread bound. Watch this fold in the per-seed JSON.
3. Seed-diversity hyperparameters¶
The ensemble's predictive variance is only useful if the seeds actually disagree on out-of-fold predictions. The max - min ≤ 0.005 gate is tight enough that the natural seed-to-seed variance from random init alone usually clears it; the failure mode is insufficient diversity (all seeds collapse to the same near-mode solution), not excess diversity.
| Knob | ADR-0303 default | Effect on diversity |
|---|---|---|
| Random init RNG | torch.manual_seed(seed) | Drives weight init; the dominant diversity source. |
| Mini-batch shuffle order | seeded per epoch | Secondary diversity source; matters when batch-size << corpus-size. |
| Dropout-at-train | None (deterministic) | The FRRegressor (ADR-0235) has no dropout; not a diversity knob here. |
| Weight-decay | 1e-5 | Regularisation; uniform across seeds. Not a diversity knob — varying it would invalidate the gate. |
| Learning rate | 5e-4 | Uniform; same rationale as weight-decay. |
| Epochs | 200 | Long enough that all seeds reach the local minimum; reducing it widens spread. |
| Bootstrap of training rows | not applied | Could be enabled per Lakshminarayanan 2017 §3.2 to raise diversity if 5 seeds collapse. |
Diagnostic for HOLD-on-spread cases:
- Spread
max - min > 0.005with all PLCC ≥ 0.95: seeds are stable individually but disagree per-fold. Most likely cause: one fold (almost always Seeking_25fps) has a seed-dependent optimum. Remediation: re-seed (cheap), or enable bootstrap to raise inter-seed diversity in a controlled way. - Spread
max - min > 0.005with some PLCC < 0.95: one or more seeds are individually under-fit. Remediation: increase epochs (200 → 400 — note: do not lower the per-seed gate threshold; per the no-test-weakening rule fix the implementation, not the gate). - Spread
≤ 0.005but mean< 0.95: tight ensemble agreement on a poor fit. Most likely a corpus regression. Re-derive the Phase A canonical-6 features and confirm against the deterministic v2 baseline first — Research-0067 §Reproducer.
Summary¶
The harness ADR-0309 ships requires no new corpus; the local 9-source Netflix Public Dataset is sufficient. The 9-fold LOSO ordering and the 200-epoch / 5e-4 / 1e-5 hyperparameters are inherited from ADR-0291 / ADR-0303 and should be held uniform across seeds — diversity comes from RNG seed alone, not hyperparameter spread. Watch the Seeking_25fps fold for the dominant failure mode if the gate's max - min ≤ 0.005 ever fires.
References¶
- Lakshminarayanan, B., Pritzel, A., Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NIPS 2017.
- Research-0067 — deterministic v2 LOSO baseline (per-fold PLCC table inherited above).
- Research-0075 — parent: ensemble theory + conformal calibration sketch + the
max - min ≤ 0.005derivation. - ADR-0303 — gate definition consumed by this digest.
- ADR-0309 — the harness ADR this digest accompanies.