Research-0075: `fr_regressor_v2` ensemble — production flip LOSO protocol¶

Date: 2026-05-05
Authors: Lusoris, Claude (Anthropic)
Status: Final (scaffold-time digest — closes the literature loop before the trainer + CI gate ship)
Tags: ai, fr-regressor, ensemble, probabilistic, loso, conformal
Companion ADR: ADR-0303
Predecessors: Research-0067 (probabilistic), Research-0067 (prod-loso)

Question¶

What ship gate does the deep ensemble for fr_regressor_v2 need to clear before the five smoke-seed registry rows (PR #372) flip from smoke: true to smoke: false, and what is the minimum-cost LOSO protocol that exercises that gate on the existing Phase A corpus?

Method¶

The literature loop closes around three reference points:

Lakshminarayanan, Pritzel, Blundell (2017) — Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NIPS 2017. Core finding: independently-seeded networks with proper scoring rule training (NLL or MSE for regression) dominate single-network uncertainty estimators on calibration quality. The empirical recommendation is N=5 ensemble members — N<5 is unreliable, N>5 is diminishing returns on calibration quality at linear inference cost. The Lakshminarayanan calibration plots show that on regression benchmarks (UCI, year-prediction, video MOS), 5-member ensembles reach within 1–2 % of the asymptotic calibration of 100-member ensembles.
Romano, Patterson, Candès (2019) — Conformalized Quantile Regression. NeurIPS 2019. The conformal correction we apply on top of the ensemble's empirical CDF gives a marginal coverage guarantee P(y ∈ [q_lo, q_hi]) ≥ 1 - α independent of the base model's calibration. This is the "free" piece — it adds no training-time cost, only a calibration-fold pass. Out of scope for this PR (the conformal layer is in the probabilistic-head backlog under ADR-0279); we record the protocol here so the eventual prod-flip can plug it in.
Foong et al. (2019) — On the Expressiveness of Approximate Inference in Bayesian Neural Networks. Demonstrates empirically that MC-dropout under-estimates predictive variance on regression tasks where the residual error is content-dependent (which is exactly our setting — VMAF residuals correlate with content complexity). This is the negative result that pushes us to deep ensembles instead of MC-dropout.

Conformal calibration sketch (deferred)¶

The eventual probabilistic-head flip will layer split-conformal on top of the ensemble's predictive CDF:

Hold out a calibration split (~10 % of training rows, stratified by source × encoder).
Train the 5-seed ensemble on the remaining 90 %.
For each calibration row, compute the conformity score r_i = |y_i - μ_ensemble_i| / σ_ensemble_i (normalised residual).
The 95 % conformal interval at inference is [μ - q̂_{0.95}(r) · σ, μ + q̂_{0.95}(r) · σ] where q̂_{0.95} is the empirical 95-percentile of r over the calibration split.

This sketch is recorded for the follow-up; the trainer in this PR emits enough metadata (per-seed predictions on each fold) for the conformal layer to consume without re-running training.

9-fold LOSO protocol — Netflix Public Dataset¶

The LOSO eval runs nine folds, one per Netflix source:

sources = [
    "BigBuckBunny_25fps", "BirdsInCage_30fps",  "CrowdRun_25fps",
    "ElFuente1_30fps",    "ElFuente2_30fps",    "FoxBird_25fps",
    "OldTownCross_25fps", "Seeking_25fps",      "Tennis_24fps",
]

Per fold:

Held-out: all rows where source == held_out.
Train: all other rows in runs/phase_a/full_grid/per_frame_canonical6.jsonl (the corpus from PR #392 — 33,840 per-frame canonical-6 rows × 9 Netflix sources × NVENC + QSV codec families).
Per-seed: train FRRegressor(in_features=6, num_codecs=12) for 200 epochs (Adam, lr=5e-4, batch=32, weight_decay=1e-5) under five seeds {0, 1, 2, 3, 4}. StandardScaler fit on the training fold, applied to the held-out fold.
Eval: PLCC, SROCC, RMSE per fold per seed.
Aggregate: mean PLCC across folds → PLCC_i for seed i.

The trainer emits one JSON per seed (loso_seed{N}.json) with the schema:

{
  "seed": 0,
  "corpus": "runs/phase_a/full_grid/per_frame_canonical6.jsonl",
  "n_folds": 9,
  "folds": [
    {"held_out": "BigBuckBunny_25fps", "plcc": 0.9712, "srocc": 0.99, "rmse": 3.7, "n_train": 30000, "n_val": 3840},
    ...
  ],
  "mean_plcc": 0.9683,
  "std_plcc": 0.0211,
  "wall_time_s": 28.4
}

The CI gate (scripts/ci/ensemble_prod_gate.py) reads the five JSONs and decides:

seeds = [load(f"loso_seed{i}.json") for i in range(5)]
mean_plccs = [s["mean_plcc"] for s in seeds]
gate_pass = (
    sum(mean_plccs) / 5 >= 0.95             # mean ship gate
    and max(mean_plccs) - min(mean_plccs) <= 0.005  # variance bound
)

Expected PLCC baseline¶

ADR-0291 / Research-0067 (prod-loso) reported deterministic v2 LOSO PLCC = 0.9681 ± 0.0207 on the same corpus. The expected ensemble behaviour:

Mean per-seed PLCC ≥ 0.99 baseline aspiration — averaging five independent trainings on the same data should slightly improve on the single-network 0.9681 (typical ensemble lift on UCI-style regression: +0.005–0.02 PLCC). Calling 0.99 a baseline is optimistic for the OldTownCross outlier fold but realistic for the other eight; the gate stays at the conservative 0.95 so we don't accidentally bake in a baseline the corpus can't sustain.
Per-seed std ≤ 0.005 — reseed-to-reseed variation on the same training corpus + protocol typically lands at ~0.002–0.004 PLCC for small MLPs. The 0.005 variance bound is calibrated against this expected envelope plus a margin.

Reproducer (smoke-only — no real corpus on this branch)¶

# Smoke-only: argparse + scaffold smoke (no real training).
python ai/scripts/train_fr_regressor_v2_ensemble_loso.py --help

# When the real corpus is present (follow-up PR):
python ai/scripts/train_fr_regressor_v2_ensemble_loso.py \
    --seeds 0,1,2,3,4 \
    --corpus runs/phase_a/full_grid/per_frame_canonical6.jsonl \
    --out-dir runs/ensemble_loso/

# Then run the gate:
python scripts/ci/ensemble_prod_gate.py runs/ensemble_loso/

Caveats / known limitations¶

OldTownCross fold remains the calibration weak point — single-seed v2 reported PLCC=0.9183 on this fold (Research-0067 prod-loso). The ensemble may not lift this fold's PLCC above 0.95 because the underlying VMAF range on OldTownCross encodes is too compressed (most cells score 96–99). The mean-across-folds gate is the primary protection; we accept that one fold may stay sub-gate.
No SW encoders in corpus — vmaf-tune Phase A used NVENC + QSV hardware encoders (12-encoder vocab v2). The eventual ensemble flip inherits this corpus restriction; a separate retrain on a software-encoder sweep (T-FR-V2-SW-CORPUS) is tracked but blocks neither this scaffold nor the prod flip.
Conformal layer not exercised here — Research-0075's main contribution is the ensemble training + gate; the conformal CDF pass is sketched for the probabilistic-head follow-up only.

References¶

Lakshminarayanan, B., Pritzel, A., Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NIPS 2017.
Romano, Y., Patterson, E., Candès, E. (2019). Conformalized Quantile Regression. NeurIPS 2019.
Foong, A., Burt, D., Li, Y., Turner, R. (2019). On the Expressiveness of Approximate Inference in Bayesian Neural Networks. NeurIPS 2019.
ADR-0303 — ensemble prod-flip trainer + CI gate decision (this digest).
ADR-0291 — deterministic v2 prod flip + 0.95 ship gate.
ADR-0279 — probabilistic head + ensemble scaffold (PR #372).
Research-0067 (prod-loso) — deterministic v2 LOSO baseline (mean PLCC=0.9681 ± 0.0207).
Research-0067 (probabilistic) — ensemble vs MC-dropout vs SWAG calibration shootout.

Research-0075: fr_regressor_v2 ensemble — production flip LOSO protocol¶