ADR-0303: `fr_regressor_v2` ensemble — production flip trainer + CI gate¶

Status: Accepted
Date: 2026-05-05
Deciders: Lusoris, Claude (Anthropic)
Companion research digest: Research-0075
Tags: ai, fr-regressor, ensemble, probabilistic, loso, ci-gate, fork-local
Related: ADR-0291 (v2 deterministic prod flip — defines the 0.95 LOSO PLCC ship gate), ADR-0279 (probabilistic head scaffold — deep-ensemble + conformal), ADR-0235 (codec-aware decision + the 0.95 LOSO PLCC ship gate it inherits), ADR-0237 (Phase A consumer).

Context¶

PR #372 shipped the scaffold for the fr_regressor_v2_ensemble_v1 deep ensemble — five kind: "fr" rows in model/tiny/registry.json (fr_regressor_v2_ensemble_v1_seed{0..4}), each carrying smoke: true and a 5 KB smoke ONNX. The scaffold wires the data path (per-member sidecar, manifest JSON, train_fr_regressor_v2_ensemble.py exporter) but the seed checkpoints are not trained on the real Phase A hardware-encoder corpus yet — they are the same under-trained smoke graphs as the original v2 scaffold from ADR-0272.

The deterministic ADR-0291 flip proved the data path works end-to-end on the runs/phase_a/full_grid/per_frame_canonical6.jsonl corpus (33,840 per-frame canonical-6 rows × 9 Netflix sources × NVENC + QSV, 12-encoder vocab v2). What's missing is the LOSO trainer for the ensemble that emits per-seed loso_seed{N}.json artefacts and a CI gate script that promotes seeds from smoke: true to smoke: false once they clear the production threshold.

The probabilistic head exists so the in-flight vmaf-tune --quality-confidence flag (consumer of ADR-0237) can answer risk-aware queries — "smallest CRF such that the lower bound of the 95 % VMAF interval is ≥ 92". A deep ensemble gives that distribution through five independent point estimates aggregated at inference; the flip is only safe if the ensemble clears a tighter gate than any single seed alone (otherwise the across-seed spread is misleading).

Decision¶

We will land the LOSO trainer scaffold + CI gate in this PR (without flipping the registry rows yet) so that a follow-up PR — gated on a real-corpus LOSO run — can flip smoke: true → false for each seed once it clears the gate. The actual ONNX swap and registry flip stay out of scope here; only the trainer + gate ship.

The production ship gate for the ensemble is two-part and tighter than ADR-0291's per-seed gate:

Mean per-seed PLCC ≥ 0.95 — mean_i(PLCC_i) ≥ 0.95 over the five seeds, where PLCC_i is the LOSO mean PLCC across the nine Netflix sources for seed i. This inherits ADR-0235 / ADR-0291's ship gate per member.
Variance bound max_i(PLCC_i) - min_i(PLCC_i) ≤ 0.005 — the spread of per-seed LOSO PLCC across the ensemble must stay tight. A wider spread means the seeds disagree on which sources they generalise to, which would invalidate the ensemble-mean as an uncertainty estimator (the predictive distribution becomes bimodal-by-seed instead of bimodal-by-content, breaking the conformal calibration assumption).

A seed flips smoke: true → false only after it individually clears PLCC_i ≥ 0.95. The ensemble-mean entry (if/when one is added to the registry) flips only after all five seeds clear and the variance bound holds.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
5-seed deep ensemble (chosen)	Lakshminarayanan 2017 — strongest empirical calibration on regression benchmarks. ONNX op-allowlist clean (5 forward passes, no dropout-at-inference, no heteroscedastic NLL). Trivially parallel across seeds. Members are ~5 KB each — 25 KB total runtime cost.	5× training wall time vs deterministic v2. Predictive variance scales with seed count; 5 is the smallest credible ensemble (Lakshminarayanan's paper shows diminishing returns past 5).	Selected — calibration quality + zero-friction ONNX export trumps the wall-time cost on a corpus that trains in <30 s per seed.
MC-dropout	Single trained model; T forward passes at inference for free.	Keeping dropout active at inference adds a `Dropout` op the libvmaf op allowlist currently rejects. Calibration on regression tasks is empirically worse than ensembles (Lakshminarayanan 2017 §5; Foong 2019 in-depth analysis).	Rejected — op-allowlist friction is unjustifiable when ensembles match the inference-time cost (5 forward passes ≈ T forward passes for T=5).
SWAG (Stochastic Weight Averaging — Gaussian)	Posterior over weights from SGD trajectory; one trained model + sampling at inference.	Sampling at inference adds either a runtime-side weight perturbation loop (new C code) or N pre-sampled checkpoints (same N× artefact cost as the ensemble, with lower calibration quality per Maddox 2019). The variance estimate depends on the SGD trajectory's last-K iterates — fragile to hyperparameter choices.	Rejected — same artefact cost as the ensemble for worse calibration on regression.

Consequences¶

Positive: a clean trainer + gate path means the eventual registry flip is mechanical: run train_fr_regressor_v2_ensemble_loso.py --seeds 0,1,2,3,4 --corpus runs/phase_a/full_grid/per_frame_canonical6.jsonl --out-dir runs/ensemble_loso/, run scripts/ci/ensemble_prod_gate.py on the resulting loso_seed{N}.json files, flip the cleared seeds in the registry. No re-derivation needed at flip time.
Positive: the variance bound (max - min ≤ 0.005) catches the pathological case where one seed wildly outperforms (or underperforms) the others — without it, mean PLCC ≥ 0.95 could mask a 0.99 + four 0.94s situation that breaks the uncertainty estimate.
Negative: the gate is strictly tighter than the deterministic v2 gate. A real corpus run might clear ADR-0291's gate but miss the variance bound here — that would force either re-seeding (cheap) or a wider tolerance (ADR change required, not silent). The trainer + gate are deliberately split into two artefacts so the gate is reviewable/auditable separately from the trainer.
Neutral / follow-up: the CI workflow wiring (.github/workflows/tests-and-quality-gates.yml — adding a job that invokes scripts/ci/ensemble_prod_gate.py) is out of scope here because there are no real loso_seed{N}.json artefacts to gate yet. A follow-up PR — the actual flip PR — wires the workflow once a real-corpus run produces the JSON.
Neutral / follow-up: conformal calibration on top of the ensemble (per ADR-0279) remains in the probabilistic-head backlog; this ADR addresses the flip mechanism for the deterministic ensemble members, not the calibrated-interval surface.

References¶

req (2026-05-05, user direction): the user requested a follow-up to PR #372 that ships the LOSO trainer + CI gate scaffold so the ensemble seeds can flip from smoke: true to smoke: false after a real LOSO run. Paraphrased: "land the trainer scaffold + gate script now; actual production flip is gated on a real-corpus LOSO clearing ≥0.95 mean PLCC plus ≤0.005 variance."
Research-0075 — ensemble theory (Lakshminarayanan 2017), conformal calibration sketch (Romano 2019), 9-fold LOSO protocol, expected PLCC baseline.
ADR-0291 — deterministic v2 prod flip; defines the 0.95 LOSO PLCC ship gate this ADR inherits.
ADR-0279 — probabilistic head scaffold; the parent of the ensemble surface this ADR flips.
ADR-0235 — codec-aware decision 0.95 LOSO PLCC ship gate.
ADR-0237 — vmaf-tune Phase A consumer + --quality-confidence flag that needs the ensemble's predictive distribution.
PR #372 — ensemble scaffold (5 smoke seeds in registry).
Lakshminarayanan, B., Pritzel, A., Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NIPS 2017.
Romano, Y., Patterson, E., Candès, E. (2019). Conformalized Quantile Regression. NeurIPS 2019.

Status update 2026-05-08: Phase 2 training script landed¶

A second consumer of the ADR-0303 production-flip gate landed in ai/scripts/train_predictor_v2_realcorpus.py — the per-codec predictor v2 real-corpus LOSO trainer (Phase 2 of the predictor pipeline; companion to PR #450 / PR #462). The trainer applies the same two-part gate documented above (mean PLCC ≥ 0.95 across LOSO folds, max-min spread ≤ 0.005) per codec rather than per ensemble seed; the constants are mirrored as SHIP_GATE_MEAN_PLCC / SHIP_GATE_PLCC_SPREAD_MAX / SHIP_GATE_PER_FOLD_MIN in ai/scripts/train_predictor_v2_realcorpus.py and continue to live authoritatively in scripts/ci/ensemble_prod_gate.py for the ensemble. A future ADR change that lowers either threshold MUST update both call sites in lockstep; the test test_gate_constants_match_adr_0303 in ai/tests/test_train_predictor_v2_realcorpus.py pins the predictor-side values to the ADR-0303 §Decision numbers.

ADR-0303: fr_regressor_v2 ensemble — production flip trainer + CI gate¶