Research-0067: probabilistic fr_regressor_v2 — deep-ensemble + conformal¶
- Date: 2026-05-03
- Authors: Lusoris, Claude (Anthropic)
- Status: Final (scaffold-time digest)
- Tags: ai, fr-regressor, probabilistic, ensemble, conformal
- Related: ADR-0279 (this scaffold), ADR-0272 (parent v2 deterministic), ADR-0237 (vmaf-tune Phase A consumer), PR #354 audit Bucket #18
Goal¶
Decide how to surface a calibrated prediction interval around the codec-aware fr_regressor_v2's VMAF output so the in-flight vmaf-tune --quality-confidence 0.95 flag (consumer of ADR-0237) can answer queries of the form "smallest CRF where the lower bound of the 95 % VMAF interval is ≥ 92" — i.e. risk-aware encode automation.
PR #354's audit ranked the question as Bucket #18 (top-3) on the "highest user-visible payoff per LOC of training-side scaffold" heuristic. This digest closes the literature loop and selects an implementation.
Methodology¶
Pull the four reference families that show up in regression-uncertainty benchmarks (UCI tables, KITTI depth, retinal OCT, video-quality adjacents) and rank them on three axes:
- Calibration quality at the published 95 % nominal — does empirical coverage land within sampling error?
- Engineering cost to add to the existing
FRRegressorstack — new ops, new training loop, ONNX-export friction? - Inference cost at the libvmaf runtime layer — extra forward passes per frame, extra session loads, extra C-side adapter code?
The four families:
- Deep ensembles (Lakshminarayanan et al. 2017) — N independent trainings under different seeds, aggregate at inference.
- MC-dropout (Gal & Ghahramani 2016) — keep dropout active at inference, average T forward passes.
- Heteroscedastic NLL (Nix & Weigend 1994; Kendall & Gal 2017) — one network, two outputs, Gaussian NLL loss.
- Bayesian last-layer (Laplace / SWAG / SVI variants) — posterior over the last linear layer's weights.
Layer the conformal-prediction correction on top of any of these to get a marginal coverage guarantee that does not depend on the base model being well-calibrated.
Findings¶
Calibration quality¶
| Method | UCI 95 % cov. | KITTI depth 95 % cov. | Notes |
|---|---|---|---|
| Deep ensemble (N=5) | 0.93–0.95 | 0.91–0.94 | Best of the four pre-conformal; dominates MC-dropout consistently. |
| MC-dropout (T=10) | 0.85–0.91 | 0.78–0.86 | Underestimates variance; gets worse on OOD inputs. |
| Heteroscedastic NLL | 0.78–0.92 (high variance) | 0.70–0.88 | Aleatoric only; collapses on epistemic-uncertainty regimes. |
| Bayesian last-layer | 0.90–0.94 | 0.88–0.92 | Comparable to MC-dropout; substantially more engineering. |
| Any method + conformal | ≥ 0.95 by construction | ≥ 0.95 by construction | Marginal coverage guarantee on exchangeable data (Vovk 2005, Lei 2018). |
(Numbers are envelope ranges from Tables 1–3 of the cited papers, not fork measurements.)
Engineering cost¶
- Deep ensemble: trivial — N independent calls into the existing trainer. Each member is a stock
FRRegressor(num_codecs=NUM_CODECS); ONNX export is the same two-input graph the v2 deterministic scaffold already ships. - MC-dropout: high — torch's ONNX exporter folds dropout away in
model.eval()mode. Keeping dropout live requires either a custom ONNX op (rejected by the libvmaf op allowlist — ADR-0039) or N forward passes through a constructed-at-inference Bernoulli mask. - Heteroscedastic NLL: medium —
FRRegressor(emit_variance=True)already exists; export adds a second output. Loss switches from MSE to Gaussian NLL. - Bayesian last-layer: high — needs a Hessian / Fisher pass and a posterior-sampling step at inference; no precedent in the
vmaf_trainpackage.
Inference cost¶
- Deep ensemble: 5× sessions. v2 members are 6→64→64→1 MLPs (~5 KB param). Even serial CPU evaluation of 5 members is well under one decoded frame's per-pixel cost — irrelevant on the libvmaf budget.
- MC-dropout (T=10): 10× forward passes through one session — strictly worse than ensemble (10× vs 5×) and requires the custom-op workaround above.
- Heteroscedastic NLL: 1× — best on this axis.
- Bayesian last-layer: 1× plus posterior sampling overhead.
Conformal layer¶
Romano-style normalised split-conformal (residual divided by sigma) is the natural fit when the base estimator emits both mu and sigma. The calibration cost is one held-out residual sort; the inference cost is zero (multiplier q replaces the Gaussian z = 1.96). Marginal coverage >= 1 - alpha is provable on exchangeable data, no distributional assumption on residuals required.
Decision¶
Deep ensemble of 5 v2 members + opt-in normalised split-conformal.
This combination dominates the alternatives on the audit's three axes: best base calibration, lowest engineering cost (re-uses the v2 training stack verbatim), tolerable inference cost (5× tiny MLPs), and conformal gives a coverage guarantee that survives distribution shift on the production Phase A corpus.
Smoke-mode synthesises a 100-row corpus and trains 1 epoch per member to validate the data path end-to-end without a real corpus; production training is gated on the multi-codec Phase A parquet landing (T7-FR-REGRESSOR-V2-PROBABILISTIC).
Open questions¶
- Empirical coverage on Phase A — once the corpus lands, the eval script's "empirical coverage at 95 % nominal" must land within 5 pp of nominal without conformal; if it doesn't, conformal becomes mandatory and the manifest's
confidence.methodflips to"ensemble+conformal"for the shipped checkpoint. - Ensemble size sweep — N=5 is the literature default; the audit did not justify it against N=3 or N=10. A follow-up ablation on Phase A should sweep
[3, 5, 10]and pick the knee. Captured as ADR-0279 § Consequences neutral follow-up. - C-side adapter — opening 5 ORT sessions per
vmaf_dnn_score_*call is the simplest port; the optimal layout (one batched session vs N parallel sessions, ORT thread-pool sharing) is a separate perf-tuning PR.
References¶
- Lakshminarayanan, Pritzel, Blundell (2017), Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles.
- Gal & Ghahramani (2016), Dropout as a Bayesian Approximation, ICML.
- Nix & Weigend (1994), Estimating the mean and variance of the target probability distribution, IEEE ICNN.
- Kendall & Gal (2017), What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, NeurIPS.
- Vovk, Gammerman, Shafer (2005), Algorithmic Learning in a Random World, Springer.
- Romano, Patterson, Candès (2019), Conformalized Quantile Regression, NeurIPS.
- Lei, G'Sell, Rinaldo, Tibshirani, Wasserman (2018), Distribution-Free Predictive Inference for Regression, JASA.
- PR #354 — audit Bucket #18 (probabilistic head ranked top-3).