ADR-0393: `fr_regressor_v2` probabilistic head — deep-ensemble + conformal scaffold¶

Status: Accepted
Status update 2026-05-15: scaffold implemented; model/tiny/fr_regressor_v2_ensemble_v1.json (5-seed ensemble) + conformal quantile scaffold present on master. Production training gated on Phase A corpus completion.
Date: 2026-05-03
Deciders: Lusoris, Claude (Anthropic)
Tags: ai, fr-regressor, probabilistic, ensemble, conformal, fork-local

Context¶

The codec-aware fr_regressor_v2 emits a single MOS scalar per frame — a point estimate. Producers running quality-aware encode automation (ADR-0237, the in-flight vmaf-tune tool) need a stronger contract: "give me a CRF such that the lower bound of a 95 % VMAF interval is still ≥ 92". A point estimate cannot answer that question — it takes a distribution.

PR #354's audit Bucket #18 (top-3 ranked) calls for a probabilistic head on top of v2: a deep ensemble of small MLPs trained under different seeds, optionally calibrated by split-conformal prediction. The audit cited Lakshminarayanan et al. (2017) (deep ensembles dominate single-network uncertainty estimators in calibration quality) and Romano et al. (2019) (normalised conformal gives a marginal coverage guarantee at no training-time cost beyond the calibration split).

The constraints in play:

No architecture change. v2 is the in-flight scaffold (PR #347); we cannot fork its training stack just to add uncertainty.
Inference cost stays small. v2 members are 6→64→64→1 MLPs (~5 KB each); 5 of them is still a rounding error on libvmaf's per-frame budget vs the existing tiny-AI surfaces.
The ONNX graph stays op-allowlist clean — no sampling / dropout-at-inference / heteroscedastic NLL hacks that would add new ops the runtime loader rejects.
The shipped checkpoint is a smoke probe. No multi-codec Phase A corpus exists yet; the scaffold's job is to wire the data path so the production training run is a one-liner when the corpus lands.

Decision¶

Add a deep ensemble of N=5 fr_regressor_v2 members trained under distinct random seeds, packaged as 5 separate ONNX files plus an ensemble manifest (model/tiny/fr_regressor_v2_ensemble_v1.json) that records the member list, feature standardisation, codec vocabulary, nominal coverage, and an optional conformal residual quantile. Inference aggregates the 5 outputs into (mu, sigma) and exposes the prediction interval via two interchangeable rules: a Gaussian mu ± z(α/2) · σ baseline and an opt-in mu ± q · σ form where q is the empirical residual quantile from a held-out split-conformal calibration set. The scaffold ships in smoke-only mode (synthetic 100-row corpus, 1 epoch per member); production training is gated on a multi-codec Phase A corpus and tracked as backlog item T7-FR-REGRESSOR-V2-PROBABILISTIC.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Deep ensemble (N=5) + conformal ✅	Best-in-class calibration on regression benchmarks (Lakshminarayanan 2017); conformal adds distribution-free coverage guarantee; reuses v2 architecture verbatim	5× training cost; 5× inference cost (mitigated by tiny model size); manifest layer is new	Chosen — calibration quality dominates, inference cost is negligible, scaffold cost is one training script + one eval script.
Single-network heteroscedastic NLL	1× cost; one ONNX graph	Captures aleatoric noise only — collapses on out-of-distribution inputs (Lakshminarayanan §4.2); empirically worse calibration than 5-member ensemble	Rejected — the audit's whole point is distribution-shift coverage (new codecs, OOD CRFs); aleatoric-only would fail silently at the boundary.
MC-dropout (T forward passes)	1 ONNX file; cheap to train	Requires dropout-at-inference, which torch's ONNX export typically folds away; would need a custom `Dropout` op or onnxruntime extension	Rejected — would force a custom op into the libvmaf op allowlist (ADR-0039) for marginal benefit over deep ensemble.
Quantile regression (multi-output)	1 forward pass yields the interval directly	Trains 3 separate quantile heads against pinball loss; tighter intervals only when the noise model matches; no built-in coverage guarantee without conformal	Rejected — strictly worse than deep-ensemble + conformal; no ablation budget to explore both.
Bayesian last-layer (Laplace, SWAG)	Theoretically grounded posterior	Requires a Hessian / second-moment pass; more ONNX-export friction; no library precedent in `vmaf_train`	Rejected — engineering surface area not justified for the marginal calibration gain over deep ensembles.
Single-network + bootstrap on data	Same architecture as v1	Bootstrap captures data uncertainty only (not model uncertainty); needs N retrainings plus resampling	Rejected — strictly dominated by deep-ensemble (no resample bookkeeping, captures both noise sources).

Consequences¶

Positive:
Producers can drive vmaf-tune --quality-confidence 0.95 --target 92 off a published, calibrated coverage guarantee instead of a point estimate plus folklore margin.
The scaffold is additive: existing v2 deterministic consumers keep working untouched; the ensemble manifest lives next to the per-member ONNX files in model/tiny/.
Conformal calibration is opt-in — when the calibration split falls below a usable threshold (small corpus, OOD test) the manifest silently falls back to the Gaussian rule. No silent fail.
Negative:
5× ONNX inference cost at the libvmaf C-side adapter layer (T7-FR-REGRESSOR-V2-PROBABILISTIC follow-up). Mitigated by the tiny model size (~3 KB graph per member) — even serial CPU evaluation of 5 members is well under one decoded frame's per-pixel cost.
5× registry entries per ensemble (one per member). Tolerated to avoid a registry-schema bump; the manifest sidecar is the ensemble-level entry point.
The 1.96 Gaussian assumption is calibrated only on Gaussian residuals — without conformal, real-world coverage will deviate. Captured in the model card and the eval script's coverage report.
Neutral / follow-ups:
T7-FR-REGRESSOR-V2-PROBABILISTIC: production training run on the Phase A multi-codec corpus once it lands; gated on clearing the v2 deterministic ship floor and on the eval script's empirical 95 % coverage being within 5 pp of nominal.
C-side runtime adapter (read manifest, open 5 sessions, fan-out inputs, aggregate mu / sigma) — separate PR after this scaffold; exposes vmaf_dnn_score_with_interval to core/src/dnn/.
vmaf-tune --quality-confidence flag — Phase B follow-up to ADR-0237; consumer of the new C-side adapter.

References¶

Lakshminarayanan, Pritzel, Blundell (2017), Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles, NeurIPS.
Vovk, Gammerman, Shafer (2005), Algorithmic Learning in a Random World, Springer.
Romano, Patterson, Candès (2019), Conformalized Quantile Regression, NeurIPS.
Lei, G'Sell, Rinaldo, Tibshirani, Wasserman (2018), Distribution-Free Predictive Inference for Regression, JASA.
Research-0054 — audit digest backing this ADR (PR #354 Bucket #18 ranking + literature pull).
ADR-0272 — parent v2 scaffold (placeholder ID; PR #347 may land it at ADR-0261; renumber the cross-reference at merge time if it does).
ADR-0237 — vmaf-tune Phase A; downstream consumer of the probabilistic interval API.
ADR-0039 — runtime op-allowlist constraint that ruled out MC-dropout.
ADR-0040, ADR-0041 — multi-input ONNX precedent the v2 ensemble member graph follows.
Source: req (PR #354 audit Bucket #18, top-3 ranked).

Status update 2026-05-08: implementation landed¶

Per ADR-0028 immutability, the body above is frozen at proposal time. The Status line at the head of this ADR remains Proposed until the production training run lands; this section records the implementation deliverables that ship today as a non-binding addendum.

What landed: the conformal-prediction surface itself. New tools/vmaf-tune/src/vmaftune/conformal.py ships SplitConformalCalibration (Lei et al. 2018 Theorem 2.2) and CVPlusConformalCalibration (Barber et al. 2021 Theorem 1) as a pure-Python, dependency-free wrapper around the existing Predictor surface. The CLI gains vmaf-tune predict --with-uncertainty --calibration-sidecar <path> [--alpha <a>] per the Decision section above; without a sidecar the wrapper degrades to low == high == point and the report is flagged uncalibrated so consumers don't silently treat a width-zero interval as a real coverage guarantee. Empirical coverage on the synthetic Gaussian-noise corpus matches the nominal 1 - alpha within ~0.01 (0.9515 vs 0.95 nominal on a 2000-point probe with a 400-point calibration set), confirming the marginal-coverage proof in operation.

What remains gated: the deep-ensemble member training run, the C-side runtime adapter (vmaf_dnn_score_with_interval), and the vmaf-tune --quality-confidence Phase B consumer. Those land under T7-FR-REGRESSOR-V2-PROBABILISTIC once the multi-codec Phase A corpus is available; flipping Status to Accepted is gated on that PR.

Per ADR-0028 immutability the body above stays frozen; this entry records the second downstream consumer of the conformal surface without changing the original decision.

What landed: tools/vmaf-tune/src/vmaftune/uncertainty.py centralises the ConfidenceThresholds dataclass, the load_confidence_thresholds sidecar loader, and the classify_interval width-band helper. The defaults tight_interval_max_width=2.0 and wide_interval_min_width=5.0 VMAF mirror the documented floor in the auto-driver F.3 work (PR #495 / Research-0067) byte-for-byte, so a single calibration sidecar drives auto, recommend, and ladder without divergence.

recommend.py gains pick_target_vmaf_with_uncertainty(rows, UncertaintyAwareRequest) plus a --with-uncertainty / --uncertainty-sidecar CLI flag pair on the recommend subcommand. Tight intervals short-circuit the search at the first row whose conformal lower bound clears the target (O(k) instead of O(n)); wide intervals force a full scan with the result tagged (UNCERTAIN); middle-band and uncalibrated rows defer to the existing point-estimate predicate verbatim. An interval_excludes_target helper surfaces a best-effort UNMET row when every visited interval lies below the target.

ladder.py gains UncertaintyLadderPoint, prune_redundant_rungs_by_uncertainty (drops adjacent rungs whose intervals overlap above DEFAULT_RUNG_OVERLAP_THRESHOLD = 0.5), insert_extra_rungs_in_high_uncertainty_regions (inserts geometric-bitrate / arithmetic-VMAF mid-rungs into wide-interval gaps), and the composed apply_uncertainty_recipe entry point. The existing convex-hull + knee-selection invariants are preserved.

Threshold provenance: every numeric default in this PR cites either Research-0067 (Phase F feasibility, the same emergency floor PR #495 documents) or the parent ADR-0279 conformal proof. No threshold was invented locally — see the feedback_no_guessing rule in CLAUDE.md.

Out of scope (deferred to follow-up): wiring the production sampler in _default_sampler to emit UncertaintyLadderPoint directly when the predictor ships a calibration sidecar. The library API is fully functional today and exercised by the unit tests (test_recommend_uncertainty.py, test_ladder_uncertainty.py); the CLI ladder --with-uncertainty flag emits an informational notice noting the follow-up.

ADR-0393: fr_regressor_v2 probabilistic head — deep-ensemble + conformal scaffold¶