Conformal VQA — distribution-free prediction intervals¶
The conformal-VQA wrapper turns any point predictor under tools/vmaf-tune/ into an interval predictor with a distribution-free, finite-sample coverage guarantee. Where the existing Predictor class returns "predicted VMAF = 87.3", the conformal wrapper returns "predicted VMAF in [85.2, 89.4] with 95 % probability" — and that 95 % is a real coverage bound on exchangeable data, not an informal confidence statement.
What the wrapper does¶
vmaftune.conformal.ConformalPredictor(base, calibration) takes any predictor with a predict_vmaf(features, crf, codec) -> float method and a calibration object built from a held-out residual set. Each predict() call returns a (point, low, high) triple plus the nominal miscoverage level alpha.
Two calibration estimators ship:
- Split conformal (
SplitConformalCalibration) — theLei et al. 2018form. Wants a calibration set disjoint from training. Cheapest, tightest bound. - CV+ / jackknife+ conformal (
CVPlusConformalCalibration) — theBarber et al. 2021form. No holdout needed; coverage bound weakens to1 - 2*alpha. Use this when the labelled corpus is too small to spare a calibration split.
Both estimators consume only (predictions, targets) pairs — there is no architecture change to the underlying ONNX models, no new runtime dependency, and no change to the existing op allowlist.
Output range and interpretation¶
point— the underlying predictor's VMAF estimate, in[0, 100]. Identical to whatPredictor.predict_vmafreturns.low,high— the conformal prediction interval, clamped to[0, 100]. By construction,low <= point <= high. The marginal coverage guarantee isP(target_vmaf in [low, high]) >= 1 - alphafor any test point exchangeable with the calibration set (Lei et al. 2018 Theorem 2.2 for split; Barber et al. 2021 Theorem 1 for CV+).alpha— the nominal miscoverage level. Default0.05(95 % interval). Anull/NaNvalue signals an uncalibrated wrapper:low == high == pointand the interval is not a coverage bound. Persisted calibration sidecars are strict JSON: non-finite diagnostic values are written asnull, not as JavaScript-onlyNaN/Infinitytokens.
The interval does not cover model-specification error or distribution shift. If the calibration set is drawn from a different distribution than the test set, the lower bound silently fails. The shipped coverage_probe() diagnostic flags the most common form of this drift via a MiscalibrationWarning.
CLI usage example¶
The vmaf-tune predict subcommand gains three new flags. Without a calibration sidecar the wrapper degrades to a width-zero interval and the uncertainty.calibrated field in the output is false.
# Baseline — no uncertainty (existing behaviour).
vmaf-tune predict --source ref.mkv --target-vmaf 92
# With uncertainty + a shipped split-conformal sidecar.
vmaf-tune predict \
--source ref.mkv \
--target-vmaf 92 \
--with-uncertainty \
--calibration-sidecar model/predictor_libx264_calibration.json \
--alpha 0.05
Sample JSON output (one residual row, alpha=0.05):
{
"verdict": "GOSPEL",
"uncertainty": {"enabled": true, "calibrated": true, "alpha": 0.05},
"residuals": [
{
"shot_start": 0,
"shot_end": 120,
"crf": 23,
"predicted_vmaf": 87.30,
"measured_vmaf": 86.95,
"residual": 0.35,
"interval": {"low": 85.21, "high": 89.39, "alpha": 0.05}
}
]
}
The Python API offers the same surface for callers that compose vmaf-tune programmatically:
from vmaftune.conformal import calibrate_split, save_split_calibration
from vmaftune.predictor import Predictor
# 1. Calibrate against a held-out (predictions, targets) sample.
cal = calibrate_split(predictions=cal_p, targets=cal_t, alpha=0.05)
save_split_calibration(cal, "predictor_libx264_calibration.json")
# 2. At inference time:
predictor = Predictor()
point, low, high = predictor.predict_vmaf_with_uncertainty(
features, crf=23, codec="libx264", calibration=cal,
)
Provenance and constraints¶
- Implementation: pure Python,
tools/vmaf-tune/src/vmaftune/conformal.py. No new build-time dependencies — the module imports onlymath,statistics,dataclasses, andjson. No new ONNX op is loaded; the wrapper sits outside the ONNX graph. - Sidecar JSON schema (split conformal):
{"method": "split-conformal", "alpha": <float>, "n": <int>, "residuals": [<float>, ...]}. Versioned by themethoddiscriminator; futurecv-plussidecars use a different value. - License: BSD-3-Clause-Plus-Patent, matching the rest of
vmaf-tune/. The conformal-prediction theory is in the public domain (algorithmic results, no patent claims known to us).
Known limitations¶
- Bit-depth / colour space: the wrapper inherits the underlying predictor's input contract; it neither widens nor restricts the supported pixel formats.
- Minimum calibration size: the marginal-coverage proof holds for any
n >= 1, but at smallnthe upper-bound bracket1 - alpha + 1/(n+1)becomes loose. We recommendn >= 100per (codec, resolution-class) cell for a 5 % miscoverage target. Atn < 20the wrapper still returns intervals but they are effectively the maximum-residual fallback. - Distribution shift: marginal validity assumes the calibration set and the test point are exchangeable. Encoding a 4K HDR shot against a 1080p SDR calibration set breaks the assumption silently; the
coverage_probe()diagnostic surfaces this when called with a held-out probe drawn from the operational distribution. - CPU-only: the wrapper is pure Python and runs on the host thread; it is not a GPU path. The cost relative to the underlying ONNX inference is negligible (one quantile on a sorted residual vector per call site, amortised across an entire job).
- Symmetric intervals only (split conformal): split conformal with the absolute-residual score produces a symmetric interval about the point estimate. For asymmetric noise (e.g. the residual distribution skews high near VMAF 100 because of the [0, 100] clamp), use
CVPlusConformalCalibrationor train a quantile head per ADR-0279's "Quantile regression" alternative.
Theoretical background¶
Split conformal was introduced as inductive conformal prediction in Vovk, Gammerman, Shafer (2005), Algorithmic Learning in a Random World, Springer (Chapter 2 / Proposition 2.2). The regression-specific treatment used here follows Lei, G'Sell, Rinaldo, Tibshirani, Wasserman (2018), Distribution-Free Predictive Inference for Regression, JASA 113(523), 1094-1111 (Theorem 2.2 — finite-sample 1 - alpha lower bound, 1 - alpha + 1/(n+1) upper bound). Romano, Patterson, Candès (2019), Conformalized Quantile Regression, NeurIPS, generalises the score function to support the locally weighted / normalised residual variant. The CV+ form is from Barber, Candès, Ramdas, Tibshirani (2021), Predictive Inference with the Jackknife+, Annals of Statistics 49(1), 486-507 (Theorem 1 — the 1 - 2*alpha worst-case bound that the implementation tests pin).
The full proof of marginal validity is reproduced in the module docstring at tools/vmaf-tune/src/vmaftune/conformal.py.
Cross-references¶
- ADR-0279 — the scoping ADR (deep-ensemble + conformal scaffold). The "Status update 2026-05-08" addendum tracks the implementation deliverables shipped here.
- ADR-0042 — the per-PR doc-substance rule this page satisfies.
tools/vmaf-tune/tests/test_conformal.py— the empirical-coverage pin, miscalibration-warning pin, and CV+1 - 2*alphapin.