ADR-0963: ai/src: guard NaN propagation in eval + tune (round-25 audit C.1 + C.2)¶
- Status: Accepted
- Date: 2026-05-31
- Deciders: lusoris
- Tags:
ai,correctness,bisect
Context¶
Round-25 correctness audit identified two NaN-propagation paths in the vmaf_train package that silently corrupted downstream consumers.
C.1 — eval.correlations (ai/src/vmaf_train/eval.py):
correlations() passed inputs directly to pearsonr/spearmanr without checking for empty or degenerate (constant-valued) inputs.
- Empty arrays (
len == 0) causepearsonr/spearmanrto raise or return NaN, andnp.mean()on an empty array returns NaN with a runtime warning. The resultingEvalReportcontained NaN fields without any indication of a data-pipeline error. - Constant-valued arrays have zero variance;
pearsonr/spearmanrreturn NaN for zero-variance inputs because the correlation coefficient is mathematically undefined.
bisect_model_quality._gate evaluates report.plcc >= value. For IEEE-754, NaN >= x is always False, so every model "fails" the gate, causing the bisect to report "first model already bad" with no diagnostic, regardless of the actual model quality. This is the most harmful failure mode: it produces a definitive-sounding wrong answer.
C.2 — tune.objective (ai/src/vmaf_train/tune.py):
When all training epochs produce NaN for the val/mse (or val/l1) metric column — which happens when a training run diverges from epoch 0 — the expression df["val/mse"].dropna().min() returns NaN (empty-series min). float(NaN) was passed to Optuna as the trial objective. Optuna records it as a completed trial with an undefined comparison value, which corrupts the study's best-trial tracking.
The study in sweep() always uses direction="minimize". There is no maximize path, so the sentinel is float("inf") unconditionally.
Correctness principle (feedback_correctness_first): investigate the root cause; never lower thresholds or weaken correctness gates.
Decision¶
C.1: Add two guards to correlations():
- Raise
ValueError("empty inputs …")iflen(pred) == 0. Empty inputs are a data-pipeline bug; raising is the correct signal. - Check
np.var(pred) < 1e-12ornp.var(target) < 1e-12(constant array). Emit aRuntimeWarningand returnEvalReport(plcc=0.0, srocc=0.0, …). Use0.0rather thanNaNbecause gate logic uses>=and NaN would silently fail every comparison. The comment in the source explains the choice so future readers do not re-introduce NaN.
evaluate_onnx (the entry point used by bisect_model_quality) delegates to correlations, so both callers are fixed automatically.
C.2: Extract the metric-reading logic into _read_best_metric(df, col). After dropna(), if the result is empty or the min is NaN, log a WARNING explaining the divergence and return float("inf") (worst-case for the minimisation study). Move the from .train import TrainConfig, train import inside sweep() to keep _read_best_metric importable without pulling in pytorch_lightning (enabling lightweight unit tests).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Return NaN for empty/degenerate (status quo) | No code change | Silently fails gate comparisons; bisect returns wrong answer | Correctness violation |
Return Result[EvalReport, str] type | Explicit error path | Introduces a new algebraic type not used elsewhere in the codebase | Inconsistent with existing ValueError patterns |
Return plcc=float("-inf") for degenerate | Also fails >= gates | Still has NaN-propagation risk if consumer does arithmetic | 0.0 is the conventional "worst correlation" and does not NaN-propagate |
Use warnings.warn for tune divergence (instead of logging.warning) | Catchable with pytest.warns | Training-run divergence is an operator diagnostic, not a calling-code invariant; logging is the right channel | Using caplog in tests is the standard pytest idiom for logging |
Consequences¶
- Positive:
bisect_model_qualityno longer reports "first model already bad" when the evaluation input is degenerate. Optuna study best-trial tracking is not corrupted by NaN objectives from diverged training runs. - Positive: Empty-input bugs in data pipelines are now raised immediately at the evaluation call site rather than silently propagating as NaN.
- Negative: Callers that previously received NaN (and may have been checking
np.isnan(report.plcc)) will now get0.0for degenerate inputs and aValueErrorfor empty inputs. No such callers exist in-tree as of 2026-05-31 (verified by grep). - Neutral:
tune.sweep()now lazily importsTrainConfig/train, which is a minor import-order change with no runtime effect on the non-test path.
References¶
feedback_correctness_first(user memory note): investigate root cause; never lower thresholds or weaken correctness gates.bisect_model_quality._gateinai/src/vmaf_train/bisect_model_quality.py.evaluate_onnxinai/src/vmaf_train/eval.py.- ADR-0109 — nightly bisect workflow.
- Round-25 correctness audit C.1 + C.2.