ADR-0620: Scaffold audit P0 — three silent-correctness fixes¶
- Status: Accepted
- Date: 2026-05-19
- Deciders: lusoris, Claude (Anthropic)
- Tags:
python,correctness,bugfix,fork-local
Context¶
The 2026-05-19 scaffold audit (docs/research/scaffold-audit-2026-05-19.md) identified three P0 silent-correctness bugs in the Python harness — all tracked in docs/state.md under T-PYTHON-ROUTINE-SWALLOWED-EXCEPTION, T-PYTHON-TRAIN-TEST-STD-ZERO, and T-PYTHON-LOCAL-EXPLAINER-HACKY.
Each bug produces wrong output without any exception reaching the caller:
-
routine.py:604—except Exception: print/fallbackswallowed any failure during the extended-stats calculation path and silently continued with uncalibrated normalisation stats, producing misleadingly narrow confidence intervals and silently wrong PLCC/SROCC when the training distribution was non-standard. -
train_test_model.py:354—plot_scattersubstitutednp.zeros(len(ys_label))forys_label_stddevwhen the key was absent from the stats dict. Downstream callers (error-bar rendering,evaluate_stddev) used the zero array as a normalisation factor, producing incorrect visualisations. -
local_explainer.py:121—model = model[0] # HACKY, TODOsilently took only the first model from an ensemble list. Callers passing a multi-model bootstrap ensemble obtained per-feature importance numbers from seed-0 only, with no diagnostic.
All three violate the SEI CERT C / CERT Python rule that error paths must be explicit and diagnosable. The bugs had been tolerated because the fallback path yielded plausible (but wrong) output — the hardest failure mode to detect.
Decision¶
Replace each silent-fallback with an explicit raise:
-
routine.py: replace the bareexcept Exception: print/fallbackwithraise CalibrationError(...) from excwhenallow_uncalibrated=False(the new default-safe parameter onrun_test_on_dataset). Callers that genuinely want the uncalibrated fallback passallow_uncalibrated=True. -
train_test_model.py: replacenp.zerossubstitution withraise MissingLabelStddevError(...). Callers that intentionally want unit error bars passassume_unit_stddev=Truetoplot_scatter. -
local_explainer.py: replace the silentmodel[0]pick withraise EnsembleNotSupportedError(...)forlen(model) > 1. A single-element list continues to work (unwraps transparently).
All three exception classes are added to python/vmaf/tools/exceptions.py.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep the warning-and-fallback, just improve the warning message | No caller breakage; zero migration cost | Silent wrong output persists; callers cannot distinguish warning from success | The whole problem is silent wrong output; a louder warning does not fix that |
| Raise unconditionally (no opt-in flag) | Strictest posture | Breaks existing callers that relied on the fallback intentionally | allow_uncalibrated / assume_unit_stddev carry zero cognitive overhead and preserve backward compat for deliberate callers |
| Iterate the ensemble and average explanations | Fixes P0-3 without raising | Semantics are undefined (weighted? unweighted? which seed?) — ship correctness first, aggregation strategy in a follow-on | Semantically ambiguous; a well-typed exception unblocks the caller to make an explicit choice |
Consequences¶
- Positive: callers that hit these code paths now get a diagnosable exception with a message pointing at the opt-in flag; silent wrong output is eliminated.
- Negative: any caller that relied on the silent fallback without
allow_uncalibrated=Trueorassume_unit_stddev=Truewill now raise. The migration is a one-liner per call site. - Neutral / follow-ups:
- P0-3 follow-up: implement ensemble aggregation (averaged feature weights) once the semantics are agreed; at that point
EnsembleNotSupportedErrorbecomes optional. T-PYTHON-ROUTINE-SWALLOWED-EXCEPTION,T-PYTHON-TRAIN-TEST-STD-ZERO, andT-PYTHON-LOCAL-EXPLAINER-HACKYclosed indocs/state.md.
References¶
docs/research/scaffold-audit-2026-05-19.md§P0-1, §P0-2, §P0-3docs/adr/0556-python-mcp-ai-audit-2026-05-18.md(original audit ADR that opened the three state.md tracking rows)python/vmaf/tools/exceptions.py(new exception classes)python/vmaf/routine.py(P0-1 fix)python/vmaf/core/train_test_model.py(P0-2 fix)python/vmaf/core/local_explainer.py(P0-3 fix)python/test/test_adr0620_scaffold_audit_p0.py(16 regression tests)