ADR-0687: CHUG HDR MOS head — held-out test partition validator¶
- Status: Accepted
- Date: 2026-05-27
- Deciders: lusoris
- Tags: ai, chug, mos-head, validation, fork-local
Context¶
The CHUG corpus (ai/scripts/train_chug_hdr_mos_head.py) has explicit train (3936), val (648), and test (552) splits in each feature JSONL row's split column. Training and cross-validation use only train + val rows. The test partition is held out by design so that at least one unbiased evaluation remains after the model is selected.
The chug_hdr_mos_head_v1_wide_seed20260521 checkpoint reported PLCC 0.8733 / SROCC 0.8528 on the val split but had never been evaluated against test. Before promoting the model or starting a new training pass, the unbiased test-partition result is required to understand the true generalisation gap.
Per ADR-0325 the production-flip gate is PLCC ≥ 0.85, SROCC ≥ 0.82, RMSE ≤ 0.45 MOS units. Per the feedback_no_test_weakening rule, the gate thresholds are never lowered to accommodate a miss.
Decision¶
We will add ai/scripts/validate_chug_hdr_mos_head.py, a standalone validator that:
- Loads a CHUG MOS head ONNX (default: the current best checkpoint).
- Reads CHUG feature JSONL shards and filters to
split == "test"rows. - Runs ONNX inference using
onnxruntime(CPU provider). - Computes PLCC, SROCC, and RMSE against the
moscolumn. - Emits a JSON report, a Markdown report, and a
write_run_manifestsidecar. - Exits 0 on gate PASS, 2 on gate FAIL, 1 on input error.
The script reuses _row_to_features, _load_jsonl, _normalise_split, and schema constants from train_konvid_mos_head.py so the feature projection is byte-identical to the training path.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Add a --evaluate-test flag to the trainer | Single entry point | Training loop rerun required; gates would need careful separation from val-gate logic | Separation of concerns: a validator should not retrain |
| Jupyter notebook | Interactive exploration | Not scriptable, not CI-able, no run-manifest | Repository tooling requires reusable scripts |
| Inline ad-hoc evaluation in OPEN.md | Zero code | Not reproducible; no run-manifest; not tested | Must be reproducible and provenance-tracked |
Consequences¶
- Positive: First unbiased held-out result for the CHUG HDR MOS head. Reproducible via a single command. Emits a provenance-tracked JSON + Markdown report. 19 unit tests cover split filtering, metric computation, and gate logic.
- Negative: PLCC 0.8468 / SROCC 0.8188 on the held-out
testpartition — both miss the ADR-0325 gate (PLCC short by 0.003, SROCC short by 0.001). Model shipsStatus: Proposedonly; not promoted to production. - Neutral / follow-ups: The held-out gap (val PLCC 0.8733 → test PLCC 0.8468) is consistent with mild overfitting to the
valdistribution during seed-sweep model selection. Remedies to investigate: (1) reseed with train+val combined for the ship checkpoint; (2) add saliency or display-profile features (T-CHUG-DISPLAY-PROFILE-TRAINING-2026-05-20); (3) increase training epochs; (4) add more CHUG rows once future extraction batches complete.
References¶
- ADR-0325 §Production-flip gate: PLCC ≥ 0.85 / SROCC ≥ 0.82 / RMSE ≤ 0.45.
- Research 0649 (
docs/research/0649-chug-hdr-wide-mos-feature-schema.md). - Held-out result:
.workingdir2/training/validation/chug_held_out_test_20260527.json. - Source: req — "Write a CHUG MOS-head held-out test validator script + run it + report results."