ADR-0687: CHUG HDR MOS head — held-out test partition validator¶

Status: Accepted
Date: 2026-05-27
Deciders: lusoris
Tags: ai, chug, mos-head, validation, fork-local

Context¶

The CHUG corpus (ai/scripts/train_chug_hdr_mos_head.py) has explicit train (3936), val (648), and test (552) splits in each feature JSONL row's split column. Training and cross-validation use only train + val rows. The test partition is held out by design so that at least one unbiased evaluation remains after the model is selected.

The chug_hdr_mos_head_v1_wide_seed20260521 checkpoint reported PLCC 0.8733 / SROCC 0.8528 on the val split but had never been evaluated against test. Before promoting the model or starting a new training pass, the unbiased test-partition result is required to understand the true generalisation gap.

Per ADR-0325 the production-flip gate is PLCC ≥ 0.85, SROCC ≥ 0.82, RMSE ≤ 0.45 MOS units. Per the feedback_no_test_weakening rule, the gate thresholds are never lowered to accommodate a miss.

Decision¶

We will add ai/scripts/validate_chug_hdr_mos_head.py, a standalone validator that:

Loads a CHUG MOS head ONNX (default: the current best checkpoint).
Reads CHUG feature JSONL shards and filters to split == "test" rows.
Runs ONNX inference using onnxruntime (CPU provider).
Computes PLCC, SROCC, and RMSE against the mos column.
Emits a JSON report, a Markdown report, and a write_run_manifest sidecar.
Exits 0 on gate PASS, 2 on gate FAIL, 1 on input error.

The script reuses _row_to_features, _load_jsonl, _normalise_split, and schema constants from train_konvid_mos_head.py so the feature projection is byte-identical to the training path.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Add a `--evaluate-test` flag to the trainer	Single entry point	Training loop rerun required; gates would need careful separation from val-gate logic	Separation of concerns: a validator should not retrain
Jupyter notebook	Interactive exploration	Not scriptable, not CI-able, no run-manifest	Repository tooling requires reusable scripts
Inline ad-hoc evaluation in OPEN.md	Zero code	Not reproducible; no run-manifest; not tested	Must be reproducible and provenance-tracked

Consequences¶

Positive: First unbiased held-out result for the CHUG HDR MOS head. Reproducible via a single command. Emits a provenance-tracked JSON + Markdown report. 19 unit tests cover split filtering, metric computation, and gate logic.
Negative: PLCC 0.8468 / SROCC 0.8188 on the held-out test partition — both miss the ADR-0325 gate (PLCC short by 0.003, SROCC short by 0.001). Model ships Status: Proposed only; not promoted to production.
Neutral / follow-ups: The held-out gap (val PLCC 0.8733 → test PLCC 0.8468) is consistent with mild overfitting to the val distribution during seed-sweep model selection. Remedies to investigate: (1) reseed with train+val combined for the ship checkpoint; (2) add saliency or display-profile features (T-CHUG-DISPLAY-PROFILE-TRAINING-2026-05-20); (3) increase training epochs; (4) add more CHUG rows once future extraction batches complete.

References¶

ADR-0325 §Production-flip gate: PLCC ≥ 0.85 / SROCC ≥ 0.82 / RMSE ≤ 0.45.
Research 0649 (docs/research/0649-chug-hdr-wide-mos-feature-schema.md).
Held-out result: .workingdir2/training/validation/chug_held_out_test_20260527.json.
Source: req — "Write a CHUG MOS-head held-out test validator script + run it + report results."