ADR-0663: MOS Label Materializer¶
- Status: Accepted
- Date: 2026-05-21
- Deciders: Lusoris agents
- Tags: ai, mos, training, corpus
Context¶
The AI refresh surfaced a bad failure mode: a KoNViD full-feature parquet had features and split labels but no mos column, and train_konvid_mos_head.py silently fell back to synthetic data. That created a real-looking output filename for a synthetic checkpoint and hid the actual problem: MOS labels were never joined onto the feature table.
The fork now has several MOS-labelled corpora and refreshed feature tables. Operators need an explicit, testable join step that materialises labels before training, with enough audit metadata to catch weak coverage and stale keys.
Decision¶
Add ai/scripts/materialize_mos_labels.py, a table-side MOS label joiner for parquet, JSONL/NDJSON, JSON, and CSV tables. The script joins subjective MOS labels onto already-extracted feature tables by explicit or inferred clip keys, supports regex key extraction, writes mos, mos_raw_0_100, mos_label_status, and provenance columns, rejects conflicting duplicate label keys, and fails by default below 95% unique-key coverage.
train_konvid_mos_head.py will now fail with exit code 2 when a real run loads zero MOS-labelled rows. Synthetic data is available through explicit --smoke; --allow-synthetic-fallback exists only as a hidden legacy/debug escape hatch.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep synthetic fallback in the trainer | Existing scripts keep running | Real-data mistakes produce plausible synthetic artefacts and waste GPU time | Rejected; real training must fail loudly when labels are absent |
| Join labels inside each trainer | Fewer operator steps for one model | Duplicates key-matching and coverage policy across KoNViD, CHUG, and future MOS heads | Rejected; table materialisation is easier to audit and reuse |
| Require exact key equality only | Simple implementation | Public corpus feature tables often carry numeric ids inside filenames or paths | Rejected; regex extraction is needed for KoNViD/CHUG-style joins |
| Average duplicate labels | Handles repeated label files | Masks stale reruns and conflicting source rows | Rejected; conflicting duplicates are data poisoning |
Consequences¶
- Positive: MOS-head training can no longer silently turn an empty real corpus into a synthetic checkpoint.
- Positive: the same labelled feature parquet can feed training, signal-mix audits, and later retrain comparisons.
- Positive: match coverage and label provenance are recorded as an audit JSON rather than inferred from trainer logs.
- Negative: operators must run one extra materialisation command when feature extraction and MOS labels live in separate tables.
- Neutral / follow-ups: use the materializer for the KoNViD refresh and CHUG-derived HDR tables, then retrain the MOS heads with real labelled rows.
References¶
- docs/ai/mos-label-materializer.md
- docs/ai/models/konvid_mos_head_v1.md
- User request: "and for sure ai trainings to start? ... wake up ffs"
- User request: "and implement everything that is not blocked by the model"