MOS Label Materializer¶
ai/scripts/materialize_mos_labels.py joins subjective MOS labels onto already-extracted feature tables. Use it before a real MOS-head training run when the feature table has metrics but not mos / mos_raw_0_100 columns.
The script is table-side only. It does not download corpora, extract VMAF features, or train a model.
Inputs¶
Feature tables may be parquet, JSONL/NDJSON, or JSON objects with rows. Label tables may be parquet, JSONL/NDJSON, JSON objects with rows, or CSV.
Both sides need a stable clip key. The script can infer common key columns such as key, video_id, clip_name, filename, src, or path; pass explicit columns when the table is ambiguous.
MOS labels may be stored as either:
| Column | Scale | Output |
|---|---|---|
mos | 1-5 | copied to mos, mapped to mos_raw_0_100 |
mos_raw_0_100 | 0-100 | mapped to mos, copied to mos_raw_0_100 |
Optional label metadata such as mos_std_dev, n_ratings, split, corpus, and corpus_version is copied onto matching feature rows.
Example¶
Join KoNViD-150K labels onto a refreshed full-feature parquet whose feature keys and source filenames both contain the numeric clip id:
.venv/bin/python ai/scripts/materialize_mos_labels.py \
--features runs/full_features_konvid_refresh_20260520_with_folds.parquet \
--labels .corpus/konvid-150k/konvid_150k.jsonl \
--feature-key-column key \
--label-key-column src \
--feature-key-regex '([0-9]{6,})' \
--label-key-regex '([0-9]{6,})' \
--out runs/full_features_konvid_refresh_20260520_with_mos.parquet \
--audit-json runs/full_features_konvid_refresh_20260520_with_mos.audit.json
The default --min-match-rate 0.95 is promotion-oriented: it fails the run if fewer than 95% of unique feature keys receive a label. Lower it only for exploratory audits where missing rows are expected and visible in mos_label_status.
When --audit-json is set, the audit file includes ADR-0661 run_provenance with the script entrypoint, argv, parsed arguments, feature table, label table inputs, output table target, and audit target. Keep that audit next to refreshed training tables so MOS-head training evidence can be reproduced without shell history.
Batch Manifest¶
Use ai/scripts/batch_materialize_mos_labels.py when several refreshed feature tables need the same label-join policy. Paths in the manifest are relative to the manifest file unless --base-dir is supplied.
Corpus-specific manifests¶
The repo ships ready-to-use batch manifests under ai/configs/:
| Manifest | Corpora | Key schema |
|---|---|---|
ai/configs/mos-label-batch-konvid.json | KonViD-1k, KonViD-150k | key (feature) ↔ src (label); 6+ digit numeric regex |
ai/configs/mos-label-batch-chug.json | CHUG UGC-HDR | chug_video_id on both sides; key_normalize: raw |
Run the KonViD batch (after feature extraction and corpus JSONL ingestion):
.venv/bin/python ai/scripts/batch_materialize_mos_labels.py \
--manifest ai/configs/mos-label-batch-konvid.json \
--report-json .workingdir2/konvid-mos-batch.report.json \
--report-md .workingdir2/konvid-mos-batch.report.md
Run the CHUG batch:
.venv/bin/python ai/scripts/batch_materialize_mos_labels.py \
--manifest ai/configs/mos-label-batch-chug.json \
--report-json .workingdir2/chug-mos-batch.report.json \
--report-md .workingdir2/chug-mos-batch.report.md
Paths in the shipped manifests follow the default layout from konvid_to_full_features.py (runs/) and chug_extract_features.py (.corpus/chug/ for input, .workingdir2/chug/runs/ for output). Pass --base-dir to override the relative-path root for non-default layouts.
Custom manifest¶
{
"defaults": {
"min_match_rate": 0.95,
"key_normalize": "auto"
},
"tables": [
{
"id": "konvid",
"features": "runs/full_features_konvid_refresh.parquet",
"labels": [".corpus/konvid-150k/konvid_150k.jsonl"],
"feature_key_column": "key",
"label_key_column": "src",
"feature_key_regex": "([0-9]{6,})",
"label_key_regex": "([0-9]{6,})",
"out": "runs/full_features_konvid_refresh_with_mos.parquet",
"audit_json": "runs/full_features_konvid_refresh_with_mos.audit.json"
}
]
}
Each table may override any single-run join option from defaults, including feature_key_column, label_key_column, label_mos_column, key_normalize, feature_key_regex, label_key_regex, min_match_rate, status_column, and overwrite. The batch report uses schema mos-label-materializer-batch-v1 and carries ADR-0661 run_provenance.
Output Columns¶
| Column | Meaning |
|---|---|
mos | Subjective MOS on the 1-5 training scale. |
mos_raw_0_100 | Same MOS on a 0-100 scale for cross-corpus audits. |
mos_label_status | ok or missing-label. |
mos_label_source | Label file that supplied the row. |
mos_std_dev, mos_n_ratings, split, corpus, corpus_version | Optional label metadata when present. |
Existing MOS columns are not overwritten unless --overwrite is passed. Duplicate label keys with conflicting MOS values are rejected; they normally mean the wrong labels were joined or a stale rerun was mixed into the label directory.
Trainer Contract¶
ai/scripts/train_konvid_mos_head.py no longer falls back to synthetic data when a real corpus path yields zero labelled rows. Use:
--smokefor dependency-free synthetic CI/load-path checks;materialize_mos_labels.pyfor real feature tables that need MOS labels;--allow-synthetic-fallbackonly for deliberate legacy debugging.
This keeps real-looking output filenames from hiding synthetic checkpoints.