ADR-0992: MOS-label batch-run manifests for KonViD and CHUG¶
- Status: Accepted
- Date: 2026-06-03
- Deciders: Lusoris
- Tags: ai, mos, training, corpus, konvid, chug, fork-local
Context¶
The batch_materialize_mos_labels.py script (ADR-0675) was merged (archived PR #1498), but no corpus-specific batch manifests ship with the code. Operators who want to join MOS labels onto their KonViD or CHUG feature extracts must either write a manifest from scratch or remember the correct column names, key regexes, and path conventions for each corpus.
In addition, the test_batch_materialize_mos_labels.py tests were broken: the _load_module() helper executed batch_materialize_mos_labels.py without inserting ai/scripts/ into sys.path, so _script_bootstrap was not importable when pytest ran from the repo root.
Decision¶
-
Add
ai/configs/mos-label-batch-konvid.json— the batch manifest for joining KonViD-1k and KonViD-150k MOS labels onto extracted feature parquets. Defaults setkey_normalize: basename,feature_key_regexandlabel_key_regexto([0-9]{6,})(KonViD clip IDs are 6+ digit integers), andmin_match_rate: 0.95. Table entries wirefeature_key_column: keytolabel_key_column: srcfor the conventional per-corpus JSONL shapes. -
Add
ai/configs/mos-label-batch-chug.json— the batch manifest for joining CHUG UGC-HDR MOS labels. Useskey_normalize: rawbecause CHUG video IDs are opaque strings (not file paths), and joins onchug_video_idon both the feature and label sides. -
Add
ai/tests/test_mos_label_batch_runs_smoke.py— validates that both manifests are well-formed JSON, parse correctly viaload_batch_manifest(), encode the expected corpus-specific key columns and normalisation policy, and run end-to-end with synthetic data (no real corpus files required on CI). -
Fix the pre-existing
sys.pathbug inai/tests/test_batch_materialize_mos_labels.py: insertai/scripts/before executing the script module so the tests pass when pytest is invoked from the repo root (the standard CI invocation).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Ship manifests as YAML | Consistent with other configs in ai/configs/ | batch_materialize_mos_labels.py reads JSON; a YAML manifest would need a new parser path | Rejected: avoid adding a parser path for a config format the tool does not support |
Embed corpus manifests in a single mos-label-batch-all.json | Single file for all corpora | Different corpora have different key schemas and path conventions; a single file mixes them without per-corpus clarity | Rejected: operator intent is clearer with one manifest per corpus family |
Store manifests under .workingdir2/ | Keeps config next to corpus artifacts | .workingdir2/ is gitignored; operators on fresh clones would have no starting point | Rejected: manifests encode code-time policy (column names, regex, thresholds), not runtime state |
Consequences¶
- Positive: Operators can run
python ai/scripts/batch_materialize_mos_labels.py --manifest ai/configs/mos-label-batch-konvid.jsonwithout constructing the manifest by hand. The manifests are version-controlled and stable across machines. The smoke test catches manifest regressions and documents the expected corpus key schemas. - Positive: Pre-existing
test_batch_materialize_mos_labels.pybreakage is fixed; all three existing tests now pass from the repo root. - Negative: Manifest paths use defaults relative to
ai/configs/which expect the standardruns/and.workingdir2/layout; operators with non-default corpus layouts must pass--base-dir. - Neutral / follow-ups: Real batch runs (with actual corpus files) remain operator-local; the manifests document the expected paths but do not ship the corpus data. The CHUG manifest points to
.corpus/chug/for the input JSONL/parquet (convention fromchug_extract_features.py) and.workingdir2/chug/runs/for output (convention fromtrain_chug_hdr_mos_head.py).