Skip to content

ADR-0676: MOS Corpus Adapter Manifests

  • Status: Accepted
  • Date: 2026-05-21
  • Deciders: Lusoris, Codex
  • Tags: ai, mos, corpus, provenance, fork-local

Context

ADR-0661 made AI run provenance a shared schema, ADR-0669 covered corpus JSONL merge/aggregate outputs, and ADR-0670 covered several legacy trainer-input builders. The remaining MOS corpus adapters still produced local JSONL shards without a sibling replay sidecar. That left CHUG, KoNViD, YouTube-UGC, LSVQ, LIVE-VQC, and Waterloo-IVC rows weaker as model-card evidence than the downstream tables derived from them.

The gap is operationally risky because these adapters download or resolve research corpora from operator-local paths, apply max-row caps, tolerate download attrition, and emit corpus-specific MOS scale fields. A later trainer cannot infer those choices from the JSONL rows alone.

Decision

All MOS corpus JSONL adapters using corpus.base.CorpusIngestBase, plus the KoNViD-1k local adapter, will write <output>.manifest.json by default and accept --manifest-out PATH for experiment bundles. The manifest records the corpus label, run counters, effective ingest config, path inputs/outputs, and ADR-0661 run_provenance.

Alternatives considered

Option Pros Cons Why not chosen
Shared base helper plus per-adapter --manifest-out One schema path; consistent counters/provenance; minimal per-script code Touches several CLI docs and tests in one PR Chosen: it closes the whole adapter family rather than leaving partial evidence
Add manifests only to CHUG Fastest HDR-specific patch KoNViD/UGC/LSVQ/LIVE/Waterloo shards remain anonymous and inconsistent Rejected: MOS-head refreshes combine multiple corpora
Rely on downstream merge manifests No new CLI flags Merge manifests cannot prove download attrition, max-row caps, source roots, or corpus-specific parser config Rejected: source-adapter choices are lost before merge
Embed run metadata into every JSONL row Single artifact Repeats run-level metadata per row and changes trainer row shape Rejected: row schemas should remain stable; run evidence belongs in a sidecar

Consequences

  • Positive: MOS corpus JSONL shards can be cited by trainers, signal-mix audits, and model cards with replayable input/output evidence.
  • Negative: Any new MOS adapter CLI must document and test its manifest sidecar in addition to row schema.
  • Neutral / follow-ups: Regenerate local CHUG/KoNViD/UGC/LSVQ/LIVE/Waterloo JSONL shards with sidecars before using them in a promoted model-card refresh.

References

  • ADR-0661 — shared AI run provenance.
  • ADR-0669 — corpus JSONL aggregate and merge manifests.
  • ADR-0670 — legacy corpus extraction manifests.
  • Source: req — "so all, netflix, regressors, encoders etc... everything we did so far needs updates"
  • Source: req — "well go on i guess we have enough backlog..."