ADR-0388: Ingest BVI-CC as the second tiny-AI training corpus¶
- Status: Draft
- Date: 2026-05-02
- Deciders: TBD
- Tags: ai, fr-regressor, corpus, license, bristol
Context¶
fr_regressor_v1 (T6-1a, commit e421d70) trains on the Netflix Public corpus only — 9 reference + 70 distorted clips, single MOS convention, no explicit codec axis. The codec-aware fr_regressor_v2 plan in ADR-0235 needs subjective labels paired with a real codec sweep. The companion research digest Research-0046 walks the Bristol VI-Lab BVI- family and concludes that BVI-CC* is the smallest useful candidate: 9 references × 34 distorted variants across HM, AV1, VTM at four resolutions, with DMOS labels.
The decision is whether to make BVI-CC the first Bristol corpus ingest for the fork, ahead of BVI-DVC (no labels), BVI-AOM (no labels, mixed-licence subset), or BVI-HD (single codec only).
Decision¶
We will ingest BVI-CC as the second tiny-AI training corpus, behind Netflix Public, in a single dedicated PR. The PR ships: the JSON manifest at ai/src/vmaf_train/data/manifests/bvi-cc.json, a mos_convention column in the feature parquet schema, loader support for inverted-DMOS-to-MOS normalisation, a per-clip licence flag column (carried for forward-compatibility with BVI-AOM), and the user-facing doc page at docs/ai/training-data.md. Raw clips remain gitignored under .workingdir2/bristol/bvi-cc/. The parquet itself is a local-build artifact, not committed.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| BVI-CC first (chosen) | DMOS labels; explicit HM / AV1 / VTM codec axis; small enough (~250–400 GB) to fit on one disk; CC-BY on derived data | Form-gated download (~2 day SLA); DMOS↔MOS conversion adds one normalisation step | — |
| BVI-DVC first | Largest sequence count (800); widest resolution range; useful for parity soak | No subjective labels — does not move fr_regressor_v2 forward; 700 GB – 1.2 TB footprint | Wrong corpus for the labelled-MOS goal; can come later for a separate parity-soak PR |
| BVI-AOM first | Authoritative size figure (124 GB packed); clean direct-S3 download (no form) | No subjective labels; mixed licence — CableLabs subset is CC-BY-NC-ND 3.0 (no derivatives), needs per-clip licence map | Same labelling gap as BVI-DVC plus a licence-complexity tax we don't need on the first ingest |
| BVI-HD first | DMOS labels; simpler than BVI-CC | Single codec (HEVC + HEVC-SYNTH); no AV1 / VVC presence | Doesn't exercise the codec one-hot the v2 regressor was designed for |
| Stay on Netflix Public, defer BVI | Zero new licence work; zero new disk | Blocks ADR-0235 indefinitely; cross-backend soak still has only 3 pairs | Defers the actual problem |
| Pull two corpora at once (BVI-CC + AOM) | One round-trip | TB-scale storage commitment; AOM licence map needs separate review; harder to attribute lift in v2 | Violates the "one decision per PR" principle |
Consequences¶
- Positive:
fr_regressor_v2gains a labelled codec sweep with three encoders; the parquet schema gains amos_conventioncolumn, unblocking any future MOS-bearing corpus from any lab;docs/ai/training-data.mdbecomes the single index of what the fork has ingested. - Positive: A 10-clip BVI-CC subset becomes available for a weekly cross-backend parity soak (idle-GPU job), expanding the parity surface beyond the 3 Netflix golden pairs without touching the golden-data gate.
- Negative: Manifest and parquet schema churn — every downstream consumer of the parquet must accept the new
mos_conventioncolumn. Acceptable cost; the column is additive. - Negative: Disk pressure. The dev box must keep ~400 GB free for the BVI-CC corpus root in addition to the existing ~37 GB Netflix root.
- Neutral: Reproducibility for future contributors requires re-registering with Bristol; the in-repo manifest plus expected SHA-256 list is the audit trail.
- Follow-ups:
- DMOS-to-MOS conversion test (places=4) added to
python/test/against a 5-clip frozen subset. - Memory entry mirroring
project_netflix_training_corpus_local.mdonce the corpus is local. - Decision deferred: BVI-DVC and BVI-AOM ingest. Revisit after measuring v2 lift from BVI-CC alone.
References¶
- Companion: Research-0046 — Bristol VI-Lab dataset feasibility
- Prior: ADR-0019 — Tiny-AI Netflix training
- Prior: ADR-0042 — Tiny-AI docs required per PR
- Prior: ADR-0235 — Codec-aware fr_regressor v2
- Source:
req(user direction, 2026-05-02 — feasibility investigation for Bristol BVI-* corpora)