ADR-0670: AI Legacy Corpus Extraction Manifests¶
- Status: Proposed
- Date: 2026-05-21
- Deciders: Lusoris maintainers
- Tags: ai, training, provenance, corpus
Context¶
ADR-0661 made run_provenance the shared evidence block for durable AI artifacts. ADR-0668 covered refreshed FULL_FEATURES table builders, and ADR-0669 covered corpus JSONL merge/aggregate boundaries. The older extraction scripts still left gaps:
extract_full_features.pywrites the Netflix public FULL_FEATURES parquet.konvid_to_vmaf_pairs.pywrites the synthetic-distortion KoNViD-1k FR-pair parquet consumed by the LOSO trainer.bvi_dvc_to_corpus_jsonl.pyreshapes cached BVI-DVC libvmaf JSON into vmaf-tune corpus rows for FR-regressor training.
Those outputs are local, gitignored, and often expensive to recreate. Without sidecars they are anonymous training inputs: a later model card can identify a path, but not the exact corpus/cache root, feature list, CRF, row count, failed-clip count, or command line that created the artifact.
Decision¶
The legacy corpus/extraction scripts emit replay manifests by default:
extract_full_features.py --manifest-outdefaults to<out>.manifest.jsonand records the Netflix corpus root, cache root, VMAF binary, feature list, pair count, row count, and sharedrun_provenance.konvid_to_vmaf_pairs.py --manifest-outdefaults to<out>.manifest.jsonand records KoNViD root, model/VMAF inputs, cache policy, CRF, feature list, selected/processed/failed clip counts, failed clip IDs, frame count, and sharedrun_provenance.bvi_dvc_to_corpus_jsonl.py --manifest-outdefaults to<output>.manifest.jsonand records cache inputs, adapter labels, row schema version, row count, cache-file count, and sharedrun_provenance.
The scripts keep their existing output schemas. While touching BVI-DVC JSONL, the adapter also emits the current vmaf-tune v3 additive columns with explicit unavailable defaults so stale cached BVI rows no longer fail the current CORPUS_ROW_KEYS contract.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Default sibling manifests | Replayable expensive artifacts; keeps row schemas stable; matches ADR-0668/0669 | Adds one small JSON file per run | Chosen; it closes the evidence gap with low operational cost. |
| Append provenance columns/rows to the parquets and JSONL | Self-contained output files | Changes trainer-facing schemas and repeats run metadata per row | Rejected; run-level evidence belongs in sidecars. |
| Only document operator commands | No code change | Still loses input hashes, parsed arguments, row counts, and failure counters | Rejected; docs are not evidence for local artifacts. |
| Wait for K150K refresh completion | Avoids touching old scripts mid-run | Leaves immediately usable Netflix/KoNViD/BVI artifacts anonymous | Rejected; these scripts are independent of the running K150K job. |
Consequences¶
- Positive: Legacy Netflix, KoNViD, and BVI training-input artifacts now carry replayable path/hash/argument evidence.
- Positive: BVI-DVC JSONL rows are brought back into current vmaf-tune v3 schema compliance with explicit missing-signal defaults.
- Negative: Operators must keep one extra JSON sidecar next to each local corpus/extraction output.
- Neutral / follow-ups: Other acquisition-only download manifests can adopt the same pattern later, but this ADR covers the older scripts that directly create trainer inputs.
References¶
- ADR-0661
- ADR-0668
- ADR-0669
- docs/ai/training.md
- Source: req "go on with next backlog?"
- Source: req "everything we did so far needs updates"