ADR-0993: KoNViD / UGC / BVI-DVC Saliency Batch Manifests and Run Scaffolding¶
- Status: Accepted
- Date: 2026-06-03
- Deciders: Lusoris
- Tags: ai, saliency, materializer, konvid, ugc, bvi-dvc, batch, fork-local
Context¶
ADR-0673 introduced ai/scripts/batch_materialize_saliency_features.py and the batch manifest schema. PR #540 fixed the Netflix saliency materializer (raw YUV flags, per-file cache, path-column diagnostic) and confirmed 11,190 Netflix rows at 0 failures. The CHUG canonical-shards run also completed successfully (5,136 rows).
The remaining corpora — KoNViD-150K, YouTube UGC, and BVI-DVC — each carry a different path-resolution challenge:
-
KoNViD-150K (
konvid_150k.jsonl): hassrc(relative clip filename),width, andheightcolumns. Clips reside at.corpus/konvid-150k/k150ka_extracted/. This corpus is directly materialisable using the existing batch runner with no preprocessing step. -
YouTube UGC (full-feature parquet
full_features_ugc_refresh_20260520.parquet): thesourcecolumn holds corpus identifiers such asugc-Gaming_1080P-223e-cbr, not file paths. The actual files are at.corpus/ugc/download/{ContentName}_{Variant}.{webm,mp4}. A path-enriched corpus JSONL must be generated first (viayoutube_ugc_to_corpus_jsonl.pyor a thin path-derivation step). -
BVI-DVC (full-feature parquet
full_features_bvi_dvc_D_refresh_20260520.parquet): thekeycolumn encodes encode parameters (e.g.DAdvertisingMassagesBangkokVidevo_480x272_25fps_10bit_420), not file paths. Raw reference YUVs live at.corpus/bvi-dvc-raw/. A path-enriched JSONL or a direct--rootpass through the raw YUV directory requires mapping key back to the source YUV filename.
Decision¶
Commit in-tree batch manifests under ai/batch-manifests/saliency/:
konvid-150k.json— fully wired; resolveskonvid_150k.jsonl→ clips at.corpus/konvid-150k/k150ka_extracted/. Ready to run.ugc.json— scaffolded withtables: []and_status/_resolutioncomments that document the blocking gap and both resolution options.bvi-dvc.json— scaffolded withtables: []and_status/_resolutioncomments that document the blocking gap, the raw YUV resolution, and the required-f rawvideoflags (already handled by the materializer when extension is.yuvanddefault_width/default_heightare set).
This makes the three corpus manifests visible to agents and reviewers, gives the exact run command for KoNViD-150K, and records the UGC / BVI-DVC gaps as in-tree scaffolding stubs rather than undocumented state in .workingdir2/.
The KoNViD-150K smoke-run command (10-row limit) is documented in the PR description. The full run touches 148,543 clips and should be scheduled as a long-running background job.
Alternatives Considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Commit manifests only after all three are wired | Fully runnable from day one | Blocks the PR on UGC/BVI path-enrichment work; KoNViD-150K can run now | Rejected: KoNViD-150K is already unblocked and the UGC/BVI gaps are worth recording in-tree |
Keep manifests in .workingdir2/ only | Zero tree churn | Gitignored; lost on clone; invisible to agents parsing the codebase | Rejected: the existing CHUG/Netflix manifests in .workingdir2/ are already unreachable to fresh clones |
| Generate UGC/BVI path-enriched JSONL in this PR | Both blocked corpora become runnable | Significantly expands scope; UGC download dir may not be present on all machines | Deferred: add as a follow-on PR once the path-derivation strategy is decided |
Consequences¶
- Positive: KoNViD-150K saliency can be launched from a versioned, in-tree manifest; UGC and BVI-DVC gaps are documented alongside the runnable manifest so the next agent can unblock them in one PR.
- Negative: Two manifests ship with empty
tablesarrays; running them is a no-op until a follow-on populates the tables. - Neutral / follow-ups:
- UGC follow-on: generate
ugc_corpus.jsonlviayoutube_ugc_to_corpus_jsonl.pyand populateugc.json. - BVI-DVC follow-on: generate a raw-YUV corpus JSONL mapping key back to source filename and populate
bvi-dvc.json. - After all three corpora have saliency columns, re-run the signal-mix audit (
ai/scripts/signal_mix_audit.py) and measure predictor / MOS-head impact.
References¶
- ADR-0655 — shared saliency table materializer.
- ADR-0673 — batch manifest schema.
- ADR-0672 — model and temporal-reducer metadata.
- ADR-0661 — AI run provenance schema.
- PR #540 — Netflix path-column and raw YUV fix.
- Source: req — "KoNViD / UGC / BVI saliency materializer runs — Manifests merged but 0 actual runs executed"