Skip to content

ADR-0677: AI Dataset Fetch Manifests

  • Status: Accepted
  • Date: 2026-05-21
  • Deciders: Lusoris, Codex
  • Tags: ai, datasets, provenance, training, fork-local

Context

ADR-0661 made AI run provenance a shared schema. ADR-0668 through ADR-0676 then applied that schema to derived feature tables, corpus JSONL boundaries, legacy trainer-input builders, source MOS adapters, and materializer batches. Two older fetch helpers still sat before that evidence chain:

  • fetch_konvid_1k.py downloads and extracts the KoNViD-1k videos/metadata archives used by early NR and learned-filter work.
  • fetch_youtube_ugc_subset.py selects the smallest complete YouTube-UGC VP9 4-tuples and writes a content manifest used by the vmaf_tiny_v5 / UGC expansion experiments.

Both scripts create operator-local, gitignored inputs. Without a replay manifest, later feature tables and model cards can cite a corpus root but not the exact fetch selection, archive URLs, row cap, or output bundle that seeded the run.

Decision

Add deterministic run-manifest sidecars to both fetch helpers:

  • fetch_konvid_1k.py writes <root>/fetch_manifest.json by default and accepts --manifest-out PATH.
  • fetch_youtube_ugc_subset.py preserves the existing --manifest content manifest and adds --run-manifest-out PATH, defaulting to <manifest>.run-manifest.json.

Both sidecars reuse aiutils.run_manifest.build_run_provenance() and record dataset identity, effective selection/download configuration, output paths, and relevant counters. The YouTube helper keeps run metadata out of the existing stem-to-files content manifest so existing consumers do not need a row schema change.

Alternatives considered

Option Pros Cons Why not chosen
Add fetch-side run manifests Closes the evidence gap before corpus conversion; reuses ADR-0661; no row-schema changes Adds one more sidecar per fetch run Chosen
Put run metadata into the YouTube content manifest Single artifact Breaks the simple stem -> {orig,cbr,vod,vodlb} shape existing callers expect Rejected
Only document shell commands No code change Shell history is not durable evidence and cannot be cited by later model cards Rejected
Defer until next full data refresh Avoids a small PR now The fetch gap would keep producing anonymous local inputs during the current AI refresh Rejected

Consequences

  • Positive: KoNViD and YouTube-UGC downloaded corpus roots can now be replayed or audited from a durable JSON sidecar before any extraction step.
  • Negative: Operators will see an additional gitignored JSON file beside fetch outputs.
  • Neutral / follow-ups: Regenerate local fetch manifests when refreshing KoNViD-1k or the YouTube-UGC subset before promoted model-card updates.

References

  • ADR-0661 — shared AI run provenance.
  • ADR-0668 — derived table manifests.
  • ADR-0676 — MOS corpus adapter manifests.
  • Source: req — "so all, netflix, regressors, encoders etc... everything we did so far needs updates"
  • Source: req — "well next backlog batch"