Skip to content

ADR-0668: AI Derived Table Provenance

  • Status: Proposed
  • Date: 2026-05-21
  • Deciders: Lusoris maintainers
  • Tags: ai, training, provenance, parquet

Context

The AI refresh pipeline now produces multiple local FULL_FEATURES parquet tables before any model export happens: K150K/FR-from-NR extraction, merged multi-corpus tables, and post-hoc CHUG/K150K metadata enrichment. Those files are gitignored and can be large, so the repository cannot rely on committed bytes to prove which inputs, feature schema, backend split, or command-line arguments produced them.

ADR-0661 introduced aiutils.run_manifest.build_run_provenance() for training/export JSON sidecars. The remaining gap is the table-building layer itself. If refreshed tables are anonymous, later HDR/MOS regressors and vmaf-tune encoder profiles can inherit stale or unreplayable data even when their own trainer manifests are correct.

Decision

Operator-facing scripts that create or reshape refreshed FULL_FEATURES parquet tables must emit a JSON manifest sidecar by default:

  • ai/scripts/extract_k150k_features.py writes <out>.manifest.json with feature order, extractor split, backend worker counts, restart counters, parquet row count, and run_provenance.
  • ai/scripts/combine_full_feature_parquets.py writes <out>.manifest.json with per-input labels, row counts, missing-feature fill lists, aggregate corpus distribution, output column order, and run_provenance.
  • ai/scripts/enrich_k150k_parquet_metadata.py writes <out>.manifest.json with metadata match/update counters, available metadata keys, overwrite policy, and run_provenance.

Each script accepts --manifest-out for experiment bundles that keep the manifest somewhere other than the default sibling path. Existing parquet row schemas remain unchanged.

Alternatives considered

Option Pros Cons Why not chosen
Sidecar manifests on each derived table Replayable local artifacts; reuses ADR-0661 helper; no parquet schema churn Adds one small JSON file per run Chosen; it closes the evidence gap without touching model inputs.
Store provenance inside parquet metadata Keeps one file per artifact Harder to inspect with standard tools; many local scripts rewrite parquet through pandas and may drop custom metadata Rejected; human-readable JSON is easier to audit and preserve.
Only document the commands in .workingdir2 notes No code change Notes drift from actual invocations and are not consumed by trainer/model-card tooling Rejected; the scripts must stamp the artifact they create.
Defer until the K150K/CHUG runs finish Avoids touching scripts during a long refresh The finished table would still be unreplayable and every downstream model would inherit that gap Rejected; provenance should be in place before the next full refresh table is trusted.

Consequences

  • Positive: Refreshed AI training tables can be traced back to inputs, args, output targets, feature schema, and row counts without shell history.
  • Positive: Metadata-enriched CHUG/K150K parquet files can be distinguished from freshly extracted tables in later HDR model cards.
  • Negative: Local runs create an additional small JSON sidecar that operators need to keep with the parquet artifact.
  • Neutral / follow-ups: Future table builders should adopt the same helper before they become operator-facing; no existing model input column order changes in this ADR.

References

  • ADR-0661
  • docs/ai/training.md
  • Source: req "batch things that are connected and create them"
  • Source: req "everything we did so far needs updates"