ADR-0642: AI refresh defaults use current fork full-feature extractors¶
- Status: Accepted
- Date: 2026-05-20
- Deciders: lusoris
- Tags:
ai,training-data,full-features,konvid,ugc,bvi-dvc,fork-local
Context¶
The fork has changed feature extraction, encoder adapters, and corpus handling enough that old AI feature tables are stale. Several refresh scripts still defaulted to ambiguous vmaf binaries (build/tools/vmaf, /usr/local/bin/vmaf, or caller PATH), which can miss fork-only extractors and silently produce incomplete tables. The KoNViD-1k path only had the legacy pair extractor, UGC refresh still emitted canonical-6 rows with NaN-filled columns, and aggregate full-feature tables were rebuilt by ad hoc concatenation.
Decision¶
We will make the current fork CPU binary (core/build-cpu/tools/vmaf) the shared default for AI feature extraction, add real full-feature refresh drivers for KoNViD-1k and UGC, allow BVI-DVC refreshes from the known-good lossless MKV bundle, and require aggregate training tables to be rebuilt through ai/scripts/combine_full_feature_parquets.py.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep each script's historical default binary | No migration work | Stale installed binaries can lack fork-only extractors; refreshes become non-reproducible across host/container contexts | The refresh must regenerate against current fork behavior |
Require callers to pass --vmaf-bin every time | Explicit and flexible | Easy to forget in long background jobs; scripts keep unsafe defaults | Explicit override stays supported, but the default should be safe |
| Keep UGC canonical-6 with NaN-filled feature columns | Fastest path for old v5 experiments | Produces mixed-schema aggregate tables and stale model inputs | Current FR regressors consume FULL_FEATURES; refresh scripts must emit that schema |
| Combine refreshed parquets in notebooks | Flexible one-off analysis | Recreates schema drift and unreviewed ordering choices | A checked-in combiner gives tests, docs, and a stable normalized schema |
Consequences¶
- Positive: Netflix, KoNViD, BVI-DVC, UGC, and CHUG refresh jobs now have one binary contract and one aggregate schema. Folded KoNViD output becomes reproducible across machines because fold labels are deterministic hashes over clip keys.
- Negative: Full-feature UGC/KoNViD refreshes are slower than the old shortcuts, and local jobs need a current
core/build-cpu/tools/vmafbuild before running. - Neutral / follow-ups: Model retraining waits for the long-running refresh jobs (CHUG, KoNViD, UGC, Netflix) to finish. The combiner intentionally writes gitignored parquet outputs; model-card number updates belong in the follow-up retrain PR.
References¶
- Research digest: AI full-feature refresh defaults.
- ADR-0026 — full-feature table motivation.
- ADR-0340 — multi-corpus aggregation.
- ADR-0362 — existing K150K full-feature precedent.
- Source:
req— "i wasnt talking about chug only, we bugfixed so many things, all our ai things must be stale" - Source:
req— "so all, netflix, regressors, encoders etc... everything we did so far needs updates"