Skip to content

ADR-0642: AI refresh defaults use current fork full-feature extractors

  • Status: Accepted
  • Date: 2026-05-20
  • Deciders: lusoris
  • Tags: ai, training-data, full-features, konvid, ugc, bvi-dvc, fork-local

Context

The fork has changed feature extraction, encoder adapters, and corpus handling enough that old AI feature tables are stale. Several refresh scripts still defaulted to ambiguous vmaf binaries (build/tools/vmaf, /usr/local/bin/vmaf, or caller PATH), which can miss fork-only extractors and silently produce incomplete tables. The KoNViD-1k path only had the legacy pair extractor, UGC refresh still emitted canonical-6 rows with NaN-filled columns, and aggregate full-feature tables were rebuilt by ad hoc concatenation.

Decision

We will make the current fork CPU binary (core/build-cpu/tools/vmaf) the shared default for AI feature extraction, add real full-feature refresh drivers for KoNViD-1k and UGC, allow BVI-DVC refreshes from the known-good lossless MKV bundle, and require aggregate training tables to be rebuilt through ai/scripts/combine_full_feature_parquets.py.

Alternatives considered

Option Pros Cons Why not chosen
Keep each script's historical default binary No migration work Stale installed binaries can lack fork-only extractors; refreshes become non-reproducible across host/container contexts The refresh must regenerate against current fork behavior
Require callers to pass --vmaf-bin every time Explicit and flexible Easy to forget in long background jobs; scripts keep unsafe defaults Explicit override stays supported, but the default should be safe
Keep UGC canonical-6 with NaN-filled feature columns Fastest path for old v5 experiments Produces mixed-schema aggregate tables and stale model inputs Current FR regressors consume FULL_FEATURES; refresh scripts must emit that schema
Combine refreshed parquets in notebooks Flexible one-off analysis Recreates schema drift and unreviewed ordering choices A checked-in combiner gives tests, docs, and a stable normalized schema

Consequences

  • Positive: Netflix, KoNViD, BVI-DVC, UGC, and CHUG refresh jobs now have one binary contract and one aggregate schema. Folded KoNViD output becomes reproducible across machines because fold labels are deterministic hashes over clip keys.
  • Negative: Full-feature UGC/KoNViD refreshes are slower than the old shortcuts, and local jobs need a current core/build-cpu/tools/vmaf build before running.
  • Neutral / follow-ups: Model retraining waits for the long-running refresh jobs (CHUG, KoNViD, UGC, Netflix) to finish. The combiner intentionally writes gitignored parquet outputs; model-card number updates belong in the follow-up retrain PR.

References

  • Research digest: AI full-feature refresh defaults.
  • ADR-0026 — full-feature table motivation.
  • ADR-0340 — multi-corpus aggregation.
  • ADR-0362 — existing K150K full-feature precedent.
  • Source: req — "i wasnt talking about chug only, we bugfixed so many things, all our ai things must be stale"
  • Source: req — "so all, netflix, regressors, encoders etc... everything we did so far needs updates"