Research: AI full-feature refresh defaults¶
Companion to ADR-0642.
Findings¶
The refresh audit found three stale-data risks:
- Several scripts defaulted to
build/tools/vmaf,/usr/local/bin/vmaf, or PATH lookup. Those locations can point at an older binary without fork-only extractors such asmotion_v2, SSIMULACRA2, and the SpEED features. extract_ugc_features.pypreserved an old v5 shortcut that emitted canonical-6 rows and filled the rest ofFULL_FEATURESwith NaN. That shape is incompatible with current full-feature FR refreshes.- KoNViD-1k had a synthetic-FR pair extractor but not a full-feature table generator with deterministic fold metadata.
Local smoke checks used the current fork CPU binary at core/build-cpu/tools/vmaf. BVI-DVC raw-YUV input produced all-zero VMAF in a one-clip probe, while the local lossless MKV bundle produced finite metrics; the refresh driver therefore accepts .mkv, .mp4, and .yuv but documents the MKV bundle as the known-good source.
Verification¶
Focused tests:
.venv/bin/python -m pytest \
ai/tests/test_feature_extractor_defaults.py \
ai/tests/test_bvi_dvc_dir_mode.py \
ai/tests/test_extract_k150k_features.py \
ai/tests/test_e2e_frame_to_score.py \
ai/tests/test_extract_ugc_features.py \
ai/tests/test_konvid_full_features.py \
ai/tests/test_combine_full_feature_parquets.py \
ai/tests/test_feature_sets.py \
-q
Real local smokes completed for:
- one KoNViD-1k full-feature clip (
200rows, currentFULL_FEATURES+vmaf); - one UGC distorted pair (
3frames, currentFULL_FEATURES+vmaf); - BVI-DVC tier-D lossless MKV refresh (
193clips,6176frames).
Long-running local background jobs continue for Netflix, KoNViD, UGC, and CHUG; their outputs are gitignored corpus artifacts and not part of this PR.