ADR-0366: vmaf-tune corpus schema v3 — canonical-6 per-feature aggregates¶
- Status: Accepted
- Date: 2026-05-08
- Deciders: lusoris, claude
- Tags:
ai,tools,vmaf-tune,corpus,schema
Context¶
The schema-v2 vmaf-tune corpus row records only the pooled vmaf_score per encode and drops the libvmaf per-feature aggregates that the libvmaf CLI already emits in its JSON output. The codec-aware FR regressors (train_fr_regressor_v2.py and train_fr_regressor_v3.py) need the canonical-6 per-feature signals (adm2, vif_scale0..3, motion2) to learn beyond what vmaf_score already implies. The partial-integration audit (docs/research/0091-partial-integration-audit-2026-05-08.md) called this out: train_fr_regressor_v2.py carries an inline TODO documenting the absence and falls back to a synthetic-data smoke mode; train_fr_regressor_v3.py raises ValueError on any real-corpus DataFrame because the columns it asserts (adm2, ..., vmaf, cq, frame_index) do not exist on disk.
We could keep the trainer-side workaround (a sidecar JSON written out-of-band by a wrapper script that re-runs libvmaf in feature-extraction mode) but that doubles the encode wall-time, breaks the single-source-of- truth contract for the corpus row, and forces every downstream consumer to know about the sidecar.
Decision¶
We extend the corpus row schema to v3 by adding the canonical-6 libvmaf features as 12 mean / std aggregate columns (adm2_mean, adm2_std, vif_scale[0..3]_{mean,std}, motion2_{mean,std}) parsed directly from libvmaf's existing pooled_metrics block. The vmaftune.score.run_score driver already runs libvmaf with the features registered by the active model; we surface the per-feature mean / stddev it already computes via parse_feature_aggregates and the corpus row writer projects them into the new columns. Features the active model does not expose (e.g. the cambi-only path) become NaN in the row, never 0.0 — the trainer drops the row instead of training on synthetic zeros. The reader (vmaftune.corpus.read_jsonl) loads schema-v2 rows and back-fills the missing columns with NaN so legacy corpora do not crash newer consumers.
The schema-v3 trainers (v2 + v3) read the columns directly from the corpus DataFrame; the previous out-of-band sidecar JSON path is removed. train_fr_regressor_v2.py keeps its --smoke path as a documented diagnostic flag (the smoke synthesises both per_frame_features and the v3 <feature>_mean columns so both materialisation paths in _row_to_features are exercised). train_fr_regressor_v3.py retires its synthetic-by-default fallback and demands a v3 corpus on the real path; --smoke continues to synthesise a v3-shaped corpus for pipeline integrity checks but the exported registry row is always tagged smoke: true.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Status quo (sidecar JSON) | No schema change | Doubles encode wall-time, two sources of truth, every consumer needs sidecar awareness | Already documented as the partial-integration debt that motivated this ADR |
| Per-frame rows in the corpus | Maximally rich | 30-100x row inflation, breaks Phase B coarse-to-fine row counts, schema overhauled | Aggregates are the input the trainers actually use |
| Add canonical-6 + extra metrics (cambi, psnr, ssim, ms-ssim) in one bump | One schema change covers Phase D too | Wider blast radius, harder to land before downstream consumers stabilise | Defer additional metrics to a future v4 bump once consumers settle |
| Out-of-band per-encode log file with the libvmaf JSON | Zero corpus-row impact | Inflates dev surface, shells the same parser, still leaves the trainers blocked | Same problem as sidecar JSON |
Consequences¶
- Positive:
train_fr_regressor_v2.pyandtrain_fr_regressor_v3.pyconsume canonical-6 per-feature signals directly from the corpus DataFrame. The synthetic fallback thattrain_fr_regressor_v3.pydefaulted to on real corpora is removed — runs that fail to load the v3 schema fail loudly with a pointer to this ADR. Phase B / C consumers get one source of truth for per-feature data and can drop NaN rows (cambi-only fixtures, encode failures) instead of training on invented zeros. - Negative: The on-disk row width grows by 12 columns. Estimated +220 bytes per row on a typical sweep (52 CRFs × 5 presets × 9 sources × 6 codecs ≈ 14k rows ⇒ ~3 MB additional disk for a full corpus), which is within budget. Schema bump from 2 to 3 means every consumer pinned to
SCHEMA_VERSION == 2needs a forward bump or to use the back-compat reader. - Neutral / follow-ups:
ShotFeatures(predictor_features.py) is not yet wired to the new mean / std columns — that is a separate PR (and that work is out of scope here).hw_encoder_corpus.pymay still emit per-frame rows with bare feature names; downstream ensemble-LOSO trainers consume that shape and are unaffected by this PR.
References¶
- Audit digest:
docs/research/0091-partial-integration-audit-2026-05-08.md. - ADR-0237 (vmaf-tune Phase A umbrella).
- ADR-0297 / ADR-0301 (schema v2 sample-clip mode).
- ADR-0272 (codec-aware FR regressor v2 scaffold).
- ADR-0291 (fr_regressor_v2 prod ship).
- ADR-0302 (encoder-vocab v3 schema expansion).
- ADR-0323 (fr_regressor_v3 train and register).
- Source:
req— task brief 2026-05-08 to "ship the corpus-schema bump that unblockstrain_fr_regressor_v2.pyandtrain_fr_regressor_v3.pyconsuming canonical-6 per-frame features directly from the corpus".