Research-0026 — Cross-metric feature fusion for tiny-AI¶
Updated: 2026-04-28.
Question¶
The fork's tiny-AI training pipeline (ADR-0203, Research-0019, PR #180 combined trainer) currently uses only the 6 features from the canonical vmaf_v0.6.1 model:
| Feature | Source | Type |
|---|---|---|
adm2 | float_adm | Detail-loss aggregate |
vif_scale0 | float_vif | Visual-info fidelity scale 0 |
vif_scale1 | float_vif | Visual-info fidelity scale 1 |
vif_scale2 | float_vif | Visual-info fidelity scale 2 |
vif_scale3 | float_vif | Visual-info fidelity scale 3 |
motion2 | integer_motion | Frame-to-frame motion energy |
The fork registers 19 extractors that produce a much wider feature set:
| Extractor | Features produced |
|---|---|
| float_adm / integer_adm | adm, adm2, adm_scale0..3 |
| float_vif / integer_vif | vif, vif_scale0..3 |
| float_motion | motion, motion2 |
| integer_motion_v2 | motion3 (5-frame window) |
| float_ansnr | float_ansnr, float_anpsnr |
| float_psnr / psnr | psnr_y, psnr_cb, psnr_cr |
| float_ssim / ssim | ssim |
| float_ms_ssim | float_ms_ssim |
| float_moment | image moments (mean, var) |
| cambi | cambi (banding) |
| ciede | ciede2000 (color difference) |
| psnr_hvs (CPU SIMD'd) | psnr_hvs, psnr_hvs_y/cb/cr |
| ssimulacra2 | ssimulacra2 |
| lpips | lpips (DNN-based perceptual) |
That's ≈ 27 distinct per-frame features the fork can extract today, all bit-exact across CPU + AVX2 + AVX-512 + NEON, and most also bit-exact (or near so) across CUDA / SYCL / Vulkan.
The question this digest scopes: does training on the broader feature set produce a better tiny model than the canonical 6 features, and which feature subset is Pareto-optimal?
Why this matters now¶
Research-0025 demonstrated that adding more data (KoNViD-1k) tightens LOSO PLCC standard deviation by 5.6× and resolves the FoxBird outlier. The natural next axis is adding more features — same data, richer signal. Three possible outcomes:
- Strong gain. A wider feature set captures perceptual aspects the 6-feature canonical misses (e.g. CAMBI for banding, SSIMULACRA 2 for chroma). The new model becomes
vmaf_tiny_v2.onnx. - Marginal gain. Cross-metric correlation is high enough that the 6 features already capture most of the signal; adding more yields ≤ 0.5 pp PLCC improvement at the cost of slower extraction. Conclusion: stay on 6 features; this digest closes "no significant gain measured".
- Mixed gain. Specific failure modes (FoxBird-class banding, chroma artifacts, color-space mismatch) get fixed; mean PLCC barely moves. Conclusion: ship a model variant per failure mode, not a one-size-fits-all replacement.
Research-0023 / 0025 already show that the FoxBird-class outlier was a content-distribution problem solvable with more data. Outcome (3) above hypothesises there's a feature-set problem hiding underneath that more data masks but doesn't eliminate.
Experimental plan¶
Phase 1 — Inventory + parquet expansion (1 PR)¶
- Extend
ai/data/feature_extractor.pyto optionally extract all 27 features, not just the 6 canonical ones. Gate by a CLI flag (--feature-set {canonical,full,custom}) so existing 6-feature flows are unchanged by default. - Re-run KoNViD acquisition + Netflix corpus extraction with
--feature-set full. Estimated wall: KoNViD ~2 hours (4 × current), Netflix ~10 minutes. Output:ai/data/konvid_vmaf_pairs_full.parquetandai/data/netflix_pairs_full.parquet(gitignored). - Verify against existing 6-feature parquets that the canonical columns are byte-identical (sanity check on the extraction pipeline).
Deliverable: PR adding --feature-set flag + new parquets. Documented in docs/ai/training-data.md. No model change yet.
Phase 2 — Correlation + mutual-information analysis¶
- Pairwise Pearson correlation matrix over the 27 features on the combined corpus (~280 K frames). Heatmap in
docs/research/0026-feature-correlation-heatmap.png. Hypothesis: VIF scales 0-3 are highly intra-correlated; PSNR-Y and SSIM correlate strongly;motion2andmotionare nearly redundant;admandadm2track within ~0.95. - Pairwise mutual-information matrix (handles non-linear relations Pearson misses). Hypothesis:
cambiandssimulacra2carry information orthogonal to the canonical 6 (banding + perceptual color, neither captured by VIF/ADM/motion). - Feature-importance ranking via three independent methods, targets =
vmaf_v0.6.1per-frame teacher score: - LASSO regression on standardized features (sparsity-inducing).
- Random-forest feature importance (model-free non-linear).
- SHAP values from a gradient-boosted regressor. Cross-check rankings — features ranked top-10 by all three methods are the highest-leverage candidates for the v2 model.
Deliverable: Notebook in ai/scripts/feature_analysis.ipynb or .py (whichever ships); aggregated tables in this digest.
Phase 3 — Tiny-AI v2 architecture sweep¶
- Train
mlp_small/mlp_mediumon three feature subsets: - Canonical-6 (baseline; what we have today).
- Top-K = canonical-6 ∪ top-K-additional from Phase 2's consensus ranking, K ∈ {2, 4, 6, 12}.
- Full-27 (sanity ceiling). 30 epochs each,
--val-mode netflix-source-and-konvid-holdout, same seed. - LOSO sweep on the winning subset — 9-fold per-source held out (matches Research-0025 §"LOSO sweep" methodology). Mean ± std PLCC / SROCC / RMSE.
- Per-failure-mode evaluation:
- Banding-heavy clips (need separate CAMBI fixture set).
- Chroma-artifact clips (need 4:4:4 distorted pairs).
- High-motion + low-light (FoxBird-class — already covered).
Deliverable: Research-0027 digest with the sweep results + recommended vmaf_tiny_v2.onnx registration if outcome (1) or (3) materialises.
Phase 4 — Latency + size tradeoff¶
- Measure
vmafCLI extraction time per (ref, dis) pair on each feature set. Tradeoff: full-27 extraction is bounded by the slowest extractor (CAMBI on 4K, ~30 % of frame time). - ONNX file-size comparison:
vmaf_tiny_v1.onnx(1.3 KB) vs Full-27 mlp_small (~5 KB) — both still tiny. - Decide: which subset becomes default for
vmaf-tiny? Per ADR-0042 / ADR-0049 doc-substance + sidecar policy, ship asvmaf_tiny_v2.onnxwith explicitfeature_setfield in the sidecar JSON so downstream tooling knows.
Open questions¶
- Q1. Should
lpips(the only DNN-based feature) be in the candidate pool? It's expensive (ORT inference per frame) and blurs the line between "tiny model on classical features" and "ensemble of DNNs". Recommend excluding lpips from Phase 1/2 and revisiting only if classical features can't close the gap. - Q2. Do we want the v2 to be drop-in compatible with
vmaf_v0.6.1(same 6-feature input layout) or a new public API? Drop-in compat means Phase 3 must include a constrained variant; new API means a richer feature loader. - Q3. Per-fold or per-clip outliers — is FoxBird the only case? Phase 2's mutual-information analysis on the
Seekingclip (which had RMSE 6.66 in Research-0025's LOSO, the worst by RMSE) should reveal whether it's a content- distribution issue or a feature-set issue.
Cost estimate¶
| Phase | Wall time | New artefacts |
|---|---|---|
| 1 — full-feature parquet | 2-3 h re-extraction | 2 parquets, gitignored |
| 2 — correlation/MI/importance | 30 min compute | 1 notebook, 3 plots, in-digest tables |
| 3 — arch sweep | 1-2 h training | ~12 ONNXes per arch, in runs/ (gitignored) |
| 4 — latency/size | 30 min benchmarking | benchmark JSON in testdata/ |
| Total | ~5-7 h focused | digest + Research-0027 follow-up |
Compare to the 8 h estimate for the upstream-port chains (Research-0024) — broadly similar effort with much higher expected upside.
Recommendation¶
- Adopt this as a tracked T-NN backlog item. Strategy: do Phase 1 + Phase 2 in one session (parquet + analysis) → ship Research-0027 with the correlation/importance numbers and a go/no-go decision on Phase 3.
- Defer Phase 3 unless Phase 2 shows ≥ 1 feature with consensus top-3 importance that's NOT in the canonical 6. Phase 2 is cheap (~3 h) and answers the question. Phase 3 only fires if Phase 2 says yes.
- Phase 4 is automatic if Phase 3 lands a v2 model.
References¶
req(user, 2026-04-28): "are we actually training the ai models on all metrics combined? and try to find more usage/overlaps/relations whatever?"- Research-0019 — Netflix training methodology survey.
- Research-0023 — 3-arch LOSO on canonical 6 features.
- Research-0025 — combined corpus resolves content-distribution variance; this digest attacks the orthogonal feature-set axis.
- ADR-0042 — Tiny-AI doc-substance per-PR rule.
- ADR-0049 — Tiny-AI sidecar JSON policy.
- ADR-0203 — Netflix-corpus training prep.
- Li et al. 2016 (Netflix VMAF paper) — original 6-feature SVR.