Research-0029 — Phase-3b: StandardScaler retry of the subset sweep¶
Updated: 2026-04-29.
Question¶
Research-0028 §"Decision" hypothesised that the negative Phase-3 result was a feature-scale-variance artefact, not a feature-signal artefact: raw features fed to mlp_small let psnr_* / cambi / ciede2000 (range 0–100) dominate gradient updates over adm2 / vif_* / float_ssim (range 0–1). Phase-3b is the most-likely-fix retry: same sweep, add per-fold StandardScaler.
Method¶
Identical to Research-0028 §"Method": 9-fold LOSO on the Netflix Public corpus (runs/full_features_netflix.parquet, 11 040 frames × 21 features), mlp_small (Linear→16→8→1), 30 epochs, Adam lr=1e-3, batch 256, seed=0. Difference: per-fold StandardScaler — fit (mean, std) on the train split and apply to both train and val. Statistics never leak from the held-out fold.
Driver flag: --standardize on ai/scripts/phase3_subset_sweep.py (added in this PR).
Results¶
| Subset | Features | Mean PLCC | Mean SROCC | Mean RMSE | Δ PLCC vs canonical6 |
|---|---|---|---|---|---|
| canonical6 | 6 | 0.9677 | 0.9802 | 9.586 | — |
| A | 7 | 0.9669 | 0.9768 | 8.672 | −0.0008 |
| B | 7 | 0.9783 | 0.9803 | 7.893 | +0.0106 |
| C | 21 | 0.9597 | 0.9632 | 10.497 | −0.0081 |
Headline¶
Subset B clears the Research-0027 stopping-rule threshold of +0.005 PLCC by 2× (+0.0106). The hypothesis "broader feature set helps a tiny MLP" is now supported on standardised inputs — the Phase-3a failure was preprocessing, not signal.
Subset B is {adm2, adm_scale3, vif_scale2, motion2, ssimulacra2, psnr_hvs, float_ssim} — the consensus-7 with redundancy-pruned VIF scales. It validates all four of Research-0027's findings:
adm2andvif_scale2are core (canonical features).adm_scale3adds signal beyondadm2(correlation only 0.84).ssimulacra2carries independent perceptual signal.- Redundant pairs (motion family, vif scales 1/2/3) ARE redundant — Subset B drops them and wins; Subset C keeps them all and loses.
Subset comparison detail¶
- Subset A (canonical-6 + ssimulacra2) matches canonical6 within noise (Δ −0.0008). One extra feature without redundancy pruning doesn't help; the model has to make room for ssimulacra2 by attenuating something else. The PLCC delta is statistically meaningless (std on 9 folds is ±0.029), but the expectation that adding any consensus-top feature would lift PLCC was wrong. Pruning matters.
- Subset B (consensus-7 with redundancy pruning) is the winner. Drops 3 redundant VIF scales (kept only
vif_scale2) and dropsmotion3(redundant withmotion2); addsadm_scale3,ssimulacra2,psnr_hvs,float_ssim. Same feature count as A (7), better PLCC (+0.0114 vs A), better RMSE (8.7 → 7.9). - Subset C (full-21) loses (Δ −0.0081). Including all 21 features overwhelms the 16-unit hidden layer; the model can't exploit the high-signal additions because most of the input is noise from redundant features. RMSE goes the wrong way too (10.5 vs canonical6's 9.6).
The pattern is: information curation > information accretion. The MI/LASSO/RF triangulation in Research-0027 correctly identified which features were complementary; Subset C just forgot to apply the redundancy pruning.
Why canonical6 PLCC dropped between Phase-3a and Phase-3b¶
Phase-3a canonical6 PLCC was 0.9845; Phase-3b canonical6 PLCC is 0.9677 — same model, same data, same seed, only difference is StandardScaler. Two contributing factors:
- Adam lr interacts with input scale. The canonical 6 features are roughly in
[0, 1]. Standardising tomean=0, std=1actually enlarges their effective magnitude (small variance in raw → scaled to unit variance). Adam's adaptive learning rate then takes larger steps per epoch; for an already well-tuned 6-feature setup, this overshoots the optimum slightly. Loweringlrfrom1e-3to3e-4(Phase-3c experiment) would likely recover Phase-3a's canonical6 number. - The
seed=0random shuffle order is identical, but the gradient magnitudes are different post-scaling. Different convergence trajectory, different result.
The apples-to-apples comparison is canonical6 vs Subset B with both standardised — that's the +0.0106 PLCC win, and it's the load-bearing finding. A future Phase-3c experiment could re-tune lr for the standardised regime; if canonical6 recovers to its Phase-3a number under a smaller lr, Subset B would have to re-clear the threshold under matched conditions.
Decision¶
ADVANCE. Per Research-0027's stopping rule and the +0.0106 result, Subset B is the v2 candidate. But three caveats gate actual model-shipping:
Required before shipping vmaf_tiny_v2¶
- Multi-seed validation — current numbers are seed=0 only. Re-run Phase-3b at
seed ∈ {0, 1, 2, 3, 4}; require Subset B to maintain Δ ≥ +0.005 against canonical6 on mean over seeds, with std ≤ 0.01. - KoNViD cross-check — extract the full-feature parquet on the 1200-clip KoNViD corpus (~3 h wall, per Research-0025 precedent) and re-run the sweep. The redundancy structure is unlikely to change but the importance ranking may.
- Phase-3c lr-sweep on canonical6 — if a tuned
lrrecovers Phase-3a's 0.9845 PLCC, the +0.0106 advantage may shrink under matched preprocessing.
Phase-3c (gated on Phase-3b)¶
- Sweep
lr ∈ {1e-3, 3e-4, 1e-4}×epochs ∈ {30, 60, 100}×arch ∈ {mlp_small, mlp_medium}on canonical6 + Subset B with StandardScaler. Identifies the canonical6 PLCC ceiling under matched preprocessing and tests whether a wider arch helps Subset B more.
Phase-3d (gated on Phase-3c if B still wins)¶
- Per-feature ablation of Subset B: train (B − {f}) for each f in B's 7 features, report PLCC delta. Identifies which Subset-B features are load-bearing vs which are decoration.
What to take from this digest¶
The Research-0026 hypothesis is alive and supported. Subset B is the right v2 candidate. But the path to vmaf_tiny_v2.onnx needs three more validation steps (multi-seed, KoNViD, matched-preprocessing canonical6). None of those are blocking; all are well-scoped follow-ups.
The most striking secondary finding: Subset C (full-21) loses even with StandardScaler. That's strong evidence that Subset B's redundancy pruning is doing real work — adding more features past the consensus-7 actively hurts, even with normalised inputs. This supports the "tiny model" thesis specifically: a wider feature input would need a correspondingly wider hidden layer (mlp_medium or larger) to benefit, which is the Phase-3c hook.
Reproducer¶
python3 ai/scripts/phase3_subset_sweep.py \
--parquet runs/full_features_netflix.parquet \
--out runs/phase3b_subset_sweep.json \
--subsets canonical6,A,B,C \
--epochs 30 \
--standardize
Wall: ~18 min on ryzen-4090-arc CPU (same as Phase-3a; the StandardScaler step is negligible).
Caveats¶
- Single seed.
seed=0only; multi-seed required before shipping (see §"Required before shipping"). - Single corpus. Netflix only; KoNViD cross-check open (Phase-3b extension).
- canonical6 PLCC moved between phases — the Phase-3a → Phase-3b drop on canonical6 (0.9845 → 0.9677) means the absolute Subset-B number isn't directly comparable to the Phase-2 importance numbers from Research-0027.
mlp_smallonly.mlp_mediummay flip the C-vs-B ordering (more capacity could exploit redundant features). Phase-3c.- No StandardScaler statistics persisted with the model. A shipped v2 ONNX would need to bundle the scaler
(mean, std)into the sidecar (per ADR-0049) so inference applies the same normalisation. This is solvable but unimplemented.
References¶
req(user, 2026-04-29): "go on" in response to "Want me to fire Phase-3b now".- Research-0026 — 4-phase plan; this digest closes Phase 3b (the standardisation retry).
- Research-0027 — Phase-2 consensus-top-10 result that justified Subset B's feature composition.
- Research-0028 — negative result that motivated this retry; explained the standardisation hypothesis.
- ADR-0049 — sidecar JSON policy (governs how a v2 model bundles its scaler statistics).
- Driver:
ai/scripts/phase3_subset_sweep.py(PR #188 baseline,--standardizeflag added in this PR). - Source data:
runs/full_features_netflix.parquet(gitignored; reproducible). - Source results:
runs/phase3b_subset_sweep.json(gitignored; reproducible).