Research-0090 — KonViD MOS head v1 design¶

Status: Notes for ADR-0336 (Phase 3 of ADR-0325)
Date: 2026-05-08
Author: @Lusoris

Question¶

How small can a fork-owned MOS head be while still clearing the ADR-0325 production-flip gate (PLCC ≥ 0.85, SROCC ≥ 0.82, RMSE ≤ 0.45, spread ≤ 0.005) on KonViD-style UGC content? And what feature shape should it consume, given that the in-flight Phase 1/2 ingester (PR #440 / #447) does not yet emit canonical-6 / saliency / shot- metadata columns?

Survey of existing open-weight MOS predictors¶

Model	Params	License	KoNViD-1k PLCC	KoNViD-1k SROCC	Notes
DOVER-Mobile	~3.8M	Apache	0.853	0.860	Mobile sibling of DOVER (Wu et al. 2023). Two-branch (technical + aesthetic) over swin-tiny.
Q-Align	~7B	MIT	0.876	0.884	LLM-based, far too large for embedding inside libvmaf.
FAST-VQA	~22M	Apache	0.859	0.851	Spatial-temporal sampling; comparable size to fr_regressor_v2 family + a 3D CNN frontend.
MD-VQA	~10M	Apache	0.846	0.835	Multi-dim VQA, swin-base.

Sources: published papers + the IQA-PyTorch leaderboard. The common denominator across the three competitive Apache-licensed predictors is roughly 4M+ params plus a backbone the fork's ONNX op-allowlist (core/src/dnn/op_allowlist.c) does not admit without Resize / patch-embed / multi-head-attention surgery.

Design constraint — ONNX-allowlist conformance¶

The fork's allowlist (per ADR-0258 / ADR-0169) is dense + conv + pool + standard activations + LayerNorm. That rules out:

Patch-embed Conv1d / Linear via einsum — admissible only if the graph lowers to plain Gemm (which patch-embed does with reasonable export settings, but the rest of swin-tiny carries MultiHeadAttention ops that don't).
3D CNN frontends — would need Conv3d. Admissible op, but the size cost is prohibitive for an in-libvmaf-shipped model.

Conclusion: a competitive head needs to consume summarised features (canonical-6, saliency mean/var, shot stats) rather than raw frames. That's the shape the rest of the fork's prediction stack already uses and matches the Phase A / fr_regressor_v2 corpus shape.

Decision — feature shape¶

11-D feature vector + 1-D ENCODER_VOCAB v4 one-hot:

Index	Feature	Source
0	`adm2`	libvmaf canonical-6
1	`vif_scale0`	libvmaf canonical-6
2	`vif_scale1`	libvmaf canonical-6
3	`vif_scale2`	libvmaf canonical-6
4	`vif_scale3`	libvmaf canonical-6
5	`motion2`	libvmaf canonical-6
6	`saliency_mean`	`saliency_student_v1` (ADR-0286)
7	`saliency_var`	`saliency_student_v1` (ADR-0286)
8	`shot_count_norm`	TransNet v2 (ADR-0223)
9	`shot_mean_len_norm`	TransNet v2 (ADR-0223)
10	`shot_cut_density`	TransNet v2 (ADR-0223)

Phase 1/2 KonViD JSONL rows do not yet carry columns 0–10; the trainer zero-fills them and runs effectively against the MOS column alone for now. Subsequent PRs (#477 for shot metadata; canonical-6 + saliency extraction during ingestion as a separate follow-up) bolt the columns on.

The ENCODER_VOCAB v4 one-hot is [1.0] (always asserted on the single "ugc-mixed" slot per ADR-0325 §Decision). The 1-D shape is forward-compatible: when LSVQ + YouTube-UGC ingestion lands, the new slots append at the end and existing ONNX stays loadable.

Architecture choice¶

LayerNorm(12)
  → Linear(12, 64) → ReLU → Dropout(0.1)
  → Linear(64, 64) → ReLU → Dropout(0.1)
  → Linear(64, 1)
  → Sigmoid + affine to [1, 5]

5,081 parameters total. The Sigmoid + affine wrapper bakes the [1.0, 5.0] MOS range into the graph — adversarial input cannot drive the output below 1 or above 5 — so the predictor surface does not need a runtime clamp on top.

The 30K–100K-param range from the task brief is wider than this; the actual architecture lands smaller because the input is already a summarised feature vector and a deeper MLP overfits the 600-row synthetic corpus.

Synthetic-corpus gate verdict (this PR)¶

5-fold cross-validation on a deterministic-seeded 600-row synthetic corpus produces:

Fold	PLCC	SROCC	RMSE
0	0.8677	0.8831	0.2565
1	0.8854	0.9356	0.2138
2	0.8017	0.8453	0.3079
3	0.8839	0.9442	0.2263
4	0.8596	0.8938	0.2291
Mean	0.8597	0.9004	0.2467

PLCC spread = 0.0836. The synthetic surrogate gate (PLCC ≥ 0.75) clears; the production-flip gate (PLCC ≥ 0.85 mean, ≤ 0.005 spread) does not clear, which is expected because synthetic noise is honestly noisier than real KonViD inter-rater consistency. The gate is not lowered (memory feedback_no_test_weakening); the checkpoint ships with Status: Proposed and the real-corpus retrain is gated on PR #447.

Fallback path — `mos = (vmaf - 30) / 14`¶

Why this specific linear remap, and not a more sophisticated calibration?

It maps VMAF 30 (visibly distorted) to MOS 0 (clamped to 1) and VMAF 100 (transparent) to MOS 5, giving a plausible 5-point estimate without per-codec calibration data.
The slope 1/14 is the inverse of the empirical (MOS - 1) * 14 + 30 ≈ VMAF regression Netflix's blog post on VMAF-vs-MOS approximations cites; using their inverse keeps the fallback in the same ball-park as the legacy assumption.
The clamp to [1, 5] keeps the surface honest — anything outside the MOS scale is a fallback artefact, not a real prediction.

The model card flags the fallback as approximate, not authoritative; callers that need MOS for a non-debug purpose should ensure the ONNX is shipped or block on the production-flip retrain.

Follow-ups¶

Extend the KonViD ingester (ai/scripts/konvid_*_to_corpus_jsonl.py) to emit the canonical-6 + saliency + shot-metadata columns. Today's trainer accepts those columns when present and zero-fills when absent, so the change is forward-compatible.
When PR #447 lands, re-run the trainer without --smoke and re-evaluate the production-flip gate. If it clears, flip the model card from Proposed to Accepted and update docs/state.md.
ENCODER_VOCAB v4 expansion (LSVQ, YouTube-UGC) — append-only schema bump, retrain under the same seed.

References¶

ADR-0325 — parent corpus-ingestion ADR.
ADR-0303 — production-flip protocol shape.
ADR-0258 — ONNX op-allowlist this graph conforms to.
ADR-0223 — shot-metadata source.
ADR-0286 — saliency-feature source.
Research-0086 — KonViD-150k corpus feasibility audit.
PR #440 — KonViD-1k Phase 1 ingestion.
PR #447 — KonViD-150k Phase 2 ingestion (in flight).