ADR-0168: Tiny-AI Wave 1 baselines C2 + C3 — KoNViD-1k training (T6-1)¶
- Status: Accepted
- Date: 2026-04-25
- Deciders: Lusoris, Claude (Anthropic)
- Tags: tiny-ai, training, onnx, konvid-1k, c2, c3, fork-local
Context¶
BACKLOG T6-1 calls for shipping "baseline C1/C2/C3 ONNX checkpoints in model/tiny/". The Wave 1 roadmap defines the three baselines:
- C1 —
fr_regressor_v1.onnx(FR; libvmaf features → MOS). Target: match or beatvmaf_v0.6.1PLCC on Netflix Public. - C2 —
nr_metric_v1.onnx(NR; distorted frame → MOS). Target: useful on live-encode + UGC without a reference. - C3 —
learned_filter_v1.onnx(residual; degraded → clean). Target: residual luma denoise, ≤ +2% encode time.
The roadmap explicitly notes:
First training run also exercises the
vmaf-trainCLI end-to-end and proves the sidecar-JSON round-trip.
Pre-existing scaffolding under ai/ covered:
- Lightning models for all three (
fr_regressor,nr_metric,learned_filter), each with hparams matching the YAML configs inai/configs/. - A C1-shaped
VmafTrainDataModulethat reads(features, mos)parquet rows, with key-aware splits. - A typer CLI with
extract-features/fit/export/eval/registersubcommands.
What was missing: actual ONNX checkpoints (model/tiny/ had only LPIPS-SqueezeNet + smoke fixtures), and the data-loading path for C2 (frame → MOS) / C3 (paired frames) which the C1-only datamodule could not handle.
A 2026-04-25 dataset-access audit (general-purpose Claude agent) established that:
- Netflix Public (the dataset that calibrates
vmaf_v0.6.1) is access-gated. Distribution is a Google Drive folder requiring a manual request to Netflix; cannot be downloaded programmatically. No public mirror exists. This blocks C1 from shipping in this PR. - KoNViD-1k (UGC NR with crowd-sourced MOS) is freely downloadable from
datasets.vqa.mmsp-kn.dewith no auth (~2.3 GB videos + ~3 MB metadata). Citation required; no Creative Commons variant claimed verbatim on the official page despite secondary reports.
Per popup 2026-04-25, the user chose: "Defer C1, ship C2 + C3 now."
Decision¶
Ship trained C2 + C3 baselines, defer C1¶
-
C2 —
nr_metric_v1— train the existingNRMetric(MobileNet-tiny, ~19K params) on KoNViD-1k middle-frames at 224×224 grayscale, key-split 80/10/10 train/val/test, 60 epochs with early-stopping (patience 15), 16-bit mixed precision on the user's RTX 4090. Finalval/mse = 0.382(MOS scale 1–5; ~RMSE 0.62 on a tiny-by-tiny model trained on a tiny-by-tiny dataset — pipeline is correct, quality is "baseline" per roadmap intent). -
C3 —
learned_filter_v1— train the existingLearnedFilter(4-block residual CNN, ~19K params) self- supervised: KoNViD-1k middle-frame + synthetic degradation (Gaussian σ=1.2 + JPEG-Q35) → clean original. 100 epochs, no early stop (loss kept improving),val/L1 = 0.019on the 224×224 normalised luma plane. -
Defer C1 — pending Netflix Public Dataset access, tracked in
docs/state.mdunder "Open bugs / deferred items".
Three new dataset-shaped scripts under ai/scripts/¶
fetch_konvid_1k.py—urllib-based downloader for the videos + metadata zips. Ships at the location pointed to by$VMAF_DATA_ROOT/konvid-1k/(default~/datasets/konvid-1k/). Idempotent. Note: themmsp-kn.deTLS certificate was observed expired 2026-04-25; the script falls back to an unverified SSL context for this single hard-coded URL with the CRC + size sanity floor as integrity backstop. Comment in the script flags this is not a generalisation.extract_konvid_frames.py— drives ffmpeg per-clip to grab one luma frame at clip midpoint, resizes to 224×224 grayscale viaareainterpolation, writes per-clip.npy, builds two parquets (C2:(key, frame_path, mos); C3:(key, deg_path, clean_path)).train_konvid.py— standalone Lightning driver that side- steps the C1-onlyvmaf_train.trainglue. Imports the existingNRMetric/LearnedFilterLightning models, plugs them into newFrameMOSDataset/PairedFrameDatasetclasses, with key-split + early-stopping wired in. Two-arg invocation:--model {c2,c3,both}.export_tiny_models.py— re-uses the existingvmaf_train.models.exports.export_to_onnxpipeline (opset 17, dynamic batch axis, op-allowlist + ORT roundtrip atol 1e-4), writes per-model sidecar JSON, patchesmodel/tiny/registry.jsonin place.
Two new in-tree datamodule classes under vmaf_train.data¶
FrameMOSDataset(C2):(frame[1, H, W], mos[scalar]).PairedFrameDataset(C3):(degraded[1, H, W], clean[1, H, W]).
Both expose a .keys property so the existing split_keys helper gives deterministic per-clip splits (no leakage between train and val).
Schema + C-side enum extension for kind: "filter"¶
model/tiny/registry.schema.json gains "filter" as a third allowed enum for kind. The matching C-side VmafModelKind enum in core/include/libvmaf/model.h gains VMAF_MODEL_KIND_DNN_FILTER = 3, and the sidecar parser in core/src/dnn/model_loader.c recognises the new string. Filter models are registry-tracked (SHA-256-pinned, signed in release) for trust-root hygiene but are not loaded by the libvmaf scoring path — the ffmpeg vmaf_pre filter consumes them by path.
Alternatives considered¶
- Stub all three with random-weight ONNX placeholders. Rejected: pipeline-true but not roadmap-true. The whole point of T6-1 is to prove the training pipeline, not just the surface.
- Wait for Netflix Public access before shipping anything. Rejected: blocking on an asynchronous external approval would stall T6-2/3/4/5/6/7 indefinitely. C2 + C3 can ship now.
- Substitute KoNViD-1k for C1's training set. Rejected:
vmaf_v0.6.1PLCC comparison is C1's defining target; using a different dataset makes the comparison non-comparable and would ship a number we'd have to caveat in the model card. - Train C2 only, defer both C1 and C3. Rejected: C3 trains self-supervised on the same KoNViD frames already extracted for C2 — the marginal cost is one additional Lightning trainer and 60 epochs of training time. C3 is too cheap to defer.
- Add a separate
model/tiny/filter_registry.json. Rejected: maintaining two registries doubles the trust-root surface and confuses release tooling. Extending the existing registry's enum is the cleaner path.
Consequences¶
Positive:
- Two real, trained, ONNX-exportable, ORT-validated baseline models ship in
model/tiny/. Closes 2 of the 3 sub-items of T6-1. - The training pipeline is exercised end-to-end: dataset fetch → manifest scan → frame extraction → Lightning training → ONNX export → op-allowlist + ORT roundtrip → registry update.
- Future tiny-AI work has working examples to copy from. C1 in particular only needs the dataset; the rest of the pipeline is proven.
- The
kind: "filter"enum extension reserves space for future pre-/post-processing models (vmaf_post, FastDVDnet) without another schema change.
Negative:
- C2 quality is baseline-grade (RMSE ~0.62 on 1–5 MOS, well below state-of-the-art NR metrics at this size). Improvements should use either (a) bigger backbone, (b) more training data, (c) multi-frame input. Tracked as future work in
docs/ai/roadmap.md. - C3 trained on synthetic degradation — real-encoder distortions may not match the gaussian + JPEG distribution. Worth re-training on real x265 / SVT-AV1 outputs once a paired-encode workflow exists.
- KoNViD-1k MOS values are not redistributed: the populated manifest stays gitignored and the user must re-run
vmaf-train manifest-scanon a fresh clone. Existing convention permanifests/README.md; not changing this. - Cross-backend ULP gate runs against CPU only on the user's box (no CUDA EP installed in the venv). The ONNX models are deterministic; CUDA EP validation is a follow-up when the self-hosted GPU runner from ADR-0167's docs/development/self-hosted-runner.md is enrolled.
- The C-side enum extension is ABI-additive (new value at the end), not breaking, but consumers with
switchstatements that don't havedefault:clauses would emit a-Wswitchwarning. Not a problem inside libvmaf (we usedefault:) but flagged in rebase-notes.
Numerical results¶
C2 — nr_metric_v1¶
- Dataset: KoNViD-1k middle-frames, 224×224 grayscale, key-split 80/10/10 (973 train / 106 val / 121 test).
- Architecture: MobileNet-tiny (1×Conv stem + 5×depth-separable blocks + AdaptiveAvgPool + Linear), width=16, ~19.1K params.
- Training: AdamW, lr=1e-3, weight_decay=1e-4, 16-bit mixed precision, batch=64, 60 epochs with early-stop patience=15. Hardware: RTX 4090.
- Final:
val/mse = 0.382(epoch 23, training stopped here on patience). Test set not yet evaluated — left for follow-up.
C3 — learned_filter_v1¶
- Dataset: same 1200 frames, paired with synthetic
gaussian σ=1.2 JPEG Q35degradation. - Architecture: 4-block residual CNN with
entry: Conv(1→16) + 4×ResBlock(16) + exit: Conv(16→1), ~18.9K params, output clamped to [0, 1]. - Training: AdamW, lr=1e-4, 16-bit mixed precision, batch=32, 100 epochs (no early-stop trigger).
- Final:
val/L1 = 0.019on the normalised luma plane (~5/255 in raw uint8). Visually denoising as expected.
ONNX export¶
- Both pass
vmaf_train.op_allowlist.check_graph(opset 17, no forbidden ops). - Both round-trip through ORT CPU within 1e-4 atol of the PyTorch output.
- File sizes:
nr_metric_v1.onnx≈ 51 KB;learned_filter_v1.onnx≈ 6 KB.
References¶
- BACKLOG T6-1 — backlog row.
- Wave 1 roadmap — model definitions + targets.
- ADR-0036 / ADR-0107 — Wave 1 scope.
- ADR-0042 — tiny-AI docs rule.
- ADR-0166 — release channel for the artifacts (the new ONNX files attach + sign on the next release tag via
supply-chain.yml). - KoNViD-1k. Hosu, Hahn, Jenadeleh, Lin, Men, Szirányi, Li, Saupe. "The Konstanz natural video database (KoNViD-1k)," QoMEX 2017. http://database.mmsp-kn.de.
req— user popup 2026-04-25: "All three — real training" → follow-up: "download them and train locally, we don't have to upload the datasets, only the models" → after audit findings: "Defer C1, ship C2 + C3 now (Recommended)".