Tiny AI — benchmarks¶

How to produce comparable numbers for the tiny-AI models and how to read them. The table below is a registry snapshot of shipped model-card metrics; regenerate the per-model reports before using any row for a release claim.

Accuracy methodology¶

For FR (C1) and NR (C2) models, the three canonical regression metrics:

PLCC — Pearson linear correlation with MOS.
SROCC — Spearman rank-order correlation.
RMSE — root mean square error against MOS (0–100 scale).

All three are computed by vmaf-train eval on the held-out test split produced by vmaf_train.data.splits.split_keys with the fixed salt vmaf-train-splits-v1. Splits are deterministic so baseline and challenger see the same frames/keys.

vmaf-train eval \
    --model model/tiny/vmaf_tiny_fr_v1.onnx \
    --features ai/data/nflx_features.parquet \
    --split test

Baseline: upstream `vmaf_v0.6.1` SVM¶

To compare a new tiny FR model against the upstream SVM, score the same test pairs through both and run ai/tests/test_eval_metrics.py helper functions. Keep the baseline's version in the committed report for auditability.

Runtime methodology¶

# Frames/second, end-to-end, single-thread CPU:
./testdata/bench_all.sh --tiny-model model/tiny/vmaf_tiny_fr_v1.onnx --backend=cpu

# GPU throughput:
./testdata/bench_all.sh --tiny-model model/tiny/vmaf_tiny_fr_v1.onnx --backend=cuda

The testdata/bench_all.sh harness logs into testdata/netflix_benchmark_results.json (never committed — ad-hoc run artefact). Collect multiple runs and report median + p99.

Shipped-score snapshot¶

Model	Target	Validation summary	Runtime note
`fr_regressor_v1`	FR VMAF-teacher score	Netflix Public Dataset 9-fold LOSO mean PLCC `0.9977 ± 0.0025`	Tiny MLP over canonical-6 features; standardisation lives in the sidecar.
`fr_regressor_v2`	FR codec-aware VMAF-teacher score	Phase-A corpus in-sample PLCC `0.9794`; promoted by ADR-0291's LOSO gate	Adds codec / preset / CRF conditioning.
`fr_regressor_v3`	FR codec-aware VMAF-teacher score	LOSO mean PLCC `0.9975`, gate `>= 0.95`	Current 16-slot encoder-vocab model.
`vmaf_tiny_v2`	FR VMAF-teacher score	Netflix LOSO PLCC `0.9978 ± 0.0021`; KoNViD 5-fold PLCC `0.9998`	Production tiny fusion default; StandardScaler baked into ONNX.
`vmaf_tiny_v3`	FR VMAF-teacher score	Netflix LOSO PLCC `0.9986 ± 0.0015`; train-set RMSE `0.112`	Higher-capacity opt-in model; int8 sidecar available.
`vmaf_tiny_v4`	FR VMAF-teacher score	Netflix LOSO PLCC `0.9987 ± 0.0015`	Largest shipped tiny fusion model; opt-in.
`dists_sq_placeholder_v0`	FR perceptual-distance smoke	No perceptual-quality score claimed; registry row is `smoke: true`	ABI / ORT two-input smoke checkpoint only.
`mobilesal_placeholder_v0`	NR saliency smoke	Superseded for production ROI by `saliency_student_v1`; registry row is `smoke: true`	Retained to preserve the historical MobileSal I/O contract.

Runtime throughput depends on ORT execution provider, CPU ISA, and GPU driver. Record measured CPU / CUDA / SYCL / OpenVINO numbers in the individual model card or release note for the exact build under test rather than maintaining a single stale global table here.

Model-size targets¶

Model class	Target size	Typical
C1 (FR MLP)	≤ 100 KB	~50 KB
C2 (NR CNN)	≤ 5 MB	~2 MB
C3 (learned filter)	≤ 2 MB	~800 KB

Models larger than VMAF_DNN_DEFAULT_MAX_BYTES (50 MB, compile-time constant) are rejected at load time. The historical VMAF_MAX_MODEL_BYTES env override was retired in T7-12 — tiny-AI is tiny by definition, and if a candidate model balloons past the targets the design is wrong, not the limit.

Determinism in benchmarks¶

Same --seed + same train_commit + same dataset manifest SHA should reproduce the reported scores within a tight allclose. CI includes a float-rounding guard so drift ≥ 1e-3 on the primary FR metric trips a regression failure.