Skip to content

Tiny AI — benchmarks

How to produce comparable numbers for the tiny-AI models and how to read them. The table below is a registry snapshot of shipped model-card metrics; regenerate the per-model reports before using any row for a release claim.

Accuracy methodology

For FR (C1) and NR (C2) models, the three canonical regression metrics:

  • PLCC — Pearson linear correlation with MOS.
  • SROCC — Spearman rank-order correlation.
  • RMSE — root mean square error against MOS (0–100 scale).

All three are computed by vmaf-train eval on the held-out test split produced by vmaf_train.data.splits.split_keys with the fixed salt vmaf-train-splits-v1. Splits are deterministic so baseline and challenger see the same frames/keys.

vmaf-train eval \
    --model model/tiny/vmaf_tiny_fr_v1.onnx \
    --features ai/data/nflx_features.parquet \
    --split test

Baseline: upstream vmaf_v0.6.1 SVM

To compare a new tiny FR model against the upstream SVM, score the same test pairs through both and run ai/tests/test_eval_metrics.py helper functions. Keep the baseline's version in the committed report for auditability.

Runtime methodology

# Frames/second, end-to-end, single-thread CPU:
./testdata/bench_all.sh --tiny-model model/tiny/vmaf_tiny_fr_v1.onnx --backend=cpu

# GPU throughput:
./testdata/bench_all.sh --tiny-model model/tiny/vmaf_tiny_fr_v1.onnx --backend=cuda

The testdata/bench_all.sh harness logs into testdata/netflix_benchmark_results.json (never committed — ad-hoc run artefact). Collect multiple runs and report median + p99.

Shipped-score snapshot

Model Target Validation summary Runtime note
fr_regressor_v1 FR VMAF-teacher score Netflix Public Dataset 9-fold LOSO mean PLCC 0.9977 ± 0.0025 Tiny MLP over canonical-6 features; standardisation lives in the sidecar.
fr_regressor_v2 FR codec-aware VMAF-teacher score Phase-A corpus in-sample PLCC 0.9794; promoted by ADR-0291's LOSO gate Adds codec / preset / CRF conditioning.
fr_regressor_v3 FR codec-aware VMAF-teacher score LOSO mean PLCC 0.9975, gate >= 0.95 Current 16-slot encoder-vocab model.
vmaf_tiny_v2 FR VMAF-teacher score Netflix LOSO PLCC 0.9978 ± 0.0021; KoNViD 5-fold PLCC 0.9998 Production tiny fusion default; StandardScaler baked into ONNX.
vmaf_tiny_v3 FR VMAF-teacher score Netflix LOSO PLCC 0.9986 ± 0.0015; train-set RMSE 0.112 Higher-capacity opt-in model; int8 sidecar available.
vmaf_tiny_v4 FR VMAF-teacher score Netflix LOSO PLCC 0.9987 ± 0.0015 Largest shipped tiny fusion model; opt-in.
dists_sq_placeholder_v0 FR perceptual-distance smoke No perceptual-quality score claimed; registry row is smoke: true ABI / ORT two-input smoke checkpoint only.
mobilesal_placeholder_v0 NR saliency smoke Superseded for production ROI by saliency_student_v1; registry row is smoke: true Retained to preserve the historical MobileSal I/O contract.

Runtime throughput depends on ORT execution provider, CPU ISA, and GPU driver. Record measured CPU / CUDA / SYCL / OpenVINO numbers in the individual model card or release note for the exact build under test rather than maintaining a single stale global table here.

Model-size targets

Model class Target size Typical
C1 (FR MLP) ≤ 100 KB ~50 KB
C2 (NR CNN) ≤ 5 MB ~2 MB
C3 (learned filter) ≤ 2 MB ~800 KB

Models larger than VMAF_DNN_DEFAULT_MAX_BYTES (50 MB, compile-time constant) are rejected at load time. The historical VMAF_MAX_MODEL_BYTES env override was retired in T7-12 — tiny-AI is tiny by definition, and if a candidate model balloons past the targets the design is wrong, not the limit.

Determinism in benchmarks

Same --seed + same train_commit + same dataset manifest SHA should reproduce the reported scores within a tight allclose. CI includes a float-rounding guard so drift ≥ 1e-3 on the primary FR metric trips a regression failure.