Tiny AI — inference¶
Three consumer surfaces share one runtime: vmaf CLI, libvmaf C API, and ffmpeg filters. All three funnel through core/src/dnn/ort_backend.c.
Prerequisites¶
- libvmaf built with
-Denable_dnn=enabled(orautowith ONNX Runtime discoverable viapkg-config). - ONNX Runtime ≥ 1.20 available at build time. ONNX Runtime isn't in the distro setup scripts under scripts/setup/ yet — install the prebuilt release tarball from https://github.com/microsoft/onnxruntime/releases or a distro package if available, and make sure its
libonnxruntime.so+ headers are onPKG_CONFIG_PATHbefore runningmeson setup. - A
.onnxmodel + sidecar.jsonpair undermodel/tiny/or anywhere else — the CLI flag accepts an absolute path.
Verify at runtime:
vmaf --help | grep -- '--tiny-model' # must list the flag
vmaf --tiny-model /missing.onnx 2>&1 # should print a clear error,
# not "option not found"
Optional deployment hardening:
| Environment | Default | Notes |
|---|---|---|
VMAF_TINY_MODEL_DIR | unset | Directory jail for --tiny-model / libvmaf tiny-model loads. When set, every ONNX path is resolved with symlinks followed and must live below this directory before the loader stats or maps the file. Missing, non-directory, sibling-prefix, and symlink-escape paths fail closed with -EACCES. |
Example:
export VMAF_TINY_MODEL_DIR=/opt/vmaf-models
vmaf -r ref.yuv -d dis.yuv -w 1920 -h 1080 -p 420 -b 8 \
--tiny-model /opt/vmaf-models/vmaf_tiny_v2.onnx
Surface 1 — the vmaf CLI¶
# C1 — drop-in augmentation of the classic SVM. Default tiny FR model
# is now vmaf_tiny_v2 (ADR-0244) — supersedes the prior vmaf_tiny_v1.
vmaf -r ref.yuv -d dis.yuv -w 1920 -h 1080 -p 420 -b 8 \
-m version=vmaf_v0.6.1 \
--tiny-model model/tiny/vmaf_tiny_v2.onnx \
--tiny-device cuda
# C2 — no-reference.
vmaf -d dis.yuv -w 1920 -h 1080 -p 420 -b 8 \
--tiny-model model/tiny/vmaf_nr_mobilenet_v1.onnx \
--no-reference
Default flip (2026-04-29).
vmaf_tiny_v2replacesvmaf_tiny_v1as the recommended tiny FR fusion model. Same input contract (canonical-6 features), same output range (0–100 VMAF), +0.005–0.018 PLCC across the Phase-3 validation chain. The v1 file stays on disk as a regression baseline. Seemodels/vmaf_tiny_v2.mdfor the full model card.
vmaf_tiny_v3available alongside v2 (2026-05-02, ADR-0241). A wider/deeper variant (mlp_medium6 → 32 → 16 → 1, ~769 params) trained on the same 4-corpus parquet, same recipe. Netflix LOSO mean PLCC 0.9986 ± 0.0015 vs v2's 0.9978 ± 0.0021 (+0.0008 mean, -30 % std). v2 remains the production default; pick v3 for lowest-variance estimates. Seemodels/vmaf_tiny_v3.md.
Architecture ladder (2026-05-02). The tiny VMAF fusion family now spans three rungs sharing the canonical-6 input contract:
| Model | Arch | Params | ONNX | NF LOSO PLCC | Status |
|---|---|---|---|---|---|
vmaf_tiny_v2 | mlp_small | 257 | 2.5 KB | 0.9978 ± 0.0021 | Production default |
vmaf_tiny_v3 | mlp_medium | 769 | 4.5 KB | 0.9986 ± 0.0015 | Opt-in (recommended higher tier) |
vmaf_tiny_v4 | mlp_large | 3073 | 14.0 KB | 0.9987 ± 0.0015 | Opt-in (top of measured ladder) |
v4's PLCC win over v3 is +0.0001 (below 1 std) — the ladder saturates on the canonical-6 + 4-corpus regime. ADR-0242 records "the arch ladder stops here". Pick v3 unless you specifically want the absolute top of the measured ladder; pick v2 for the smallest bundle.
New flags:
| Flag | Default | Notes |
|---|---|---|
--tiny-model PATH | — | ONNX model path; sidecar JSON at ${PATH%.onnx}.json. |
--tiny-device STR | auto | auto | cpu | cuda | openvino | coreml | coreml-ane | coreml-gpu | coreml-cpu | openvino-npu | openvino-cpu | openvino-gpu | rocm. |
--tiny-threads N | 0 | CPU EP intra-op threads; 0 = ORT default. |
--tiny-fp16 | off | Request fp16 I/O when the EP supports it. |
--tiny-model-verify | off | Require Sigstore-bundle verification (cosign verify-blob) before model load. Refuses to load on missing bundle, missing cosign, or non-zero exit. See model-registry.md and security.md. |
--tiny-codec NAME | unknown | Encoder name for codec-aware tiny models (e.g. fr_regressor_v2). Validated against the sidecar's encoder_vocab; unknown names hard-fail at attach time so typos are caught. Common ffprobe aliases (h264, hevc, av1, vp9, vvc) are accepted. See ADR-0522. |
--tiny-preset STR | medium | Encoder preset (medium / slow / p4 / 5 etc.); interpretation is encoder-specific and mirrors ai/scripts/train_fr_regressor_v2.py::PRESET_ORDINAL. Unknown presets fall back to ordinal 5. |
--tiny-crf N | 0 | CRF / QP integer used during encoding; clamped to [0, 63] and normalised by 63 to match the trainer. |
--no-reference | off | Skip reference loading; only valid with an NR tiny model. |
Codec-aware tiny models (fr_regressor_v2)¶
fr_regressor_v2.onnx carries a second codec input of shape [batch, N_VOCAB + 2]. The first N_VOCAB slots are a one-hot over the sidecar's encoder_vocab (v2: 12 entries — libx264, libx265, libsvtav1, libvvenc, libvpx-vp9, h264_nvenc, hevc_nvenc, av1_nvenc, h264_qsv, hevc_qsv, av1_qsv, unknown); the last two are preset_norm = preset_ordinal / 9.0 and crf_norm = crf / 63.0.
Without --tiny-codec / --tiny-preset / --tiny-crf the loader pre-seeds the codec block to the unknown baseline (ADR-0518) and the model receives a constant conditioning vector — every call returns the same score regardless of the distorted YUV's encoder. Passing the flags populates the block from the user-supplied parameters via the new vmaf_dnn_set_codec_context() public API (ADR-0519):
vmaf --reference src.yuv --distorted dst.yuv \
--width 576 --height 324 --pixel_format 420 --bitdepth 8 \
--tiny-model model/tiny/fr_regressor_v2.onnx \
--tiny-codec libx264 --tiny-preset medium --tiny-crf 28 \
--json --output /tmp/scores.json
Unknown codec names exit non-zero before the first frame is read:
$ vmaf … --tiny-codec UNKNOWN_ENC …
--tiny-codec 'UNKNOWN_ENC' not found in model encoder_vocab;
use one of the names listed by --help.
Non-codec-aware models (fr_regressor_v1, vmaf_tiny_v4, dists_sq) reject the flags with a -ENOTSUP message — --tiny-codec requires a model whose sidecar carries an encoder_vocab array.
--tiny-model accepts an absolute or relative path. For production, set VMAF_TINY_MODEL_DIR to the trusted model directory and pass paths inside that directory; a model outside the jail fails before ONNX Runtime opens a session. The jail is independent of --tiny-model-verify: use the jail to restrict where models may load from, and verification to pin which signed model bytes may load.
Output JSON gains a tiny_model block alongside pooled_metrics:
{
"pooled_metrics": { "vmaf": { "mean": 91.23... } },
"tiny_model": {
"name": "vmaf_tiny_fr_v1",
"kind": "fr",
"device": "cuda",
"mean": 90.8...,
"per_frame": [...]
}
}
For attached multi-output models, each scalar ONNX output is recorded as its own feature. A single-output model keeps the sidecar name as the score key. Multi-output models use <sidecar-name>_<output-name>, with output-name taken from sidecar output_names[] when present and count-matched, otherwise from the ONNX graph output name. Attached mode still rejects non-scalar output tensors; use vmaf_dnn_session_run() when the caller needs vectors or images.
Auto-resize for image-input tiny models (ADR-0550)¶
Image-input (rank-4 NCHW) tiny models declare a fixed input shape — the shipped model/tiny/nr_metric_v1.onnx NR scorer, for example, expects [1, 1, 224, 224] because it was trained on KoNViD-1k middle-frames downscaled to 224×224 grayscale. Most NR workflows pass the encoder's native resolution as --width / --height, so a dimension mismatch is the norm rather than the exception.
The per-frame NCHW dispatch can auto-resample the luma plane to the model's input shape when they differ. The default behaviour is disabled — a dimension mismatch returns -ERANGE and the operator must explicitly choose a resize filter. This preserves the strict mode for parity harnesses and avoids a silent free parameter.
Warning:
bilinear,nearest, andbicubicproduce scores that differ by approximately 2% on the same input. Treat filter choice as a model hyperparameter and document it alongside the model checkpoint.
Filter selectors:
--tiny-resize disabled # default: mismatch -> -ERANGE (strict mode)
--tiny-resize bilinear # torchvision / OpenCV BILINEAR (half-pixel-centre)
--tiny-resize nearest # nearest-neighbour, floor coord (debug-friendly)
--tiny-resize bicubic # Catmull-Rom (a = -0.5), separable
When the source dims already equal the model dims, the dispatch forwards verbatim to vmaf_tensor_from_luma — the matched-dims path stays bit-identical to the pre-ADR-0550 code, so the Netflix golden gate is unaffected regardless of the selected filter.
Smoke test (Finding 11 reproducer — requires explicit --tiny-resize):
vmaf --no-reference \
--tiny-model model/tiny/nr_metric_v1.onnx \
--distorted testdata/dis_576x324_48f.yuv \
--width 576 --height 324 --pixel_format 420 --bitdepth 8 \
--tiny-resize bilinear \
--json --output /tmp/nr.json
# Expected: 48 frames scored, vmaf_tiny_model mean ~ 3.09 (bilinear),
# ~ 3.05 (nearest), ~ 3.11 (bicubic).
Without --tiny-resize, the default (disabled) produces 0 frames and "problem reading pictures" for any size-mismatched NR model:
vmaf --no-reference \
--tiny-model model/tiny/nr_metric_v1.onnx \
--distorted testdata/dis_576x324_48f.yuv \
--width 576 --height 324 --pixel_format 420 --bitdepth 8
# Expected: "problem reading pictures" at frame 0; 0 frames scored.
The same selector is reachable from the C API via vmaf_dnn_set_resize_mode(ctx, VMAF_DNN_RESIZE_BILINEAR | _NEAREST | _BICUBIC | _DISABLED) — see Surface 2 below.
Surface 2 — the libvmaf C API¶
#include <libvmaf/libvmaf.h>
#include <libvmaf/dnn.h>
VmafContext *ctx;
vmaf_init(&ctx, (VmafConfiguration){ /* ... */ });
if (!vmaf_dnn_available()) {
fprintf(stderr, "libvmaf built without --enable_dnn; rebuild.\n");
return 1;
}
VmafDnnConfig dnn_cfg = {
.device = VMAF_DNN_DEVICE_CUDA,
.device_index = 0,
.threads = 0,
.fp16_io = false,
};
int err = vmaf_use_tiny_model(ctx, "/models/vmaf_tiny_fr_v1.onnx", &dnn_cfg);
if (err < 0) { /* handle -errno */ }
/* … feed frames as usual; tiny-model scores appear in the same
per-frame collector the built-in SVM uses. */
The sidecar JSON is discovered automatically at ${onnx_path%.onnx}.json. Its kind field (fr / nr) tells libvmaf whether to expect a reference. Optional output_names[] entries name attached multi-output scalar scores; the legacy output_name field remains accepted for single-output metadata.
Accepted ONNX input shapes (ADR-0518, extended by ADR-0523)¶
The loader accepts two input ranks:
| Rank | Shape | Meaning | Example checkpoint |
|---|---|---|---|
| 4 | [N, 1, H, W] | NCHW single-channel luma image — the picture's Y plane is fed through vmaf_tensor_from_luma each frame. Optional (mean, std) normalisation comes from the sidecar's norm_mean / norm_std. | model/tiny/dists_sq.onnx, model/tiny/nr_metric_v1.onnx |
| 2 | [N, F] | Feature-vector model. The host materialises the F features (default canonical-6: adm2, vif_scale0..3, motion2) from libvmaf's classic feature collector at inference time. The sidecar's feature_order (or features) declares the slot-to-feature mapping; the sidecar's feature_mean / feature_std (or input_mean / input_std) apply a StandardScaler before the tensor is handed to ORT. | model/tiny/fr_regressor_v1.onnx, model/tiny/fr_regressor_v2.onnx, model/tiny/vmaf_tiny_v4.onnx |
The batch dimension N may be:
- the fixed value
1(legacy single-sample exports), or - the symbolic ONNX
dim_paramtoken ('batch','N', …) which ORT reports back through the C API as-1— this is the default produced bytorch.onnx.export(..., dynamic_axes=...)and is what every shipped NR checkpoint uses (ADR-0523).
A fixed batch greater than 1 is rejected: libvmaf's per-frame inference loop feeds one sample per ORT Run call, so multi-sample batches have no consumer today. The diagnostic reads tiny-model loader: <rank-4|feature-vector> model has fixed batch N; only batch=1 or symbolic batch (-1) is supported.
For rank-4 models, the spatial dims H and W must be known positive values; symbolic H/W ("dynamic-resolution" exports) fails with tiny-model loader: rank-4 model has dynamic / non-positive spatial dims (H=…, W=…); symbolic H/W is unsupported — re-export with a fixed input resolution. The scratch buffer is sized once at attach time, so the runtime cannot accept varying resolution.
Anything other than rank 2 or 4 fails loud with a human-readable log line: tiny-model loader: model has input rank N, expected 2 (feature vector) or 4 (NCHW image).
Rank-2 models may declare a second input — fr_regressor_v2, for instance, takes a 14-dim codec block (one-hot encoder + preset_norm + crf_norm). The loader discovers the second-input width via ORT and allocates a zero-initialised scratch buffer pre-seeded to the "unknown encoder" one-hot at the third-from-last slot. No public CLI / C API exists today to populate the codec block with the real encoder context; consumers needing codec-aware predictions should treat fr_regressor_v2 as approximate until that surface lands.
ONNX external data is supported automatically. Models shipped as <basename>.onnx plus a sibling <basename>.onnx.data (the standard ONNX-protobuf-external-data layout) load with no extra configuration — ONNX Runtime resolves the sibling file when given the absolute model path. The fork's fr_regressor_v1 and fr_regressor_v2 ship in this layout.
Surface 3 — ffmpeg filters¶
Apply ffmpeg-patches/*.patch against a pinned FFmpeg SHA (see ffmpeg-patches/test/build-and-run.sh) then:
# C1 / C2 scoring through vf_libvmaf.
ffmpeg -i ref.mp4 -i dis.mp4 \
-lavfi "[0:v][1:v]libvmaf=tiny_model=/models/vmaf_tiny_fr_v1.onnx:tiny_device=cuda" \
-f null -
# C3 learned pre-filter.
ffmpeg -i in.mp4 \
-vf "vmaf_pre=model=/models/filter_denoise_residual_v1.onnx:device=cuda" \
out.mp4
The vmaf_pre filter's device= option accepts the same twelve device strings as tiny_device= in the libvmaf filter — auto, cpu, cuda, openvino, openvino-npu, openvino-cpu, openvino-gpu, coreml, coreml-ane, coreml-gpu, coreml-cpu, and rocm — all mapping to the corresponding VmafDnnDevice enum values (ADR-0482).
Execution-provider matrix¶
| Backend flag | ORT EP | Notes |
|---|---|---|
--tiny-device cpu | CPUExecutionProvider | Always available. |
--tiny-device cuda | CUDAExecutionProvider | Requires CUDA-enabled ORT; shares context with libvmaf-cuda. |
--tiny-device openvino | OpenVINOExecutionProvider | Covers Intel GPU / SYCL / oneAPI. Tries GPU device type first, falls back to CPU device type. Also covers the integrated Xe / Xe2 GPU on Intel AI-PC platforms (Meteor / Lunar / Arrow Lake) for free. |
--tiny-device openvino-npu | OpenVINOExecutionProvider, device_type=NPU | Intel AI-PC NPU only (Meteor / Lunar / Arrow Lake). No fallback inside the explicit selector; if the EP isn't compiled in or no NPU silicon is present, the open downgrades to the CPU EP via the same two-stage vmaf_ort_open() fallback that all explicit-EP selectors share. End-to-end NPU validation pending hardware access — see ADR-0332 and Research-0031. |
--tiny-device openvino-cpu | OpenVINOExecutionProvider, device_type=CPU | OpenVINO CPU plugin (skip the GPU.0 probe). Useful when you want OpenVINO's CPU implementation specifically — e.g. for parity testing against a measured --tiny-device openvino-gpu run, or as a stable fallback on hosts without Intel iGPU/NPU. |
--tiny-device openvino-gpu | OpenVINOExecutionProvider, device_type=GPU | OpenVINO GPU.0 plugin. Targets the iGPU / dGPU on systems where OpenVINO's intel_gpu plugin is the desired backend (Arc dGPU, Xe / Xe2 iGPU). |
--tiny-device coreml | CoreMLExecutionProvider | Apple-only EP (macOS). CoreML auto-routes across the Apple Neural Engine (ANE), Metal-backed GPU, and CPU. The unscoped selector lets CoreML pick the compute unit per-op; use the explicit variants below to pin a single unit. See ADR-0365. |
--tiny-device coreml-ane | CoreMLExecutionProvider, MLComputeUnits=CPUAndNeuralEngine | Highest perf-per-watt on M-series silicon (M1, M2, M3, M4). Routes to the dedicated on-die Neural Engine and falls back to CPU for ops the ANE doesn't support. Recommended Apple-silicon entry point. |
--tiny-device coreml-gpu | CoreMLExecutionProvider, MLComputeUnits=CPUAndGPU | Pins CoreML to Metal-backed GPU + CPU. Useful when a graph hits ANE op-coverage gaps and falls back to CPU more aggressively than expected. |
--tiny-device coreml-cpu | CoreMLExecutionProvider, MLComputeUnits=CPUOnly | Universal CoreML CPU path. Functionally similar to the plain CPU EP but exercises the same dispatch shape as the other coreml-* variants — useful for diff-style debugging on macOS. |
--tiny-device rocm | ROCmExecutionProvider | Requires ROCm-enabled ORT. |
--tiny-device auto | best available | Ordered try-chain: CUDA → OpenVINO (GPU then CPU) → ROCm → CoreML (auto-route) → CPU. NPU is not in the AUTO chain — opt-in only via --tiny-device openvino-npu because of NPU power-state latency floor on small graphs. CoreML is last because the recommended Apple-silicon entry point is --tiny-device=coreml-ane; AUTO picks CoreML only when no discrete-GPU EP is available. |
Graceful EP fallback¶
If the requested EP isn't compiled into the linked ORT build (for example, you ask for cuda on a CPU-only ORT), the session still opens — it silently degrades to the CPU EP rather than failing. This matches VmafDnnConfig.device being documented as a hint, not a requirement: a laptop and a workstation running the same binary get the best EP each one has.
To see which EP actually bound, call vmaf_dnn_session_attached_ep() on the session:
VmafDnnSession *sess;
vmaf_dnn_session_open(&sess, "/models/m.onnx",
&(VmafDnnConfig){.device = VMAF_DNN_DEVICE_AUTO});
printf("bound EP: %s\n", vmaf_dnn_session_attached_ep(sess));
/* One of: "CPU", "CUDA", "OpenVINO:GPU", "OpenVINO:CPU", "OpenVINO:NPU", "ROCm" */
Consumers that need a hard failure on missing EP should assert on the returned string at the call site (for example strcmp(ep, "CUDA") == 0).
fp16 I/O (VmafDnnConfig.fp16_io)¶
Setting .fp16_io = true enables a host-side fp32 ↔ fp16 round-trip at the I/O boundary, triggered per input/output slot when the model's graph declares that slot as FLOAT16. The public API always takes fp32; libvmaf performs the cast internally. When the model declares FLOAT32 on a slot, fp16_io = true is a no-op at that slot. When the EP is OpenVINO, the precision hint FP16 is additionally passed to the EP so intermediate compute also runs at half precision.
Example — running a FLOAT16-typed model:
VmafDnnConfig cfg = {.device = VMAF_DNN_DEVICE_AUTO, .fp16_io = true};
VmafDnnSession *sess;
vmaf_dnn_session_open(&sess, "/models/m_fp16.onnx", &cfg);
float in[H*W] = { /* fp32 input */ };
float out[H*W];
VmafDnnInput din = {.data = in, .shape = (int64_t[4]){1,1,H,W}, .rank = 4};
VmafDnnOutput dout = {.data = out, .capacity = H*W};
vmaf_dnn_session_run(sess, &din, 1, &dout, 1);
Expected cross-device variance¶
Running the same .onnx on two different EPs produces near-identical scores:
- CPU vs CUDA (FP32): within 1e-4.
- CPU vs CUDA (FP16 via
--tiny-fp16): within 1e-2.
CI exercises CPU-only; GPU parity is checked manually on the dev workstation for now (planned: self-hosted runner).