ADR-0022: Inference runtime is ONNX Runtime via execution providers¶
- Status: Accepted
- Date: 2026-04-17
- Deciders: Lusoris, Claude (Anthropic)
- Tags: ai, dnn, cuda, sycl, build
Context¶
The fork already ships CPU / CUDA / SYCL / HIP backends. Tiny-AI inference needs a runtime that spans the same matrix without forking inference code per backend, and must be safely sandboxable with an operator allowlist.
Decision¶
We will use ONNX Runtime C API; gate the build with Meson -Denable_dnn=auto; ORT execution providers map to libvmaf backends — CPU → CPU EP, CUDA → CUDA EP, SYCL/Intel → OpenVINO EP, HIP → ROCm EP.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| TensorRT | Fast on NVIDIA | NVIDIA-only; no CPU/SYCL/HIP path | Splits inference code per backend |
| OpenVINO direct | Great on Intel | Intel-focused; no CUDA/HIP | Same |
| Per-backend native runtimes | Max perf | Huge duplication; each backend diverges | Too expensive |
| ONNX Runtime (chosen) | One runtime; execution providers match our backend matrix 1:1; sandboxed graph | Some perf overhead vs native | Rationale: avoids forking inference paths; op allowlist restricts further |
Consequences¶
- Positive: single inference code path; execution provider selection mirrors backend selection.
- Negative: ORT pulls in transitive deps; must maintain op allowlist.
- Neutral / follow-ups: ADR-0039 wires the runtime op-allowlist + registry.
References¶
- Source:
Q5.3 - Related ADRs: ADR-0020, ADR-0021, ADR-0023, ADR-0039