Tiny-AI — PTQ accuracy across Execution Providers¶
Empirical comparison of int8 post-training-quantisation (PTQ) PLCC drop across the Execution Providers (EPs) the fork actually has hardware for: ORT CPUExecutionProvider, ORT CUDAExecutionProvider, and the Intel OpenVINO runtime targeting either an Intel Arc dGPU or the OpenVINO CPU plugin.
The harness lives in ai/scripts/measure_quant_drop_per_ep.py and is the sibling of the CPU-only measure_quant_drop.py gate. The CPU gate is what CI enforces; this script is the investigation tool used to verify that a model that passes on CPU also behaves on GPU EPs before we ship it as a non-CPU default.
What it measures¶
For each (model, EP) pair:
- Load the fp32 ONNX through the chosen EP. The script reuses the
_load_sessionworkaround introduced in PR #165 — some shipped ONNX (vmaf_tiny_v1.onnx,vmaf_tiny_v1_medium.onnx) carry a staleexternal_data.locationpointing at a renamedmlp_*_final.onnx.data; the loader rewrites the entry in memory. - Either reuse the registry-shipped int8 sibling (
<basename>.int8.onnx) or, for fp32-only baselines passed via--extra-fp32, materialise a dynamic-PTQ int8 graph at runtime (matches the published mode forlearned_filter_v1). - Run 16 deterministic synthetic samples (RNG seed 0, fp32 input in
[0, 1)) through both fp32 and int8, collect the headline output, compute the Pearson linear correlation across the concatenated outputs, report1 - PLCCas the drop and assert it stays under the per-modelquant_accuracy_budget_plccfrommodel/tiny/registry.json. - The script also tracks the worst per-sample max-absolute delta so that a degenerate model whose PLCC is undefined (constant output) still surfaces actual numerical divergence.
When invoked with --gate, the exit code is non-zero if any pair exceeds its budget — useful for ad-hoc CI. The default mode is "investigate" (always exit 0; emit JSON + Markdown).
Running it¶
# CUDA EP needs the bundled CUDA-12 ABI .so files on LD_LIBRARY_PATH;
# our system has CUDA 13 installed but ORT-GPU 1.25 wheels expect 12.
SP="$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia"
export LD_LIBRARY_PATH="$SP/cublas/lib:$SP/cudnn/lib:$SP/cuda_nvrtc/lib:$SP/cuda_runtime/lib:$SP/cufft/lib:$SP/curand/lib:$SP/cusolver/lib:$SP/cusparse/lib:$SP/cuda_cupti/lib:$SP/nvtx/lib:$SP/nvjitlink/lib"
python ai/scripts/measure_quant_drop_per_ep.py \
--eps cpu cuda openvino \
--extra-fp32 vmaf_tiny_v1.onnx vmaf_tiny_v1_medium.onnx \
--openvino-device GPU.0 \
--out runs/quant-eps-$(date +%Y-%m-%d)
Outputs results.json (machine-readable) and results.md (the human-readable table that goes into the research digest). The JSON report carries the shared run_provenance block, recording the script revision, argv, selected EPs, hardware tag, tiny-model registry input, optional fp32 baselines, and both report targets. The runs/ directory is gitignored — copy the headline numbers into docs/research/0006-tinyai-ptq-accuracy-targets.md when refreshing the GPU-EP findings section.
Headline findings (2026-04-28)¶
CPU EP and CUDA EP agree to 6 decimal places on every shipped tiny model; the OpenVINO CPU plugin agrees to ~10⁻⁴ PLCC drop. The Intel Arc A380 (GPU.0 through OpenVINO 2026.1) is currently int8-broken — Conv-based int8 graphs fail to compile, MLP int8 graphs compile but emit inf/NaN. The fp32 path is healthy on Arc, so the runtime can fall back to fp32 (or the OpenVINO CPU plugin) for a reliable inference there.
See the full table + failure-mode breakdown in docs/research/0006-tinyai-ptq-accuracy-targets.md §"GPU-EP quantisation".
When to re-run¶
- After bumping
onnxruntime-gpuoropenvinomajor versions (the int8 layout failure on Arc may resolve in a later intel_gpu plugin). - Before adding a new int8 model to the registry — paste the CUDA-EP row into the model-card and the research digest.
- When changing the dynamic-PTQ defaults (e.g. switching from per-tensor to per-channel weight quantisation).
Limitations¶
- The 16 synthetic samples are deterministic but not distribution-realistic. The CPU gate's number is the one that matches the soak-test fixtures; this harness's number is a cross-EP comparison, not a regression on real video.
- The OpenVINO path uses the native
openvinoPython runtime (notonnxruntime-openvino) because nocp314wheel for the ORT-OpenVINO bridge exists on PyPI as of this writing. The numerics are equivalent; the API is just different. - Only the first model output is compared. Multi-output models would need a small extension to the runner.