ADR-0129: Tiny-AI post-training int8 quantisation — static + dynamic + QAT per model¶
- Status: Accepted
- Date: 2026-04-20
- Deciders: Lusoris, Claude (Anthropic)
- Tags: ai, onnx, quantization, model, docs
Context¶
The fork's tiny-AI surface (ai/ for training, core/src/dnn/ for the ONNX Runtime integration) currently ships fp32 ONNX models. Even the smallest fork-trained model (tiny-vmaf-v1.onnx, ~4 MB) dominates per-frame CPU cost on low-end boxes and embedded ARM platforms. The ONNX Runtime int8 story is mature: the onnxruntime.quantization Python module covers static (calibration-based), dynamic (per-activation runtime quant), and QAT (quant-aware training) in one API family; the runtime CPUExecutionProvider has been shipping QDQ-format int8 kernels since 2022.
What we don't currently have is a policy for which quantisation mode each model gets. The three modes trade off in different directions:
- Static PTQ: highest accuracy preservation among PTQ techniques; requires a representative calibration dataset; needs to be re-run when the calibration set changes. Best fit for models where we control training and have a stable calibration set.
- Dynamic PTQ: cheapest (single CLI command, no calibration data); quantises weights offline, activations at runtime; accuracy penalty slightly larger than static. Best fit for models where calibration data is unavailable or where the deployment box differs from the training box.
- QAT: largest training-time investment (needs a second training phase with fake-quant ops in the forward pass); recovers most of the static PTQ accuracy loss; requires the original training code to cooperate. Best fit for models where PTQ drops accuracy below the VMAF-soak tolerance band.
The user directive on 2026-04-20 was explicit: "static, dynamic and QAT" — all three, each used where it fits. What the project lacked was a mechanism for declaring which model uses which, and a harness for running the regression comparison.
The existing model/registry.json (one entry per shipped model) is the natural place to carry the quantisation decision per model.
Decision¶
We will add a per-model quant_mode field to model/registry.json and a three-script PTQ harness under ai/scripts/ that produces quantised artefacts committed alongside the fp32 originals.
- Registry schema extension: each model entry gains
{
"quant_mode": "fp32" | "static" | "dynamic" | "qat",
"quant_calibration_set": "path/to/calibration.bin", // static only
"quant_accuracy_budget_plcc": 0.01 // max allowed PLCC drop vs fp32
}
- Three scripts under
ai/scripts/: ptq_static.py— loads fp32 ONNX, runsonnxruntime.quantization.quantize_staticagainst the calibration set named in the registry.ptq_dynamic.py— one-liner wrapper aroundquantize_dynamic; no calibration needed.qat_train.py— wraps the existing tiny-AI PyTorch trainer withtorch.quantization.prepare_qat_fx, converts to ONNX with quantised ops at the end.- Artefact layout: quantised ONNX sits next to fp32 ONNX with
.int8.onnxsuffix.model/registry.jsonpoints the runtime to the.int8.onnxfile iffquant_mode != "fp32"; the fp32 file is kept as the regression baseline. - Accuracy gate: a new CI leg (
ai-quant-accuracy) runs the quantised model against the same VMAF soak-test fixtures the fp32 model was validated on, and asserts Pearson-linear-correlation-drop against thequant_accuracy_budget_plccthreshold from the registry. A drop beyond budget fails the PR. - Runtime switch: the ONNX Runtime initialisation in
core/src/dnn/inspects the registry entry and loads the quantised file transparently. Users see no API change; the model just runs faster on int8-capable CPUs. - First target models:
tiny-vmaf-v1.onnx→ dynamic (we have no calibration set checked in; the 2x speedup is worth the accuracy cost on an already-small model).- Future SSIMULACRA-2-adjacent models that ship with a calibration set → static.
- Models where static drops PLCC past budget → QAT (deferred until we hit a concrete case; empirically ~1 in 4 tiny-AI models need it).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Three modes via registry (chosen) | Fine-grained per-model control; audit trail in the registry; clean CI gate per model | Three scripts to maintain; must keep the registry schema and the runtime in sync | Matches the user's explicit "static + dynamic + QAT" directive and the existing registry.json architecture |
| Dynamic-only (pick the easiest) | One script; no calibration data management; covers 80% of the speedup | Models that need static or QAT would silently drift past accuracy budget | Explicitly rejected by the user — "somehow I want static, dynamic and qat" |
| Static-only (pick the strongest PTQ) | Best PTQ accuracy; single codepath | Calibration-set management becomes a submodule; blocks models for which calibration data is unavailable | Misses the "dynamic, no-calibration-data" use case users care about |
| Leave fp32 as-is, optimise elsewhere | Zero new code; no accuracy risk | Leaves the 2–4x inference speedup on the floor; mobile / embedded story stays weak | Doesn't match the modernisation goal |
| Quant decision in Python only, no registry field | Less schema surface | Decision is hidden in a training notebook; impossible to audit per-model post-hoc | Contradicts the fork's existing "registry.json is the source of truth for model metadata" pattern |
Consequences¶
Positive
- Closes the 2–4x inference-speedup gap on int8-capable CPUs (most modern x86 + all ARMv8.2+) for tiny-AI paths.
- Per-model quant-mode field documents the decision permanently at the model level — future maintainers don't have to reconstruct "why is this one static?".
- QAT is now reachable without re-architecting the training pipeline; the existing trainer gains one optional phase.
- The accuracy-budget field + CI gate prevents silent regressions when a quantised model is rebuilt against a new calibration set.
Negative
- Three scripts (static / dynamic / QAT) duplicate about 60% of the ORT API surface. Acceptable because each has distinct user-facing semantics.
- Calibration-set storage: static PTQ calibration data is small (~50 MB) but still new binary under
ai/calibration/. Tracked via git LFS (already set up for fork-trained model weights). - QAT re-training is expensive (roughly 1.5× the fp32 training time). Runs only on-demand; not part of every training cycle.
Neutral
- No impact on the Netflix CPU golden gate — it never exercises tiny-AI models.
- No change to the public C ABI — the quantisation is entirely internal to model loading.
mcp-server/vmaf-mcp/sees no change; the model swap is transparent to the MCP surface.
Alignment with the audit-first directive¶
Per the user's direction on 2026-04-20 ("Audit first" for the tiny-AI model registry), the first implementation PR does not immediately quantise every model. The sequence is:
- Audit PR: extend the registry schema, add the three scripts, add the CI accuracy-gate leg, but do not change any existing model's
quant_modefromfp32. Purely infrastructural. - Per-model quantisation PRs: one PR per model, each with its own accuracy-drop measurement in the PR body and its own soak-test result attached.
- If a model fails the static budget, escalate to QAT in a follow-up PR rather than relaxing the budget.
This keeps each quantisation decision reviewable in isolation.
References¶
- [req] AskUserQuestion popup answered 2026-04-20: PTQ int8 scope → "somehow I want static, dynamic and qat!!!!"; harness layout → "registry.json per-model (Recommended)"; first workstream action → "Audit first".
- Research-0006 — accuracy regression targets, ORT API comparison, calibration-set sourcing.
- ONNX Runtime quantization docs
- PyTorch QAT guide
- ADR-0042 — tiny-AI docs-per-PR rule applies to this workstream.
model/registry.json— the registry this ADR extends.ai/— the training directory where the PTQ / QAT scripts land.- CLAUDE.md §12 r10 — per-surface docs rule (quant-mode user-visible → doc entry under
docs/ai/quantization.md).
Status update 2026-05-08: Accepted¶
Audited as part of the 2026-05-08 ADR Proposed sweep (Research-0086).
Acceptance criteria verified in tree at HEAD 0a8b539e:
model/tiny/registry.schema.jsoncarriesquant_mode(line 64),quant_calibration_set(line 70),quant_int8_sha256(line 75),quant_accuracy_budget_plcc(line 79).- PTQ scripts under
ai/scripts/:ptq_static.py,ptq_dynamic.py,measure_quant_drop.py,measure_quant_drop_per_ep.py. - ADR-0173 (Accepted) shipped the audit-first registry/CI plumbing.
- ADR-0174 (Accepted) shipped the first per-model PTQ (
learned_filter_v1onquant_mode: "dynamic"). - ADR-0207 / ADR-0208 (this sweep, both Accepted) extend the policy to QAT.
- Verification command:
grep -E "quant_mode|quant_accuracy_budget_plcc" model/tiny/registry.schema.json.