ADR-0129: Tiny-AI post-training int8 quantisation — static + dynamic + QAT per model¶

Status: Accepted
Date: 2026-04-20
Deciders: Lusoris, Claude (Anthropic)
Tags: ai, onnx, quantization, model, docs

Context¶

The fork's tiny-AI surface (ai/ for training, core/src/dnn/ for the ONNX Runtime integration) currently ships fp32 ONNX models. Even the smallest fork-trained model (tiny-vmaf-v1.onnx, ~4 MB) dominates per-frame CPU cost on low-end boxes and embedded ARM platforms. The ONNX Runtime int8 story is mature: the onnxruntime.quantization Python module covers static (calibration-based), dynamic (per-activation runtime quant), and QAT (quant-aware training) in one API family; the runtime CPUExecutionProvider has been shipping QDQ-format int8 kernels since 2022.

What we don't currently have is a policy for which quantisation mode each model gets. The three modes trade off in different directions:

Static PTQ: highest accuracy preservation among PTQ techniques; requires a representative calibration dataset; needs to be re-run when the calibration set changes. Best fit for models where we control training and have a stable calibration set.
Dynamic PTQ: cheapest (single CLI command, no calibration data); quantises weights offline, activations at runtime; accuracy penalty slightly larger than static. Best fit for models where calibration data is unavailable or where the deployment box differs from the training box.
QAT: largest training-time investment (needs a second training phase with fake-quant ops in the forward pass); recovers most of the static PTQ accuracy loss; requires the original training code to cooperate. Best fit for models where PTQ drops accuracy below the VMAF-soak tolerance band.

The user directive on 2026-04-20 was explicit: "static, dynamic and QAT" — all three, each used where it fits. What the project lacked was a mechanism for declaring which model uses which, and a harness for running the regression comparison.

The existing model/registry.json (one entry per shipped model) is the natural place to carry the quantisation decision per model.

Decision¶

We will add a per-model quant_mode field to model/registry.json and a three-script PTQ harness under ai/scripts/ that produces quantised artefacts committed alongside the fp32 originals.

Registry schema extension: each model entry gains

{
  "quant_mode": "fp32" | "static" | "dynamic" | "qat",
  "quant_calibration_set": "path/to/calibration.bin",  // static only
  "quant_accuracy_budget_plcc": 0.01                    // max allowed PLCC drop vs fp32
}

Three scripts under ai/scripts/:
ptq_static.py — loads fp32 ONNX, runs onnxruntime.quantization.quantize_static against the calibration set named in the registry.
ptq_dynamic.py — one-liner wrapper around quantize_dynamic; no calibration needed.
qat_train.py — wraps the existing tiny-AI PyTorch trainer with torch.quantization.prepare_qat_fx, converts to ONNX with quantised ops at the end.
Artefact layout: quantised ONNX sits next to fp32 ONNX with .int8.onnx suffix. model/registry.json points the runtime to the .int8.onnx file iff quant_mode != "fp32"; the fp32 file is kept as the regression baseline.
Accuracy gate: a new CI leg (ai-quant-accuracy) runs the quantised model against the same VMAF soak-test fixtures the fp32 model was validated on, and asserts Pearson-linear-correlation-drop against the quant_accuracy_budget_plcc threshold from the registry. A drop beyond budget fails the PR.
Runtime switch: the ONNX Runtime initialisation in core/src/dnn/ inspects the registry entry and loads the quantised file transparently. Users see no API change; the model just runs faster on int8-capable CPUs.
First target models:
tiny-vmaf-v1.onnx → dynamic (we have no calibration set checked in; the 2x speedup is worth the accuracy cost on an already-small model).
Future SSIMULACRA-2-adjacent models that ship with a calibration set → static.
Models where static drops PLCC past budget → QAT (deferred until we hit a concrete case; empirically ~1 in 4 tiny-AI models need it).

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Three modes via registry (chosen)	Fine-grained per-model control; audit trail in the registry; clean CI gate per model	Three scripts to maintain; must keep the registry schema and the runtime in sync	Matches the user's explicit "static + dynamic + QAT" directive and the existing registry.json architecture
Dynamic-only (pick the easiest)	One script; no calibration data management; covers 80% of the speedup	Models that need static or QAT would silently drift past accuracy budget	Explicitly rejected by the user — "somehow I want static, dynamic and qat"
Static-only (pick the strongest PTQ)	Best PTQ accuracy; single codepath	Calibration-set management becomes a submodule; blocks models for which calibration data is unavailable	Misses the "dynamic, no-calibration-data" use case users care about
Leave fp32 as-is, optimise elsewhere	Zero new code; no accuracy risk	Leaves the 2–4x inference speedup on the floor; mobile / embedded story stays weak	Doesn't match the modernisation goal
Quant decision in Python only, no registry field	Less schema surface	Decision is hidden in a training notebook; impossible to audit per-model post-hoc	Contradicts the fork's existing "registry.json is the source of truth for model metadata" pattern

Consequences¶

Positive

Closes the 2–4x inference-speedup gap on int8-capable CPUs (most modern x86 + all ARMv8.2+) for tiny-AI paths.
Per-model quant-mode field documents the decision permanently at the model level — future maintainers don't have to reconstruct "why is this one static?".
QAT is now reachable without re-architecting the training pipeline; the existing trainer gains one optional phase.
The accuracy-budget field + CI gate prevents silent regressions when a quantised model is rebuilt against a new calibration set.

Negative

Three scripts (static / dynamic / QAT) duplicate about 60% of the ORT API surface. Acceptable because each has distinct user-facing semantics.
Calibration-set storage: static PTQ calibration data is small (~50 MB) but still new binary under ai/calibration/. Tracked via git LFS (already set up for fork-trained model weights).
QAT re-training is expensive (roughly 1.5× the fp32 training time). Runs only on-demand; not part of every training cycle.

Neutral

No impact on the Netflix CPU golden gate — it never exercises tiny-AI models.
No change to the public C ABI — the quantisation is entirely internal to model loading.
mcp-server/vmaf-mcp/ sees no change; the model swap is transparent to the MCP surface.

Alignment with the audit-first directive¶

Per the user's direction on 2026-04-20 ("Audit first" for the tiny-AI model registry), the first implementation PR does not immediately quantise every model. The sequence is:

Audit PR: extend the registry schema, add the three scripts, add the CI accuracy-gate leg, but do not change any existing model's quant_mode from fp32. Purely infrastructural.
Per-model quantisation PRs: one PR per model, each with its own accuracy-drop measurement in the PR body and its own soak-test result attached.
If a model fails the static budget, escalate to QAT in a follow-up PR rather than relaxing the budget.

This keeps each quantisation decision reviewable in isolation.

References¶

[req] AskUserQuestion popup answered 2026-04-20: PTQ int8 scope → "somehow I want static, dynamic and qat!!!!"; harness layout → "registry.json per-model (Recommended)"; first workstream action → "Audit first".
Research-0006 — accuracy regression targets, ORT API comparison, calibration-set sourcing.
ONNX Runtime quantization docs
PyTorch QAT guide
ADR-0042 — tiny-AI docs-per-PR rule applies to this workstream.
model/registry.json — the registry this ADR extends.
ai/ — the training directory where the PTQ / QAT scripts land.
CLAUDE.md §12 r10 — per-surface docs rule (quant-mode user-visible → doc entry under docs/ai/quantization.md).

Status update 2026-05-08: Accepted¶

Audited as part of the 2026-05-08 ADR Proposed sweep (Research-0086).

Acceptance criteria verified in tree at HEAD 0a8b539e:

model/tiny/registry.schema.json carries quant_mode (line 64), quant_calibration_set (line 70), quant_int8_sha256 (line 75), quant_accuracy_budget_plcc (line 79).
PTQ scripts under ai/scripts/: ptq_static.py, ptq_dynamic.py, measure_quant_drop.py, measure_quant_drop_per_ep.py.
ADR-0173 (Accepted) shipped the audit-first registry/CI plumbing.
ADR-0174 (Accepted) shipped the first per-model PTQ (learned_filter_v1 on quant_mode: "dynamic").
ADR-0207 / ADR-0208 (this sweep, both Accepted) extend the policy to QAT.
Verification command: grep -E "quant_mode|quant_accuracy_budget_plcc" model/tiny/registry.schema.json.