Skip to content

ADR-0275: vmaf_tiny_v3 and vmaf_tiny_v4 join dynamic-PTQ family (T5-3d follow-up)

  • Status: Accepted
  • Date: 2026-05-03
  • Deciders: Lusoris, Claude (Anthropic)
  • Tags: tiny-ai, onnx, quantization, registry, fork-local

Context

ADR-0173 shipped the audit-first PTQ harness. ADR-0174 flipped the first per-model entry — learned_filter_v1 — into quant_mode: "dynamic". ADR-0248 added nr_metric_v1 once its value_info shape-inference issue was resolved.

The tiny-AI ladder for VMAF feature fusion now has three rungs: vmaf_tiny_v2 (mlp_small, ~257 params), vmaf_tiny_v3 (mlp_medium, ~769 params; ADR-0241), and vmaf_tiny_v4 (mlp_large, ~3 073 params; ADR-0242). v2's quantisation analysis is moot — its fp32 ONNX is 2 446 bytes; the weight tensors are a tiny fraction of that and an int8 sidecar would not deliver a meaningful size win. v3 and v4 are the first VMAF feature-fusion tier where the question is worth asking.

This ADR closes that gap by shipping dynamic-PTQ int8 sidecars for both v3 and v4 so the runtime redirect from ADR-0174 has a target when an operator opts into v3 or v4 with quantisation enabled in their registry overlay.

Decision

We will (1) produce vmaf_tiny_v3.int8.onnx and vmaf_tiny_v4.int8.onnx via ai/scripts/ptq_dynamic.py, (2) add both models to model/tiny/registry.json with quant_mode: "dynamic", int8_sha256, and quant_accuracy_budget_plcc: 0.01, (3) mirror those fields into the per-model sidecars model/tiny/vmaf_tiny_v3.json and vmaf_tiny_v4.json, and (4) extend the ai-quant-accuracy CI gate's coverage transparently (the gate already iterates every non-fp32 registry entry).

Both models stay inside the 0.01 PLCC budget by two orders of magnitude on the Netflix-features parquet (~11k rows of canonical-6 inputs the registered v3/v4 graphs were trained on):

Model fp32 → int8 size PLCC drop (vs fp32 on Netflix) Headroom vs 0.01 budget
vmaf_tiny_v3 4 496 B → 4 267 B (×0.95) 0.000120 ×83
vmaf_tiny_v4 14 046 B → 7 769 B (×0.55, -45 %) 0.000145 ×69

The PLCC self-similarity (int8 vs fp32 on the same inputs) is 0.999963 / 0.999958. KoNViD cross-corpus drop on the canonical-6 parquet (~270k rows) is 0.000177 / 0.000080 — both still inside budget.

Alternatives considered

Option Pros Cons Why not chosen
Static PTQ with calibration data Slightly tighter accuracy; per-channel scales Requires shipping a calibration .npz (~1 MB of canonical-6 vectors); dynamic already inside budget by ~70× Rejected. ADR-0174 precedent: don't add the calibration-asset cost until a budget violation forces it.
Per-channel dynamic (--per-channel) Marginal accuracy improvement on weight-rich models Negligible PLCC delta on graphs this small (already at 1e-4 drop); slightly larger int8 file from per-row scale arrays Rejected. The per-tensor default already lands two orders below budget; per-channel is a follow-up only if a future architecture rung erodes headroom.
Skip v3 entirely (it shrinks only 5 %) Avoids shipping a barely-smaller sidecar Breaks "every quantisable rung is registered" CI invariant; runtime redirect would surprise operators who set quant_mode=dynamic on v3 Rejected. The size win is small, but the gate-coverage and registry-completeness wins justify the 4 KB on-disk cost.
Wait for QAT (ADR-0129) Best-in-class accuracy retention Requires a training-time pipeline; v3 / v4 trainers don't yet emit a quant-aware graph Rejected for now. PTQ inside budget is the cheaper first step; QAT is tracked as a global escalation lever, not a per-model blocker.

Consequences

  • Positive:
  • Closes the v3 / v4 gap in the dynamic-PTQ family. The registry's quantised set now covers learned_filter_v1, nr_metric_v1, vmaf_tiny_v3, vmaf_tiny_v4.
  • v4 in particular shrinks 45 % on disk, making it cheaper to bundle in deploys that pin v4 over v3 for absolute-top-of-ladder PLCC.
  • The ai-quant-accuracy CI gate's coverage matrix grows by two rows transparently — it already iterates models[] and skips fp32 entries.
  • Negative:
  • Two new int8 sidecar files in-tree (4 267 B + 7 769 B = ~12 KB total). Both are well under the "few-MB" external-data threshold, so they ship as committed binaries rather than via the sigstore + .onnx.data pattern (mirroring learned_filter_v1 and nr_metric_v1).
  • v3's size delta is small (×0.95). The ADR's gate-coverage rationale stays the win; readers should not expect the learned_filter_v1 2.4× shrink on every model.
  • Neutral / follow-ups:
  • Sigstore bundles for v3 / v4 fp32 + int8 are populated at release time by .github/workflows/supply-chain.yml; placeholder bundles are not added in this PR.
  • When v5 lands (if ever — ADR-0242 declared the ladder saturated), the same recipe applies: run ptq_dynamic.py, register fields, document.

Tests

  • python ai/scripts/ptq_dynamic.py model/tiny/vmaf_tiny_v3.onnx produces a 4 267-byte int8 file.
  • python ai/scripts/ptq_dynamic.py model/tiny/vmaf_tiny_v4.onnx produces a 7 769-byte int8 file.
  • python ai/scripts/measure_quant_drop.py --all reports [PASS] for both vmaf_tiny_v3 (drop=0.000120) and vmaf_tiny_v4 (drop=0.000145).
  • python ai/scripts/validate_model_registry.py reports OK: 12 registry entries valid against registry.schema.json.

Reproducer

# 1. Quantise.
python ai/scripts/ptq_dynamic.py model/tiny/vmaf_tiny_v3.onnx
python ai/scripts/ptq_dynamic.py model/tiny/vmaf_tiny_v4.onnx

# 2. Gate.
python ai/scripts/measure_quant_drop.py --all
# Expected: [PASS] for vmaf_tiny_v3 + vmaf_tiny_v4 (drops well under 0.01).

# 3. Schema validation.
python ai/scripts/validate_model_registry.py

References

  • ADR-0129 — PTQ policy.
  • ADR-0173 — audit-first PTQ harness.
  • ADR-0174 — first per-model PTQ (learned_filter_v1); established int8_sha256 + quant_accuracy_budget_plcc registry fields and the runtime .int8.onnx redirect.
  • ADR-0248nr_metric_v1 PTQ; same recipe.
  • ADR-0241 — v3 ship decision.
  • ADR-0242 — v4 ship decision.
  • req — user direction 2026-05-03: paraphrased — "add INT8 dynamic-PTQ sidecars for vmaf_tiny_v3 and vmaf_tiny_v4 with the ADR-0174 0.01-PLCC budget."