ADR-0234: GPU-generation-aware ULP calibration head¶
- Status: Accepted (2026-05-03 — calibration-table tier landed; ONNX-head tier remains scoped per "Decision" §)
- Date: 2026-05-01 (proposed) / 2026-05-03 (accepted)
- Deciders: Lusoris, Claude (Anthropic)
- Tags: ai, gpu, vulkan, cuda, sycl, cross-backend, fork-local, t7-gpu-ulp-cal
Context¶
The fork ships four backends (CPU, CUDA, SYCL, Vulkan) and gates them through ADR-0214's scripts/ci/cross_backend_parity_gate.py. The matrix gate confirms that every GPU path stays within the per-feature absolute tolerance (5e-5 for the integer pipeline, 5e-4 / 5e-3 for transcendentals / DCT-heavy features) but never claims bit-exactness. The CPU is the canonical reference; the GPU paths diverge by ~1e-4 on float metrics — small, deterministic, and a direct consequence of fp32-only kernel arithmetic on architectures without native fp64 (see ADR-0220).
Two facts matter for this ADR:
- The divergence is deterministic per (feature, kernel, GPU architecture) — not random noise. Re-running the same
(ref, dist)pair on the same Vulkan device gives the same ~1e-4 delta versus the CPU run. - The cross-backend parity gate already emits a per-cell JSON record (
per_metric_max_abs_diff,per_metric_mismatches,tolerance_abs,n_frames). That record, run over enough frames and broken down per-frame, is a training corpus for a calibration model.
The hypothesis: a tiny per-architecture calibration head (arch_id, raw_gpu_score) → cpu_equivalent_score could close the gap to bit-exact for the user-visible scores, deterministically per arch, without touching the kernel arithmetic. This would not replace the parity gate — it would compose with it: the gate keeps watch over raw GPU scores, the calibration head transforms them into CPU-equivalent values for callers that want bit-exact reproducibility across hardware (e.g. Netflix-golden-replay on a non-Netflix GPU runner).
This is exploratory. The rest of this fork's GPU strategy (ADR-0220, ADR-0214) is "fix it at the kernel" or "accept the known floor"; calibration is a third axis worth scoping but not yet committing to.
Decision¶
We will scope a per-architecture ONNX calibration head (arch_id_one_hot, raw_gpu_score) → cpu_score as a proposal and ship the data-collection plumbing first. The head will live behind a new CLI flag --gpu-calibrated (default off) and will be wired into the existing tiny-AI surface (loadable via the --tiny-model machinery or a dedicated registry entry, decision deferred to the implementation ADR). The flag will not flip default-on until a measurement-driven follow-up ADR cites:
- per-arch held-out PLCC ≥ 0.9999 against the CPU score on the Netflix golden fixtures, and
- a worst-case per-frame absolute residual of ≤ 1e-6 on the calibrated path (i.e. tighter than the current
5e-5parity tolerance, otherwise calibration buys nothing).
The proposal-stage scope is exactly: ADR-0234 (this file), Research-0041 (corpus design), and ai/scripts/collect_gpu_calibration_data.py (data-collection script with a --smoke mode wired to lavapipe so hosted CI can verify the pipeline end-to-end). Training, the ONNX artefact, and the --gpu-calibrated CLI surface are deferred to a follow-up PR (feat(ai): T7-GPU-ULP-CAL — calibration-head v0) gated by the data-collection script producing a real corpus on at least one real GPU.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| (a) Per-arch calibration head (this proposal) | Deterministic per-arch correction; doesn't touch kernel arithmetic; composes with the parity gate; small ONNX (probably < 50 KiB per arch); user opts in via --gpu-calibrated; reuses the tiny-AI runtime already on disk; fits the ADR-0214 "verify, don't modify" philosophy | Adds a new CLI surface, a per-arch training corpus to maintain, and a registry entry with smoke: true until measured; requires a per-arch detection mechanism that works for Vulkan/CUDA/SYCL/HIP; calibration only helps callers that opt in | Picked because it's the only option that closes the residual without re-litigating ADR-0220's fp64-free contract |
| (b) Fix the divergence at the kernel level (matches ADR-0220's approach for SYCL fp64) | Truly closes the gap at the source; no calibration corpus, no new CLI surface, no model registry entry; per ADR-0220 the int64 Q31 path is already the right answer for ADM gain limiting | Requires bit-exact fp32 reductions across every feature × backend cell — a multi-quarter project; some GPUs (Arc A380 et al.) lack the fp64 fallback that would make this trivial; the residual is already below the production tolerance so the ROI on a kernel-level fix is hard to justify against the rebase/maintenance cost; doesn't generalise to future ULP-sensitive consumers (HIP, Metal) | Not chosen yet — kernel-level fixes happen on a per-feature, per-backend basis as their own follow-ups (ADR-0220 was one such); the calibration-head approach is orthogonal and ships in weeks rather than quarters |
| (c) Accept the divergence and do nothing | Zero diff; the parity gate already proves we're within tolerance | Some downstream callers (golden-replay validators, regulatory comparisons) want bit-exactness, not "within tolerance"; the ADR-0220 / ADR-0214 audit explicitly named ULP divergence as a known floor — leaving it untreated forecloses future use cases that need bit-exact CPU equivalence on a GPU runner | Default on the table but rejected because the proposal in (a) is cheap to scope and the corpus collection is a prerequisite for any future treatment, including (b) |
| (d) Fixed per-feature scalar offset (no model, just a constant per (feature, arch) cell) | Tiny — a JSON table; no ONNX runtime dependency; explicit; auditable | The residual is not a constant — it varies per (frame, score-magnitude); a scalar offset would over-correct in some regimes and under-correct in others; would not pass the 1e-6 worst-case bar | Considered as a fallback if the corpus shows the residual really is constant per cell — would obviate the ONNX head entirely. Documented here so future work knows to check |
Consequences¶
- Positive: opens an option for bit-exact GPU↔CPU reproducibility without touching kernel arithmetic; gives the fork a per-arch treatment of the residual that ADR-0214 currently only reports; reuses the tiny-AI runtime; the data-collection script is immediately useful as a richer instrumentation of the parity gate even if calibration never ships (it produces per-frame ULP histograms that the gate's max-abs-diff alone hides).
- Negative: a new CLI flag (
--gpu-calibrated); a per-arch training corpus to keep current as kernels evolve (every kernel change invalidates the corresponding calibration head); a model registry entry that ships assmoke: trueuntil measured; arch-detection plumbing that has to work across CUDA, SYCL, Vulkan, and HIP (the latter not yet shipped — see ADR for HIP scaffold pending). Per CLAUDE.md §12 r1 the Netflix golden assertions remain untouchable; calibration can only run on the GPU side of the pair, never on the CPU reference. - Neutral / follow-ups:
- Research-0041 must converge on a corpus design before any training PR.
- The implementation PR will need its own ADR ("ADR-NNNN: GPU calibration head v0 implementation") with the held-out PLCC / residual measurements and the
--gpu-calibratedCLI contract. - HIP is currently un-scaffolded (T7-10); this proposal does not block on HIP — Vulkan + lavapipe is enough to prove the pipeline.
- If the data-collection script ships and the corpus shows the residual is effectively constant per cell, alternative (d) supersedes this proposal and we drop the ONNX in favour of a JSON offset table. That would be a good outcome.
References¶
- Source:
req— user-direction: scaffold a GPU-generation-aware ULP calibration head as a proposal-stage PR (paraphrased). - ADR-0214 — GPU-parity CI gate. Source of training data; shape of the per-cell JSON record.
- ADR-0220 — SYCL fp64-free contract; documents one cause of the divergence and the alternative-(b) approach this proposal is orthogonal to.
- ADR-0042 — tiny-AI per-PR doc bar; calibration head will inherit it on the implementation PR.
- Research-0041 — corpus design, arch-detection mechanisms, training matrix size estimate, smallest viable training set.
- Companion code:
ai/scripts/collect_gpu_calibration_data.py. - Related PRs: this ADR's PR.
Implementation status (2026-05-03)¶
The calibration-table tier of this ADR shipped under the PR "feat(ci): T-GPU-ULP — per-GPU-gen calibration table". That PR adds scripts/ci/gpu_ulp_calibration.yaml and wires both scripts/ci/cross_backend_vif_diff.py and scripts/ci/cross_backend_parity_gate.py to a per-GPU-id tolerance lookup. Coverage on land:
- 1 calibrated row (Mesa lavapipe — hosted-CI baseline; tolerances match the gate's pre-existing
FEATURE_TOLERANCEdefaults so behaviour is unchanged). - 11 placeholder rows (NVIDIA Ampere/Turing/Ada/Hopper, AMD RDNA2/RDNA3, Intel Arc Alchemist/Battlemage, generic Intel SYCL). Placeholder rows fall back to the per-feature default until a real-hardware corpus replaces their
features:block.
The --gpu-calibrated CLI flag and the ONNX calibration head described in §Decision remain deferred to the follow-up PR feat(ai): T7-GPU-ULP-CAL — calibration-head v0, gated on the data-collection script producing a real corpus on at least one real GPU. The table tier is the prerequisite plumbing — it establishes the per-arch identification surface and the threshold-lookup contract that the head will reuse.