ADR-0214: GPU-parity CI gate (T6-8) — cross-device variance matrix¶
- Status: Accepted
- Date: 2026-04-29
- Deciders: Lusoris, Claude (Anthropic)
- Tags: ci, gpu, vulkan, cuda, sycl, agents, fork-local
Context¶
The fork ships four backends today (CPU, CUDA, SYCL, Vulkan) and is about to add HIP (T7-10). Every feature with a GPU twin is gated today by scripts/ci/cross_backend_vif_diff.py, but the gate runs one cell per CLI invocation: the tests-and-quality-gates.yml vulkan-vif-cross-backend job calls the script ~17 times (once per feature) and each call only diffs CPU↔Vulkan. There is no gate on CPU↔CUDA, CPU↔SYCL, or pairwise GPU↔GPU consistency, no machine-readable summary, and the per-feature tolerance is buried inside a hand-written --places flag on each step (which drifts silently when a feature's empirical floor changes).
Wave-1 roadmap §7 calls for a single matrix gate before Vulkan goes to production. T6-8 is that gate.
Decision¶
We add scripts/ci/cross_backend_parity_gate.py and a new CI job vulkan-parity-matrix-gate in tests-and-quality-gates.yml. The script iterates every (feature, backend-pair) cell, runs vmaf once per backend, diffs per-frame metrics with a per-feature absolute tolerance declared in a single FEATURE_TOLERANCE table, and emits one JSON record + one Markdown row per cell. Default tolerance is 5e-5 (places=4 — the existing fork contract from ADR-0125 / ADR-0138 / ADR-0140); transcendental-heavy features (ciede, psnr_hvs, ssimulacra2) keep their relaxed contracts; a --fp16-features flag forces 1e-2 for the future tiny-AI lane (T7-39 ONNX Runtime FP16). The hosted CI lane uses Mesa lavapipe for Vulkan (no GPU runner needed); CUDA / SYCL / hardware-Vulkan remain advisory on a self-hosted lane until that hardware is plumbed in. The gate never modifies feature implementations — it only verifies.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Per-feature tolerance table (chosen) | Honest about real precision floors (ciede places=2, psnr_hvs places=3, vif places=4); single source of truth | One Python dict to keep current with kernel changes | Aligns with measure-then-set-the-contract from ADR-0188/9; uniform places=4 would force fake-passing or hide the relaxation in a --places flag per CI step |
| Uniform places=4 across all cells | Simplest mental model | Already false (psnr_hvs has been places=3 since ADR-0191; ciede places=2 since ADR-0187); would need per-step overrides anyway | Hides the relaxation in CI YAML rather than in code; reviewers can't see the contract at a glance |
| CUDA / SYCL hard-required on every PR | Catches every regression immediately | Needs self-hosted runner; cost + flake risk on every PR | Lane is if: false until self-hosted runner with the right hardware exists; advisory-only until then |
| Lavapipe-only Vulkan in CI (chosen) | Free hosted runners, no GPU dependency, ~deterministic | No hardware drift coverage (Mesa anv, Intel Arc, NVIDIA proprietary) | Self-hosted Arc lane already exists for vulkan-vif-arc-nightly; T6-8 piggy-backs on it for hardware-Vulkan once enabled |
| JSON-only output | Less code to maintain | Reviewers reading PR comments can't see a status table | Markdown table is ~30 lines; reviewers comment on it directly. JSON for downstream tooling, MD for humans |
Consequences¶
- Positive: one place declares the parity contract for every feature; CI surfaces a single PASS/FAIL plus a per-cell report instead of 17 separate green check-marks; new GPU backends (HIP) drop into the matrix by adding one entry to
BACKEND_SUFFIX; CUDA / SYCL parity becomes a measured value the moment a runner exists (no code change). - Negative: the per-feature
FEATURE_TOLERANCEtable can drift out of agreement with the kernel's empirical floor — bumping tolerance must come with a measurement-driven follow-up ADR per CLAUDE.md §12 r1. - Neutral / follow-ups: the legacy
cross_backend_vif_diff.pystays — sister PRs in this session still wire it as the per-feature smoke. The legacy lane is deleted in the T6-8b PR once the matrix gate has soaked one release cycle. Future--fp16-featuresconsumers (T7-39 ONNX tiny-AI) inherit the gate without further changes.
References¶
req— backlog row T6-8 ("GPU-parity CI gate (CPU ↔ CUDA ↔ SYCL ↔ Vulkan cross-device variance). Wave 1 roadmap §7. Cross-device ULP gate with ≤1e-4 FP32 / ≤1e-2 FP16 tolerance. Blocks Vulkan → production once T5-1b lands.")- ADR-0125 / ADR-0138 / ADR-0140 — places=4 cross-backend contract (places=4 ↔ 5e-5 absolute).
- ADR-0175 — Vulkan backend scaffold; the audit-first shape this PR mirrors.
- ADR-0176 —
cross_backend_vif_diff.pyrationale for places=4 + half-ULP threshold. - ADR-0187 / ADR-0188 — relaxed places=2 contracts for ciede / ssimulacra2.
- ADR-0191 — psnr_hvs places=3 contract.
- Companion code:
scripts/ci/cross_backend_parity_gate.py,.github/workflows/tests-and-quality-gates.ymllanevulkan-parity-matrix-gate,docs/development/cross-backend-gate.md.