ADR-0907: Wall-clock perf regression gate over the multi-resolution baseline¶

Status: Proposed
Date: 2026-05-30
Deciders: Lusoris
Tags: ci, performance, regression-gate, fork-local, testing

Context¶

The fork ships multiple benchmark harnesses but no automated wall-clock regression gate:

testdata/bench_all.sh (ADR-0513) compares per-backend VMAF scores on three canonical fixtures. Its in-script compare() checks numerical correctness (max_diff < 0.01), not wall time.
testdata/bench_perf.py (ADR-0429) is explicitly "operator-facing, not a CI gate".
testdata/benchmark_netflix.py regenerates the netflix_benchmark_results.json snapshot but ADR-0001 classifies that snapshot as "noise" — not a tracked baseline.
scripts/perf/bench-multi-resolution.sh (ADR-0752) writes a versioned baseline at testdata/perf_multi_resolution.json (schema_version=1, hardware-tagged, 50 cells). Its README and ADR say "future perf PRs must re-run the script and include a diff in the PR description" — manual inspection, no enforcement.
.github/workflows/tests-and-quality-gates.yml invokes ./testdata/bench_all.sh --backend=cpu --snapshot-only --tolerance-ulp=2 but bench_all.sh does not parse any of those flags; the run is effectively a no-op gate.
.github/workflows/nightly.yml runs bash testdata/bench_all.sh || true and uploads netflix_benchmark_results.json as an artefact, never asserting anything.

Net effect: a wall-clock regression in any CPU/SYCL/CUDA path can land on master without any signal. The closest existing gate is ADR-0164's SSIMULACRA 2 numerical snapshot gate, which is correctness-only.

Decision¶

We will ship a minimal wall-clock perf regression gate over the ADR-0752 multi-resolution baseline:

Gate script at scripts/perf/check-regression.py — a single Python file (no new third-party deps; stdlib only) that loads the committed testdata/perf_multi_resolution.json baseline and a freshly-produced run JSON, joins them by (resolution, backend, metric), and exits non-zero when any cell's median_ms exceeds the baseline by more than --tolerance-pct (default 5%). Cells with status != "ok" in either side are reported and skipped, not failed, so missing GPU lanes do not block CPU-only CI.
Wire it into tests-and-quality-gates.yml — replace the broken bench_all.sh --backend=cpu --snapshot-only --tolerance-ulp=2 invocation (the flags are silently ignored today) with a CPU-only run of bench-multi-resolution.sh against the baseline. The job stays optional (continue-on-error: true) for one release cycle so we can collect cross-runner variance data before flipping it into the required-check aggregator.
5% wall-clock tolerance as the default. Single-cell runs on GitHub-hosted Ubuntu show typical CPU-time variance of ~2-3% across identical commits; 5% catches real regressions while absorbing measurement noise. The tolerance is a CLI flag so a future ADR can tighten it once we have a self-hosted bench runner with lower variance.
Baseline lives in testdata/perf_multi_resolution.json — already committed, already versioned. Intentional perf improvements regenerate it via scripts/perf/bench-multi-resolution.sh with the regen documented in the commit message (mirrors the /regen-snapshots discipline for SSIMULACRA 2).

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Gate script + CPU-only CI step (this ADR)	Stdlib-only; reuses the existing baseline + harness; CPU-only keeps the gate runnable on stock GitHub Ubuntu; tolerance flag tunable as variance data accrues	Catches CPU regressions only at start; GPU lanes deferred to a follow-up self-hosted job	Chosen. Smallest viable gate; closes the "any wall-clock regression can land silently" gap immediately
Status-quo (manual diff in PR descriptions per ADR-0752)	No infra work	Relies on perfect human discipline; ADR-0752 has been live since 2026-05-29 and no PR has actually pasted a diff	Drops on the floor in practice
Fix `bench_all.sh` to honour the `--snapshot-only --tolerance-ulp=N` flags it is already invoked with	Closes the broken-CI-invocation foot-gun directly	`tolerance-ulp` is a numerical tolerance, not a wall-clock one; would gate on score parity that the SSIMULACRA 2 + Netflix golden gates already cover; doesn't address the wall-clock gap	Wrong tool — would conflate two orthogonal gates
Full GPU matrix gate (CUDA + SYCL + HIP in CI)	Catches GPU regressions too	Requires self-hosted GPU runners; cross-runner variance is much higher than CPU; would block on infra not yet in place	Out of scope for the first iteration; follow-up ADR
External tool (e.g. `pyperformance`, `bencher.dev`)	Off-the-shelf statistical machinery	New dependency; SaaS integration; doesn't read our JSON shape; pinning + supply-chain review cost dwarfs the value at this scale	Overkill; ~120 LOC of stdlib Python covers our shape

Consequences¶

Positive¶

A wall-clock regression > 5% on any tracked CPU cell now turns the CI step red (after the one-cycle continue-on-error warm-up).
The broken bench_all.sh --snapshot-only invocation in tests-and-quality-gates.yml is replaced with a real gate.
The baseline file (testdata/perf_multi_resolution.json) now has a consumer in CI, not just operator-facing inspection.

Negative¶

One CI step adds ~3-5 min of wall time per PR run (CPU-only, five resolutions × five metrics, 3 runs each at median ~74 ms per cell).
Operators who intentionally regress perf for a correctness fix must regenerate the baseline in the same PR (same discipline as ADR-0752 already implies).

Neutral / follow-ups¶

Promote the step from continue-on-error: true to required-check after one release cycle of variance data.
A follow-up ADR will add a self-hosted GPU lane (CUDA + SYCL) once the runner pool stabilises. The script is backend-agnostic by design.
The gate intentionally does not read netflix_benchmark_results.json — ADR-0001 already classified that snapshot as noise.

References¶

ADR-0001 — netflix_benchmark_results.json is noise, not a baseline.
ADR-0429 — bench_perf.py is operator-facing, not a CI gate.
ADR-0513 — bench_all.sh skip-on-missing-backend behaviour.
ADR-0752 — multi-resolution baseline this gate consumes.
ADR-0108 — deep-dive deliverables for this PR.
ADR-0164 — precedent for a snapshot-based regression gate.
Source: req — user direction to "audit existing benchmark infrastructure — is there a performance regression gate? If not, propose one" with a "±5% wall-clock regression gate" target.