ADR-0907: Wall-clock perf regression gate over the multi-resolution baseline¶
- Status: Proposed
- Date: 2026-05-30
- Deciders: Lusoris
- Tags: ci, performance, regression-gate, fork-local, testing
Context¶
The fork ships multiple benchmark harnesses but no automated wall-clock regression gate:
testdata/bench_all.sh(ADR-0513) compares per-backend VMAF scores on three canonical fixtures. Its in-scriptcompare()checks numerical correctness (max_diff < 0.01), not wall time.testdata/bench_perf.py(ADR-0429) is explicitly "operator-facing, not a CI gate".testdata/benchmark_netflix.pyregenerates thenetflix_benchmark_results.jsonsnapshot but ADR-0001 classifies that snapshot as "noise" — not a tracked baseline.scripts/perf/bench-multi-resolution.sh(ADR-0752) writes a versioned baseline attestdata/perf_multi_resolution.json(schema_version=1, hardware-tagged, 50 cells). Its README and ADR say "future perf PRs must re-run the script and include a diff in the PR description" — manual inspection, no enforcement..github/workflows/tests-and-quality-gates.ymlinvokes./testdata/bench_all.sh --backend=cpu --snapshot-only --tolerance-ulp=2butbench_all.shdoes not parse any of those flags; the run is effectively a no-op gate..github/workflows/nightly.ymlrunsbash testdata/bench_all.sh || trueand uploadsnetflix_benchmark_results.jsonas an artefact, never asserting anything.
Net effect: a wall-clock regression in any CPU/SYCL/CUDA path can land on master without any signal. The closest existing gate is ADR-0164's SSIMULACRA 2 numerical snapshot gate, which is correctness-only.
Decision¶
We will ship a minimal wall-clock perf regression gate over the ADR-0752 multi-resolution baseline:
- Gate script at
scripts/perf/check-regression.py— a single Python file (no new third-party deps; stdlib only) that loads the committedtestdata/perf_multi_resolution.jsonbaseline and a freshly-produced run JSON, joins them by(resolution, backend, metric), and exits non-zero when any cell'smedian_msexceeds the baseline by more than--tolerance-pct(default 5%). Cells withstatus != "ok"in either side are reported and skipped, not failed, so missing GPU lanes do not block CPU-only CI. - Wire it into
tests-and-quality-gates.yml— replace the brokenbench_all.sh --backend=cpu --snapshot-only --tolerance-ulp=2invocation (the flags are silently ignored today) with a CPU-only run ofbench-multi-resolution.shagainst the baseline. The job stays optional (continue-on-error: true) for one release cycle so we can collect cross-runner variance data before flipping it into the required-check aggregator. - 5% wall-clock tolerance as the default. Single-cell runs on GitHub-hosted Ubuntu show typical CPU-time variance of ~2-3% across identical commits; 5% catches real regressions while absorbing measurement noise. The tolerance is a CLI flag so a future ADR can tighten it once we have a self-hosted bench runner with lower variance.
- Baseline lives in
testdata/perf_multi_resolution.json— already committed, already versioned. Intentional perf improvements regenerate it viascripts/perf/bench-multi-resolution.shwith the regen documented in the commit message (mirrors the/regen-snapshotsdiscipline for SSIMULACRA 2).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Gate script + CPU-only CI step (this ADR) | Stdlib-only; reuses the existing baseline + harness; CPU-only keeps the gate runnable on stock GitHub Ubuntu; tolerance flag tunable as variance data accrues | Catches CPU regressions only at start; GPU lanes deferred to a follow-up self-hosted job | Chosen. Smallest viable gate; closes the "any wall-clock regression can land silently" gap immediately |
| Status-quo (manual diff in PR descriptions per ADR-0752) | No infra work | Relies on perfect human discipline; ADR-0752 has been live since 2026-05-29 and no PR has actually pasted a diff | Drops on the floor in practice |
Fix bench_all.sh to honour the --snapshot-only --tolerance-ulp=N flags it is already invoked with | Closes the broken-CI-invocation foot-gun directly | tolerance-ulp is a numerical tolerance, not a wall-clock one; would gate on score parity that the SSIMULACRA 2 + Netflix golden gates already cover; doesn't address the wall-clock gap | Wrong tool — would conflate two orthogonal gates |
| Full GPU matrix gate (CUDA + SYCL + HIP in CI) | Catches GPU regressions too | Requires self-hosted GPU runners; cross-runner variance is much higher than CPU; would block on infra not yet in place | Out of scope for the first iteration; follow-up ADR |
External tool (e.g. pyperformance, bencher.dev) | Off-the-shelf statistical machinery | New dependency; SaaS integration; doesn't read our JSON shape; pinning + supply-chain review cost dwarfs the value at this scale | Overkill; ~120 LOC of stdlib Python covers our shape |
Consequences¶
Positive¶
- A wall-clock regression > 5% on any tracked CPU cell now turns the CI step red (after the one-cycle
continue-on-errorwarm-up). - The broken
bench_all.sh --snapshot-onlyinvocation intests-and-quality-gates.ymlis replaced with a real gate. - The baseline file (
testdata/perf_multi_resolution.json) now has a consumer in CI, not just operator-facing inspection.
Negative¶
- One CI step adds ~3-5 min of wall time per PR run (CPU-only, five resolutions × five metrics, 3 runs each at median ~74 ms per cell).
- Operators who intentionally regress perf for a correctness fix must regenerate the baseline in the same PR (same discipline as ADR-0752 already implies).
Neutral / follow-ups¶
- Promote the step from
continue-on-error: trueto required-check after one release cycle of variance data. - A follow-up ADR will add a self-hosted GPU lane (CUDA + SYCL) once the runner pool stabilises. The script is backend-agnostic by design.
- The gate intentionally does not read
netflix_benchmark_results.json— ADR-0001 already classified that snapshot as noise.
References¶
- ADR-0001 — netflix_benchmark_results.json is noise, not a baseline.
- ADR-0429 — bench_perf.py is operator-facing, not a CI gate.
- ADR-0513 — bench_all.sh skip-on-missing-backend behaviour.
- ADR-0752 — multi-resolution baseline this gate consumes.
- ADR-0108 — deep-dive deliverables for this PR.
- ADR-0164 — precedent for a snapshot-based regression gate.
- Source:
req— user direction to "audit existing benchmark infrastructure — is there a performance regression gate? If not, propose one" with a "±5% wall-clock regression gate" target.