ADR-0890: CI concurrency + cost audit follow-up to PR #301¶
- Status: Accepted
- Date: 2026-05-30
- Deciders: lusoris, Claude Code
- Tags: ci, cost
Context¶
PR #301 added top-level concurrency: blocks (group + cancel-in-progress: true) to three PR-triggered workflows that lacked one — go-ci.yml, rust-ci.yml, docker-image.yml — and tightened three ONNX-Runtime install steps with set -euo pipefail + curl -fSL --retry 3. Its "Out-of-scope" note explicitly deferred ffmpeg-integration.yml ("a candidate for the same treatment, but kept out of this PR to stay strictly within the task scope; can land in a follow-up"). The user surfaced the cost issue directly: "CI minutes are being burned because the one-PR-in-flight rule isn't being followed — that is actively costing money" (paraphrased).
An audit of .github/workflows/*.yml against four cost-control axes (concurrency groups, paths-filters, caches, matrix minimization) found five remaining gaps where the in-tree convention exists but had not yet been applied:
ffmpeg-integration.ymlhad noconcurrency:block. Push + PR triggers meant a force-push or PR rebase queued a duplicate matrix run instead of cancelling the prior one. Cost per duplicate: ~10-15 min runner-min × 3 legs (gcc, clang, SYCL).ffmpeg-integration.ymlhad no ccache wrapper. libvmaf + FFmpeg combined are ~10 min compile time per leg on a cold runner; ccache typically hits 60-85 % after warm-up per Research-0089 §3.1 — which thelibvmaf-build-matrix.ymllegs already exploit. The FFmpeg clone was also full-depth (git clone -q --branch release/8.1), wasting 20-40 s on each leg even though only the worktree at the tip is needed.sanitizers.yml(ASan+UBSan PR-gate, TSan master-push) ran clang-18 debug builds with no ccache. Sanitizer builds rebuild fresh every PR push; measured ~5-7 min cold today, well-served by the established ccache pattern.security-scans.ymlhad nopaths-ignoreon the PR trigger. Doc-only PRs fired all of CodeQL C++ (~35 min build + analyze), CodeQL Python, CodeQL Actions, Semgrep, Gitleaks, Dependency Review — none of which had any security-relevant delta to scan. The weekly cron (0 6 * * 1) still provides full coverage against master, so doc-only PRs that skip the gate are not a coverage gap.lint-and-format.yml::clang-tidypaid ~5 min of apt-install + meson-setup eson-compile before its existing "no C/C++ changes — exit 0" short-circuit at the innerRun clang-tidy on changed filesstep. An early file-delta probe gated on*.c/*.h/*.cpp/*.hppextension lets the install + build steps be skipped entirely on doc-only / Python-only PRs.
docker-publish-production.yml, nightly*.yml, release-please.yml, scorecard.yml, supply-chain.yml, and the four upstream-*-watcher.yml files were intentionally left without concurrency blocks — they are release / scheduled / cron workflows where cancelling mid-run is the wrong behaviour. Same rationale as PR #301.
Decision¶
Apply the in-tree cost-control conventions to the five gaps above. No new patterns introduced; every change mirrors an existing in-tree precedent (build.yml / libvmaf-build-matrix.yml ccache + concurrency, the paths-ignore deny-list from libvmaf-build-matrix.yml / tests-and-quality-gates.yml, the per-step if: gating already used elsewhere in lint-and-format.yml).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Land the five gaps as one consolidated CI hygiene PR (chosen) | One review, one CI cycle, one rebase note. All five changes share the same "follow-up to PR #301" framing and decision logic. | Slightly larger diff than five micro-PRs. | Matches the user's "ONE PR active at a time — strict" rule. Five micro-PRs would burn more reviewer time + more CI cycles than the saving justifies. |
| Five micro-PRs (one per workflow) | Each can be reverted independently. | 5× CI cycles, 5× review queue slots, violates the one-PR-in-flight constraint, no shared decision context. | Cost outweighs the (negligible) revert-granularity benefit; the five changes are all CI-YAML hygiene with no source/header/patch impact. |
Adopt a third-party reusable concurrency action (e.g. softprops/turnstyle) | Centralises the concurrency-group shape across all workflows. | New external dependency on the supply chain, no benefit over native concurrency:. | The native GitHub Actions concurrency: block already does exactly what we need; a third-party action adds attack surface for zero functional gain. |
Disable the ffmpeg-integration.yml SYCL leg entirely (most expensive leg, no GPU on runner) | Saves ~10 min/run unconditionally. | Loses build-time coverage of the SYCL FFmpeg patch (0003) on every PR — exactly the gate that catches patch-series rebase breakage before a downstream FFmpeg consumer trips it. | Build-only coverage of the SYCL patch is the entire point of the leg per ADR-0186 / ADR-0726 follow-up; ccache narrows the cost without losing the signal. |
Consequences¶
- Positive:
- Force-push / PR-rebase on
ffmpeg-integration.ymlcancels superseded matrix runs instead of queueing dups (mirrors the other 8 PR-triggered workflows). - ccache on
ffmpeg-integration.yml+sanitizers.ymlsaves 3-5 min per leg after warm-up (Research-0089 §3.1 baseline).ccache --max-size=400Mleaves headroom inside the 10 GB per-repo cache budget. --depth=1on the FFmpeg clone saves 20-40 s per leg.paths-ignoreonsecurity-scans.ymlskips ~50-60 min of compute on every doc-only PR (CodeQL C++ build + analyze dominates).- Early file-delta probe in
lint-and-format.yml::clang-tidyskips ~5 min of apt + meson setup on every doc-only / Python-only PR. - Negative:
- One additional
actions/cache@v5call per affected workflow (already a pinned dep, no new SBOM line). - The clang-tidy early-skip duplicates the file-pattern list maintained in the later step's exclusion logic. The early probe is intentionally less strict (asks "any C/C++ at all?", not "any non-GPU C/C++"); a follow-up can fold the two probes if maintenance friction surfaces.
- Neutral / follow-ups:
- The
Coverage Gatejob intests-and-quality-gates.ymlalready caches the ORT GPU.tgz(~150 MB) — no change needed there. - The Vulkan-wrap packagecache in
libvmaf-build-matrix.ymlis dead code now that ADR-0726 dropped Vulkan; cleanup belongs in a separate dead-code sweep, not this audit. cppcheckinlint-and-format.ymlwas left as-is — it scans the whole project (no per-file delta), and itsmeson setup + meson compile + cppcheckcost on a doc-only PR is ~5 min total. Not a free win the way clang-tidy is; deferred until adorny/paths-filter-style probe is introduced project-wide.
References¶
- PR #301 (chore(ci): add concurrency groups + shell-strict on curl|tar steps) — explicit "Out-of-scope" note deferring
ffmpeg-integration.yml. - ADR-0317 — path-filter introduction on
ffmpeg-integration.yml/docker-image.yml. - ADR-0341 —
paths-ignoredeny-list onlibvmaf-build-matrix.yml/tests-and-quality-gates.yml. - Research-0089 §3.1 — ccache hit-rate evidence (60-85 % typical after warm-up).
- Source:
req— direct user direction surfaced in the parent agent task brief: "ci is clocked because you dont follow my rules and have more than one active not draft pr" + "thats actively wasting my money as well".