ADR-0752 — Multi-Resolution Performance Benchmark Baseline¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-05-29 |
| Deciders | lusoris |
| Supersedes | — |
| See also | ADR-0108 (deep-dive deliverables), ADR-0165 (state.md tracking) |
Context¶
Performance work (GPU kernels, SIMD paths, memory optimisations) has been benchmarked ad-hoc per PR using varying resolutions, fixture sources, and reporting formats. Numbers from PR #76 (1080p re-measurement), the CUDA kernel work, and bench_all.sh exist but are not comparable: different wall-clock methods, different frame counts, different backend-engagement flag semantics, and no single authoritative JSON that a future PR can diff against.
The goal is a versioned JSON baseline (testdata/perf_multi_resolution.json) generated by a reproducible script, so any future performance PR can run the same script and produce a structured diff.
Decision¶
Ship scripts/perf/bench-multi-resolution.sh as the canonical multi-resolution benchmark harness. Key choices:
- Shell (bash) over Python or Go. The existing bench scripts in
testdata/are bash. Shell has zero dependency overhead inside the container and is straightforward to audit. Python is still used inline for JSON assembly and score parsing viapython3 -cone-liners. - Five resolutions: 576, 720, 1080, 1440, 2160. 576×324 is the Netflix golden fixture; 640×480 is the second native in-tree fixture; 1080p/1440p are upscaled from 576×324 via ffmpeg bilinear (noted in metadata); 2160p uses the BBB 30-frame trim from
testdata/bbb/. Upscaled fixtures are cache-written once totestdata/and reused on subsequent runs. - Five metrics: vif, adm, motion, ssim, ms_ssim. These cover the four VMAF component features plus MS-SSIM. The full composite
vmaf_v0.6.1score is always collected alongside (required for a valid run). - Three timing runs, median wall time. Best-of-N captures warm-cache steady-state; median is less sensitive to JIT-cache misses than min.
- Optional ncu integration behind
--ncu. ncu is not always available; gating it on a flag avoids blocking CI-free runs. - Schema version "1". The JSON carries a
schema_versionkey so parsers can branch on format changes without breaking.
Alternatives considered¶
| Option | Rejected because |
|---|---|
Python orchestrator only (extend bench_perf.py) | bench_perf.py is FFmpeg-based (uses libvmaf FFmpeg filter); we need to drive the vmaf binary directly per-metric to measure feature-level throughput without the FFmpeg decode overhead |
| Go benchmark binary | Would require a compiled artifact checked into the repo; adds a Go build dependency for a development-only tool; shell + python3 are already present in the container |
| Fewer resolutions (576 + 1080 + 4K only) | Drops 720p and 1440p which are commercially relevant targets; minimal extra cost per run |
| More resolutions (all multiples of 144p up to 4K) | Would double run time with diminishing insight; the five-point set covers SD / HD / QHD / UHD breakpoints |
| Best-of-N timing | More sensitive to CPU boost transients than median; median is the standard for wall-clock benchmarks |
JSON baseline committed to testdata/ at HEAD | The initial baseline is generated by the script and committed once; future PRs re-run the script to compare |
Consequences¶
testdata/perf_multi_resolution.jsonis the versioned baseline. Future perf PRs must re-run the script and include a diff in the PR description.- Upscaled fixtures (
testdata/ref_1920x1080_48f.yuv, etc.) are generated on first run and should be listed in.gitignore(they are reproducible). - The script is portable: no hardcoded
/opt/cudapaths; uses$CUDA_HOMEdetection with fallback candidates. - SYCL cells require oneAPI setvars.sh to be sourceable in the execution environment. If unavailable, those cells emit
status=skip.
References¶
- req: "Productionize [the multi-resolution benchmark] as a script + JSON output checked into testdata, so future perf PRs can compare against a versioned baseline."