ADR-0752 — Multi-Resolution Performance Benchmark Baseline¶

Field	Value
Status	Accepted
Date	2026-05-29
Deciders	lusoris
Supersedes	—
See also	ADR-0108 (deep-dive deliverables), ADR-0165 (state.md tracking)

Context¶

Performance work (GPU kernels, SIMD paths, memory optimisations) has been benchmarked ad-hoc per PR using varying resolutions, fixture sources, and reporting formats. Numbers from PR #76 (1080p re-measurement), the CUDA kernel work, and bench_all.sh exist but are not comparable: different wall-clock methods, different frame counts, different backend-engagement flag semantics, and no single authoritative JSON that a future PR can diff against.

The goal is a versioned JSON baseline (testdata/perf_multi_resolution.json) generated by a reproducible script, so any future performance PR can run the same script and produce a structured diff.

Decision¶

Ship scripts/perf/bench-multi-resolution.sh as the canonical multi-resolution benchmark harness. Key choices:

Shell (bash) over Python or Go. The existing bench scripts in testdata/ are bash. Shell has zero dependency overhead inside the container and is straightforward to audit. Python is still used inline for JSON assembly and score parsing via python3 -c one-liners.
Five resolutions: 576, 720, 1080, 1440, 2160. 576×324 is the Netflix golden fixture; 640×480 is the second native in-tree fixture; 1080p/1440p are upscaled from 576×324 via ffmpeg bilinear (noted in metadata); 2160p uses the BBB 30-frame trim from testdata/bbb/. Upscaled fixtures are cache-written once to testdata/ and reused on subsequent runs.
Five metrics: vif, adm, motion, ssim, ms_ssim. These cover the four VMAF component features plus MS-SSIM. The full composite vmaf_v0.6.1 score is always collected alongside (required for a valid run).
Three timing runs, median wall time. Best-of-N captures warm-cache steady-state; median is less sensitive to JIT-cache misses than min.
Optional ncu integration behind --ncu. ncu is not always available; gating it on a flag avoids blocking CI-free runs.
Schema version "1". The JSON carries a schema_version key so parsers can branch on format changes without breaking.

Alternatives considered¶

Option	Rejected because
Python orchestrator only (extend `bench_perf.py`)	`bench_perf.py` is FFmpeg-based (uses `libvmaf` FFmpeg filter); we need to drive the `vmaf` binary directly per-metric to measure feature-level throughput without the FFmpeg decode overhead
Go benchmark binary	Would require a compiled artifact checked into the repo; adds a Go build dependency for a development-only tool; shell + python3 are already present in the container
Fewer resolutions (576 + 1080 + 4K only)	Drops 720p and 1440p which are commercially relevant targets; minimal extra cost per run
More resolutions (all multiples of 144p up to 4K)	Would double run time with diminishing insight; the five-point set covers SD / HD / QHD / UHD breakpoints
Best-of-N timing	More sensitive to CPU boost transients than median; median is the standard for wall-clock benchmarks
JSON baseline committed to `testdata/` at HEAD	The initial baseline is generated by the script and committed once; future PRs re-run the script to compare

Consequences¶

testdata/perf_multi_resolution.json is the versioned baseline. Future perf PRs must re-run the script and include a diff in the PR description.
Upscaled fixtures (testdata/ref_1920x1080_48f.yuv, etc.) are generated on first run and should be listed in .gitignore (they are reproducible).
The script is portable: no hardcoded /opt/cuda paths; uses $CUDA_HOME detection with fallback candidates.
SYCL cells require oneAPI setvars.sh to be sourceable in the execution environment. If unavailable, those cells emit status=skip.

References¶

req: "Productionize [the multi-resolution benchmark] as a script + JSON output checked into testdata, so future perf PRs can compare against a versioned baseline."