Skip to content

ADR-0752 — Multi-Resolution Performance Benchmark Baseline

Field Value
Status Accepted
Date 2026-05-29
Deciders lusoris
Supersedes
See also ADR-0108 (deep-dive deliverables), ADR-0165 (state.md tracking)

Context

Performance work (GPU kernels, SIMD paths, memory optimisations) has been benchmarked ad-hoc per PR using varying resolutions, fixture sources, and reporting formats. Numbers from PR #76 (1080p re-measurement), the CUDA kernel work, and bench_all.sh exist but are not comparable: different wall-clock methods, different frame counts, different backend-engagement flag semantics, and no single authoritative JSON that a future PR can diff against.

The goal is a versioned JSON baseline (testdata/perf_multi_resolution.json) generated by a reproducible script, so any future performance PR can run the same script and produce a structured diff.

Decision

Ship scripts/perf/bench-multi-resolution.sh as the canonical multi-resolution benchmark harness. Key choices:

  • Shell (bash) over Python or Go. The existing bench scripts in testdata/ are bash. Shell has zero dependency overhead inside the container and is straightforward to audit. Python is still used inline for JSON assembly and score parsing via python3 -c one-liners.
  • Five resolutions: 576, 720, 1080, 1440, 2160. 576×324 is the Netflix golden fixture; 640×480 is the second native in-tree fixture; 1080p/1440p are upscaled from 576×324 via ffmpeg bilinear (noted in metadata); 2160p uses the BBB 30-frame trim from testdata/bbb/. Upscaled fixtures are cache-written once to testdata/ and reused on subsequent runs.
  • Five metrics: vif, adm, motion, ssim, ms_ssim. These cover the four VMAF component features plus MS-SSIM. The full composite vmaf_v0.6.1 score is always collected alongside (required for a valid run).
  • Three timing runs, median wall time. Best-of-N captures warm-cache steady-state; median is less sensitive to JIT-cache misses than min.
  • Optional ncu integration behind --ncu. ncu is not always available; gating it on a flag avoids blocking CI-free runs.
  • Schema version "1". The JSON carries a schema_version key so parsers can branch on format changes without breaking.

Alternatives considered

Option Rejected because
Python orchestrator only (extend bench_perf.py) bench_perf.py is FFmpeg-based (uses libvmaf FFmpeg filter); we need to drive the vmaf binary directly per-metric to measure feature-level throughput without the FFmpeg decode overhead
Go benchmark binary Would require a compiled artifact checked into the repo; adds a Go build dependency for a development-only tool; shell + python3 are already present in the container
Fewer resolutions (576 + 1080 + 4K only) Drops 720p and 1440p which are commercially relevant targets; minimal extra cost per run
More resolutions (all multiples of 144p up to 4K) Would double run time with diminishing insight; the five-point set covers SD / HD / QHD / UHD breakpoints
Best-of-N timing More sensitive to CPU boost transients than median; median is the standard for wall-clock benchmarks
JSON baseline committed to testdata/ at HEAD The initial baseline is generated by the script and committed once; future PRs re-run the script to compare

Consequences

  • testdata/perf_multi_resolution.json is the versioned baseline. Future perf PRs must re-run the script and include a diff in the PR description.
  • Upscaled fixtures (testdata/ref_1920x1080_48f.yuv, etc.) are generated on first run and should be listed in .gitignore (they are reproducible).
  • The script is portable: no hardcoded /opt/cuda paths; uses $CUDA_HOME detection with fallback candidates.
  • SYCL cells require oneAPI setvars.sh to be sourceable in the execution environment. If unavailable, those cells emit status=skip.

References

  • req: "Productionize [the multi-resolution benchmark] as a script + JSON output checked into testdata, so future perf PRs can compare against a versioned baseline."