Performance Benchmarking¶

This page describes how to benchmark VMAF throughput across resolutions, backends, and metrics, and how to use the versioned JSON baseline for regression detection.

Quick start¶

# CPU-only (no GPU required)
bash scripts/perf/bench-multi-resolution.sh \
  --backends cpu \
  --resolutions 576,720,1080 \
  --metrics vif,adm,motion,ssim,ms_ssim \
  --output /tmp/my_perf.json

# CPU + CUDA inside the dev container
docker run --rm --gpus all \
  -v $(git rev-parse --show-toplevel):/workspace \
  -w /workspace \
  vmaf-dev-mcp:cuda13.3 bash -c '
    export VMAF_BIN=/workspace/core/build/tools/vmaf
    bash scripts/perf/bench-multi-resolution.sh \
      --backends cpu,cuda \
      --output testdata/perf_multi_resolution.json
  '

Script reference¶

scripts/perf/bench-multi-resolution.sh [OPTIONS]

Flag	Default	Description
`--backends`	`cpu`	Comma-separated list: `cpu`, `cuda`, `sycl`
`--resolutions`	all	Comma-separated height keys: `576`, `720`, `1080`, `1440`, `2160`
`--metrics`	all	`vif`, `adm`, `motion`, `ssim`, `ms_ssim`
`--runs`	`3`	Timing runs per cell; median is kept
`--ncu`	off	Collect Nsight Compute metrics (CUDA cells only)
`--output`	`testdata/perf_multi_resolution.json`	Output JSON path
`--workspace`	git toplevel	Root of the VMAF tree
`--dry-run`	off	Print plan and exit without running

Environment variables¶

Variable	Purpose
`VMAF_BIN`	Override vmaf binary path
`VMAF_CUDA_HOME`	Override CUDA install prefix (fallback: `$CUDA_HOME`, `/opt/cuda`, `/usr/local/cuda`)
`VMAF_ONEAPI_SETVARS`	Override oneAPI `setvars.sh` path
`VMAF_NCU`	Override `ncu` binary path

JSON schema¶

{
  "schema_version": "1",
  "timestamp": "2026-05-29T...",
  "hardware": {
    "cpu_model": "...",
    "gpu_model": "...",
    "nvidia_driver": "...",
    "cuda_version": "...",
    "git_hash": "..."
  },
  "toolkit": {
    "runs_per_cell": 3,
    "timing": "median wall time (ms)",
    "ncu_enabled": false
  },
  "fixture_notes": { ... },
  "total_cells": 50,
  "ok_cells": 40,
  "skipped_cells": 10,
  "runs": [
    {
      "resolution": "1080",
      "backend": "cuda",
      "metric": "adm",
      "width": 1920,
      "height": 1080,
      "bitdepth": 8,
      "frames": 48,
      "fixture_source": "upscaled",
      "median_ms": 142,
      "fps": 338.0,
      "vmaf_score": 74.123456,
      "feature_score": 10.987,
      "status": "ok",
      "skip_reason": null,
      "ncu": {}
    }
  ]
}

status is "ok" or "skip". skip_reason is non-null when status=skip. ncu is populated only when --ncu is passed and the backend is cuda.

Fixtures¶

Key	Size	Source
`576`	576×324, 48f	Native — Netflix golden `testdata/ref_576x324_48f.yuv`
`720`	640×480, 48f	Native — `testdata/ref_640x480_48f.yuv`
`1080`	1920×1080, 48f	Generated on first run — ffmpeg bilinear upscale from 576×324
`1440`	2560×1440, 48f	Generated on first run — ffmpeg bilinear upscale from 576×324
`2160`	3840×2160, 30f	Trimmed from `testdata/bbb/ref_3840x2160_200f.yuv` (gitignored)

Upscaled fixtures are cached in testdata/ after the first run. The upscale is bilinear (throughput vehicle only; scores from these fixtures are not comparable to production content).

Comparing a PR against the baseline¶

Automated (CI gate — ADR-0907)¶

CI runs scripts/perf/check-regression.py against the committed baseline on every PR (CPU-only, tests-and-quality-gates.yml job perf-regression). The gate fails if any (resolution, backend, metric) cell regresses by more than 5% wall-clock vs the baseline. The job is continue-on-error: true for one release cycle so cross-runner variance data can inform whether the 5% tolerance is right before the step is promoted to a required check.

Run the same gate locally:

# 1. Produce a fresh run JSON.
./scripts/perf/bench-multi-resolution.sh \
  --backends cpu --runs 3 \
  --output /tmp/perf_current.json

# 2. Diff against the committed baseline (exit 1 on regression > 5%).
python3 scripts/perf/check-regression.py \
  --baseline testdata/perf_multi_resolution.json \
  --current  /tmp/perf_current.json \
  --tolerance-pct 5.0 \
  --backend cpu

The gate prints a per-cell report:

== Perf regression gate (tolerance: +/- 5.0%) ==

REGRESSIONS (1):
  1080p  cpu      adm :   142.0 ms ->  151.5 ms ( +6.69%)

Improvements (informational, 1):
   720p  cpu      vif :   105.0 ms ->   95.0 ms ( -9.52%)

Cells with status != "ok" in either side (e.g. SYCL skipped because oneAPI is unavailable) are reported under Skipped and do not fail the gate.

Manual diff (legacy)¶

Run the script before and after your change, then diff in Python:

import json, sys
old = {(r["resolution"],r["backend"],r["metric"]): r
       for r in json.load(open(sys.argv[1]))["runs"]}
new = {(r["resolution"],r["backend"],r["metric"]): r
       for r in json.load(open(sys.argv[2]))["runs"]}
for key in sorted(new):
    o = old.get(key, {}); n = new[key]
    if n.get("fps") and o.get("fps"):
        delta = (n["fps"] - o["fps"]) / o["fps"] * 100
        print(f"{key[0]:>5}p/{key[1]:6}/{key[2]:8} "
              f"{o['fps']:7.1f} -> {n['fps']:7.1f} fps  {delta:+.1f}%")

Include the diff table in the PR description under "Performance delta". If the PR intentionally improves performance, commit the updated testdata/perf_multi_resolution.json.

SYCL prerequisites¶

SYCL cells require oneAPI to be sourced inside the execution environment. The script attempts to source setvars.sh from $VMAF_ONEAPI_SETVARS, /opt/intel/oneapi-2025.3/setvars.sh, and /opt/intel/oneapi/setvars.sh in that order. If none are found, SYCL cells emit status=skip.

See also the one-off container SYCL device-access pattern in docs/rebase-notes.md.

ncu integration¶

Pass --ncu to collect Nsight Compute metrics for CUDA cells. The following counters are collected per kernel:

sm__throughput.avg.pct_of_peak_sustained_elapsed — SM utilisation
l1tex__t_sector_hit_rate.pct — L1 cache hit rate
lts__t_sector_hit_rate.pct — L2 cache hit rate
gpu__time_duration.sum — total GPU time

Requires ncu (Nsight Compute ≥ 2022) on PATH or set via $VMAF_NCU.

Baseline location¶

The versioned baseline lives at testdata/perf_multi_resolution.json. The hardware.git_hash and timestamp fields identify when and on what machine it was generated. Regenerate intentionally via the script after any structural performance change; include the justification in the commit message.