Benchmarks¶

Scope: this file tracks fork-added benchmarks (GPU backends, SIMD paths, --precision overhead). Netflix's upstream correctness numbers are the Netflix golden CPU pools — see CLAUDE.md §8.

Runs below are produced by make bench (drives testdata/bench_all.sh) on a fixed hardware profile and pinned commit. Contribute new numbers via a PR that updates this file alongside the commit that motivates the rerun.

Hardware profiles¶

Profile	CPU	GPUs	Memory	OS
`ryzen-4090-arc`	AMD Ryzen 9 9950X3D (16c/32t, Zen 5, AVX-512)	NVIDIA RTX 4090 (24 GB) + Intel Arc A380 (6 GB, fp64-emulated)	96 GB DDR5-6400	Linux 7.0.x (CachyOS)
`xeon-arc`	Intel Xeon w9-3475X	Intel Arc A770	128 GB DDR5-4800	Ubuntu 26.04
`m4-pro`	Apple M4 Pro	(integrated)	48 GB unified	macOS 15

The ryzen-4090-arc profile is the canonical fork bench host: a single machine that exposes CUDA (RTX 4090), SYCL (Arc A380 via oneAPI 2025.3 Level Zero), and HIP so all active backends can run back-to-back from one shell. (Vulkan backend removed per ADR-0726; historical Vulkan bench rows are preserved below for reference.)

Backend comparison (Netflix normal pair, 576×324, 48 frames)¶

Source: python/test/resource/yuv/src01_hrc00_576x324.yuv vs …hrc01…. Model: model/vmaf_v0.6.1.json. Threads: 1. Precision: CLI default %.6f per ADR-0119. Numbers averaged over 5 wall-clock reps after 1 warmup; standard deviation in parentheses. Commit 41301496 on ryzen-4090-arc.

Backend	fps (higher better)	wall ms / 48f	vmaf pool	metrics-keys	delta vs CPU pool
`cpu` (full ISA, AVX-512)	598 (±21)	80.3	`76.667828`	15	0 (reference)
`cuda` (RTX 4090)	278 (±52)	177.5	`76.667828`	12	0.0 — pool match to 6 dp; per-frame max ULP diff 1.8×10⁻⁵
`sycl` (Arc A380)	315 (±0.9)	152.3	`76.667767`	34	-6.1×10⁻⁵ pool; per-frame max diff 1.11×10⁻³
~~`vulkan`~~ (removed ADR-0726)	(historical: 171 fps)	(historical: 280.6ms)	`76.667758`	34	historical reference only

Key-count check. Each backend emits a different frames[0].metrics key set (CPU=15 with integer_aim/integer_motion3/integer_adm3, CUDA=12, SYCL=34 with raw _num/_den intermediates). Identical key counts across two rows would indicate a silent-fallback to CPU; the counts above confirm each backend actually engaged. See AGENTS.md §"Backend-engagement foot-guns".

Backend comparison (1080p, 5 frames)¶

Source: python/test/resource/yuv/src01_hrc{00,01}_1920x1080_5frames.yuv. Same setup as 576×324.

Backend	fps	wall ms / 5f	vmaf pool	metrics-keys
`cpu`	45.6 (±1.0)	109.7	`35.815478`	15
`cuda`	33.6 (±1.1)	148.8	`35.815478`	12
`sycl`	41.1 (±0.7)	121.7	`35.815404`	34
~~`vulkan`~~ (removed ADR-0726)	(historical: 21.8 fps)	(historical: 229.5ms)	`35.815399`	34

CPU outpaces CUDA at 1080p × 5 frames because dispatch overhead dominates the workload — only 5 frames doesn't amortise the CUDA launch/copy cost. CUDA decisively wins once the workload grows (see 4K below).

Backend comparison (BBB 4K, 200 frames)¶

Source: testdata/bbb/{ref,dis}_3840x2160_200f.yuv (BigBuckBunny 4K master, ffmpeg-encoded ref + libx264 CRF=35 round-trip distortion; see How to reproduce).

Backend	fps	wall s / 200f	vmaf pool	speedup vs CPU
`cpu`	13.9 (±0.5)	14.43	`36.343813`	1.0× (baseline)
`cuda` (RTX 4090)	227.6 (±11.3)	0.88	`36.343815`	16.4×
`sycl` (Arc A380)	32.1 (±0.1)	6.23	`36.343780`	2.3×
~~`vulkan`~~ (removed ADR-0726)	(historical: 14.1 fps)	(historical: 14.16s)	`36.343774`	historical

Notes:

CUDA at 4K is the headline number — the RTX 4090 sustains 227 fps on 8-bit 3840×2160 with vmaf_v0.6.1.json, ~16× faster than the CPU + AVX-512 baseline.
SYCL on Arc A380 is fp64-emulated (the A380 is a Gen12.7 part without native fp64). The 2.3× headline understates the SIMD path's potential on a fp64-native dGPU; revisit when an Arc B-series or Battlemage host lands. See backlog T7-17.
Vulkan rows are historical; the backend was removed in ADR-0726. The performance bottleneck (dispatch-overhead-bound on NVIDIA, 14 fps matching CPU) contributed to the removal rationale.

CPU SIMD-ISA breakdown (576×324)¶

Selected via --cpumask (bits set = ISAs to disable).

Configuration	`--cpumask`	fps	wall ms / 48f	speedup vs scalar
Scalar (no SIMD)	`63`	92.4 (±0.8)	519.6	1.0×
Up to AVX2	`48`	273.5 (±0.8)	175.5	2.96×
Default (full, AVX-512)	`0`	611.5 (±8.9)	78.5	6.62×

AVX-512 over AVX2 buys another 2.24× on top of the AVX2 baseline on the 9950X3D (Zen 5 has 512-bit SIMD pipes). Pools match across all three configurations to within assertAlmostEqual(places=6) per the Netflix golden gate.

`--precision` overhead (576×324 CPU, 48 frames)¶

String formatting is not on the hot path; switching from %.6f (default per ADR-0119) to %.17g (--precision=max) changes only the JSON-emit stage.

`--precision`	fps	wall ms	JSON output size	size delta vs default
no flag (`%.6f` default)	613.8 (±6.7)	78.2	31 837 B	baseline
`=6` (explicit)	616.9 (±8.3)	77.8	31 525 B	-1.0 %
`=max` (`%.17g`)	612.8 (±11.2)	78.4	40 041 B	+25.8 %

Wall-time delta is in the noise (<1 % across all three), confirming that the per-frame cost of %.17g is negligible — the cost shows up in JSON byte-count, not in wall time. Use --precision=max whenever cross-backend numerical diffing or IEEE-754 round-trip determinism matters.

How to reproduce¶

# 1. Build with all backends (oneAPI 2025.3 sourced for icx/icpx + Arc visibility)
source /opt/intel/oneapi-2025.3/setvars.sh
CC=icx CXX=icpx meson setup core/build libvmaf \
    -Denable_cuda=true -Denable_sycl=true \
    -Db_lto=false --buildtype=release
# Note: -Denable_vulkan=enabled removed per ADR-0726
ninja -C core/build

# 2. Acquire fixtures (gitignored — don't commit)
#    a) Netflix golden 576x324 + 1080p_5frames already live in
#       python/test/resource/yuv/ in the main checkout.
#    b) BBB 4K 200-frame pair: download from archive.org and ffmpeg-encode.
mkdir -p testdata/bbb
curl -L https://archive.org/download/big-buck-bunny-4k-60fps/BigBuckBunny4k60fps.mp4 \
    -o /tmp/bbb4k.mp4
ffmpeg -y -i /tmp/bbb4k.mp4 -frames:v 200 -pix_fmt yuv420p -s 3840x2160 \
    testdata/bbb/ref_3840x2160_200f.yuv
ffmpeg -y -i /tmp/bbb4k.mp4 -frames:v 200 -c:v libx264 -crf 35 -preset veryfast \
    -pix_fmt yuv420p -s 3840x2160 -f rawvideo - \
    | ffmpeg -y -f rawvideo -pix_fmt yuv420p -s 3840x2160 -i - \
        -pix_fmt yuv420p -s 3840x2160 testdata/bbb/dis_3840x2160_200f.yuv
#    Or any equivalent libx264 CRF=35 round-trip; the absolute pool drifts
#    with codec parameters but the fps numbers don't.

# 3. Run the bench
VMAF_BIN="$(pwd)/core/build/tools/vmaf" bash testdata/bench_all.sh

# 4. Verify each backend engaged via per-row metrics-key counts in the
#    bench output ("CPU 15 keys, CUDA 12 keys, SYCL 34 keys").
#    Identical key counts across two rows = silent CPU fallback.

For SIMD breakdown / --precision overhead numbers, see the harness scripts under testdata/ (or paste the inlined repeat_bench.py / simd_bench.py / precision_bench.py from the T7-37 PR description).

FFmpeg lavfi performance harness¶

testdata/bench_perf.py runs the FFmpeg filter path used by the historical perf_benchmark_results.json snapshot. It is useful when the question is "how fast is FFmpeg decode/upload/filter end-to-end?" rather than "how fast is the vmaf CLI binary?"

The harness is portable across checkouts:

python3 testdata/bench_perf.py \
    --ffmpeg /path/to/ffmpeg \
    --backend cpu \
    --backend cuda \
    --runs 3

Environment overrides are also supported:

Variable	CLI equivalent	Purpose
`VMAF_FFMPEG`	`--ffmpeg`	FFmpeg binary with the fork's libvmaf filters.
`VMAF_BENCH_RUNS`	`--runs`	Timing repetitions per backend.
`VMAF_BENCH_TIMEOUT_S`	`--timeout-s`	Per-run timeout.
`VMAF_BBB_MP4_REF`	`--bbb-mp4-ref`	Optional external BBB 4K MP4 for the decode+VMAF test.
`VMAF_SYCL_DEVICE`	`--sycl-device`	VAAPI render node used by the SYCL/QSV import path.
`VMAF_LD_LIBRARY_PATH`	`--ld-library-path`	Runtime library path for FFmpeg/libvmaf.

The committed raw 4K BBB pair remains the required fixture. The 1080p raw pair and MP4 decode test are optional by default; pass --require-all when you want a strict lab run that fails on any missing configured input. Use --list-tests to audit fixture availability and --dry-run to print the exact FFmpeg commands without touching hardware.