Skip to content

Benchmarks

Scope: this file tracks fork-added benchmarks (GPU backends, SIMD paths, --precision overhead). Netflix's upstream correctness numbers are the Netflix golden CPU pools — see CLAUDE.md §8.

Runs below are produced by make bench (drives testdata/bench_all.sh) on a fixed hardware profile and pinned commit. Contribute new numbers via a PR that updates this file alongside the commit that motivates the rerun.

Hardware profiles

Profile CPU GPUs Memory OS
ryzen-4090-arc AMD Ryzen 9 9950X3D (16c/32t, Zen 5, AVX-512) NVIDIA RTX 4090 (24 GB) + Intel Arc A380 (6 GB, fp64-emulated) 96 GB DDR5-6400 Linux 7.0.x (CachyOS)
xeon-arc Intel Xeon w9-3475X Intel Arc A770 128 GB DDR5-4800 Ubuntu 26.04
m4-pro Apple M4 Pro (integrated) 48 GB unified macOS 15

The ryzen-4090-arc profile is the canonical fork bench host: a single machine that exposes CUDA (RTX 4090), SYCL (Arc A380 via oneAPI 2025.3 Level Zero), and HIP so all active backends can run back-to-back from one shell. (Vulkan backend removed per ADR-0726; historical Vulkan bench rows are preserved below for reference.)

Backend comparison (Netflix normal pair, 576×324, 48 frames)

Source: python/test/resource/yuv/src01_hrc00_576x324.yuv vs …hrc01…. Model: model/vmaf_v0.6.1.json. Threads: 1. Precision: CLI default %.6f per ADR-0119. Numbers averaged over 5 wall-clock reps after 1 warmup; standard deviation in parentheses. Commit 41301496 on ryzen-4090-arc.

Backend fps (higher better) wall ms / 48f vmaf pool metrics-keys delta vs CPU pool
cpu (full ISA, AVX-512) 598 (±21) 80.3 76.667828 15 0 (reference)
cuda (RTX 4090) 278 (±52) 177.5 76.667828 12 0.0 — pool match to 6 dp; per-frame max ULP diff 1.8×10⁻⁵
sycl (Arc A380) 315 (±0.9) 152.3 76.667767 34 -6.1×10⁻⁵ pool; per-frame max diff 1.11×10⁻³
~~vulkan~~ (removed ADR-0726) (historical: 171 fps) (historical: 280.6ms) 76.667758 34 historical reference only

Key-count check. Each backend emits a different frames[0].metrics key set (CPU=15 with integer_aim/integer_motion3/integer_adm3, CUDA=12, SYCL=34 with raw _num/_den intermediates). Identical key counts across two rows would indicate a silent-fallback to CPU; the counts above confirm each backend actually engaged. See AGENTS.md §"Backend-engagement foot-guns".

Backend comparison (1080p, 5 frames)

Source: python/test/resource/yuv/src01_hrc{00,01}_1920x1080_5frames.yuv. Same setup as 576×324.

Backend fps wall ms / 5f vmaf pool metrics-keys
cpu 45.6 (±1.0) 109.7 35.815478 15
cuda 33.6 (±1.1) 148.8 35.815478 12
sycl 41.1 (±0.7) 121.7 35.815404 34
~~vulkan~~ (removed ADR-0726) (historical: 21.8 fps) (historical: 229.5ms) 35.815399 34

CPU outpaces CUDA at 1080p × 5 frames because dispatch overhead dominates the workload — only 5 frames doesn't amortise the CUDA launch/copy cost. CUDA decisively wins once the workload grows (see 4K below).

Backend comparison (BBB 4K, 200 frames)

Source: testdata/bbb/{ref,dis}_3840x2160_200f.yuv (BigBuckBunny 4K master, ffmpeg-encoded ref + libx264 CRF=35 round-trip distortion; see How to reproduce).

Backend fps wall s / 200f vmaf pool speedup vs CPU
cpu 13.9 (±0.5) 14.43 36.343813 1.0× (baseline)
cuda (RTX 4090) 227.6 (±11.3) 0.88 36.343815 16.4×
sycl (Arc A380) 32.1 (±0.1) 6.23 36.343780 2.3×
~~vulkan~~ (removed ADR-0726) (historical: 14.1 fps) (historical: 14.16s) 36.343774 historical

Notes:

  • CUDA at 4K is the headline number — the RTX 4090 sustains 227 fps on 8-bit 3840×2160 with vmaf_v0.6.1.json, ~16× faster than the CPU + AVX-512 baseline.
  • SYCL on Arc A380 is fp64-emulated (the A380 is a Gen12.7 part without native fp64). The 2.3× headline understates the SIMD path's potential on a fp64-native dGPU; revisit when an Arc B-series or Battlemage host lands. See backlog T7-17.
  • Vulkan rows are historical; the backend was removed in ADR-0726. The performance bottleneck (dispatch-overhead-bound on NVIDIA, 14 fps matching CPU) contributed to the removal rationale.

CPU SIMD-ISA breakdown (576×324)

Selected via --cpumask (bits set = ISAs to disable).

Configuration --cpumask fps wall ms / 48f speedup vs scalar
Scalar (no SIMD) 63 92.4 (±0.8) 519.6 1.0×
Up to AVX2 48 273.5 (±0.8) 175.5 2.96×
Default (full, AVX-512) 0 611.5 (±8.9) 78.5 6.62×

AVX-512 over AVX2 buys another 2.24× on top of the AVX2 baseline on the 9950X3D (Zen 5 has 512-bit SIMD pipes). Pools match across all three configurations to within assertAlmostEqual(places=6) per the Netflix golden gate.

--precision overhead (576×324 CPU, 48 frames)

String formatting is not on the hot path; switching from %.6f (default per ADR-0119) to %.17g (--precision=max) changes only the JSON-emit stage.

--precision fps wall ms JSON output size size delta vs default
no flag (%.6f default) 613.8 (±6.7) 78.2 31 837 B baseline
=6 (explicit) 616.9 (±8.3) 77.8 31 525 B -1.0 %
=max (%.17g) 612.8 (±11.2) 78.4 40 041 B +25.8 %

Wall-time delta is in the noise (<1 % across all three), confirming that the per-frame cost of %.17g is negligible — the cost shows up in JSON byte-count, not in wall time. Use --precision=max whenever cross-backend numerical diffing or IEEE-754 round-trip determinism matters.

How to reproduce

# 1. Build with all backends (oneAPI 2025.3 sourced for icx/icpx + Arc visibility)
source /opt/intel/oneapi-2025.3/setvars.sh
CC=icx CXX=icpx meson setup core/build libvmaf \
    -Denable_cuda=true -Denable_sycl=true \
    -Db_lto=false --buildtype=release
# Note: -Denable_vulkan=enabled removed per ADR-0726
ninja -C core/build

# 2. Acquire fixtures (gitignored — don't commit)
#    a) Netflix golden 576x324 + 1080p_5frames already live in
#       python/test/resource/yuv/ in the main checkout.
#    b) BBB 4K 200-frame pair: download from archive.org and ffmpeg-encode.
mkdir -p testdata/bbb
curl -L https://archive.org/download/big-buck-bunny-4k-60fps/BigBuckBunny4k60fps.mp4 \
    -o /tmp/bbb4k.mp4
ffmpeg -y -i /tmp/bbb4k.mp4 -frames:v 200 -pix_fmt yuv420p -s 3840x2160 \
    testdata/bbb/ref_3840x2160_200f.yuv
ffmpeg -y -i /tmp/bbb4k.mp4 -frames:v 200 -c:v libx264 -crf 35 -preset veryfast \
    -pix_fmt yuv420p -s 3840x2160 -f rawvideo - \
    | ffmpeg -y -f rawvideo -pix_fmt yuv420p -s 3840x2160 -i - \
        -pix_fmt yuv420p -s 3840x2160 testdata/bbb/dis_3840x2160_200f.yuv
#    Or any equivalent libx264 CRF=35 round-trip; the absolute pool drifts
#    with codec parameters but the fps numbers don't.

# 3. Run the bench
VMAF_BIN="$(pwd)/core/build/tools/vmaf" bash testdata/bench_all.sh

# 4. Verify each backend engaged via per-row metrics-key counts in the
#    bench output ("CPU 15 keys, CUDA 12 keys, SYCL 34 keys").
#    Identical key counts across two rows = silent CPU fallback.

For SIMD breakdown / --precision overhead numbers, see the harness scripts under testdata/ (or paste the inlined repeat_bench.py / simd_bench.py / precision_bench.py from the T7-37 PR description).

FFmpeg lavfi performance harness

testdata/bench_perf.py runs the FFmpeg filter path used by the historical perf_benchmark_results.json snapshot. It is useful when the question is "how fast is FFmpeg decode/upload/filter end-to-end?" rather than "how fast is the vmaf CLI binary?"

The harness is portable across checkouts:

python3 testdata/bench_perf.py \
    --ffmpeg /path/to/ffmpeg \
    --backend cpu \
    --backend cuda \
    --runs 3

Environment overrides are also supported:

Variable CLI equivalent Purpose
VMAF_FFMPEG --ffmpeg FFmpeg binary with the fork's libvmaf filters.
VMAF_BENCH_RUNS --runs Timing repetitions per backend.
VMAF_BENCH_TIMEOUT_S --timeout-s Per-run timeout.
VMAF_BBB_MP4_REF --bbb-mp4-ref Optional external BBB 4K MP4 for the decode+VMAF test.
VMAF_SYCL_DEVICE --sycl-device VAAPI render node used by the SYCL/QSV import path.
VMAF_LD_LIBRARY_PATH --ld-library-path Runtime library path for FFmpeg/libvmaf.

The committed raw 4K BBB pair remains the required fixture. The 1080p raw pair and MP4 decode test are optional by default; pass --require-all when you want a strict lab run that fails on any missing configured input. Use --list-tests to audit fixture availability and --dry-run to print the exact FFmpeg commands without touching hardware.