Research-0744 — Cross-Backend Parity Baseline (Pre-ncu-opt)¶
Date: 2026-05-28 Branch: research/cuda-cross-backend-baseline-20260528 Purpose: Establish a numeric and throughput baseline for CPU and CUDA (SYCL skipped — no Intel device exposed to one-off container) before perf/cuda-vif-filter1d-ncu-driven and future optimization PRs land. All future performance PRs should cite this digest as the pre-optimization reference point.
Environment¶
| Item | Value |
|---|---|
| Container | vmaf-dev-mcp:cuda13.3 (one-off docker run --rm --gpus all) |
| GPU | NVIDIA GeForce RTX 4090 |
| CUDA driver | 610.43.02 |
| vmaf binary | /build/vmaf/core/build/tools/vmaf v3.0.0 |
| Model | vmaf_v0.6.1.json |
| SYCL | Skipped — vmaf_sycl_state_init fails (no /dev/dri Intel node passed to container) |
| HIP | Skipped — no AMD device on host |
| Metal | Skipped — not Linux |
Workloads¶
| ID | File pair | Resolution | Frames | bit-depth |
|---|---|---|---|---|
| WL1 | src01_hrc00_576x324.yuv ↔ src01_hrc01_576x324.yuv | 576×324 | 48 | 8-bit |
| WL2 | checkerboard_1920_1080_10_3_0_0.yuv ↔ ..._1_0.yuv | 1920×1080 | 3 | 8-bit |
| WL3 | checkerboard_1920_1080_10_3_0_0.yuv ↔ ..._10_0.yuv | 1920×1080 | 3 | 8-bit |
Wall-Time Baseline (3 runs each; all values in ms)¶
| Workload | Backend | Run 1 | Run 2 | Run 3 | Median |
|---|---|---|---|---|---|
| WL1 (576×324, 48f) | CPU | 96 | 86 | 90 | 90 |
| WL1 (576×324, 48f) | CUDA | 170 | 148 | 149 | 149 |
| WL2 (1080p, 3f) | CPU | 107 | 72 | 73 | 73 |
| WL2 (1080p, 3f) | CUDA | 161 | 147 | 154 | 154 |
| WL3 (1080p, 3f) | CPU | 73 | 75 | 74 | 74 |
| WL3 (1080p, 3f) | CUDA | 147 | 153 | 152 | 152 |
Interpretation: CPU is faster than CUDA on all three workloads at these frame counts. CUDA overhead (init + H2D/D2H transfers) dominates at 3–48 frames. Throughput advantage only appears with larger batch sizes; the perf_benchmark_results.json snapshot (BBB 1080p, 48 frames) shows CUDA at ~249 fps vs CPU at ~44 fps — a 5.7× advantage at that batch size. The crossover point is somewhere between 10 and 48 frames for 1080p content.
Score Correctness vs CPU (ADR-0119 tolerance)¶
| Workload | CPU score | CUDA score | Delta | ULP diff | Status |
|---|---|---|---|---|---|
| WL1 (576×324, 48f) | 76.667830 | 76.667829 | −1.0×10⁻⁶ | 70,368,744 | WITHIN TOLERANCE |
| WL2 (1080p, 3f) | 35.068672 | 35.068668 | −4.0×10⁻⁶ | 562,949,953 | WITHIN TOLERANCE |
| WL3 (1080p, 3f) | 7.985899 | 7.985899 | 0.0 | 0 | EXACT |
The ULP values reflect the IEEE-754 bit-level distance between the two float64 values. The delta at WL2 is ~1.1×10⁻⁷ relative error, well within the established "close but never bit-exact" GPU tolerance (memory note: golden gate is CPU-only; GPUs NOT bit-exact). No correctness finding.
Netflix golden gate (WL1): CPU score 76.667830 — the Python assertAlmostEqual(..., places=4) gate expects 76.6678 at 4 decimal places. CUDA 76.667829 also passes at places=4. No regression.
Per-Feature Parity (WL2, CPU vs CUDA)¶
| Feature | CPU | CUDA | Delta | ULP | Status |
|---|---|---|---|---|---|
integer_adm2 | 0.78533200 | 0.78533200 | 0.0 | 0 | EXACT |
integer_adm3 | 0.81757600 | MISSING | — | — | GAP |
integer_adm_scale0 | 0.72133600 | 0.72133600 | 0.0 | 0 | EXACT |
integer_adm_scale1 | 0.69258800 | 0.69258800 | 0.0 | 0 | EXACT |
integer_adm_scale2 | 0.80414800 | 0.80414800 | 0.0 | 0 | EXACT |
integer_adm_scale3 | 0.82791000 | 0.82791000 | 0.0 | 0 | EXACT |
integer_aim | 0.15018000 | MISSING | — | — | GAP |
integer_motion | 12.55471200 | 12.55471200 | 0.0 | 0 | EXACT |
integer_motion2 | 12.55471200 | 12.55471200 | 0.0 | 0 | EXACT |
integer_motion3 | 18.82319100 | 18.82319100 | 0.0 | 0 | EXACT |
integer_vif_scale0 | 0.11290600 | 0.11290600 | 0.0 | 0 | EXACT |
integer_vif_scale1 | 0.29837200 | 0.29837200 | 0.0 | 0 | EXACT |
integer_vif_scale2 | 0.33743200 | 0.33743200 | 0.0 | 0 | EXACT |
integer_vif_scale3 | 0.49644500 | 0.49644500 | 0.0 | 0 | EXACT |
vmaf | 35.06867200 | 35.06866800 | −4.0×10⁻⁶ | 562,949,953 | WITHIN TOLERANCE |
Feature gaps noted: integer_adm3 and integer_aim are absent from CUDA pooled_metrics. These are model-layer post-processing scalars; the CUDA feature extractor computes integer_adm2 plus the four scale components but does not emit the derived adm3 aggregate or the aim (Attention-weighted Inverse Metric) score. The vmaf model JSON references these; the CPU path computes them as a post-processing pass. Whether the CUDA path omits the collection step or the JSON-writer filters them requires a code-path check — outside scope of this baseline.
Existing Snapshot Comparison¶
testdata/perf_benchmark_results.json records a previous BBB 1080p 48-frame run:
| Backend | Pooled score | Best FPS | This baseline |
|---|---|---|---|
| CPU | 95.098707 | 43.6 fps | — (different YUV) |
| CUDA | 95.098706 | 249.4 fps | — |
| SYCL | 95.098681 | 93.4 fps | SKIP (device unavailable) |
The new baseline uses Netflix golden YUV pairs rather than BBB, so scores are not comparable. However the SYCL 93 fps at 1080p (from the snapshot) and CUDA 249 fps at 1080p provide reference throughput numbers that future PRs should beat.
Reproducer¶
# One-off container run (isolates from shared state)
docker run --rm --gpus all \
--entrypoint /bin/bash \
-v /home/kilian/dev/vmaf/python/test/resource/yuv:/yuv:ro \
vmaf-dev-mcp:cuda13.3 \
-c '/build/vmaf/core/build/tools/vmaf \
-r /yuv/src01_hrc00_576x324.yuv \
-d /yuv/src01_hrc01_576x324.yuv \
-w 576 -h 324 -p 420 -b 8 \
-m path=/build/vmaf/model/vmaf_v0.6.1.json \
-o /tmp/out.json --json -q \
--backend cpu'
# Replace --backend cpu with --backend cuda for CUDA run
Observations and Open Questions¶
-
CUDA slower than CPU for < ~10 frames at 1080p. The crossover depends on frame count, not resolution alone. The
perf_benchmark_results.jsonBBB data shows CUDA wins at 48 frames. Budget 100–150ms for CUDA init per invocation. -
integer_adm3/integer_aimmissing from CUDA output. These features are emitted by the CPU path; the CUDAfloat_adm_cudaextractor apparently does not expose them in the pooled metrics JSON. Requires investigation: ADR-0574 addedVMAF_feature_aim_scoreandVMAF_feature_adm3_scoretofloat_adm_cuda. The missing output may be an integer vs float ADM path distinction — the vmaf_v0.6.1 model uses integer ADM by default, and the CUDA path may route throughinteger_adm_cudawhich predates ADR-0574. -
SYCL baseline unavailable. The Intel Arc A380 device is not exposed to one-off containers via
--device /dev/dri/renderD129. The prior snapshot shows SYCL at ~93 fps on 1080p. To get SYCL baseline, run via the sharedvmaf-dev-mcpcontainer:docker exec vmaf-dev-mcp vmaf ... --backend sycl. -
Score determinism is good. All three CPU runs and all three CUDA runs produce identical scores across runs (no non-determinism from threading or GPU scheduling observed at this clip length).
Future Comparison Baseline¶
When perf/cuda-vif-filter1d-ncu-driven lands, compare against:
- WL1 CPU median: 90ms, CUDA median: 149ms
- WL2 CPU median: 73ms, CUDA median: 154ms
- WL1 VMAF score (CPU): 76.667830, (CUDA): 76.667829, delta: −1×10⁻⁶
- WL2 VMAF score (CPU): 35.068672, (CUDA): 35.068668, delta: −4×10⁻⁶
- Existing BBB 1080p CUDA throughput snapshot: 249 fps (from
perf_benchmark_results.json)