BD-rate with VMAF¶
BD-rate (Bjøntegaard Delta rate) measures the average bitrate difference at equivalent quality between two encoders. Negative BD-rate means the test encoder is more efficient — it spends fewer bits to reach the same VMAF/PSNR/MS-SSIM as the reference. The metric was introduced in Bjøntegaard 2001 (VCEG-M33) and is the standard codec-comparison reporting axis at JVET, AOM, and most recent neural-video-compression (NVC) papers (e.g. Gao et al., Advances in Neural Video Compression: A Review and Benchmarking, Bristol VI-Lab 2026, where VMAF is one of three reporting metrics alongside PSNR and MS-SSIM).
This page documents how to run a BD-rate study with VMAF using the fork's existing tooling.
What you have¶
python/vmaf/tools/bd_rate.py implements the Bjøntegaard piecewise-cubic-Hermite interpolation (pchip_interpolate) with the integration window clamped to the overlap region of the two RD curves, plus a convex-hull mode for near-monotonic data. The implementation is validated against JCTVC-E137 (the JCTVC reference numbers).
python/vmaf/tools/bd_rate_calculator.py is the higher-level calculator class.
python/test/bd_rate_calculator_test.py gates the math against published reference numbers; it runs in CI under the Netflix CPU Golden Tests (D24) lane.
What this fork ships beyond Netflix upstream¶
Same calculator. The fork-local additions are around it: a --precision CLI flag (%.17g default, IEEE-754 round-trip lossless — see docs/usage/precision.md) so the RD-point CSV your encoder run produces survives float-text round-trip, plus the GPU + SIMD speedups that make sweeping a BD-rate study tractable on a single workstation.
Recipe — NVC-style BD-rate against a reference encoder¶
The Bristol VI-Lab 2026 review reports BD-rate against VTM-20 LD/RA on UVG, MCL-JCV, HEVC B-E, and AOM A2-A5 with VMAF, PSNR, and MS-SSIM as the three quality metrics (Tables 3-4 of that paper). The protocol below mirrors that, scoped to a single sequence pair to keep the documentation compact.
1 — pick the sequences¶
The standard reference sets:
| Set | Source | Notes |
|---|---|---|
| UVG | ultravideo.fi (Mercat 2020) | 1080p / 4K, 7-12 sequences, 120 frames each |
| MCL-JCV | mcl.usc.edu (Wang 2016) | 1080p, 30 sequences, 150 frames each |
| HEVC B-E | JCTVC CTC (Bossen) | 416×240 to 1080p, 5 classes |
| AOM A1-A5 | AOM CTC (Zhao) | A1-A5 resolutions |
The fork's tiny-AI training already uses BVI-DVC + KoNViD-1k + BVI-AOM (see docs/research/0019-tiny-ai-netflix-training.md and docs/research/0033-bristol-nvc-review-2026.md); those are training corpora and don't substitute for the codec-comparison sets above.
2 — encode at multiple QP/CRF rungs¶
Standard ladders:
- HEVC/AVC: QP
{22, 27, 32, 37}(JVET CTC) or 6-rung{22, 27, 32, 37, 42, 47}for low-bitrate analysis. - AV1: CRF
{23, 31, 39, 47, 55, 63}(AOM CTC). - VVC: same as HEVC plus
{42, 47}for modern reporting.
For each rung, log:
bitrate_kbps— actual coded bitrate from the bitstream (ffprobe -show_streamsor the encoder's stat file).vmaf— score from libvmaf (this fork's CLI:vmafor via FFmpeg'slibvmaffilter).psnr—psnr_yis the standard reporting axis; the fork additionally emitspsnr_cb/psnr_cron the Vulkan backend withenable_chroma=true(seedocs/backends/vulkan/overview.md).ms_ssim—float_ms_ssim(CPU AVX2/AVX-512/NEON, CUDA, SYCL, Vulkan; the fork also exposes the 15 per-scale L/C/S sub-metrics behindenable_lcs=true, seedocs/metrics/features.md§SSIM/MS-SSIM).
3 — feed the calculator¶
from vmaf.tools.bd_rate import calculate_bd_rate
# (rate, quality) tuples — same metric on both sides.
reference = [(8650, 91.2), (4112, 86.5), (1956, 79.8), (920, 70.1)] # x265 slow
candidate = [(7980, 91.2), (3801, 86.5), (1810, 79.8), (852, 70.1)] # SVT-AV1
bd_rate_pct = calculate_bd_rate(reference, candidate)
print(f"BD-rate (VMAF): {bd_rate_pct:+.2f}%")
# Negative means `candidate` is more efficient than `reference`.
For the convex-hull variant (recommended on noisy or non-monotonic RD points):
The function clamps the integration window to the metric overlap between the two curves and raises BdRateNoOverlapException if they don't intersect. Use min_metric / max_metric to clamp explicitly.
4 — report¶
For an NVC-style table you want one BD-rate column per quality metric × test set combination. Example layout (mirroring Tables 3-4 of the Bristol paper):
| Encoder | UVG (VMAF) | UVG (PSNR-Y) | UVG (MS-SSIM) | MCL-JCV (VMAF) | … |
|---|---|---|---|---|---|
| x265 slow | 0.0 | 0.0 | 0.0 | 0.0 | |
| SVT-AV1 P2 | -8.4% | -3.1% | -7.6% | -10.1% | |
| VVenC slower | -22.3% | -16.5% | -21.0% | -24.8% |
(Values illustrative.)
Common pitfalls¶
- Different VMAF model versions across the two columns. Always pin
vmaf_v0.6.1or whatever model your study uses; the fork auto-loadsvmaf_v0.6.1.jsonby default. Verify withvmaf --version --verbose | grep model. - VBV / 1-pass vs 2-pass mismatch. BD-rate compares like-for-like rate-control modes. Two-pass encodes have a different RD shape than 1-pass
crf=encodes; mixing them inflates the apparent delta. The fork doesn't enforce this — your harness must. - GPU vs CPU score divergence. CUDA / SYCL / Vulkan backends produce ≤ ~1e-4 ULP-different VMAF scores from the CPU reference (within
places=4tolerance, but observable). For competitive BD-rate reporting use the same backend on both columns. Seedocs/research/0034-ci-pipeline-audit-2026-05.mdand ADR-0234 (proposed) on the forthcoming--gpu-calibratedflag for cross-backend reconciliation. - Too few rungs. The PCHIP interpolant degenerates with fewer than 4 RD points;
calculate_bd_rateraisesBdRateNotEnoughPointsExceptionbelow the threshold. For a publication-grade NVC report, 6 rungs is the AOM CTC norm.
References¶
- Bjøntegaard, G., Calculation of Average PSNR Differences between RD-curves, VCEG-M33, ITU-T VCEG, 2001.
- Bossen, F., Common Test Conditions and Software Reference Configurations, JCTVC, 2010.
- Gao et al., Advances in Neural Video Compression: A Review and Benchmarking, Bristol VI-Lab preprint 2026 (audited in
docs/research/0033-bristol-nvc-review-2026.md). - Mercat et al., UVG dataset: 50/120fps 4K sequences for video codec analysis and development, MMSys 2020.
- Wang et al., MCL-JCV: a JND-based H.264/AVC video quality assessment dataset, ICIP 2016.
- Zhao et al., AOM Common Test Conditions v3, AOM, 2022.