psnr_hvs AVX-512 bench-first audit (2026-05-09)¶
Scope: T3-9 (a) close-out under the unified AVX-512 audit-sweep methodology — bench AVX-512 against the existing AVX2 path on the Netflix normal pair (src01_hrc00_576x324.yuv ↔ src01_hrc01_576x324.yuv, 576×324, 8-bit, 48 frames); ship if 16-lane wins by ≥ 1.3× over AVX2; otherwise document as an ADR-0180-style ceiling row. This digest is the empirical companion to ADR-0350.
1. Methodology¶
- Host: AMD Ryzen 9 9950X3D (Zen 5). Full AVX-512 in
/proc/cpuinfo:avx512f / avx512dq / avx512cd / avx512bw / avx512vl / avx512ifma / avx512vbmi. - OS / toolchain: Linux 7.0.5-cachyos, GCC 14, single-thread.
- Repo state:
masterpost9cd2a354(current tip at bench-time). - Build: CPU-only release,
-Denable_cuda=false --buildtype=releaseagainstcore/build/. LTO on by default (b_lto=trueincore/meson.build:8). - Fixture: Netflix normal pair copied to
/tmp/vmaf_test/per thevmaf_benchdata-dir convention. - Runs:
n=10per row for wall-clock; oneperf record -F 4000 -gfor the cycle-share breakdown. - Isolation:
--no_prediction --feature psnr_hvsversus--no_predictionalone — the wall-clock delta is the isolated psnr_hvs increment, free of the default-model feature stack. - Scalar trigger:
--cpumask 0xfffffffemasks every SIMD flag, forcing the dispatch incore/src/feature/third_party/xiph/psnr_hvs.cback to the scalarcalc_psnrhvs.
2. Wall-clock results (delta isolation)¶
| run | wall-clock (s, n=10 mean) | psnr_hvs increment (s) |
|---|---|---|
--no_prediction only (default cpumask) | 0.0027 | — |
--no_prediction --cpumask 0xfffffffe | 0.0022 | — |
--no_prediction --feature psnr_hvs (default) | 0.0743 | 0.0716 (AVX2) |
--no_prediction --feature psnr_hvs --cpumask 0xfffffffe | 0.1096 | 0.1074 (scalar) |
- Effective AVX2 vs scalar speedup on this host: 0.1074 / 0.0716 = 1.50×. Notably better than the 1.17× recorded in ADR-0180 on a different host — Zen 5 has a wider integer-mul throughput than the earlier audit machine, so the 8-lane DCT amortises better.
- This strengthens the ceiling argument: AVX2 already recovers more of the available headroom than the prior bench measured.
3. Per-symbol cycle share (the decisive number)¶
$ perf record -F 4000 -g -o /tmp/perf.data -- \
build/tools/vmaf -r /tmp/vmaf_test/ref_576x324.yuv \
-d /tmp/vmaf_test/dis_576x324.yuv \
-w 576 -h 324 -p 420 -b 8 \
--threads 1 --no_prediction --feature psnr_hvs \
-o /tmp/bench.json --json
$ perf report -i /tmp/perf.data --stdio --no-children -g none | head
78.42% vmaf libvmaf.so.3.0.0 [.] calc_psnrhvs_avx2
14.82% vmaf libvmaf.so.3.0.0 [.] od_bin_fdct8x8_avx2
0.83% vmaf libc.so.6 [.] 0x00000000001b24f3
0.82% vmaf libc.so.6 [.] 0x00000000001b2354
...
calc_psnrhvs_avx2(78.42 %) is the per-block scalar tail: byte-load + global/quadrant means, variances, the float fold ofcompute_masks,accumulate_error. ADR-0138 + ADR-0139 (re-asserted by ADR-0159) lock these as per-lane-scalar to preserve byte-for-byte parity with the scalar reference's IEEE-754 summation tree. Vectorising this would break the Netflix golden (~5.5e-5 drift — see ADR-0160 §Context).od_bin_fdct8x8_avx2(14.82 %) is the only piece an AVX-512 widening could attack — the integer 8×8 DCT butterfly + transpose network.
4. Amdahl ceiling¶
p = vectorisable share = 14.82 / (14.82 + 78.42 + 6.76) ≈ 0.1482 of total cycles (treating libc / kernel / glue as non-vectorisable). Even an infinitely fast DCT bounds wall-clock improvement at:
i.e. 1.17× over current AVX2. A realistic 2-block-batched 16-lane DCT recovers ~50 % of the DCT cost (the per-block scalar setup and per-pass transpose overheads do not scale linearly with lane count on the 8×8 working set):
i.e. 1.08× over AVX2. The T3-9 ship gate is 1.3× — the realistic figure misses by 22 percentage points and the theoretical-best ceiling misses by 13 percentage points. There is no plausible path to 1.3× from a 14.82 % cycle slice.
5. Why the DCT is bandwidth-amortised at 8 lanes already¶
The integer 8×8 DCT does 30 butterfly ops × 2 passes = 60 ops per column, distributed across 8 columns by AVX2's 8-lane __m256i. The per-block working set is 64 × 4 bytes = 256 bytes — a fraction of an L1 cache line group, so the load / store side is already negligible. The AVX2 path's runtime is dominated by:
- The two 8-lane butterfly passes (operands stay in registers across the 60 ops; pure ALU throughput).
- The 3-stage 8×8 transpose between passes (
unpacklo/hi_epi32+unpacklo/hi_epi64+ 2×permute2x128_si256). - Final transpose + 8 stores.
Going to 16 lanes (2-block batch) doubles the operand pressure without doubling the throughput on Zen 5: vpmulld / vpaddd / vpsrad issue at 1/cycle on AVX-512 just as they do on AVX2 (they run on the 256-bit FMA / int execution port which doesn't widen on Zen 5). The transpose network also needs an additional cross-block lane crossing (vshufi32x4) that AVX2's 2-block-equivalent doesn't pay. Net: the per-instruction throughput of the DCT is similar on AVX2 and AVX-512; the only win is amortising fixed per-call overhead across two blocks, which the perf data shows is small.
This matches the pattern in ADR-0138 §"Follow-up" (iqa_convolve AVX-512 — same suspected ceiling on 8×8 windows; T3-9 (c) bench scheduled), ADR-0179 (float_moment — memory-bound reduction at 4096 pixels), and ADR-0180 (audit-level close-out for both kernels).
6. Decision¶
T3-9 (a) closes as AVX2 ceiling. No psnr_hvs_avx512.c is added; ADR-0350 records the close-out; ADR-0160 §Consequences gets a status-update appendix pointing at ADR-0350; T3-9 row in BACKLOG.md flips sub-row (a) to DONE-as-ceiling.
T3-9 (b) (SSIMULACRA 2 IIR blur + picture_to_linear_rgb AVX-512+NEON drift audit) and (c) (iqa_convolve AVX-512) remain open; they bench independently in their own follow-up PRs.
7. Reproducer¶
cd $REPO/libvmaf
meson setup build -Denable_cuda=false --buildtype=release
ninja -C build
mkdir -p /tmp/vmaf_test
cp ../python/test/resource/yuv/src01_hrc00_576x324.yuv /tmp/vmaf_test/ref_576x324.yuv
cp ../python/test/resource/yuv/src01_hrc01_576x324.yuv /tmp/vmaf_test/dis_576x324.yuv
# Wall-clock isolation (n=10 each)
for label in 'AVX2 + psnr_hvs' 'scalar + psnr_hvs'; do
case "$label" in
'AVX2 + psnr_hvs') FLAGS='--feature psnr_hvs' ;;
'scalar + psnr_hvs') FLAGS='--feature psnr_hvs --cpumask 0xfffffffe' ;;
esac
for i in $(seq 1 10); do
t0=$(date +%s%N)
build/tools/vmaf -r /tmp/vmaf_test/ref_576x324.yuv \
-d /tmp/vmaf_test/dis_576x324.yuv \
-w 576 -h 324 -p 420 -b 8 --threads 1 \
--no_prediction $FLAGS \
-o /tmp/bench.json --json >/dev/null 2>&1
t1=$(date +%s%N)
awk -v a=$t0 -v b=$t1 -v l="$label" 'BEGIN { printf "%-20s %.4fs\n", l, (b-a)/1e9 }'
done
done
# Cycle-share breakdown
perf record -F 4000 -g -o /tmp/perf.data -- \
build/tools/vmaf -r /tmp/vmaf_test/ref_576x324.yuv \
-d /tmp/vmaf_test/dis_576x324.yuv \
-w 576 -h 324 -p 420 -b 8 \
--threads 1 --no_prediction --feature psnr_hvs \
-o /tmp/bench.json --json
perf report -i /tmp/perf.data --stdio --no-children -g none | head -10
8. References¶
- ADR-0350 — the close-out ADR companion to this digest.
- ADR-0180 — original ceiling decision (2026-04-26); this digest re-validates it.
- ADR-0179 — sister bandwidth-bound kernel (
float_moment). - ADR-0159 / ADR-0160 — AVX2 + NEON bit-exactness contracts that lock the scalar tail.
- ADR-0138 / ADR-0139 — per-lane-scalar float reduction discipline.
- T3-9 row at
.workingdir2/BACKLOG.md:549.