`psnr_hvs` AVX-512 bench-first audit (2026-05-09)¶

Scope: T3-9 (a) close-out under the unified AVX-512 audit-sweep methodology — bench AVX-512 against the existing AVX2 path on the Netflix normal pair (src01_hrc00_576x324.yuv ↔ src01_hrc01_576x324.yuv, 576×324, 8-bit, 48 frames); ship if 16-lane wins by ≥ 1.3× over AVX2; otherwise document as an ADR-0180-style ceiling row. This digest is the empirical companion to ADR-0350.

1. Methodology¶

Host: AMD Ryzen 9 9950X3D (Zen 5). Full AVX-512 in /proc/cpuinfo: avx512f / avx512dq / avx512cd / avx512bw / avx512vl / avx512ifma / avx512vbmi.
OS / toolchain: Linux 7.0.5-cachyos, GCC 14, single-thread.
Repo state: master post 9cd2a354 (current tip at bench-time).
Build: CPU-only release, -Denable_cuda=false --buildtype=release against core/build/. LTO on by default (b_lto=true in core/meson.build:8).
Fixture: Netflix normal pair copied to /tmp/vmaf_test/ per the vmaf_bench data-dir convention.
Runs: n=10 per row for wall-clock; one perf record -F 4000 -g for the cycle-share breakdown.
Isolation: --no_prediction --feature psnr_hvs versus --no_prediction alone — the wall-clock delta is the isolated psnr_hvs increment, free of the default-model feature stack.
Scalar trigger: --cpumask 0xfffffffe masks every SIMD flag, forcing the dispatch in core/src/feature/third_party/xiph/psnr_hvs.c back to the scalar calc_psnrhvs.

2. Wall-clock results (delta isolation)¶

run	wall-clock (s, n=10 mean)	psnr_hvs increment (s)
`--no_prediction` only (default cpumask)	0.0027	—
`--no_prediction --cpumask 0xfffffffe`	0.0022	—
`--no_prediction --feature psnr_hvs` (default)	0.0743	0.0716 (AVX2)
`--no_prediction --feature psnr_hvs --cpumask 0xfffffffe`	0.1096	0.1074 (scalar)

Effective AVX2 vs scalar speedup on this host: 0.1074 / 0.0716 = 1.50×. Notably better than the 1.17× recorded in ADR-0180 on a different host — Zen 5 has a wider integer-mul throughput than the earlier audit machine, so the 8-lane DCT amortises better.
This strengthens the ceiling argument: AVX2 already recovers more of the available headroom than the prior bench measured.

$ perf record -F 4000 -g -o /tmp/perf.data -- \
    build/tools/vmaf -r /tmp/vmaf_test/ref_576x324.yuv \
                     -d /tmp/vmaf_test/dis_576x324.yuv \
                     -w 576 -h 324 -p 420 -b 8 \
                     --threads 1 --no_prediction --feature psnr_hvs \
                     -o /tmp/bench.json --json

$ perf report -i /tmp/perf.data --stdio --no-children -g none | head
    78.42%  vmaf  libvmaf.so.3.0.0  [.] calc_psnrhvs_avx2
    14.82%  vmaf  libvmaf.so.3.0.0  [.] od_bin_fdct8x8_avx2
     0.83%  vmaf  libc.so.6         [.] 0x00000000001b24f3
     0.82%  vmaf  libc.so.6         [.] 0x00000000001b2354
     ...

calc_psnrhvs_avx2 (78.42 %) is the per-block scalar tail: byte-load + global/quadrant means, variances, the float fold of compute_masks, accumulate_error. ADR-0138 + ADR-0139 (re-asserted by ADR-0159) lock these as per-lane-scalar to preserve byte-for-byte parity with the scalar reference's IEEE-754 summation tree. Vectorising this would break the Netflix golden (~5.5e-5 drift — see ADR-0160 §Context).
od_bin_fdct8x8_avx2 (14.82 %) is the only piece an AVX-512 widening could attack — the integer 8×8 DCT butterfly + transpose network.

4. Amdahl ceiling¶

p = vectorisable share = 14.82 / (14.82 + 78.42 + 6.76) ≈ 0.1482 of total cycles (treating libc / kernel / glue as non-vectorisable). Even an infinitely fast DCT bounds wall-clock improvement at:

S_max = 1 / (1 − p) = 1 / (1 − 0.1482) = 1.174

i.e. 1.17× over current AVX2. A realistic 2-block-batched 16-lane DCT recovers ~50 % of the DCT cost (the per-block scalar setup and per-pass transpose overheads do not scale linearly with lane count on the 8×8 working set):

S_realistic ≈ 1 / (1 − 0.5 × 0.1482) = 1.080

i.e. 1.08× over AVX2. The T3-9 ship gate is 1.3× — the realistic figure misses by 22 percentage points and the theoretical-best ceiling misses by 13 percentage points. There is no plausible path to 1.3× from a 14.82 % cycle slice.

5. Why the DCT is bandwidth-amortised at 8 lanes already¶

The integer 8×8 DCT does 30 butterfly ops × 2 passes = 60 ops per column, distributed across 8 columns by AVX2's 8-lane __m256i. The per-block working set is 64 × 4 bytes = 256 bytes — a fraction of an L1 cache line group, so the load / store side is already negligible. The AVX2 path's runtime is dominated by:

The two 8-lane butterfly passes (operands stay in registers across the 60 ops; pure ALU throughput).
The 3-stage 8×8 transpose between passes (unpacklo/hi_epi32 + unpacklo/hi_epi64 + 2× permute2x128_si256).
Final transpose + 8 stores.

Going to 16 lanes (2-block batch) doubles the operand pressure without doubling the throughput on Zen 5: vpmulld / vpaddd / vpsrad issue at 1/cycle on AVX-512 just as they do on AVX2 (they run on the 256-bit FMA / int execution port which doesn't widen on Zen 5). The transpose network also needs an additional cross-block lane crossing (vshufi32x4) that AVX2's 2-block-equivalent doesn't pay. Net: the per-instruction throughput of the DCT is similar on AVX2 and AVX-512; the only win is amortising fixed per-call overhead across two blocks, which the perf data shows is small.

This matches the pattern in ADR-0138 §"Follow-up" (iqa_convolve AVX-512 — same suspected ceiling on 8×8 windows; T3-9 (c) bench scheduled), ADR-0179 (float_moment — memory-bound reduction at 4096 pixels), and ADR-0180 (audit-level close-out for both kernels).

6. Decision¶

T3-9 (a) closes as AVX2 ceiling. No psnr_hvs_avx512.c is added; ADR-0350 records the close-out; ADR-0160 §Consequences gets a status-update appendix pointing at ADR-0350; T3-9 row in BACKLOG.md flips sub-row (a) to DONE-as-ceiling.

T3-9 (b) (SSIMULACRA 2 IIR blur + picture_to_linear_rgb AVX-512+NEON drift audit) and (c) (iqa_convolve AVX-512) remain open; they bench independently in their own follow-up PRs.

7. Reproducer¶

cd $REPO/libvmaf
meson setup build -Denable_cuda=false --buildtype=release
ninja -C build

mkdir -p /tmp/vmaf_test
cp ../python/test/resource/yuv/src01_hrc00_576x324.yuv /tmp/vmaf_test/ref_576x324.yuv
cp ../python/test/resource/yuv/src01_hrc01_576x324.yuv /tmp/vmaf_test/dis_576x324.yuv

# Wall-clock isolation (n=10 each)
for label in 'AVX2 + psnr_hvs' 'scalar + psnr_hvs'; do
  case "$label" in
    'AVX2 + psnr_hvs')   FLAGS='--feature psnr_hvs' ;;
    'scalar + psnr_hvs') FLAGS='--feature psnr_hvs --cpumask 0xfffffffe' ;;
  esac
  for i in $(seq 1 10); do
    t0=$(date +%s%N)
    build/tools/vmaf -r /tmp/vmaf_test/ref_576x324.yuv \
                     -d /tmp/vmaf_test/dis_576x324.yuv \
                     -w 576 -h 324 -p 420 -b 8 --threads 1 \
                     --no_prediction $FLAGS \
                     -o /tmp/bench.json --json >/dev/null 2>&1
    t1=$(date +%s%N)
    awk -v a=$t0 -v b=$t1 -v l="$label" 'BEGIN { printf "%-20s %.4fs\n", l, (b-a)/1e9 }'
  done
done

# Cycle-share breakdown
perf record -F 4000 -g -o /tmp/perf.data -- \
  build/tools/vmaf -r /tmp/vmaf_test/ref_576x324.yuv \
                   -d /tmp/vmaf_test/dis_576x324.yuv \
                   -w 576 -h 324 -p 420 -b 8 \
                   --threads 1 --no_prediction --feature psnr_hvs \
                   -o /tmp/bench.json --json
perf report -i /tmp/perf.data --stdio --no-children -g none | head -10

8. References¶

ADR-0350 — the close-out ADR companion to this digest.
ADR-0180 — original ceiling decision (2026-04-26); this digest re-validates it.
ADR-0179 — sister bandwidth-bound kernel (float_moment).
ADR-0159 / ADR-0160 — AVX2 + NEON bit-exactness contracts that lock the scalar tail.
ADR-0138 / ADR-0139 — per-lane-scalar float reduction discipline.
T3-9 row at .workingdir2/BACKLOG.md:549.

psnr_hvs AVX-512 bench-first audit (2026-05-09)¶