ADR-0160: psnr_hvs NEON port — bit-exact DCT vectorization (T3-5-neon)¶
- Status: Accepted
- Date: 2026-04-24
- Deciders: Lusoris, Claude (Anthropic)
- Tags: simd, neon, aarch64, psnr-hvs, bit-exact, performance
Context¶
Sister port to ADR-0159. The AVX2 variant landed first (T3-5-avx2, PR #96) per the popup split that mirrored the T3-4 motion_v2 precedent (AVX2 then NEON as a separate PR). This ADR covers the aarch64 NEON follow-up.
The bit-exactness invariant set by ADR-0159 carries through: od_coeff (int32) DCT output and final psnr_hvs_{y,cb,cr,psnr_hvs} feature scores are byte-identical to the scalar reference in core/src/feature/third_party/xiph/psnr_hvs.c on every Netflix golden pair.
The AVX2 implementation exposed one subtle bit-exactness hazard that was fixed pre-merge (ADR-0159 era, commit c8e63d45): the per-block accumulate_error helper used a local float ret = 0 accumulator and returned the per-block total, which the caller added to the outer cross-block ret. IEEE-754 add is non-associative, so that re-ordered the float summation tree vs scalar's inline ret += ... and drifted the Netflix golden by ~5.5e-5. The fix threads ret through by pointer so every contribution hits the outer accumulator directly. This NEON port inherits the pointer-based accumulate_error signature; rebase-notes.md §0052 invariant #3 documents the rule for future ISA ports.
Decision¶
Port calc_psnrhvs to aarch64 NEON in a new TU core/src/feature/arm64/psnr_hvs_neon.c under the same byte-for-byte bit-exactness contract as the AVX2 variant.
Vectorization strategy — one-to-one mirror of AVX2 with lane-width adjusted to NEON's 4-wide int32x4_t:
- DCT butterfly: load 8×8 block as 16
int32x4_tregisters (row k →r_k_loholding cols 0-3,r_k_hiholding cols 4-7). Run the 30-butterflyod_bin_fdct8_simdtwice per DCT pass — once for the low half, once for the high half. Transpose 8×8 decomposes into four 4×4transpose4x4_s32calls plus a block-level swap of top-right ↔ bottom-left (because after the quadrant transposes, the top-right 4×4 holds cols 0-3 of transposed rows 4-7, which belong in the lo halves of rows 4-7). Second butterfly + transpose completes the DCT. - Fixed-point arithmetic: every scalar
(x * k + round) >> shiftbecomesvmulq_s32+vaddq_s32+vshlq_s32(_, -shift).OD_UNBIASED_RSHIFT32is implemented via the canonical uint32 logical shift + signed add + arith shift combo; the helpersod_dct_rshift_neonandod_mulrshift_neonmirrorod_dct_rshift_avx2andod_mulrshift_avx2line-for-line. - Float accumulators stay scalar (ADR-0159 rule): means, variances, mask, per-coefficient error accumulation reuse the scalar per-block loop verbatim.
accumulate_errorthreads the outerretby pointer — see rebase-notes §0052 invariant #3. - FMA off:
#pragma STDC FP_CONTRACT OFFat the TU header. Note: aarch64 GCC emits-Wunknown-pragmasfor this pragma because its support is compiler-specific; however aarch64 GCC does not contracta + b * cacross statements by default, so the effect is preserved. The pragma is kept for portability with other toolchains and with the AVX2 sibling TU. - 4×4 transpose idiom: aarch64 lacks the armv7
vtrnq_s64intrinsic; uses separatevtrn1q_s64/vtrn2q_s64instead. The transpose is written as two stages — 32-bitvtrn1q_s32/vtrn2q_s32of row pairs, then 64-bitvtrn1q_s64/vtrn2q_s64of the resulting pairs — yielding a 4×4 transpose in 4 trn instructions.
Runtime dispatch: psnr_hvs.c's init() gains an ARCH_AARCH64 branch that picks calc_psnrhvs_neon when flags & VMAF_ARM_CPU_FLAG_NEON. The AVX2 branch is unchanged.
NOLINT accounting (all with inline ADR-0141 citations):
od_bin_fdct8_simdexceedsreadability-function-size— the 30-butterfly network must stay together for line-by-line diff against scalarod_bin_fdct8.- Two
sqrtcalls incompute_maskstripperformance-type-promotion-in-math-fn—sqrt(double)matches scalar'sfloat→doublepromotion; switching tosqrtfwould break the bit-exact contract.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Bit-exact DCT via 4-wide int32 NEON, halved per pass (this ADR) | Preserves Netflix golden numerically; mirrors AVX2 structurally so reviewers can diff the two SIMD TUs side-by-side | 2× butterfly calls per DCT pass (vs AVX2's 1×) means ~2× the static code in the inline butterfly | Chosen — bit-exactness discipline + reviewability |
| Process 4 rows in parallel (not 8) | Simpler — one register per row | 2× more passes through the transpose network; higher register pressure across 4 half-passes | Rejected — symmetric 8-in-parallel-split-into-halves keeps the AVX2/NEON diff trivial |
| Float DCT with tolerance | Simpler intrinsics; potentially faster on some cores | Requires separate Netflix-golden tolerance ADR; breaks ADR-0159 precedent | Rejected — fork rule is "SIMD must match scalar" absent explicit ADR loosening |
| SVE / SVE2 variant | Length-agnostic; future-proof | Very few aarch64 consumer cores ship SVE2 as of 2026-Q2; QEMU user-mode SVE support is patchy; would need a separate feature-detect path | Deferred — revisit when real SVE2 hardware is common on CI |
| Defer NEON until later | Zero work now | Backlog flagged the matrix gap explicitly; CI already builds aarch64, letting scalar ride is pure perf loss | Rejected — matrix-parity with AVX2 is the established expectation |
Consequences¶
- Positive:
psnr_hvson aarch64 now gets NEON bit-exact parity with scalar. Matches ADR-0159's Netflix-golden contract.- ISA-parity matrix for psnr_hvs closes: scalar + AVX2 + NEON.
- New unit test
test_psnr_hvs_neon.cpins the bit-exactness contract via DCT-level scalar-vs-SIMD diffs on 5 reproducible inputs; registered inmeson test -C build-aarch64. - Negative:
- NEON TU is ~500 lines — similar size to AVX2. Mitigated by line-for-line diff-ability against the AVX2 sibling.
- The two SIMD TUs (AVX2 + NEON) now need to move in lockstep on any future psnr_hvs scalar change from upstream Xiph. Documented in rebase-notes.md §0052.
#pragma STDC FP_CONTRACT OFFis ignored by aarch64 GCC (non-fatal-Wunknown-pragmas). Reviewed as safe: aarch64 GCC does not folda + b * cacross statements at default optimization levels, and the scalar float accumulators are all inside a single expression with a single+=operator.- Neutral / follow-ups:
- SVE2 variant tracked informally under the gap-fill queue; revisit when CI hardware support matures.
- AVX-512
psnr_hvsintentionally not scheduled (AVX2 covers the x86_64 baseline; adding 512 requires re-verifying bit-exactness against a different reduction tree).
Verification¶
- Unit test
test_psnr_hvs_neonunderqemu-aarch64-static -L /usr/aarch64-linux-gnu: 5/5 subtests pass (3 random 12-bit seeds + delta + constant). - Netflix golden pair (scalar vs NEON via
VMAF_CPU_MASK=0vs default on aarch64 under QEMU):
Per-frame psnr_hvs_y/cb/cr/psnr_hvs values match byte-for-byte; only the <fyi fps="…" /> header field (wall-clock timing) differs between scalar and NEON runs, as expected. - The 1080p 10-bit checkerboard pairs segfault in qemu-aarch64-static with the default memory map under heavy threadpool allocations — a known QEMU user-mode limitation, not a defect in the port. These pairs will be validated by the native-aarch64 CI job (ARM ubuntu runner) and the Netflix CPU Golden Tests required check. - ninja -C build-aarch64 → clean; one pre-existing -Wpedantic warning about float mask[8][8] qualifier passing to a const float mask[8][8] helper (also present in AVX2 TU, inherited from scalar signature).
References¶
- ADR-0159 — AVX2 variant (sister port, bit-exactness contract source of truth).
- ADR-0138 — AVX2 convolve bit-exact via double accumulators.
- ADR-0139 — SSIM per-lane scalar-float reduction for bit-exactness.
- ADR-0141 — touched-file lint-clean rule (scope of the NOLINTs above).
- ADR-0145 — NEON-after-AVX2 port precedent (motion_v2 NEON followed AVX2; this PR mirrors that split for psnr_hvs).
- Xiph/Daala DCT source:
core/src/feature/third_party/xiph/psnr_hvs.c(BSD-licensed,Copyright 2001-2012 Xiph.Org). - rebase-notes 0052 — upstream-sync invariants (shared with ADR-0159; NEON TU added to the
Toucheslist and invariant #3 expanded). - User direction 2026-04-24: "alter go on" after PR #96 merged, confirming T3-5-neon sister-port execution per ADR-0159's "NEON follow-up PR" commitment.
Status update 2026-05-09¶
The §Consequences bullet "AVX-512 psnr_hvs intentionally not scheduled" has now been empirically validated under the T3-9 bench-first methodology and closes as an AVX2 ceiling. ADR-0350 carries the re-bench, the per-symbol cycle-share breakdown (calc_psnrhvs_avx2 78.42 % scalar tail vs od_bin_fdct8x8_avx2 14.82 % DCT) and the Amdahl ceiling calculation that puts a realistic 16-lane DCT at 1.07–1.08× over the current AVX2 path — well below the 1.3× T3-9 ship gate. The original ADR-0160 body is unchanged per the ADR-0028 / ADR-0106 immutability rule; this appendix only records that the deferral has graduated from "intentionally not scheduled" to "ceiling-confirmed by re-bench" and points forward to ADR-0350 / ADR-0180 as the authoritative close-outs. T3-9 (a) is DONE-as-ceiling in BACKLOG.md.