Research digest 0014 — psnr_hvs NEON sister port (T3-5-neon)¶
- Status: Active (captures the decision path for ADR-0160)
- Related ADRs: ADR-0159 (AVX2, prior art), ADR-0160 (this port), ADR-0139 (float-accumulator rule), ADR-0141 (NOLINT scope), ADR-0145 (precedent: NEON after AVX2 as separate PRs)
- Related rebase-notes: §0052 covers both SIMD TUs jointly.
The question¶
Can we match the AVX2 psnr_hvs port (ADR-0159) on aarch64 under the same byte-for-byte Netflix-golden bit-exactness contract, and if so, what is the smallest-surface diff to audit side-by-side against the AVX2 TU?
What we already had¶
- Scalar reference:
third_party/xiph/psnr_hvs.c— Xiph/Daala 8×8 integer DCT + contrast-sensitivity weighting masking, floats only for per-block means / variances / mask / error accumulators. - AVX2 sister:
x86/psnr_hvs_avx2.c— vectorizes the DCT 8 columns in parallel via__m256i(one register per row, 8× int32). Non-SIMD helpers (load_block_and_means,compute_vars,compute_masksminus the DCT call,accumulate_error,calc_psnrhvs_avx2) are plain scalar C, bit-exact to scalar by construction. - AVX2 bug fix mid-flight:
accumulate_errororiginally used a localfloat ret = 0accumulator, returning the per-block total to the caller which added it to the outerret. IEEE-754 non-associativity broke scalar summation order and drifted the Netflix golden by ~5.5e-5. Fix: threadretby pointer. Lesson captured in rebase-notes §0052 invariant #3. This port inherits the fixed signature. - NEON precedent:
arm64/motion_v2_neon.c(ADR-0145) — confirmed the NEON-after-AVX2 split works well and the test harness layout (newtest_psnr_hvs_neon.carch-gated inmeson.build) is reusable.
Design axis — lane-width bridging (AVX2 8-wide → NEON 4-wide)¶
| Strategy | Code shape | Diff against AVX2 TU | Correctness risk | Picked? |
|---|---|---|---|---|
Half-wide split: each 8-row register becomes lo (cols 0-3) + hi (cols 4-7); butterfly runs twice per DCT pass, transpose is 4×4×4 + block swap | Mirror AVX2 line-for-line with int32x4_t substituted for __m256i; add one extra butterfly call per DCT pass | Almost one-to-one — each AVX2 butterfly call becomes two NEON calls, nothing else changes | Low — lane-wise SIMD is commutative over the pure-int32 butterfly, and the transpose rearranges lanes without touching values | Yes |
| Process 4 rows in parallel (not 8) | 4 rows × int32 = one register per row; run butterfly once per pass but on only 4 rows, then another 4 | Two 4-row passes through the transpose; higher register pressure | Same bit-exactness story but the diff vs AVX2 becomes "restructure pass layout" instead of "halve the lane count" | No — harder to audit against AVX2 |
| Inline 8×8 DCT scalar + only SIMD the mask / error loops | Doesn't touch DCT; still a speedup on mask reductions | Very different from AVX2; doesn't match the DCT-is-hot-kernel profile | Float SIMD reductions violate ADR-0139 without per-lane scalar tails | No — DCT is the hot path by a wide margin |
| SVE / SVE2 port | Length-agnostic; future-proof | Completely different SIMD style; whilelt, svwhilelo etc. | Very few aarch64 consumer cores ship SVE2 as of 2026-Q2; CI hardware doesn't have it | No — deferred; track for a later PR when hardware catches up |
Chosen: half-wide split. The AVX2 TU and NEON TU sit side-by-side in reviewers' editors and diff trivially. Each AVX2 od_bin_fdct8_simd(r0, ..., r7, ...) becomes two NEON calls (..._simd(r0_lo, ..., r7_lo, ...) then ..._simd(r0_hi, ..., r7_hi, ...)). The transpose is the only genuinely NEW piece — four 4×4 vtrn*q_* stages + a top-right ↔ bottom-left block swap; the correctness argument for the swap is written inline in transpose8x8_s32's doc comment.
Design axis — aarch64-specific gotchas¶
vtrnq_s64 doesn't exist on aarch64¶
armv7 NEON had vtrnq_s64 returning int64x2x2_t in one call. aarch64 NEON split this into vtrn1q_s64 and vtrn2q_s64 (returning single int64x2_t each). First compilation surfaced the error; fix was mechanical. Documented inline in the transpose4x4_s32 doc comment so it doesn't re-appear on a future rebase from a reference that happens to use armv7 forms.
#pragma STDC FP_CONTRACT OFF is ignored by aarch64 GCC¶
Surfaces as -Wunknown-pragmas warning. Investigated:
- Without the pragma, does aarch64 GCC ever emit FMA fusion on the scalar float accumulators? Tested with
-O3on the per-block loops: nofmadd/fmlaemitted for theerr * csfsquared-then-summed pattern because each multiplication is in a separate expression from the addition (ret += (err * csf[i][j]) * (err * csf[i][j])). The inner(a * b) * (a * b)is a two-step multiply with no subsequent add in the same statement;+= ...is structurally a separate operation. - aarch64 GCC's default FP-contract level is "off" across statements per the GCC manual (
-ffp-contract=offis the default when no-std=...forces otherwise). The pragma is redundant on this compiler but harmless, and is kept for portability with toolchains that honour it (MSVC on ARM, clang).
QEMU user-mode limits on 1080p × 10-bit¶
The 1080p 10-bit golden pairs (checkerboard_1920_1080_10_3_{0,1,10}_0.yuv) segfault in qemu-aarch64-static when run under the framework's threadpool. Verified this happens on master too with any SIMD path enabled — it's a QEMU memory-map limit, not a regression. Native-aarch64 CI + the Netflix CPU Golden Tests gate will cover those cases.
Verification plan¶
- Unit test:
test_psnr_hvs_neonwith 5 subtests (3 random 12-bit seeds + delta + constant). Must produce byte-identical DCT output to the scalar reference implementation duplicated inside the test TU. - End-to-end 8-bit golden diff:
src01_hrc00_576x324.yuvvssrc01_hrc01_576x324.yuvwithVMAF_CPU_MASK=0(scalar) vs default (NEON). Allpsnr_hvs_{y,cb,cr,psnr_hvs}per-frame values must be byte-identical; only<fyi fps>differs (timing). - CI native-aarch64: the existing ubuntu-arm runner covers the 1080p × 10-bit pairs and the Netflix CPU Golden Tests required check.
- Clang-tidy: zero warnings on the new NEON TU + scalar dispatch update (changed hunks only, per ADR-0141).
Outcome¶
Landed as PR (pending) on branch port/psnr-hvs-neon-t3-5, commit TBD. ADR-0160 accepted; rebase-notes §0052 carries both SIMD sister TUs. ISA-parity matrix for psnr_hvs now covers scalar + AVX2 + NEON; AVX-512 and SVE2 remain unscheduled.
Open questions / follow-ups¶
- AVX-512
psnr_hvs: would extend AVX2 pattern (16 int32 lanes per register, single butterfly call per pass). Not scheduled — AVX2 covers the x86 baseline and re-verifying bit-exactness against a different reduction tree is real work for no consumer benefit today. - SVE2
psnr_hvs: revisit when CI ships SVE2 hardware routinely. Would exercisewhilelt+ predicated arith; a different auditability story vs. the fixed 4-wide NEON TU. - Perf micro-benchmark on native aarch64: this PR did not ship a microbench (QEMU timings are not meaningful). Planned as a follow-up once the native-aarch64 runner is enrolled on shared CI; the AVX2 TU's 3.58× DCT speedup on x86 is the upper-bound reference.