Research-0013: SIMD DX framework — audit + NEON bit-exactness port¶
- Status: Active
- Workstream: ADR-0140
- Last updated: 2026-04-21
Question¶
After ADR-0138 and ADR-0139, the fork holds two auditable bit-exactness patterns that every future SIMD kernel must preserve:
- Single-rounded
float * float→ widen todouble→double +=(no FMA). - Per-lane scalar-double reduction for kernels whose scalar C loop type-promotes via a
doubleliteral (2.0 * ...).
Both patterns were written longhand in AVX2 / AVX-512 TUs. We wanted to answer two questions at once:
- Q1. What is the minimum fork-internal scaffolding that lets the next SIMD kernel reuse these patterns without re-deriving them (and without importing Highway / simde / xsimd, which the fork's SIMD policy rules out — see memory
feedback_simd_dx_scope.md)? - Q2. Does NEON's current SSIM / convolve SIMD on aarch64 match the same bit-exactness bar as AVX2 / AVX-512 after PR #18 + PR #76, or does it carry ADR-0138 / ADR-0139-class drift that was never surfaced because CI has no aarch64 runner?
Sources¶
- Scalar references:
_iqa_convolveandssim_accumulate_default_scalar. - Existing AVX2 / AVX-512 SIMD:
ssim_avx2.c/ssim_avx512.c/convolve_avx2.c/convolve_avx512.cas of commit6db63310(PR #76 HEAD). - Existing NEON SIMD:
ssim_neon.cas of commitbe1f74d1(PR #76 pre-DX). - Upstream baseline: Netflix/vmaf
origin/masterhas no SSIM SIMD on any ISA.convolve_neondoesn't exist upstream — the existingiqa_convolve_avx2/iqa_convolve_avx512are fork-only. First NEON SSIM touch is fork commit81fcd42e. - Measurement (aarch64):
aarch64-linux-gnu-gcccross toolchain.qemu-aarch64-static -L /usr/aarch64-linux-gnuexe_wrapper (meson cross-filebuild-aux/aarch64-linux-gnu.ini).vmaf --cpumask 255— all SIMD blocked (scalar path).vmaf— native (NEON active).--precision maxforces%.17gIEEE-754 round-trip.- Related ADRs: ADR-0125 (bit-exactness ground rule), ADR-0138 / ADR-0139 (x86 patterns), ADR-0140 (this workstream's decision).
Findings¶
SIMD-gap inventory (9 gaps; shaped PR #A + PR #B scope)¶
Audit of core/src/feature/ against the fork's existing AVX2 / AVX-512 / NEON coverage:
| # | Feature | AVX2 | AVX-512 | NEON | Gap class |
|---|---|---|---|---|---|
| 1 | iqa_convolve (SSIM / MS-SSIM inner) | ✓ | ✓ | missing | New NEON TU |
| 2 | ssim_accumulate (per-lane reduce) | ✓ | ✓ | present but ADR-0139 drift | NEON bit-exactness port |
| 3 | ssimulacra2 | missing | missing | missing | New 3-ISA |
| 4 | motion_v2 | present (upstream) | present (upstream) | missing | New NEON TU |
| 5 | vif_statistic | present (upstream) | missing | present (partial) | AVX-512 + NEON gap |
| 6 | float_ansnr | missing | missing | missing | New 3-ISA |
| 7 | moment | missing | missing | missing | New 3-ISA |
| 8 | luminance_tools | missing | missing | missing | New 3-ISA |
| 9 | DX macros + scaffold skill | n/a | n/a | n/a | Tooling |
Gap #9 is the tool that makes gaps #1-#8 tractable. Demonstrated in PR #A on gaps #1 + #2 (real code, both demos shipped). Gaps #3-#8 are PR #B scope.
Drift audit under QEMU (ADR-0139 check on NEON)¶
Ran scalar vs NEON on the Netflix 576×324 pair + 1920×1080 checkerboard (1-px) pair, cross-compiled with aarch64-linux-gnu-gcc, run under qemu-aarch64-static.
Before the NEON per-lane-scalar-double fix:
Netflix pair, frame 0: float_ssim 0.92502313852310181 (scalar)
0.92502307891845703 (NEON) Δ ≈ 6e-08
Netflix pair, frame 4: float_ssim 0.85594284534454346 (scalar)
0.85594278573989868 (NEON) Δ ≈ 6e-08
Netflix pair, frame 7: float_ms_ssim 0.95786428224536158 (scalar)
0.95786428986614292 (NEON) Δ ≈ 8e-09
Root cause: same as ADR-0139 on AVX2 / AVX-512. Scalar computes lv = (2.0 * rm * cm + C1) / ... in double (because the 2.0 C literal promotes its float operands to double and lv is double). The NEON code kept everything in float32x4_t, then widened the final l * c * s to double — a different numerical pipeline by construction.
After applying the ADR-0139 fix (per-lane scalar-double reduction using SIMD_ALIGNED_F32_BUF_NEON + SIMD_LANES_NEON), the XML diff collapses to the <fyi fps="..."/> line only (runtime metadata).
Convolve NEON port — bit-exact by construction¶
New NEON iqa_convolve_neon built on top of SIMD_WIDEN_ADD_F32_F64_NEON_4L. Matches scalar under FLT_EVAL_METHOD == 0:
- 11-tap Gaussian (odd kernel,
kw_even == 0): exact. - 8-tap box (even kernel,
kw_even == 1): exact. - Sizes: 11×11 / 12×12 / 19×19 / 20×20 / 25×25 / 33×17 / 61×41 / 576×324 / 1920×1080 / 8×8 / 16×16 / 21×13.
test_iqa_convolve extended to cover aarch64 under QEMU — now runs in the arm64 / aarch64 arch filter alongside the existing x86_64 / x86 coverage.
Cross-compile infrastructure¶
Added build-aux/aarch64-linux-gnu.ini so local meson setup --cross-file=... works. Host requires aarch64-linux-gnu-gcc + qemu-user-static + an aarch64 sysroot under /usr/aarch64-linux-gnu. QEMU is invoked with -L /usr/aarch64-linux-gnu so the dynamic linker /lib/ld-linux-aarch64.so.1 resolves.
One pre-existing test-gate bug surfaced during this audit: core/test/dnn/test_cli.sh invokes $VMAF_BIN directly from bash. Meson's exe_wrapper is not applied to env-provided binaries inside shell scripts, and the host's binfmt_misc entry doesn't know about the aarch64 sysroot prefix — so qemu fails to load the aarch64 dynamic linker. Fixed by gating the test_cli registration on not meson.is_cross_build() in core/test/dnn/meson.build. Unrelated to the NEON work; fix is in-scope for PR #A because PR #A introduces the aarch64 cross-compile lane.
Decision emitted¶
See ADR-0140. Two-part framework: a header (simd_dx.h) with ISA-specific macros + an upgraded /add-simd-path skill that scaffolds new SIMD TUs from a short kernel-spec declaration.
Dead ends / not chosen¶
- Cross-ISA portability layer (Highway / simde / xsimd). Writes once, runs everywhere, but hides the ISA-specific bit-exactness rules (FMA availability, rounding mode, lane ordering) inside the abstraction. The fork's SIMD policy (user memory
feedback_simd_dx_scope.md) explicitly rules this out. - Generic cross-ISA
SIMD_WIDEN_ADD_F32_F64(acc, a, b)macro. A single name behind#if __AVX2__/__AVX512F__/__ARM_NEONdispatch would read cleaner in-line, but it conceals which intrinsics a given line emits and breaks reviewer intuition on bit-exactness trade-offs. ISA-suffixed names (_AVX2_4L,_AVX512_8L,_NEON_4L) let a reviewer audit the intrinsics without opening the header. - Codegen-only skill (no header). The skill alone would have cut the per-TU bootstrap cost, but the bit-exactness patterns would stay copy-pasted and easy to lose silently on the next port.
- Dedicated
ssim_accumulate_*unit test (AVX2 + AVX-512 + NEON uniformly). Good idea but broader than PR #A — spans 3 ISAs and would inflate review surface. Deferred into PR #B.
Status¶
Active. Will be closed when PR #B lands and the macros have been consumed by ≥3 additional kernels (the ssimulacra2 + motion_v2 + vif_statistic set).