Skip to content

ARM NEON / SVE2 backend

libvmaf's aarch64 path uses ARMv8-A NEON intrinsics by default and upgrades to ARMv9-A SVE2 at runtime when the host CPU advertises HWCAP2_SVE2. Unlike the GPU backends, NEON is always built when the host (or cross) compiler targets aarch64 — there is no -Denable_neon toggle. Kernels live under core/src/feature/arm64/ and are dispatched at runtime via vmaf_get_cpu_flags().

The SVE2 path is purely additive: when the build-time probe (cc.compiles(... -march=armv9-a+sve2)) succeeds, the SVE2 sister TUs compile alongside the NEON ones and the dispatch table picks SVE2 if the runtime probe (getauxval(AT_HWCAP2) & HWCAP2_SVE2) fires. Otherwise the binary keeps the NEON dispatch entries unchanged. SVE2 today covers SSIMULACRA 2 and float_moment — see ADR-0213 and ADR-0584. Adding more SVE2 ports follows the same pattern; per-extractor coverage is in the table below.

Build

meson setup build           # NEON sources compile automatically on aarch64
ninja -C build

The only switch that affects NEON code generation is the global enable_asm flag — -Denable_asm=false disables every SIMD path (NEON included) and falls back to scalar C. See ../../development/build-flags.md.

Runtime control

NEON dispatch is per-extractor. To force scalar fallback (debugging, A/B against the reference) mask out the NEON ISA bit at the CLI:

vmaf --cpumask 0 ...        # disable every CPU SIMD ISA, scalar only

--cpumask accepts a 64-bit hex value matching the bits returned by vmaf_get_cpu_flags(); passing 0 is the simplest "scalar only" override.

There is no per-feature NEON disable flag — extractors that have a NEON kernel will pick it whenever --cpumask allows it.

Per-feature coverage

The table below tracks which extractors have a NEON kernel. Coverage matches the Backends column in ../../metrics/features.md.

Feature NEON kernel SVE2 kernel Notes
vif yes no matches AVX2 path bit-for-bit
adm yes no matches AVX2 path bit-for-bit
motion yes no fixed-point legacy motion
motion_v2 yes no pipelined fused-blur variant
float_moment yes yes 1st/2nd moment reduction; SVE2 VLA f32→f64 path (ADR-0584)
float_motion yes no float-pipeline twin
float_adm yes no float-pipeline twin
float_psnr yes no per-plane float PSNR
ciede yes no YUV → CIELAB ΔE
psnr yes no fixed-point per-plane
psnr_hvs yes no bit-identical to scalar — see ADR-0160
ssim / float_ssim yes no shared decimate kernel
float_ms_ssim yes no 9-tap 9/7 wavelet decimate via ms_ssim_decimate_neon
ssimulacra2 yes yes bit-identical to scalar (NEON and SVE2 produce byte-equal output); see ADR-0161, ADR-0162, ADR-0163, ADR-0213
cambi yes no scalar fallback also retained

Bit-exactness

NEON outputs are byte-identical to the scalar C reference for the features that ship a determinism contract:

  • psnr_hvs — pinned by ADR-0160; verified across all three Netflix golden pairs.
  • ssimulacra2 — pinned by ADR-0161 / ADR-0162 / ADR-0163 / ADR-0213; cross-host determinism via vmaf_ss2_cbrtf and the sRGB-EOTF LUT. The SVE2 sister TU is locked to a fixed 4-lane predicate (svwhilelt_b32(0, 4)) so its arithmetic order matches the NEON output regardless of the host's runtime vector length — wider lanes simply stay false in the predicate. Validated under qemu-aarch64-static -cpu max via the cross-file build-aux/aarch64-linux-gnu-sve2.ini.
  • ms_ssim_decimate — pinned by ADR-0125; per-lane vfmaq_n_f32 with broadcast coefficients matches the scalar fmaf chain exactly.

Other extractors are numerically equivalent to their scalar twins within places=4 of the snapshot tolerance but do not carry an explicit byte-identity contract. The Netflix golden CPU gate (make test-netflix-golden) is the cross-arch correctness check.

Build / CI matrix

The Build — Ubuntu ARM clang (CPU) job in the libvmaf build matrix (libvmaf-build-matrix.yml) runs on ubuntu-26.04-arm against clang and exercises the full unit-test + tox suite on real aarch64 hardware (not qemu).

make test-netflix-golden runs on aarch64 in the same matrix and must remain green — see docs/principles.md § 8 (Netflix golden gate).

Limitations

  • No per-feature override: every NEON kernel runs whenever --cpumask permits. To bisect a suspected NEON regression use --cpumask 0 to drop to scalar across all extractors at once, then re-enable per-extractor by running individual --feature invocations.
  • No discrete GPU path on aarch64 yet. The CUDA / SYCL / HIP backends compile for x86_64 only in the current matrix; on Apple Silicon the Metal backend (metal/index.md) is the aarch64 GPU surface. (The Vulkan backend was removed in ADR-0726.)