ADR-0584 — `float_moment` SVE2 port¶

Field	Value
Status	Accepted
Date	2026-05-16
Deciders	lusoris
Tags	arm64, sve2, simd, float_moment, bit-exactness

Context¶

float_moment (the 1st-moment / 2nd-moment feature extractor) has a NEON kernel (moment_neon.c) but no SVE2 sibling. The NEON path processes four f32 lanes per iteration using float32x4_t and accumulates into float64x2_t pairs — a fixed-width design that cannot exploit SVE2 registers wider than 128 bits on Neoverse V2 / Cortex-X4 and later hardware.

ssimulacra2_sve2.c (ADR-0213) established the fork's SVE2 pattern. float_moment is the simplest candidate for a second SVE2 port: two functions, pure reduction, no libm or branch-heavy code, and an existing f64-accumulator contract (ADR-0179) that maps cleanly onto svfloat64_t.

Decision¶

Add core/src/feature/arm64/moment_sve2.c + moment_sve2.h implementing compute_1st_moment_sve2 and compute_2nd_moment_sve2 using the SVE2 VLA f32→f64 widening pattern:

Inner loop steps by svcntd() f32 elements per iteration (= number of f64 lanes per register).
Each iteration loads up to svcntd() f32 elements under a svwhilelt_b32 predicate, squares (for 2nd moment) in f32, then widens with svcvt_f64_f32_x under a matching svwhilelt_b64 predicate.
Accumulates into svfloat64_t dsum; per-row svaddv_f64 collapses to scalar double.
No scalar tail loop: the predicated load handles row tails natively.

The step is svcntd() and NOT svcntw() because svcvt_f64_f32_x widens only the lower svcntd() f32 lanes. Using svcntw() would silently skip the upper half of the f32 register on SVE2 hardware with registers wider than 128 bits.

Dispatch¶

float_moment.c selects SVE2 over NEON when VMAF_ARM_CPU_FLAG_SVE2 is set:

#if HAVE_SVE2
if (cpu_flags & VMAF_ARM_CPU_FLAG_SVE2) {
    s->moment1 = compute_1st_moment_sve2;
    s->moment2 = compute_2nd_moment_sve2;
}
#endif

NEON remains the fallback. The SVE2 path is purely additive.

Build gate¶

The moment_sve2.c TU is compiled in its own static library with -march=armv9-a+sve2 -ffp-contract=off, registered under the existing if is_sve2_supported block in core/src/meson.build. Darwin forces is_sve2_supported = false per ADR-0419; the TU is therefore never linked on Apple Silicon.

Parity test¶

core/test/test_moment_simd.c gains four SVE2 test cases (test_sve2_seed_{a,b}, test_sve2_aligned_w, test_sve2_tiny) using the existing SIMD_BITEXACT_ASSERT_RELATIVE macro at MOMENT_REL_TOL = 1e-7. The test cases are guarded by #if ARCH_AARCH64 && HAVE_SVE2 at compile time and by a vmaf_get_cpu_flags() check at runtime so a NEON-only host passes without executing the SVE2 path.

Alternatives considered¶

No alternatives: the implementation is a mechanical port of the NEON pattern to the canonical SVE2 VLA f32→f64 accumulation pattern. The only non-obvious choice (step size svcntd() vs svcntw()) is dictated by the semantics of svcvt_f64_f32_x and is documented inline in the source.

Consequences¶

ARM Graviton (Neoverse V1/V2) and Cortex-X4 guests running the CI ARM lane will exercise the SVE2 path when HWCAP2_SVE2 is set.
NEON host CI (e.g. ubuntu-24.04-arm on M1-era hardware) skips the SVE2 test cases at runtime with a log message; the NEON tests still run.
x86-64 host builds ignore both NEON and SVE2 paths entirely (compile-time #if ARCH_AARCH64 guards).

References¶

req: "Add ARM SVE2 SIMD path for ONE feature extractor that has NEON but not SVE2 yet"
ADR-0213 — first SVE2 port, established the pattern
ADR-0419 — Darwin SVE2 opt-out
ADR-0179 — float_moment NEON bit-exactness contract
ADR-0138 — SIMD bit-exactness general rule
ADR-0245 — simd_bitexact_test.h harness

ADR-0584 — float_moment SVE2 port¶