ADR-0784: AVX2 SIMD path for integer SSIM horizontal moment accumulation¶
- Status: Accepted
- Date: 2026-05-29
- Deciders: lusoris
- Tags:
simd,x86,avx2,ssim,performance,fork-local
Context¶
The ssim feature extractor (core/src/feature/integer_ssim.c) was entirely scalar with no SIMD dispatch. The per-pixel horizontal moment accumulation in ssim_accumulate_row() dominates runtime: for each of the w output pixels it runs a 9-tap kernel loop accumulating five int64 moments (mux, muy, x2, xy, y2). At 1080p this is roughly 1920 x 1080 x 9 x 5 = ~93M int64 multiply-accumulate ops per frame.
PR #111 identified this as the highest-priority gap in the x86 SIMD coverage. The CUDA twin (integer_ssim_score.cu) already uses a two-pass separable design on GPU, confirming the approach is sound.
The numerical contract (docs/principles.md §bit-exactness) requires that SIMD output be bit-identical to scalar for integer arithmetic.
Decision¶
We add an AVX2 implementation in core/src/feature/x86/integer_ssim_avx2.c that processes 8 output pixels in parallel (8bpc path) or 4 output pixels (16bpc path, which requires 64-bit products). Runtime dispatch via vmaf_get_cpu_flags() selects the AVX2 paths when VMAF_X86_CPU_FLAG_AVX2 is set; boundary pixels fall back to scalar within the same function.
All accumulation is performed in the same integer domain as the scalar reference (int32 intermediates widened to int64 at store time for 8bpc; direct int64 accumulation for 16bpc). No floating-point is introduced. Bit-exactness is therefore guaranteed by construction.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Vectorise the k-loop (inner) | Parallelise tap sums | k-loop is only 9 wide; short SIMD bodies; boundary handling still scalar | Chosen approach parallelises the x-loop instead — much wider |
| Vectorise both x and k loops | Maximum SIMD utilisation | Complex gather-scatter for the per-pixel offset addressing | Not needed for the projected 4-6x gain |
| AVX-512 path in addition | 16-wide x-loop for 8bpc | Requires -mavx512bw, adds a separate static lib carve-out | Deferred to follow-up if AVX-512 profiling shows further gain |
| Rewrite using existing float SSIM AVX2 | Reuse float_ssim kernel | Would change the output metric from integer to float moments | Ruled out: different metric, breaks bit-exact contract |
Consequences¶
- Positive: projected 4–6x speedup on the hot path for AVX2 hosts. Scalar hosts (old x86, arm64, non-x86) are unaffected.
- Positive: dispatch is runtime-safe; no ISA gating at the call site.
- Negative: two new files (
integer_ssim_avx2.c,integer_ssim_avx2.h) must be kept in sync with the scalar reference. - Neutral:
integer_ssim.cgains a function-pointer struct inIntegerSsimState(48 bytes) and two dispatch pointers ininit().
References¶
- PR #111 (priority gap analysis,
integer_ssimitem #1). - ADR-0139 (
ssim_avx2bit-exact contract — double-precision reduction for float SSIM; not applicable here since moments are integer). - ADR-0784 stub:
docs/adr/0784-integer-ssim-avx2.md.stub. core/test/test_integer_ssim_simd.c— bit-exactness test.docs/backends/x86/integer-ssim-avx2.md— user-facing documentation.